[클러스터링] DBSCAN

K-means가 거리에 따라 클러스터링을 했다면, 데이터 포인트 밀도에 따라 클러스터링을 해주는 알고리즘이 있다.

밀도 기반 군집 분석인 DBSCAN에 대해 정리해보고자 한다.

DBSCAN (Density Based Spatial Clustering of Applications with Noise)

DBSCAN은 두 가지의 하이퍼파라미터를 갖는다.

epsilon: MinPts 가 들어있는 범위를 결정 짓는 radius의 크기
MinPts: 만약, Epsilon 안에 MinPts 만큼의 데이터가 있으면 해당 포인트는 core point가 된다.

DBSCAN Point

core point: MinPts의 개수를 갖고있는 중간 지점의 point
border point: core point의 epsilon 안에 있지만, MinPts 만큼 갖고 있지 못하는 point
noise point: core, border도 아닌 point

DBSCAN Algorithm

위의 세 가지 포인트를 나누는 알고리즘은 아래와 같다.

포인트 한 점에 대해 core point인지 확인
- 아니면, 0과 같은 null label 부여 → 그리고 다음 포인트로 넘어감
- 맞으면, 새로운 클러스터 할당 ( cluster += 1)
(1)에서 새로 배정된 core point 주변의 포인트들을 같은 클러스터로 군집화
확장되다가 border point를 만나면, 클러스터의 확장이 멈춤
모든 포인트 방문할 때까지 위 과정 반복
null label로 남은 것들은 noise point로 할당.

DBSCAN R code

'''
DBSCAN
'''
# 각 데이터 포인트에 대해 특정 반경 eps 내에 있는 포인트 수 계산
# minPts보다 많은 이웃이 있는 포인트를 core point로 정의
# core point들을 연결해 클러스터를 생성 
expand_cluster <- function(data, labels, neighbors, cluster, eps, minPts, dist_matrix) {
  i <- 1
  # 일단 
  # neighbors: epsilon 안에 있는 데이터 포인트 index
  while (i <= length(neighbors)) {
    neighbor <- neighbors[i]
    # Border points 
    # 처음에 노이즈라고 판단되었지만, neighbor가 된 경우 
    if (labels[neighbor] == -1) {
      labels[neighbor] <- cluster
      
      # 아직 방문 안했으면, 
    } else if (labels[neighbor] == 0) {
      labels[neighbor] <- cluster # core point cluster에 배정 
      # neighbor의 neighbor
      new_neighbors <- which(dist_matrix[neighbor, ] < eps)
      
        # neighbor가 core point가 되는지 확인
        if (length(new_neighbors) >= minPts) {
          # 원래 있던 이웃에 이어붙임 
          neighbors <- unique(c(neighbors, new_neighbors))
        }
      }
    
    # 단계 별로 시각화 
    plot(data, col = labels + 1L, pch = 19, main = paste("DBSCAN Clustering - Cluster", cluster), xlab = "", ylab = "")
    Sys.sleep(0.04)  
    
    i <- i + 1
  }
  return(labels)  # 레이블 반환
}

dbscan <- function(data, eps, minPts) {
  labels <- rep(0, nrow(data))  # 초기 레이블 할당
  cluster <- 0  # 클러스터 초기화
  dist_matrix <- as.matrix(dist(data))  # 거리 행렬 계산
  
  for (i in 1:nrow(data)) {
    # 이미 할당된 데이터 확인
    if (labels[i] != 0) next
    
    neighbors <- which(dist_matrix[i, ] < eps)  # epsilon 이내 indexing 
    
    # minPts 미만이면, 노이즈 처리
    if (length(neighbors) < minPts) {
      labels[i] <- -1 
      next
    }
    # minPts 이상이면, 새 클러스터 할당
    cluster <- cluster + 1
    labels[i] <- cluster  # 클러스터 레이블 할당
    
    # epsilon 안에 있는 데이터 포인트 하나씩 방문하고 라벨 상태 update 
    labels <- expand_cluster(data, labels, neighbors, cluster, eps, minPts, dist_matrix)
  }
  # --- 데이터 포인트의 밀집이 끊긴 경우 ---
  
  return(labels)  # 최종 레이블 반환
}

DBSCAN 장단점

장점: noise 처리에 능하고 군집 개형이 flexible 하다.

단점: 밀도가 다른 데이터가 혼재해 있을 시 epslion과 minPts 설정이 어렵다.

'Machine learning > Unsupervised Learning' 카테고리의 다른 글

[클러스터링] T-SNE (0)	2024.01.23
[차원축소] PCA(주성분분석) (0)	2024.01.22
추천시스템 쉽게 이해하기 (Content-based, Collaborative filtering) (2)	2023.12.05
[클러스터링] CH-index (0)	2023.10.11
[클러스터링] K-means (0)	2023.10.10

Wave to AI

[클러스터링] DBSCAN

DBSCAN (Density Based Spatial Clustering of Applications with Noise)

DBSCAN Point

DBSCAN Algorithm

DBSCAN R code

DBSCAN 장단점

'Machine learning > Unsupervised Learning' 카테고리의 다른 글

티스토리툴바

[클러스터링] DBSCAN

DBSCAN (Density Based Spatial Clustering of Applications with Noise)

DBSCAN Point

DBSCAN Algorithm

DBSCAN R code

DBSCAN 장단점

'Machine learning > Unsupervised Learning' 카테고리의 다른 글

'Machine learning/Unsupervised Learning' Related Articles

티스토리툴바