Code&Data Insights

[Machine Learning] Unsupervised Learning - Clustering | K-Means | Anomaly Detection 본문

Data Science/Machine Learning

[Machine Learning] Unsupervised Learning - Clustering | K-Means | Anomaly Detection

paka_corn 2023. 5. 29. 04:11

Unsupervised Learning - Clustering 

Unsupervised Learning - Unsupervised learning uses unlabeled data. The training examples do not have targets or labels "y". Recall the T-shirt example. The data was height and weight but no target size.

Clustering  : find the data points related or similar 

- mostly used in marketing | segmentation | tracking | abnormaly detection 

- different cluster must have dissimilarity

- Clustering Algorithm : K-Means | Mean Shift | Gaussian Mixture Model | DBSCAN 

 

[ K-means ]

k-means : k-means will repeatedly do two different things

- If each example x is a vector of 5 numbers, then each cluster centroid is also going to be a vector of 5 numbers.

- The number of cluster assignment variables is equal to the number of training examples.

 

 

(1) it assigns points to cluster centroids (centroid : 중심) 

(2) It moves cluster centroids 

 

Step 1 ) Assign each point to its closest centroid 

Step 2) Recompute the centroids -> Reassign each point to its new closest centroid

=> Apply these two steps until there is no further changes 

=> K-means can arrive at different solutions depending on initialization. After running repeated trials, choose the solution with the lowest cost.

 

 

K-means algorithm 

: Randomly initialize K cluster centroids 

  • The 𝐾-means algorithm will always converge to some final set of means for the centroids.
  • However, the converged solution may not always be ideal and depends on the initial setting of the centroids.
  • Therefore, in practice the K-means algorithm is usually run a few times with different random initializations.
  • One way to choose between these different solutions from different random initializations is to choose the one with the lowest cost function value (distortion).

=> K : the number of cluster we want to find 

 ** K shoud be smaller than m ( the number of points(train example)) 

 

=> The dimension of = the dimension of the examples

- which centroid example is assigned to.

 

=> If you are running K-means with clusters, then each should be 1, 2, or 3.

---> describes which centroid example() is assigned to. If , then would be one of 1,2 or 3 assuming counting starts at 1.

 

Therefore, if we run K-means and compute the value of the cost function after each iteration.

-> The cost will either decrease or stay the same after each iteration. 

 

 

How to Choose the Value of K ? 

Way 1. Elbow Methods could be one of the method! 

Elbow method - plots a graph between the number of clusters K and the cost function. The ‘bend’ in the cost curve can suggest a natural value for K. Note that this feature may not exist or be significant in some data sets.

 

=> However, the right K is often ambiguous! hard to find elbow. 

 

Way 2. Evaluate K-means based on how well it performs on that later purpose 

 

 

 

 

[ Anomaly Detection ]

Anomaly Detection   : find the data points related or similar 

- Use Density Estimation 

- Used for Fraud Detection 

 

Gaussian Distrubution 

Gaussian Distrubution  = Normal Distribution 

1/m and 1/(m-1) dont make big difference for the result!

 

 

Abnormaly Detection Algorithm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Comments