[Machine Learning] Unsupervised Learning - Clustering | K-Means

Recent Posts

Recent Comments

Tags more

Archives

Today

Total

Code&Data Insights

[Machine Learning] Unsupervised Learning - Clustering | K-Means | Anomaly Detection 본문

Artificial Intelligence/Machine Learning

[Machine Learning] Unsupervised Learning - Clustering | K-Means | Anomaly Detection

paka_corn 2023. 5. 29. 04:11

Unsupervised Learning - Clustering

Unsupervised Learning - Unsupervised learning uses unlabeled data. The training examples do not have targets or labels "y". Recall the T-shirt example. The data was height and weight but no target size.

Clustering : find the data points related or similar

- mostly used in marketing | segmentation | tracking | abnormaly detection

- different cluster must have dissimilarity

- Clustering Algorithm : K-Means | Mean Shift | Gaussian Mixture Model | DBSCAN

[ K-means ]

k-means : k-means will repeatedly do two different things

- If each example x is a vector of 5 numbers, then each cluster centroid is also going to be a vector of 5 numbers.

- The number of cluster assignment variables is equal to the number of training examples.

(1) it assigns points to cluster centroids (centroid : 중심)

(2) It moves cluster centroids

Step 1 ) Assign each point to its closest centroid

Step 2) Recompute the centroids -> Reassign each point to its new closest centroid

=> Apply these two steps until there is no further changes

=> K-means can arrive at different solutions depending on initialization. After running repeated trials, choose the solution with the lowest cost.

K-means algorithm

: Randomly initialize K cluster centroids

The $K$ -means algorithm will always converge to some final set of means for the centroids.
However, the converged solution may not always be ideal and depends on the initial setting of the centroids.
Therefore, in practice the K-means algorithm is usually run a few times with different random initializations.
One way to choose between these different solutions from different random initializations is to choose the one with the lowest cost function value (distortion).

=> K : the number of cluster we want to find

** K shoud be smaller than m ( the number of points(train example))

=> The dimension of = the dimension of the examples

- which centroid example is assigned to.

=> If you are running K-means with clusters, then each should be 1, 2, or 3.

---> describes which centroid example() is assigned to. If , then would be one of 1,2 or 3 assuming counting starts at 1.

Therefore, if we run K-means and compute the value of the cost function after each iteration.

-> The cost will either decrease or stay the same after each iteration.

How to Choose the Value of K ?

Way 1. Elbow Methods could be one of the method!

Elbow method - plots a graph between the number of clusters K and the cost function. The ‘bend’ in the cost curve can suggest a natural value for K. Note that this feature may not exist or be significant in some data sets.

=> However, the right K is often ambiguous! hard to find elbow.

Way 2. Evaluate K-means based on how well it performs on that later purpose

[ Anomaly Detection ]

Anomaly Detection : find the data points related or similar

- Use Density Estimation

- Used for Fraud Detection

Gaussian Distrubution

Gaussian Distrubution = Normal Distribution

1/m and 1/(m-1) dont make big difference for the result!

Abnormaly Detection Algorithm

'Artificial Intelligence > Machine Learning' 카테고리의 다른 글

[Machine Learning] Bias and Variance \| Regularization \| Learning Curves (0)	2023.05.31
[Youtube] Machine Learning Basic Concepts (머신러닝 기초 관련 유튜브 추천) (0)	2023.05.29
[Machine Learning] Supervised Learning - Regression (Linear \| Multiple \| Logistic ) (1)	2023.05.26
[Machine Learning] Cross Validation \| Confusion Matrix \| ROC - AUC curves (0)	2023.05.13
[Machine Learning] What is Machine Learning ? (0)	2023.05.13