Code&Data Insights
[Machine Learning] Unsupervised Learning - Clustering | K-Means | Anomaly Detection 본문
[Machine Learning] Unsupervised Learning - Clustering | K-Means | Anomaly Detection
paka_corn 2023. 5. 29. 04:11Unsupervised Learning - Clustering
Unsupervised Learning - Unsupervised learning uses unlabeled data. The training examples do not have targets or labels "y". Recall the T-shirt example. The data was height and weight but no target size.
Clustering : find the data points related or similar
- mostly used in marketing | segmentation | tracking | abnormaly detection
- different cluster must have dissimilarity
- Clustering Algorithm : K-Means | Mean Shift | Gaussian Mixture Model | DBSCAN
[ K-means ]
k-means : k-means will repeatedly do two different things
- If each example x is a vector of 5 numbers, then each cluster centroid is also going to be a vector of 5 numbers.
- The number of cluster assignment variables is equal to the number of training examples.
(1) it assigns points to cluster centroids (centroid : 중심)
(2) It moves cluster centroids
Step 1 ) Assign each point to its closest centroid
Step 2) Recompute the centroids -> Reassign each point to its new closest centroid
=> Apply these two steps until there is no further changes
=> K-means can arrive at different solutions depending on initialization. After running repeated trials, choose the solution with the lowest cost.
K-means algorithm
: Randomly initialize K cluster centroids
- The 𝐾-means algorithm will always converge to some final set of means for the centroids.
- However, the converged solution may not always be ideal and depends on the initial setting of the centroids.
- Therefore, in practice the K-means algorithm is usually run a few times with different random initializations.
- One way to choose between these different solutions from different random initializations is to choose the one with the lowest cost function value (distortion).
=> K : the number of cluster we want to find
** K shoud be smaller than m ( the number of points(train example))
=> The dimension of = the dimension of the examples
- which centroid example is assigned to.
=> If you are running K-means with clusters, then each should be 1, 2, or 3.
---> describes which centroid example() is assigned to. If , then would be one of 1,2 or 3 assuming counting starts at 1.
Therefore, if we run K-means and compute the value of the cost function after each iteration.
-> The cost will either decrease or stay the same after each iteration.
How to Choose the Value of K ?
Way 1. Elbow Methods could be one of the method!
Elbow method - plots a graph between the number of clusters K and the cost function. The ‘bend’ in the cost curve can suggest a natural value for K. Note that this feature may not exist or be significant in some data sets.
=> However, the right K is often ambiguous! hard to find elbow.
Way 2. Evaluate K-means based on how well it performs on that later purpose
[ Anomaly Detection ]
Anomaly Detection : find the data points related or similar
- Use Density Estimation
- Used for Fraud Detection
Gaussian Distrubution
Gaussian Distrubution = Normal Distribution
Abnormaly Detection Algorithm