Machine Learning: Unsupervised Learning

- September 27, 2019

Unsupervised Learning deals with finding patterns in unlabeled data. It does not exactly help in predicting something, but it helps us to cluster the data on some pattern using we which we can come to some conclusion about the data and make business decisions on it.

Types of Unsupervised Learning are:

Clustering: It involved finding natural clusters in a dataset if they exist. The criteria of clustering can be very simple for example Gender or complex as purchase preferences. There are different types of clustering that can be utilized: K-Means Clustering, Hierarchical Clustering, Probabilistic Clustering.
Data Compression: It is one of the goals that can be achieved using unsupervised learning. Since the amount of data is increasing day by day and we require more and more storage space to store that data. It can be achieved by a process called Dimensionality Reduction. Popular algorithms used for dimensionality reduction are: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA).
Auto Encoders: In Deep Learning Auto Encoders are an example of Unsupervised Learning. Just like the Data Compression algorithms, it tries to represent the original data with a smaller subset of features. It has been discussed in the Neural Network post briefly on this blog.

Clustering

Clustering as mentioned above deals with grouping or finding clusters in data using some criteria. It is used widely in the retail sector nowadays to serve customers better and to increase profits.

For example, in a supermarket we some items being sold together with some additional discount on the combination of items. It is through clustering that we can see what kind of items are being bought together most frequently and then we can just sell them together instead of selling them separately. Various brands do this so that the customer buys the other item of their brand instead of any other brand that might be available.

The quality of clustering depends on the algorithm, distance function, and the application.

Some major clustering approaches are:

1. Partitioning based: It constructs various partitions and then evaluates them based on certain criterion.

2. Hierarchial: Creates hierarchical decomposition of a set of objects using some criterion.

3. Model-based: Hypothesise a model for the cluster and find the best fit model for the data.

4. Density-based: It is guided by the connectivity and density functions. The density of the points over the region contributes to the formation of a cluster. Example: DBSCAN.

5. Graph-Theoretic: Clustering on a graph using the weights of the connected nodes.

K-Means Clustering

It is a partitioning based clustering algorithm. In this approach, we first define K centers in the dataset, one for each cluster. Then the points near to the centers are associated with them. After this step is done, we need to recalculate the centroids of the data and the centers are moved. The process is repeated until there is no further significant movement. This algorithm aims to minimize the objective function known as squared error function given by:

where, ‘||x_i- v_j||’ is the Euclidean distance between x_i and v_j.

‘c_i’ is the number of data points in i^th cluster.

‘c’ is the number of cluster centers.

Steps for K-means Clustering:

Let X = {x₁,x₂,x₃,……..,x_n} be the set of data points and V = {v₁,v₂,…….,v_c} be the set of centers.

1) Randomly select ‘c’ cluster centers.

2) Calculate the distance between each data point and cluster centers.

3) Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers..

4) Recalculate the new cluster center using:

where, ‘c_i’ represents the number of data points in i^th cluster.

5) Recalculate the distance between each data point and new obtained cluster centers.

6) If no data point was reassigned then stop, otherwise repeat from step 3).

There are a number of advantages to this approach:

1) Fast, robust and easier to understand.

2) Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is # dimension of each object, and t is # iterations. Normally, k, t, d << n.

3) Gives best result when data set are distinct or well separated from each other.

Disadvantages:

1. It fails on non-linear data.

2. It fails to handle noisy data.

3. It might get stuck on local optima.

4. Euclidean distance measures can unequally weight underlying factors.

For Hierarchical Clustering, you can refer to this link for a detailed explanation. It's quite neat.

Thanks for reading. :)

Search This Blog

Anand Tech Talks