Unsupervised Machine Learning

Unsupervised Machine Learning

Unsupervised machine learning algorithms infer patterns from a dataset without reference to known or labeled, outcomes. Unlike supervised machine learning, unsupervised machine learning methods cannot be directly applied to regression or a classification problem because you have no idea what the values for the output data might be, making it impossible for you to train the algorithm the way you normally would. Unsupervised learning can instead be used to discover the underlying structure of the data.


Why?

It purports to uncover previously unknown patterns in data, but most of the time these patterns are poor approximations of what supervised machine learning can achieve. Additionally, since you do not know what the outcomes should be, there is no way to determine how accurate they are, making supervised machine learning more applicable to real-world problems.


The best time to use unsupervised machine learning is when you do not have data on desired outcomes, such as determining a target market for an entirely new product that your business has never sold before. However, if you are trying to get a better understanding of your existing consumer base, supervised learning is the optimal technique. It is a machine learning technique in which the users do not need to supervise the model. Instead, it allows the model to work on its own to discover patterns and information that was previously undetected. It mainly deals with the unlabelled data.


Unsupervised Learning Algorithms

It allows users to perform more complex processing tasks compared to supervised learning. Although, unsupervised learning can be more unpredictable compared with other natural learning methods. Unsupervised learning algorithms include clustering, anomaly detection, neural networks, etc.


Why?

Here, are prime reasons for using Unsupervised Learning:

It finds all kinds of unknown patterns in data.

What methods help you to find features that can be useful for categorization.

It is taken place in real-time, so all the input data to be analyzed and labeled in the presence of learners.

It is easier to get unlabeled data from a computer than labeled data, which needs manual intervention.


Types of Unsupervised Learning

That problem further grouped into clustering and association problems.


Clustering

It is an important concept when it comes to unsupervised learning. It mainly deals with finding a structure or pattern in a collection of uncategorized data. Clustering algorithms will process your data and find natural clusters(groups) if they exist in the data. You can also modify how many clusters your algorithms should identify. It allows you to adjust the granularity of these groups.


There are different types of clustering you can utilize:


Exclusive (partitioning)

In this clustering method, Data are grouped in such a way that one data can belong to one cluster only.


Example: K-means


Agglomerative

In this clustering technique, every data is a cluster. The iterative unions between the two nearest clusters reduce the number of clusters.


Example: Hierarchical clustering


Overlapping

In this technique, fuzzy sets are used to cluster data. Each point may belong to two or more clusters with separate degrees of membership.


Here, data will be associated with appropriate membership value. Example: Fuzzy C-Means


Probabilistic

This technique uses probability distribution to create the clusters


Example: Following keywords


"man's shoe."

"women's shoe."

"women's glove."

"man's glove."

can be clustered into two categories "shoe" and "glove" or "man" and "women."


Clustering Types

Hierarchical clustering

K-means clustering

K-NN (k nearest neighbors)

Principal Component Analysis

Singular Value Decomposition

Independent Component Analysis

Hierarchical Clustering:

Hierarchical clustering is an algorithm that builds a hierarchy of clusters. It begins with all the data which is assigned to a cluster of their own. Here, two close clusters are going to be in the same cluster. This algorithm ends when there is only one cluster left.


K-means Clustering

K means it is an iterative clustering algorithm that helps you to find the highest value for every iteration. Initially, the desired number of clusters is selected. In this clustering method, you need to cluster the data points into k groups. A larger k means smaller groups with more granularity in the same way. A lower k means larger groups with less granularity.


The output of the algorithm is a group of "labels." It assigns data points to one of the k groups. In k-means clustering, each group is defined by creating a centroid for each group. The centroids are like the heart of the cluster, which captures the points closest to them and adds them to the cluster.


K-mean clustering further defines two subgroups:


Agglomerative clustering

Dendrogram


Agglomerative clustering:

This type of K-means clustering starts with a fixed number of clusters. It allocates all data into the exact number of clusters. This clustering method does not require the number of clusters K as an input. The agglomeration process starts by forming each data as a single cluster.


This method uses some distance measure, reduces the number of clusters (one in each iteration) by merging process. Lastly, we have one big cluster that contains all the objects.


Dendrogram:

In the Dendrogram clustering method, each level will represent a possible cluster. The height of the dendrogram shows the level of similarity between two join clusters. The closer to the bottom of the process they are more similar clusters which are finding of the group from dendrogram which is not natural and mostly subjective.


K- Nearest neighbors

It is the simplest of all machine learning classifiers. It differs from other machine learning techniques, in that it doesn't produce a model. It is a simple algorithm that stores all available cases and classifies new instances based on a similarity measure.


It works very well when there is a distance between examples. The learning speed is slow when the training set is large, and the distance calculation is nontrivial.


Principal Components Analysis:

In case you want a higher-dimensional space. You need to select a basis for that space and only the 200 most important scores of that basis. This base is known as a principal component. The subset you select constitutes is a new space that is small in size compared to original space. It maintains as much of the complexity of data as possible.


Association

Association rules allow you to establish associations amongst data objects inside large databases. This unsupervised technique is about discovering interesting relationships between variables in large databases. For example, people that buy a new home most likely to buy new furniture.


Other Examples:


A subgroup of cancer patients grouped by their gene expression measurements

Groups of shopper based on their browsing and purchasing histories

Movie group by the rating given by movies viewers


Applications of unsupervised machine learning

Some applications of unsupervised machine learning techniques are:


Clustering automatically split the dataset into groups base on their similarities

Anomaly detection can discover unusual data points in your dataset. It is useful for finding fraudulent transactions

Association mining identifies sets of items which often occur together in your dataset

Latent variable models are widely used for data preprocessing. Like reducing the number of features in a dataset or decomposing the dataset into multiple components

Disadvantages of Unsupervised Learning

You cannot get precise information regarding data sorting, and the output as data used in unsupervised learning is labeled and not known

Less accuracy of the results is because the input data is not known and not labeled by people in advance. This means that the machine requires to do this itself.

The spectral classes do not always correspond to informational classes.

The user needs to spend time interpreting and label the classes which follow that classification.

Spectral properties of classes can also change over time so you can't have the same class information while moving from one image to another.

Summary

It is a machine learning technique, where you do not need to supervise the model.

It helps you to finds all kinds of unknown patterns in data.

Clustering and Association are two types of Unsupervised learning.

Four types of clustering methods are

 1) Exclusive 

2) Agglomerative 

3) Overlapping 

4) Probabilistic.

Important clustering types are: 

1)Hierarchical clustering 

2) K-means clustering 

3) K-NN 

4) Principal Component Analysis 

5) Singular Value Decomposition 

6) Independent Component Analysis.

Association rules allow you to establish associations amongst data objects inside large databases.

In Supervised learning, Algorithms are trained using labeled data while in Unsupervised Learning Algorithms are used against data that is not labeled.

Anomaly detection can discover important data points in your dataset which is useful for finding fraudulent transactions.

The biggest drawback of Unsupervised learning is that you cannot get precise information regarding data sorting.

Comments