Dimensionality Reduction, PCA Intro
As a data scientist, you always meet a situation, when you have a lot of data with a lot of features. You don't even know you are going to use them all or not. Is this data structured enough and which features are valuable for my problem.
So in general, you get data set from one engineering team, after that, you add another data coming from other team and so on. As a result, you've got a dataset with a lot of features, which also may not be useful, cause you get it from engineering them and they can give you everything.
Trying to fit your model with the given dataset might be slow, can take a lot of memory even can be harmful and can affect your learning process. So you need to get rid of unuseful and invaluable features from your data, in other words, you want to reduce the dimensionality of your data.
Let's take a look a the example. Let's assume you've got a dataset with a lot of features
You can realize which columns to take and how valuable they are. You need to get the most valuable columns and remove other fields.
Let's see what we can do about that.
Now we are going to take a look simple examples to understand how we can reduce the dimension.
We've got 2D points on the plane and we want to get 1D points.
If we draw a line between these points and project all points on that line, we will get a good approximation of the points, but only on the line we've found. It means we are going to have new points on the line, which will be just one dimension.
We are going to have something like this
Taking all projections (locations) on the green line we will have new points from 1D
With the same approach, we can approximate points from 3D to 2D, by finding the best plane for these 3D points. All we need is to find that plane, project all points on it, summarize them and take the average. The best approximation will be the plane, which has the minimum mean projection error.
It looks good, but how are going to make it?
PCA (Principal Components Analysis)
Now let's formulate this approach a let's take a look at how PCA works and how we are going to find these planes, components.
Trying reducing from n-dimension to k-dimension, means to find k vectors to the proper direction, on which mean projection error is small. Before PCA algorithm introduction, we need to make some data preprocessing
Let's assume we've got a training dataset with m examples
First, we normalize this example, so they will have 0 means. Also, we will calculate the variance of the points.
Next, we will need to calculate the "covariance" matrix of that input matrix. In other words, we will compute the covariance matrix for each example and after that, we will take the mean of them.
Now we need to find the vectors, which will represent new axis or dimensions for our give dataset. To find them, we will use SVD (Singular Value Decomposition), which is a known approach in linear algebra and is implemented in many languages and libraries. You can read more about SVD here. By using this, we will get 3 new matrices
Matrix U contains all necessary vectors we are looking for. If our X matrix is p by p, then U will also be p by p. These vectors are the representative vectors for our new dimensionality. By choosing the first k columns, we will get the vectors we need. Having a new matrix for converting our X points, we will do just dot product between these 2 matrices.
Matrix Z will be our new points with k dimension. U reduce will be the matrix, which contains all necessary vectors. This is how PCA works.
The Intuition of PCA (Principal Components Analysis)
Now let's understand the intuition of the algorithm and what it actually tries to do. We've got our dataset X which has n columns and m rows. Let's take every column and define as new X
When we try to find a new dimension, we basically trying to find some parameters, which is mean normalized. For every feature in our dataset has some correlations with others and for every new dimension vector, we want to find something else, which is not correlated with previous ones. What it means.
It means the math expectation of Z on the first axis will be almost the same as the mathematical expectation of X. By collecting these Z vectors, we will get a matrix and each column will represent a new dimension. As we've already said X matrix and u vectors are mean normalized, we want to maximize the variance of the new vector.
In other words, we want to maximize the variance for every feature of the Z matrix. Each dimension or principal component must be not correlated with others. Let's take a look at these points.
It has 2 components, green and blue line. You can see that points for the green line have high variance and high correlation, but the blue line has not a high correlation with the green line and it represents another component. These 2 lines are the principal components of our points and we can choose one of them, to reduce the dimensionality.
How to choose the number of Components in PCA
Talking about the PCA algorithm, we realize we will have to choose the number of components, which is on us and we don't know which value of k will be the best choice. First lest define a metric, which will represent how much we have changed the distribution of the points.
Where X approx is the approximation of the points by using Z vectors. In other words, when we moved the points on the line, by projecting them, which are the new coordinates of that new, moving points.
This quantity tells us, how much did we changed or moved our points? what is the distance between our real and approximated points? That is the projection distance. We will take the mean projection error on all examples and will divide on average vectors size of our dataset. If we changed them just a little bit, then it's good, we have changed it just 1%, the other 99% remains the same.
Then let's describe how we can choose the value of k. We can do it iteratively, by running the PCA algorithm every time with different k values. First, we can set k = 1
- Compute covariance matrix
- Find principal components
- Choose the first k vectors
- Compute the moving quantity
- If it's less than 0.01, it's ok, if not k += 1 and go to 1
Until we get less than 0.01, but this does not look good in terms of implementation. There is another way of choosing the number of components in PCA.
We've talked about SVD (Singular value decomposition) and said it returns 3 matrices. The second matrix S is a diagonal matrix and in many libraries, it returns just a vector, the element of diagonal. If the following condition is true, then your algorithm is finished.
By using this formula, you can find the value of k, where this condition is true.
PCA in Python
We described an approach of dimensionality reduction and talked about an algorithm called Principal Components Analysis for making it real. Now let see how we can do it in python. If you use Numpy, there are a couple of actions you will have to know.
import numpy as np X = load_data() # m*n matrix Covariance_X = X.dot(X.T) # computing covariance matrix [U, S, V] = np.linalg.svd(Cov_X) # U is the matrix with principal components # S is diagonal Matrix for choosing best value for k U_reduce = U[:, k] # will be the matrix for k components
If you are not using numpy, or your dataset is large enough and you want to have better API, then you will have to use sklearn package. It has good implementations of algorithms and the functional API is better.
from sklearn.decomposition import PCA k = 2 X = load_data() pca = PCA(n_components=k) principal_components = pca.fit_transform(X)
PCA use cases
By knowing about the PCA and understanding how useful it can be, some machine learning engineers misunderstand PCA use cases and always try to use them.
Bad Use PCA (Principal Components Analysis) To Prevent Overfitting.
PCA is not made for solving overfitting issue, there is a regularization approach, which is made just for it. Also by applying PCA on your dataset, you lose some of the information from it, but trying to keep the variance and it will not help you to solve overfitting issue
Other Bad Use of PCA
When you try to learn a model for your data, don't start from applying PCA on it. That is not a good experience. Start to fit your model on the dataset you've got, then try to understand issues. Maybe your X contains valuable features and you don't have to get rid of them. After some experiments, if you have to make dimensionality reduction, then use PCA.
Good Use of PCA Is To Reduce The Memory Usage
One of the main reasons, why we need to use PCA. If your data is too large in terms of the features and you've got problems to fit them on the memory, then using PCA is a good idea.
Another Good Usage of PCA Is Increasing The Spead of Training
Second reason when you need to use dimensionality reduction. If your learning process takes too much time, then maybe using PCA will make your process faster, cause computing gradients will be faster, every step of the gradient descent algorithm will be faster as well.