This could mean a lot of things in different contexts but in the context of machine learning, it a is simple but important concept. I am going to describe it in an example:
- Assume that you have plotted a square in the plane with side length of and then you partition each side of this square into mutually exclusive line segments with length . By doing this you can create a checkerboard of unit squares.
- Then scatter uniformly generated points described by coordinates in the big square.
- The density of the data points in this space (big square) is .
- We have on average points in each unit square.
- Repeat the plotting again but this time set (i.e. space) and partition the cube with side length of into unit cubes.
- Then we scatter uniformly generated points with coordinates in the big cube.
- This time the density of the data points in this space (big cube) is .
- We have on average points in each unit cube.
- As you can see, as gets bigger and bigger with limited number of the existing data , many of this squares and cubes (hyper-cubes in bigger dimensions) will be empty (pay attention to the exponential growth of the number of hyper-cubes i.e. ) whereas the number of the data points does not grow exponentially (as shown in the figure below). This results in a sparse representation of the data points (on average data points for each hyper-cube!) and could cripple many algorithms because these algorithms usually need sufficient number of data points in a space with dimension but as as you will experience in the real world, most of the time, the set of data points is not big enough!
Many methods have been introduced through the years for dimensionality reduction. Among them, we can consider PCA (Principle Component Analysis) and its derivatives the most common ones. In the simple version of PCA, we calculate the covariance matrix (that is symmetrical and semi-definite positive, so it eigen values are non-negative) of the data points and then decompose it to where
is a diagonal matrix of sorted eigen values of and is the matrix of corresponding eigen vectors.
After finding the eigen vectors (that are mutually perpendicular), it is an easy task to transform the old data to the new vector space knowing that we could compress a possibly huge amount of information (could say variation or variance) of the orginal data is preserved in a few components of the transformed points.
Here the real numbers representing the eigen values are associated with the amount of information that the corresponding dimension contain, so after the calculations we take k dimensions of the all dimensions with the biggest eigen values and dump the rest!
Note: In a representational point of view, by applying PCA we reduce the dimensions but this does not mean that we dumped the irrelevant features and preserved the relevent ones. The main benefits of applying PCA are
- to reduce the computational cost subject to the fact that the most information be preserved,
- and solve the problem of having not enough data points in a relatively big space.
Despite the benefits of PCA we should not forget that the chosen set of dimensions can not guarantee better performance in a predictor’s point of view.
import numpy as np from sklearn.decomposition import PCA X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) pca = PCA(n_components=2) pca.fit(X) PCA(copy=True, n_components=2, whiten=False) print(pca.explained_variance_ratio_)