Skip to content

DOC Clarify components_ attributes in PCA #20340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 23, 2021
Merged

DOC Clarify components_ attributes in PCA #20340

merged 5 commits into from
Jul 23, 2021

Conversation

nannau
Copy link
Contributor

@nannau nannau commented Jun 23, 2021

Reference Issues/PRs

None as far as I can tell.

What does this implement/fix? Explain your changes.

Greetings! Long time sklearn user, first contribution attempt on this project.

PCA typically contains language that is discipline-dependent and can be confusing. For users who would like to go beyond just using sklearn's PCA for dimensionality reduction, a documentation enhancement to the components_ attribute of a "fit" PCA object would help clarify what it is in terms of common linear algebra language. I've often found myself wondering what components_ actually is, and "principal axes" as is written in current documentation is not entirely clear. This PR simply adds a line that describes what the sklearn.decomposition.PCA.components_ actually is in sklearn/decomposition/_pca.py.

From this:

    components_ : ndarray of shape (n_components, n_features)
        Principal axes in feature space, representing the directions of
        maximum variance in the data. The components are sorted by
        ``explained_variance_``.

To this:

    components_ : ndarray of shape (n_components, n_features)
        Principal axes in feature space, representing the directions of
        maximum variance in the data. Equivalently, principal axes are
        the eigenvectors of the input data's covariance matrix. The components
        are sorted by ``explained_variance_``.

Any other comments?

I could be incorrect about this, and so maybe this is why it isn't explicitly mentioned. However, if I am incorrect, then I think other people may have a similar confusion and end up using components_ as the eigenvectors. If my reasoning is correct, by simply using eigenvectors somewhere in the documentation here (as is done with explained_variance_ and eigenvalues), could help users.

In this line of the main _pca.py file, if A is our input data, the eigenvectors of ATA make up the columns of V, which is the variable used for components_ (see this link). ATA itself is the variance-covariance matrix, which using eigendecomposition, yields the same eigenvectors of the covariance matrix. In practice, it seems that these vectors differ by sign (not magnitude), and it's not entirely clear to me why. Intuitively, it seems the only reason for this is just in the definition of the direction of the rotation for each component (although I can't point to any math to back this up, but it might have something to do with this line and the minutiae of the solver being used).

A quick test of whether the magnitudes of the eigenvectors of the covariance are identical is done on the iris toy dataset below.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

x = load_iris()["data"]

# Center the data
x = StandardScaler().fit_transform(x)

assert np.allclose(x.mean(axis=0), 0.0)

# Eigendecomposition
###################
# Compute covariance
x_cov = np.cov(x, rowvar=False)

# Find eigenvalues and vectors of covariance matrix
eig_values, eig_vectors = np.linalg.eig(x_cov)

# Sort by magnitude of eigenvalue
idx = np.argsort(eig_values)[::-1]

eig_values_sorted = eig_values[idx]
eig_vectors_sorted = eig_vectors[:, idx]

# Equivalent to pca.transform
scores = np.dot(x, eig_vectors_sorted)

# Now use sklearn API (SVD)
###################
pca = PCA()
# Project data onto axes
x_st = StandardScaler().fit_transform(x)
sklearn_scores = pca.fit_transform(x_st)

# Compare eigenvalues
np.allclose(eig_values_sorted, pca.explained_variance_)
>>> True

# Compare magnitudes of eigenvectors to components_
np.allclose(np.abs(eig_vectors_sorted), np.abs(pca.components_.T))
>>> True

# Compare magnitudes of projections onto eigenvectors
np.allclose(np.abs(sklearn_scores), np.abs(scores))
>>> True

I'd be very interested in hearing what the maintainers think of this change, or if there are other ways we could communicate this particular attribute.

Thanks for your time!

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nannau , thanks for the PR and for your suggestion

Equivalently, principal axes are the eigenvectors of the input data's covariance matrix

your addition above is only true if the data is centered. It's not true in general (that's why we center the data internally before computing the SVD). nvm the covariance matrix is computed on centered data so that's true

Also on a more general note: there are about 5 billion ways to explain and understand PCA. You can see it from a purely linear algebra perspective, you can define it as a minimization problem, you can look at it from a probabilistic perspective... It's quite fascinating. But it also means that there's not going to be one way for us to properly explain it so that it clicks for all users.

@nannau
Copy link
Contributor Author

nannau commented Jun 24, 2021

Hi @NicolasHug - thanks for your response. Yes, I agree with most of your above comment, and yes it would only be true if centered.

I suppose this line:

there are about 5 billion ways to explain and understand PCA.

Is what motivated this change in the first place.

Since sklearn is already using eigenvalue as a descriptor of explained_variance_, isn't this already consistent with a linear-algebra interpretation?

    explained_variance_ : ndarray of shape (n_components,)
        The amount of variance explained by each of the selected components.
        The variance estimation uses `n_samples - 1` degrees of freedom.
        Equal to n_components largest eigenvalues
        of the covariance matrix of X.

I don't mean to imply that one intepretation is superior to another, however, the documentation should reflect what linear algebra the module is doing.. Especially with an inline comment that refers to both of the singular vectors as eigenvectors..

In keeping more closely with what's used in explained_variance_, perhaps the following is better to describe components_?

    components_ : ndarray of shape (n_components, n_features)
        Principal axes in feature space, representing the directions of
        maximum variance in the data. Equivalently, principal axes are
        the eigenvectors of the covariance matrix of X. The components
        are sorted by ``explained_variance_``.

Thanks again for the reply!

@nannau
Copy link
Contributor Author

nannau commented Jul 15, 2021

Hi @NicolasHug - just wondering if you saw my previous comment. It would be great to hear your thoughts if you have the time!

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair points @nannau :)

LGTM, pinging @ogrisel @glemaitre for a quick second round maybe?

@glemaitre
Copy link
Member

If we go in this direction I am wondering if we should first mention that the principal components are the singular vector of the centered input data (because we are using some SVD at the end) that are parallel to the eigenvectors of the input data's covariance matrix.

@nannau
Copy link
Contributor Author

nannau commented Jul 15, 2021

Great, thanks @NicolasHug! @glemaitre, yes I agree. :) Ultimately, it would just be nice to not have to dig around the source code to figure out what's being done. To that end, how does this sound?

    components_ : ndarray of shape (n_components, n_features)
        Principal axes in feature space, representing the directions of
        maximum variance in the data. Equivalently, the right singular 
        vectors of the centered input data, parallel to its eigenvectors.
        The components are sorted by ``explained_variance_``.

I think the note about being parallel to the eigenvectors would be helpful for anyone who does a similar analysis to what I've done above - i.e. finding that sklearn PCA is equivalent to the eigenvectors in magnitude but not in sign.

Can we identify the other modules that have a components_ attribute that I can/should add this too as well? I'm afraid I'm less familiar with those modules.

@glemaitre
Copy link
Member

I can think of:

  • IncremmentalPCA
  • TruncatedSVD

Otherwise, I am not sure that there is some other places that it makes sense to change the documentation.

@glemaitre glemaitre changed the title Clarifies what sklearn.decomposition.PCA.components_ actually is DOC Clarify components_ attributes in PCA Jul 22, 2021
nannau and others added 4 commits July 22, 2021 15:10
TruncatedSVD does not center the data by default, and according to the example, the explained_variance_ratio is not necessarily sorted by decreasing explained variance. The decsription of components_ is therefore simplified to just be the right singular vectors of the input data.
@nannau
Copy link
Contributor Author

nannau commented Jul 22, 2021

I've added the identical above definition of components_ to:

  • PCA
  • IncrementalPCA

TruncatedSVD is missing a description for components_. According to existing documentation, the input data is not centered by default, and according to the example, the explained_variance_ratio is not necessarily sorted by explained variance. I took the liberty of just adding that these are the right singular vectors of the input data.

@glemaitre glemaitre merged commit ec758d8 into scikit-learn:main Jul 23, 2021
@glemaitre
Copy link
Member

Thanks @nannau Merging then.

@nannau nannau deleted the documentation/pca-eigenvector branch July 23, 2021 18:17
@nannau
Copy link
Contributor Author

nannau commented Jul 23, 2021

Thanks for the help, @glemaitre and @NicolasHug!

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Jul 29, 2021
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants