DOC Clarify components_ attributes in PCA #20340

nannau · 2021-06-23T17:30:10Z

Reference Issues/PRs

None as far as I can tell.

What does this implement/fix? Explain your changes.

Greetings! Long time sklearn user, first contribution attempt on this project.

PCA typically contains language that is discipline-dependent and can be confusing. For users who would like to go beyond just using sklearn's PCA for dimensionality reduction, a documentation enhancement to the components_ attribute of a "fit" PCA object would help clarify what it is in terms of common linear algebra language. I've often found myself wondering what components_ actually is, and "principal axes" as is written in current documentation is not entirely clear. This PR simply adds a line that describes what the sklearn.decomposition.PCA.components_ actually is in sklearn/decomposition/_pca.py.

From this:

    components_ : ndarray of shape (n_components, n_features)
        Principal axes in feature space, representing the directions of
        maximum variance in the data. The components are sorted by
        ``explained_variance_``.

To this:

    components_ : ndarray of shape (n_components, n_features)
        Principal axes in feature space, representing the directions of
        maximum variance in the data. Equivalently, principal axes are
        the eigenvectors of the input data's covariance matrix. The components
        are sorted by ``explained_variance_``.

Any other comments?

I could be incorrect about this, and so maybe this is why it isn't explicitly mentioned. However, if I am incorrect, then I think other people may have a similar confusion and end up using components_ as the eigenvectors. If my reasoning is correct, by simply using eigenvectors somewhere in the documentation here (as is done with explained_variance_ and eigenvalues), could help users.

In this line of the main _pca.py file, if A is our input data, the eigenvectors of ATA make up the columns of V, which is the variable used for components_ (see this link). ATA itself is the variance-covariance matrix, which using eigendecomposition, yields the same eigenvectors of the covariance matrix. In practice, it seems that these vectors differ by sign (not magnitude), and it's not entirely clear to me why. Intuitively, it seems the only reason for this is just in the definition of the direction of the rotation for each component (although I can't point to any math to back this up, but it might have something to do with this line and the minutiae of the solver being used).

A quick test of whether the magnitudes of the eigenvectors of the covariance are identical is done on the iris toy dataset below.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

x = load_iris()["data"]

# Center the data
x = StandardScaler().fit_transform(x)

assert np.allclose(x.mean(axis=0), 0.0)

# Eigendecomposition
###################
# Compute covariance
x_cov = np.cov(x, rowvar=False)

# Find eigenvalues and vectors of covariance matrix
eig_values, eig_vectors = np.linalg.eig(x_cov)

# Sort by magnitude of eigenvalue
idx = np.argsort(eig_values)[::-1]

eig_values_sorted = eig_values[idx]
eig_vectors_sorted = eig_vectors[:, idx]

# Equivalent to pca.transform
scores = np.dot(x, eig_vectors_sorted)

# Now use sklearn API (SVD)
###################
pca = PCA()
# Project data onto axes
x_st = StandardScaler().fit_transform(x)
sklearn_scores = pca.fit_transform(x_st)

# Compare eigenvalues
np.allclose(eig_values_sorted, pca.explained_variance_)
>>> True

# Compare magnitudes of eigenvectors to components_
np.allclose(np.abs(eig_vectors_sorted), np.abs(pca.components_.T))
>>> True

# Compare magnitudes of projections onto eigenvectors
np.allclose(np.abs(sklearn_scores), np.abs(scores))
>>> True

I'd be very interested in hearing what the maintainers think of this change, or if there are other ways we could communicate this particular attribute.

Thanks for your time!

NicolasHug

Hi @nannau , thanks for the PR and for your suggestion

Equivalently, principal axes are the eigenvectors of the input data's covariance matrix

~~your addition above is only true if the data is centered. It's not true in general (that's why we center the data internally before computing the SVD).~~ nvm the covariance matrix is computed on centered data so that's true

Also on a more general note: there are about 5 billion ways to explain and understand PCA. You can see it from a purely linear algebra perspective, you can define it as a minimization problem, you can look at it from a probabilistic perspective... It's quite fascinating. But it also means that there's not going to be one way for us to properly explain it so that it clicks for all users.

nannau · 2021-06-24T16:35:26Z

Hi @NicolasHug - thanks for your response. Yes, I agree with most of your above comment, and yes it would only be true if centered.

I suppose this line:

there are about 5 billion ways to explain and understand PCA.

Is what motivated this change in the first place.

Since sklearn is already using eigenvalue as a descriptor of explained_variance_, isn't this already consistent with a linear-algebra interpretation?

    explained_variance_ : ndarray of shape (n_components,)
        The amount of variance explained by each of the selected components.
        The variance estimation uses `n_samples - 1` degrees of freedom.
        Equal to n_components largest eigenvalues
        of the covariance matrix of X.

I don't mean to imply that one intepretation is superior to another, however, the documentation should reflect what linear algebra the module is doing.. Especially with an inline comment that refers to both of the singular vectors as eigenvectors..

In keeping more closely with what's used in explained_variance_, perhaps the following is better to describe components_?

    components_ : ndarray of shape (n_components, n_features)
        Principal axes in feature space, representing the directions of
        maximum variance in the data. Equivalently, principal axes are
        the eigenvectors of the covariance matrix of X. The components
        are sorted by ``explained_variance_``.

Thanks again for the reply!

nannau · 2021-07-15T04:02:52Z

Hi @NicolasHug - just wondering if you saw my previous comment. It would be great to hear your thoughts if you have the time!

NicolasHug

Fair points @nannau :)

LGTM, pinging @ogrisel @glemaitre for a quick second round maybe?

glemaitre · 2021-07-15T08:54:23Z

If we go in this direction I am wondering if we should first mention that the principal components are the singular vector of the centered input data (because we are using some SVD at the end) that are parallel to the eigenvectors of the input data's covariance matrix.

nannau · 2021-07-15T16:38:09Z

Great, thanks @NicolasHug! @glemaitre, yes I agree. :) Ultimately, it would just be nice to not have to dig around the source code to figure out what's being done. To that end, how does this sound?

    components_ : ndarray of shape (n_components, n_features)
        Principal axes in feature space, representing the directions of
        maximum variance in the data. Equivalently, the right singular 
        vectors of the centered input data, parallel to its eigenvectors.
        The components are sorted by ``explained_variance_``.

I think the note about being parallel to the eigenvectors would be helpful for anyone who does a similar analysis to what I've done above - i.e. finding that sklearn PCA is equivalent to the eigenvectors in magnitude but not in sign.

Can we identify the other modules that have a components_ attribute that I can/should add this too as well? I'm afraid I'm less familiar with those modules.

glemaitre · 2021-07-22T08:40:44Z

I can think of:

IncremmentalPCA
TruncatedSVD

Otherwise, I am not sure that there is some other places that it makes sense to change the documentation.

TruncatedSVD does not center the data by default, and according to the example, the explained_variance_ratio is not necessarily sorted by decreasing explained variance. The decsription of components_ is therefore simplified to just be the right singular vectors of the input data.

nannau · 2021-07-22T23:46:42Z

I've added the identical above definition of components_ to:

PCA
IncrementalPCA

TruncatedSVD is missing a description for components_. According to existing documentation, the input data is not centered by default, and according to the example, the explained_variance_ratio is not necessarily sorted by explained variance. I took the liberty of just adding that these are the right singular vectors of the input data.

glemaitre · 2021-07-23T07:52:31Z

Thanks @nannau Merging then.

nannau · 2021-07-23T18:22:07Z

Thanks for the help, @glemaitre and @NicolasHug!

…edSVD (scikit-learn#20340) Co-authored-by: Nic Annau <nannau@uvic.ca>

Adds eigenvector sentence

259945f

github-actions bot added the module:decomposition label Jun 23, 2021

NicolasHug reviewed Jun 24, 2021

View reviewed changes

NicolasHug approved these changes Jul 15, 2021

View reviewed changes

glemaitre changed the title ~~Clarifies what sklearn.decomposition.PCA.components_ actually is~~ DOC Clarify components_ attributes in PCA Jul 22, 2021

github-actions bot added the Documentation label Jul 22, 2021

nannau and others added 4 commits July 22, 2021 15:10

Adds parallel eigenvectors note

711a834

Removes trailing whitespace

5cfe12a

Merge branch 'scikit-learn:main' into documentation/pca-eigenvector

38f64bb

glemaitre merged commit ec758d8 into scikit-learn:main Jul 23, 2021

nannau deleted the documentation/pca-eigenvector branch July 23, 2021 18:17

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Jul 29, 2021

DOC Clarify components_ attributes in PCA, IncrementalPCA and Truncat…

dacaae8

…edSVD (scikit-learn#20340) Co-authored-by: Nic Annau <nannau@uvic.ca>

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

DOC Clarify components_ attributes in PCA, IncrementalPCA and Truncat…

1005384

…edSVD (scikit-learn#20340) Co-authored-by: Nic Annau <nannau@uvic.ca>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC Clarify components_ attributes in PCA #20340

DOC Clarify components_ attributes in PCA #20340

nannau commented Jun 23, 2021

NicolasHug left a comment •

edited

Loading

nannau commented Jun 24, 2021

nannau commented Jul 15, 2021

NicolasHug left a comment

glemaitre commented Jul 15, 2021

nannau commented Jul 15, 2021 •

edited

Loading

glemaitre commented Jul 22, 2021

nannau commented Jul 22, 2021

glemaitre commented Jul 23, 2021

nannau commented Jul 23, 2021

DOC Clarify components_ attributes in PCA #20340

DOC Clarify components_ attributes in PCA #20340

Conversation

nannau commented Jun 23, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

NicolasHug left a comment • edited Loading

Choose a reason for hiding this comment

nannau commented Jun 24, 2021

nannau commented Jul 15, 2021

NicolasHug left a comment

Choose a reason for hiding this comment

glemaitre commented Jul 15, 2021

nannau commented Jul 15, 2021 • edited Loading

glemaitre commented Jul 22, 2021

nannau commented Jul 22, 2021

glemaitre commented Jul 23, 2021

nannau commented Jul 23, 2021

NicolasHug left a comment •

edited

Loading

nannau commented Jul 15, 2021 •

edited

Loading