-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
DOC Clarify components_ attributes in PCA #20340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Clarify components_ attributes in PCA #20340
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nannau , thanks for the PR and for your suggestion
Equivalently, principal axes are the eigenvectors of the input data's covariance matrix
your addition above is only true if the data is centered. It's not true in general (that's why we center the data internally before computing the SVD). nvm the covariance matrix is computed on centered data so that's true
Also on a more general note: there are about 5 billion ways to explain and understand PCA. You can see it from a purely linear algebra perspective, you can define it as a minimization problem, you can look at it from a probabilistic perspective... It's quite fascinating. But it also means that there's not going to be one way for us to properly explain it so that it clicks for all users.
Hi @NicolasHug - thanks for your response. Yes, I agree with most of your above comment, and yes it would only be true if centered. I suppose this line:
Is what motivated this change in the first place. Since sklearn is already using eigenvalue as a descriptor of explained_variance_, isn't this already consistent with a linear-algebra interpretation?
I don't mean to imply that one intepretation is superior to another, however, the documentation should reflect what linear algebra the module is doing.. Especially with an inline comment that refers to both of the singular vectors as eigenvectors.. In keeping more closely with what's used in
Thanks again for the reply! |
Hi @NicolasHug - just wondering if you saw my previous comment. It would be great to hear your thoughts if you have the time! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair points @nannau :)
LGTM, pinging @ogrisel @glemaitre for a quick second round maybe?
If we go in this direction I am wondering if we should first mention that the principal components are the singular vector of the centered input data (because we are using some SVD at the end) that are parallel to the eigenvectors of the input data's covariance matrix. |
Great, thanks @NicolasHug! @glemaitre, yes I agree. :) Ultimately, it would just be nice to not have to dig around the source code to figure out what's being done. To that end, how does this sound?
I think the note about being parallel to the eigenvectors would be helpful for anyone who does a similar analysis to what I've done above - i.e. finding that sklearn PCA is equivalent to the eigenvectors in magnitude but not in sign. Can we identify the other modules that have a |
I can think of:
Otherwise, I am not sure that there is some other places that it makes sense to change the documentation. |
TruncatedSVD does not center the data by default, and according to the example, the explained_variance_ratio is not necessarily sorted by decreasing explained variance. The decsription of components_ is therefore simplified to just be the right singular vectors of the input data.
I've added the identical above definition of
TruncatedSVD is missing a description for |
Thanks @nannau Merging then. |
Thanks for the help, @glemaitre and @NicolasHug! |
…edSVD (scikit-learn#20340) Co-authored-by: Nic Annau <nannau@uvic.ca>
…edSVD (scikit-learn#20340) Co-authored-by: Nic Annau <nannau@uvic.ca>
Reference Issues/PRs
None as far as I can tell.
What does this implement/fix? Explain your changes.
Greetings! Long time sklearn user, first contribution attempt on this project.
PCA typically contains language that is discipline-dependent and can be confusing. For users who would like to go beyond just using sklearn's PCA for dimensionality reduction, a documentation enhancement to the
components_
attribute of a "fit" PCA object would help clarify what it is in terms of common linear algebra language. I've often found myself wondering whatcomponents_
actually is, and "principal axes" as is written in current documentation is not entirely clear. This PR simply adds a line that describes what thesklearn.decomposition.PCA.components_
actually is insklearn/decomposition/_pca.py
.From this:
To this:
Any other comments?
I could be incorrect about this, and so maybe this is why it isn't explicitly mentioned. However, if I am incorrect, then I think other people may have a similar confusion and end up using
components_
as the eigenvectors. If my reasoning is correct, by simply using eigenvectors somewhere in the documentation here (as is done withexplained_variance_
and eigenvalues), could help users.In this line of the main
_pca.py
file, if A is our input data, the eigenvectors of ATA make up the columns of V, which is the variable used forcomponents_
(see this link). ATA itself is the variance-covariance matrix, which using eigendecomposition, yields the same eigenvectors of the covariance matrix. In practice, it seems that these vectors differ by sign (not magnitude), and it's not entirely clear to me why. Intuitively, it seems the only reason for this is just in the definition of the direction of the rotation for each component (although I can't point to any math to back this up, but it might have something to do with this line and the minutiae of the solver being used).A quick test of whether the magnitudes of the eigenvectors of the covariance are identical is done on the iris toy dataset below.
I'd be very interested in hearing what the maintainers think of this change, or if there are other ways we could communicate this particular attribute.
Thanks for your time!