-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] Improve spectral clustering implementation #10739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request introduces 2 alerts when merging eb7a35d into ec691e9 - view on lgtm.com new alerts:
Comment posted by lgtm.com |
sklearn/cluster/spectral.py
Outdated
@@ -263,7 +263,7 @@ def spectral_clustering(affinity, n_clusters=8, n_components=None, | |||
maps = spectral_embedding(affinity, n_components=n_components, | |||
eigen_solver=eigen_solver, | |||
random_state=random_state, | |||
eigen_tol=eigen_tol, drop_first=False) | |||
eigen_tol=eigen_tol, drop_first=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a SpectralClustering expert at all, but according to the comment above, there seems to be a reason for drop_first=False
:
What do you make of this comment?
# The first eigen vector is constant only for fully connected graphs
# and should be kept for spectral clustering (drop_first = False)
# See spectral_embedding documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea about why the comment said drop_first should be false.
All information about SpectralClustering I read show that the eigenvector that correspond to eigenvalues 0 should be dropt, which is the first eigenvector. And it's also reasonable in the physics view.
Take information in wiki (https://en.wikipedia.org/wiki/Spectral_clustering) for example:
The eigenvectors that are relevant are the ones that correspond to smallest several eigenvalues of the Laplacian except for the smallest eigenvalue which will have a value of 0
Anyway, my test example given above is simple and obvious. I give detail test code below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An obvious reason to drop the first eigenvector is that the elements in this eigenvector are identical, for example like [4, 4, 4, 4, 4, 4]. As I have said in the issue, it represent a translational motion of the system in physics. It give no information about how to devide nodes into clusters.
@jmargeta may know well this property of this eigenvector according to #9062
In contrast, the eigenvector related to the second smallest eigenvalue in my example is something like [4, 4, 3, -4, -4, -3]. Nodes in different clusters have different sign. Not dropping the first eigenvector only affects the stability of the K-means result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @sky88088 that for fully connected graphs, the first eigenvector is constant and should be dropped as it can actually hurt clustering performance when included.
However, when the graph contains disconnected regions, isn't it precisely the first - best splitting - eigenvector that you would want to keep?
Extra graphical intuition here: https://www.cs.cmu.edu/~aarti/Class/10701/slides/Lecture21_2.pdf
A suggestion I have seen in the case of several disconnected components is to apply spectral clustering per component individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree with @jmargeta.
However, even in the case of graph contains disconnected region, I think one can get the right result only when setting the exact right n_components=the number of disconnected region. In this case, it's hard for users to make the right choise in practice. If I'm wrong, correct me plz.
Then the first eigenvector hurt the performance of clustering in the most of cases, and it's hard to use even when it's useful.
Maybe we should check the connectivity of the input matrix before applying spectral_embedding to decide whether drop the first eigenvector. Or just raise an error for this case and drop the first eigenvector in other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scikit-learn is a wonderful tool also because makes many of the hard choices for its users out of the box. Getting this right is important and thanks for looking into this @sky88088 .
I do not know whether there is any consensus on what to do with disconnected regions.
- If we set n_components=number of disconnected regions, would there be any need for clustering? (what about assigning a unique label to each connected component?)
- What about isolated outliers? Do we count them as valid connected components?
- ...
Dropping the first should only hurt the performance with disconnected graphs. So a connectivity test to decide whether to drop or not seems to me like a good step forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we now have the common idea that the first eigenvector should be dropt in the most of cases.
The second improvement is a little complicated. I don't know if there is any variant of spectral clustering or spectral embedding use eigenvalues to weight their eigenvectors. I learn this idea from the elastic network models(https://en.wikipedia.org/wiki/Gaussian_network_model) in physics. Actually, it's almost the same thing of spectral embedding.
Whether to drop the first eigenvector now is a parameter of the function. I don't know if I should add a new parameter to decide whether to weigh the eigenvectors. In my view, this will almost always improve the result of embedding, so there is no need for users to choose do it or not. However, not like classification task, it maybe not easy to judge which embedding result is better.
In my example, this improvement make the result robust and independent of the choise of the number of components. But my example is simple and maybe not so convincing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gaussian network model indeed seems similar and certainly could be an interesting source of inspiration for improvements to spectral embedding/clustering. In my opinion (I am not a spectral expert by any means nor a core sklearn developer), we have to be careful when introducing the weighting though. If for nothing else then for the sake of reproducible research.
While it might improve the spectral clustering's behavior (which as a user I would love to see, of course) it would also mean that the implementation diverges from its definition in the original spectral clustering (Shi and Malik 2000?) paper. When people would use it in their research they could be unknowingly citing something else (unlikely that they would reference spectral clustering as "spectral clustering with enhancements introduced in sklearn commit eb7a35d").
@sky88088 , @glemaitre , @lesteve what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reproducible research is something important and I haven't taken it into account. Then adding a new parameter to decide whether to introduce weighting I think is necessary. I updated my PR to add this new parameter.
For the problem of connectivity test of input graph, I found that spectral_embedding is using a private function for this task.
scikit-learn/sklearn/manifold/spectral_embedding_.py
Lines 69 to 89 in 2aba6e2
def _graph_is_connected(graph): | |
""" Return whether the graph is connected (True) or Not (False) | |
Parameters | |
---------- | |
graph : array-like or sparse matrix, shape: (n_samples, n_samples) | |
adjacency matrix of the graph, non-zero weight means an edge | |
between the nodes | |
Returns | |
------- | |
is_connected : bool | |
True means the graph is fully connected and False means not | |
""" | |
if sparse.isspmatrix(graph): | |
# sparse graph, find all the connected components | |
n_connected_components, _ = connected_components(graph) | |
return n_connected_components == 1 | |
else: | |
# dense graph, find all connected components start from node 0 | |
return _graph_connected_component(graph, 0).sum() == graph.shape[0] |
However, we should make this test in spectral_clustering before applying spectral embedding to decide whether to drop the first eigenvector. I'm wondering if sklearn has any public function for this task or if this private function could be moved to some utils module?
Your PR broke quite a few existing tests, you will need to look at the failures (see this Travis build for example), and fix the tests. See this for example for how to run the tests on your machine. |
but we wouldn't change the defaults without a deprecation period, so the
test failures are probably irrelevant
|
I'm sorry that I haven't given code of test at the beginning. from scipy.sparse import coo_matrix
from sklearn.cluster import spectral_clustering
# affinity between nodes
row = [0, 0, 1, 2, 3, 3, 4]
col = [1, 2, 2, 3, 4, 5, 5]
val = [100, 100, 100, 1, 100, 100, 100]
coo = coo_matrix((val + val, (row + col, col + row)), shape=(6, 6))
print coo.todense()
labels = spectral_clustering(coo, n_clusters=2, n_components=1)
# clustering result, should be [0 0 0 1 1 1] or [1 1 1 0 0 0]
print labels For the original spectral_clustering, the result is unstable, no matter how many components used. labels = spectral_clustering(coo, n_clusters=2, n_components=1) would always give the right answer. But n_components must be 1. After weight eigenvectors according to their related eigenvalues, it works stably no matter how many components used. For example labels = spectral_clustering(coo, n_clusters=2, n_components=4) I use this as a test example because it's easy for us to figure out which result is reasonable. |
This pull request introduces 2 alerts when merging 741ec13 into 2aba6e2 - view on lgtm.com new alerts:
Comment posted by lgtm.com |
Oh, it seems that I use rebase to update my PR and lost all the comments base on the last version…… I conclude the opnions in the lost discussion here.
I found that spectral embedding use a private function to check the connectivity of input. scikit-learn/sklearn/manifold/spectral_embedding_.py Lines 69 to 89 in 2aba6e2
I'm wondering if sklearn has a public method to do this task or if this private function could be moved to some utils modules.
What do you think of these @lesteve @jmargeta @glemaitre @jnothman |
The issue appears to be fixed already, so may be the PR and the issue can be closed? Running the provided code gives the correct answer for any n_components<4 (unclear why one would want to use larger n_components?):
|
@lobpcg I just saw you commented somewhere about dropping the first eigenvector if and only if the graph is connected. That sounds like the right thing to me but I can't find the comment right now. Are we implementing that? |
docs say:
but it looks like we never drop it. But I guess the assumption is the discretization method doesn't mind a constant vector? |
also
with the above example doesn't give the result I'm expecting (which would be one constant eigenvector and one eigenvector separating the two clusters, which is what the other solvers give). |
The "constant" vector is actually needed in #12316 implements an alternative method for choosing the clusters from the eigenvectors, outperforming both k-means and discretization method.s The constant vector presence should surely make no difference in k-means and discretization methods Unrelated to this issue, there is some mystery to me why there is this distinction in the code whether or not the graph is fully connected... |
Most likely - could you please give #13393 a try? The trouble may also be related to scipy/scipy#10621 since AMG is actually lobpcg with preconditioning. |
@lobpcg when you say #13393, do you mean adding a small value to the diagonal? That didn't seem to help. To me it looks more like the eigenvalues are inverted as you suggested in #10720 (comment) - the 5th eigenvector is as expected (wow that seems bad) |
Yes, in this case the issue is most likely just the same as in #10720 (comment) - thanks for reminding me my own recent post! |
@lobpcg yeah it's a bit tricky to keep track of all of these, thanks for all your input! |
Reference Issues/PRs
Fixes #10736
What does this implement/fix? Explain your changes.
This PR improve the spectral clustering implementation.
The first eigenvector is now dropt in spectral clustering. And the eigenvectors is weithted by their related eigenvalues for the embedding.
Any other comments?
Givng a simple affinity matrix like
[ (1, 2, 100), (1, 3, 100), (2, 3, 100), (3, 4, 1), (4, 5, 100), (4, 6, 100), (5, 6, 100) ]
If the cluster number is set 2, now spectral clustering can stably resulting in clusters (1,2,3) and (4,5,6)