Skip to content

[WIP] Improve spectral clustering implementation #10739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

[WIP] Improve spectral clustering implementation #10739

wants to merge 1 commit into from

Conversation

sky88088
Copy link

@sky88088 sky88088 commented Mar 1, 2018

Reference Issues/PRs

Fixes #10736

What does this implement/fix? Explain your changes.

This PR improve the spectral clustering implementation.

The first eigenvector is now dropt in spectral clustering. And the eigenvectors is weithted by their related eigenvalues for the embedding.

Any other comments?

Givng a simple affinity matrix like
[ (1, 2, 100), (1, 3, 100), (2, 3, 100), (3, 4, 1), (4, 5, 100), (4, 6, 100), (5, 6, 100) ]

If the cluster number is set 2, now spectral clustering can stably resulting in clusters (1,2,3) and (4,5,6)

@sklearn-lgtm
Copy link

This pull request introduces 2 alerts when merging eb7a35d into ec691e9 - view on lgtm.com

new alerts:

  • 2 for Potentially uninitialized local variable

Comment posted by lgtm.com

@@ -263,7 +263,7 @@ def spectral_clustering(affinity, n_clusters=8, n_components=None,
maps = spectral_embedding(affinity, n_components=n_components,
eigen_solver=eigen_solver,
random_state=random_state,
eigen_tol=eigen_tol, drop_first=False)
eigen_tol=eigen_tol, drop_first=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a SpectralClustering expert at all, but according to the comment above, there seems to be a reason for drop_first=False:

What do you make of this comment?

# The first eigen vector is constant only for fully connected graphs 
# and should be kept for spectral clustering (drop_first = False) 
# See spectral_embedding documentation. 

Copy link
Author

@sky88088 sky88088 Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea about why the comment said drop_first should be false.

All information about SpectralClustering I read show that the eigenvector that correspond to eigenvalues 0 should be dropt, which is the first eigenvector. And it's also reasonable in the physics view.

Take information in wiki (https://en.wikipedia.org/wiki/Spectral_clustering) for example:
The eigenvectors that are relevant are the ones that correspond to smallest several eigenvalues of the Laplacian except for the smallest eigenvalue which will have a value of 0

Anyway, my test example given above is simple and obvious. I give detail test code below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An obvious reason to drop the first eigenvector is that the elements in this eigenvector are identical, for example like [4, 4, 4, 4, 4, 4]. As I have said in the issue, it represent a translational motion of the system in physics. It give no information about how to devide nodes into clusters.

@jmargeta may know well this property of this eigenvector according to #9062

In contrast, the eigenvector related to the second smallest eigenvalue in my example is something like [4, 4, 3, -4, -4, -3]. Nodes in different clusters have different sign. Not dropping the first eigenvector only affects the stability of the K-means result.

Copy link
Contributor

@jmargeta jmargeta Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @sky88088 that for fully connected graphs, the first eigenvector is constant and should be dropped as it can actually hurt clustering performance when included.
However, when the graph contains disconnected regions, isn't it precisely the first - best splitting - eigenvector that you would want to keep?
Extra graphical intuition here: https://www.cs.cmu.edu/~aarti/Class/10701/slides/Lecture21_2.pdf

A suggestion I have seen in the case of several disconnected components is to apply spectral clustering per component individually.

Copy link
Author

@sky88088 sky88088 Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with @jmargeta.

However, even in the case of graph contains disconnected region, I think one can get the right result only when setting the exact right n_components=the number of disconnected region. In this case, it's hard for users to make the right choise in practice. If I'm wrong, correct me plz.

Then the first eigenvector hurt the performance of clustering in the most of cases, and it's hard to use even when it's useful.

Maybe we should check the connectivity of the input matrix before applying spectral_embedding to decide whether drop the first eigenvector. Or just raise an error for this case and drop the first eigenvector in other cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scikit-learn is a wonderful tool also because makes many of the hard choices for its users out of the box. Getting this right is important and thanks for looking into this @sky88088 .

I do not know whether there is any consensus on what to do with disconnected regions.

  • If we set n_components=number of disconnected regions, would there be any need for clustering? (what about assigning a unique label to each connected component?)
  • What about isolated outliers? Do we count them as valid connected components?
  • ...

Dropping the first should only hurt the performance with disconnected graphs. So a connectivity test to decide whether to drop or not seems to me like a good step forward.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we now have the common idea that the first eigenvector should be dropt in the most of cases.

The second improvement is a little complicated. I don't know if there is any variant of spectral clustering or spectral embedding use eigenvalues to weight their eigenvectors. I learn this idea from the elastic network models(https://en.wikipedia.org/wiki/Gaussian_network_model) in physics. Actually, it's almost the same thing of spectral embedding.

Whether to drop the first eigenvector now is a parameter of the function. I don't know if I should add a new parameter to decide whether to weigh the eigenvectors. In my view, this will almost always improve the result of embedding, so there is no need for users to choose do it or not. However, not like classification task, it maybe not easy to judge which embedding result is better.

In my example, this improvement make the result robust and independent of the choise of the number of components. But my example is simple and maybe not so convincing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gaussian network model indeed seems similar and certainly could be an interesting source of inspiration for improvements to spectral embedding/clustering. In my opinion (I am not a spectral expert by any means nor a core sklearn developer), we have to be careful when introducing the weighting though. If for nothing else then for the sake of reproducible research.

While it might improve the spectral clustering's behavior (which as a user I would love to see, of course) it would also mean that the implementation diverges from its definition in the original spectral clustering (Shi and Malik 2000?) paper. When people would use it in their research they could be unknowingly citing something else (unlikely that they would reference spectral clustering as "spectral clustering with enhancements introduced in sklearn commit eb7a35d").

@sky88088 , @glemaitre , @lesteve what do you think?

Copy link
Author

@sky88088 sky88088 Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reproducible research is something important and I haven't taken it into account. Then adding a new parameter to decide whether to introduce weighting I think is necessary. I updated my PR to add this new parameter.

For the problem of connectivity test of input graph, I found that spectral_embedding is using a private function for this task.

def _graph_is_connected(graph):
""" Return whether the graph is connected (True) or Not (False)
Parameters
----------
graph : array-like or sparse matrix, shape: (n_samples, n_samples)
adjacency matrix of the graph, non-zero weight means an edge
between the nodes
Returns
-------
is_connected : bool
True means the graph is fully connected and False means not
"""
if sparse.isspmatrix(graph):
# sparse graph, find all the connected components
n_connected_components, _ = connected_components(graph)
return n_connected_components == 1
else:
# dense graph, find all connected components start from node 0
return _graph_connected_component(graph, 0).sum() == graph.shape[0]

However, we should make this test in spectral_clustering before applying spectral embedding to decide whether to drop the first eigenvector. I'm wondering if sklearn has any public function for this task or if this private function could be moved to some utils module?

@lesteve
Copy link
Member

lesteve commented Mar 1, 2018

Your PR broke quite a few existing tests, you will need to look at the failures (see this Travis build for example), and fix the tests.

See this for example for how to run the tests on your machine.

@jnothman
Copy link
Member

jnothman commented Mar 1, 2018 via email

@sky88088
Copy link
Author

sky88088 commented Mar 2, 2018

I'm sorry that I haven't given code of test at the beginning.

from scipy.sparse import coo_matrix
from sklearn.cluster import spectral_clustering

# affinity between nodes
row = [0, 0, 1, 2, 3, 3, 4]
col = [1, 2, 2, 3, 4, 5, 5]
val = [100, 100, 100, 1, 100, 100, 100]

coo = coo_matrix((val + val, (row + col, col + row)), shape=(6, 6))
print coo.todense()

labels = spectral_clustering(coo, n_clusters=2, n_components=1)
# clustering result, should be [0 0 0 1 1 1] or [1 1 1 0 0 0]
print labels

For the original spectral_clustering, the result is unstable, no matter how many components used.
After setting drop_first = True

labels = spectral_clustering(coo, n_clusters=2, n_components=1)

would always give the right answer. But n_components must be 1.

After weight eigenvectors according to their related eigenvalues, it works stably no matter how many components used. For example

labels = spectral_clustering(coo, n_clusters=2, n_components=4)

I use this as a test example because it's easy for us to figure out which result is reasonable.

@sky88088 sky88088 changed the title Improve spectral clustering implementation [WIP] Improve spectral clustering implementation Mar 5, 2018
@sklearn-lgtm
Copy link

This pull request introduces 2 alerts when merging 741ec13 into 2aba6e2 - view on lgtm.com

new alerts:

  • 2 for Potentially uninitialized local variable

Comment posted by lgtm.com

@sky88088
Copy link
Author

sky88088 commented Mar 6, 2018

Oh, it seems that I use rebase to update my PR and lost all the comments base on the last version……

I conclude the opnions in the lost discussion here.

  1. The first eigenvector hurt the performance of clustering when the affinity matrix with no disconnected region. Maybe we should check the connectivity of the input matrix to decide whether to drop it.

I found that spectral embedding use a private function to check the connectivity of input.

def _graph_is_connected(graph):
""" Return whether the graph is connected (True) or Not (False)
Parameters
----------
graph : array-like or sparse matrix, shape: (n_samples, n_samples)
adjacency matrix of the graph, non-zero weight means an edge
between the nodes
Returns
-------
is_connected : bool
True means the graph is fully connected and False means not
"""
if sparse.isspmatrix(graph):
# sparse graph, find all the connected components
n_connected_components, _ = connected_components(graph)
return n_connected_components == 1
else:
# dense graph, find all connected components start from node 0
return _graph_connected_component(graph, 0).sum() == graph.shape[0]

I'm wondering if sklearn has a public method to do this task or if this private function could be moved to some utils modules.

  1. The idea of weighting eigenvectors is inspired by the gaussian network model(https://en.wikipedia.org/wiki/Gaussian_network_model), which is almost the same as spectral embedding. I add a new parameter to spectral clustering to decide whether to use weight for the sake of compatibility.

What do you think of these @lesteve @jmargeta @glemaitre @jnothman

@lobpcg
Copy link
Contributor

lobpcg commented Aug 13, 2019

The issue appears to be fixed already, so may be the PR and the issue can be closed? Running the provided code gives the correct answer for any n_components<4 (unclear why one would want to use larger n_components?):

import numpy as np
from scipy.sparse import coo_matrix
from sklearn.cluster import spectral_clustering

# affinity between nodes
row = [0, 0, 1, 2, 3, 3, 4]
col = [1, 2, 2, 3, 4, 5, 5]
val = [100, 100, 100, 1, 100, 100, 100]

coo = coo_matrix((val + val, (row + col, col + row)), shape=(6, 6))
print (coo.todense())

for n_c in np.arange(1,4):
    labels = spectral_clustering(coo, n_clusters=2, n_components=n_c)
    # clustering result, should be [0 0 0 1 1 1] or [1 1 1 0 0 0]
    print (labels)

[[  0 100 100   0   0   0]
 [100   0 100   0   0   0]
 [100 100   0   1   0   0]
 [  0   0   1   0 100 100]
 [  0   0   0 100   0 100]
 [  0   0   0 100 100   0]]
[1 1 1 0 0 0]
[1 1 1 0 0 0]
[0 0 0 1 1 1]

@amueller
Copy link
Member

@lobpcg I just saw you commented somewhere about dropping the first eigenvector if and only if the graph is connected. That sounds like the right thing to me but I can't find the comment right now. Are we implementing that?

@amueller
Copy link
Member

docs say:

    # The first eigen vector is constant only for fully connected graphs
    # and should be kept for spectral clustering (drop_first = False)
    # See spectral_embedding documentation.

but it looks like we never drop it. But I guess the assumption is the discretization method doesn't mind a constant vector?
I agree we should close this and the issue, though.

@amueller
Copy link
Member

also eigensolver='amg' seems to give garbage but I figure that's related to #13393?

spectral_embedding(coo, n_components=2, random_state=0, drop_first=False, eigen_solver='amg')

with the above example doesn't give the result I'm expecting (which would be one constant eigenvector and one eigenvector separating the two clusters, which is what the other solvers give).

@lobpcg
Copy link
Contributor

lobpcg commented Aug 13, 2019

The "constant" vector is actually needed in #12316 implements an alternative method for choosing the clusters from the eigenvectors, outperforming both k-means and discretization method.s The constant vector presence should surely make no difference in k-means and discretization methods

Unrelated to this issue, there is some mystery to me why there is this distinction in the code whether or not the graph is fully connected...

@lobpcg
Copy link
Contributor

lobpcg commented Aug 13, 2019

also eigensolver='amg' seems to give garbage but I figure that's related to #13393?

spectral_embedding(coo, n_components=2, random_state=0, drop_first=False, eigen_solver='amg')

with the above example doesn't give the result I'm expecting (which would be one constant eigenvector and one eigenvector separating the two clusters, which is what the other solvers give).

Most likely - could you please give #13393 a try? The trouble may also be related to scipy/scipy#10621 since AMG is actually lobpcg with preconditioning.

@amueller
Copy link
Member

amueller commented Aug 13, 2019

@lobpcg when you say #13393, do you mean adding a small value to the diagonal? That didn't seem to help. To me it looks more like the eigenvalues are inverted as you suggested in #10720 (comment) - the 5th eigenvector is as expected (wow that seems bad)

@lobpcg
Copy link
Contributor

lobpcg commented Aug 13, 2019

@lobpcg when you say #13393, do you mean adding a small value to the diagonal? That didn't seem to help. To me it looks more like the eigenvalues are inverted as you suggested in #10720 (comment) - the 5th eigenvector is as expected (wow that seems bad)

Yes, in this case the issue is most likely just the same as in #10720 (comment) - thanks for reminding me my own recent post!

@amueller
Copy link
Member

@lobpcg yeah it's a bit tricky to keep track of all of these, thanks for all your input!

@cmarmo
Copy link
Contributor

cmarmo commented Sep 2, 2020

Closing this following @amueller comment.
Thanks @lobpcg for checking this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Two improvement suggestions for spectral clustering
8 participants