[WIP] Improve spectral clustering implementation #10739

sky88088 · 2018-03-01T12:03:23Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR improve the spectral clustering implementation.

The first eigenvector is now dropt in spectral clustering. And the eigenvectors is weithted by their related eigenvalues for the embedding.

Any other comments?

Givng a simple affinity matrix like
[ (1, 2, 100), (1, 3, 100), (2, 3, 100), (3, 4, 1), (4, 5, 100), (4, 6, 100), (5, 6, 100) ]

If the cluster number is set 2, now spectral clustering can stably resulting in clusters (1,2,3) and (4,5,6)

sklearn-lgtm · 2018-03-01T12:37:55Z

This pull request introduces 2 alerts when merging eb7a35d into ec691e9 - view on lgtm.com

new alerts:

2 for Potentially uninitialized local variable

Comment posted by lgtm.com

lesteve · 2018-03-01T15:50:23Z

sklearn/cluster/spectral.py

@@ -263,7 +263,7 @@ def spectral_clustering(affinity, n_clusters=8, n_components=None,
    maps = spectral_embedding(affinity, n_components=n_components,
                              eigen_solver=eigen_solver,
                              random_state=random_state,
-                              eigen_tol=eigen_tol, drop_first=False)
+                              eigen_tol=eigen_tol, drop_first=True)


Not a SpectralClustering expert at all, but according to the comment above, there seems to be a reason for drop_first=False:

What do you make of this comment?

# The first eigen vector is constant only for fully connected graphs # and should be kept for spectral clustering (drop_first = False) # See spectral_embedding documentation.

I have no idea about why the comment said drop_first should be false.

All information about SpectralClustering I read show that the eigenvector that correspond to eigenvalues 0 should be dropt, which is the first eigenvector. And it's also reasonable in the physics view.

Take information in wiki (https://en.wikipedia.org/wiki/Spectral_clustering) for example:
The eigenvectors that are relevant are the ones that correspond to smallest several eigenvalues of the Laplacian except for the smallest eigenvalue which will have a value of 0

Anyway, my test example given above is simple and obvious. I give detail test code below.

An obvious reason to drop the first eigenvector is that the elements in this eigenvector are identical, for example like [4, 4, 4, 4, 4, 4]. As I have said in the issue, it represent a translational motion of the system in physics. It give no information about how to devide nodes into clusters.

@jmargeta may know well this property of this eigenvector according to #9062

In contrast, the eigenvector related to the second smallest eigenvalue in my example is something like [4, 4, 3, -4, -4, -3]. Nodes in different clusters have different sign. Not dropping the first eigenvector only affects the stability of the K-means result.

I agree with @sky88088 that for fully connected graphs, the first eigenvector is constant and should be dropped as it can actually hurt clustering performance when included.
However, when the graph contains disconnected regions, isn't it precisely the first - best splitting - eigenvector that you would want to keep?
Extra graphical intuition here: https://www.cs.cmu.edu/~aarti/Class/10701/slides/Lecture21_2.pdf

A suggestion I have seen in the case of several disconnected components is to apply spectral clustering per component individually.

I do agree with @jmargeta.

However, even in the case of graph contains disconnected region, I think one can get the right result only when setting the exact right n_components=the number of disconnected region. In this case, it's hard for users to make the right choise in practice. If I'm wrong, correct me plz.

Then the first eigenvector hurt the performance of clustering in the most of cases, and it's hard to use even when it's useful.

Maybe we should check the connectivity of the input matrix before applying spectral_embedding to decide whether drop the first eigenvector. Or just raise an error for this case and drop the first eigenvector in other cases.

Scikit-learn is a wonderful tool also because makes many of the hard choices for its users out of the box. Getting this right is important and thanks for looking into this @sky88088 .

I do not know whether there is any consensus on what to do with disconnected regions.

If we set n_components=number of disconnected regions, would there be any need for clustering? (what about assigning a unique label to each connected component?)

What about isolated outliers? Do we count them as valid connected components?

...

Dropping the first should only hurt the performance with disconnected graphs. So a connectivity test to decide whether to drop or not seems to me like a good step forward.

I think we now have the common idea that the first eigenvector should be dropt in the most of cases.

The second improvement is a little complicated. I don't know if there is any variant of spectral clustering or spectral embedding use eigenvalues to weight their eigenvectors. I learn this idea from the elastic network models(https://en.wikipedia.org/wiki/Gaussian_network_model) in physics. Actually, it's almost the same thing of spectral embedding.

Whether to drop the first eigenvector now is a parameter of the function. I don't know if I should add a new parameter to decide whether to weigh the eigenvectors. In my view, this will almost always improve the result of embedding, so there is no need for users to choose do it or not. However, not like classification task, it maybe not easy to judge which embedding result is better.

In my example, this improvement make the result robust and independent of the choise of the number of components. But my example is simple and maybe not so convincing.

Gaussian network model indeed seems similar and certainly could be an interesting source of inspiration for improvements to spectral embedding/clustering. In my opinion (I am not a spectral expert by any means nor a core sklearn developer), we have to be careful when introducing the weighting though. If for nothing else then for the sake of reproducible research.

While it might improve the spectral clustering's behavior (which as a user I would love to see, of course) it would also mean that the implementation diverges from its definition in the original spectral clustering (Shi and Malik 2000?) paper. When people would use it in their research they could be unknowingly citing something else (unlikely that they would reference spectral clustering as "spectral clustering with enhancements introduced in sklearn commit eb7a35d").

@sky88088 , @glemaitre , @lesteve what do you think?

Reproducible research is something important and I haven't taken it into account. Then adding a new parameter to decide whether to introduce weighting I think is necessary. I updated my PR to add this new parameter.

For the problem of connectivity test of input graph, I found that spectral_embedding is using a private function for this task.

scikit-learn/sklearn/manifold/spectral_embedding_.py

Lines 69 to 89 in 2aba6e2

def _graph_is_connected(graph):

""" Return whether the graph is connected (True) or Not (False)

Parameters

----------

graph : array-like or sparse matrix, shape: (n_samples, n_samples)

adjacency matrix of the graph, non-zero weight means an edge

between the nodes

Returns

-------

is_connected : bool

True means the graph is fully connected and False means not

"""

if sparse.isspmatrix(graph):

# sparse graph, find all the connected components

n_connected_components, _ = connected_components(graph)

return n_connected_components == 1

else:

# dense graph, find all connected components start from node 0

return _graph_connected_component(graph, 0).sum() == graph.shape[0]

However, we should make this test in spectral_clustering before applying spectral embedding to decide whether to drop the first eigenvector. I'm wondering if sklearn has any public function for this task or if this private function could be moved to some utils module?

lesteve · 2018-03-01T15:57:42Z

Your PR broke quite a few existing tests, you will need to look at the failures (see this Travis build for example), and fix the tests.

See this for example for how to run the tests on your machine.

jnothman · 2018-03-01T21:28:54Z

but we wouldn't change the defaults without a deprecation period, so the test failures are probably irrelevant

sky88088 · 2018-03-02T02:28:04Z

I'm sorry that I haven't given code of test at the beginning.

from scipy.sparse import coo_matrix
from sklearn.cluster import spectral_clustering

# affinity between nodes
row = [0, 0, 1, 2, 3, 3, 4]
col = [1, 2, 2, 3, 4, 5, 5]
val = [100, 100, 100, 1, 100, 100, 100]

coo = coo_matrix((val + val, (row + col, col + row)), shape=(6, 6))
print coo.todense()

labels = spectral_clustering(coo, n_clusters=2, n_components=1)
# clustering result, should be [0 0 0 1 1 1] or [1 1 1 0 0 0]
print labels

For the original spectral_clustering, the result is unstable, no matter how many components used.
After setting drop_first = True

labels = spectral_clustering(coo, n_clusters=2, n_components=1)

would always give the right answer. But n_components must be 1.

After weight eigenvectors according to their related eigenvalues, it works stably no matter how many components used. For example

labels = spectral_clustering(coo, n_clusters=2, n_components=4)

I use this as a test example because it's easy for us to figure out which result is reasonable.

…eigenvalues

sklearn-lgtm · 2018-03-06T11:19:52Z

This pull request introduces 2 alerts when merging 741ec13 into 2aba6e2 - view on lgtm.com

new alerts:

2 for Potentially uninitialized local variable

Comment posted by lgtm.com

sky88088 · 2018-03-06T11:48:59Z

Oh, it seems that I use rebase to update my PR and lost all the comments base on the last version……

I conclude the opnions in the lost discussion here.

The first eigenvector hurt the performance of clustering when the affinity matrix with no disconnected region. Maybe we should check the connectivity of the input matrix to decide whether to drop it.

I found that spectral embedding use a private function to check the connectivity of input.

scikit-learn/sklearn/manifold/spectral_embedding_.py

Lines 69 to 89 in 2aba6e2

    
           def _graph_is_connected(graph): 
        
               """ Return whether the graph is connected (True) or Not (False) 
        
               Parameters 
        
               ---------- 
        
               graph : array-like or sparse matrix, shape: (n_samples, n_samples) 
        
                   adjacency matrix of the graph, non-zero weight means an edge 
        
                   between the nodes 
        
               Returns 
        
               ------- 
        
               is_connected : bool 
        
                   True means the graph is fully connected and False means not 
        
               """ 
        
               if sparse.isspmatrix(graph): 
        
                   # sparse graph, find all the connected components 
        
                   n_connected_components, _ = connected_components(graph) 
        
                   return n_connected_components == 1 
        
               else: 
        
                   # dense graph, find all connected components start from node 0 
        
                   return _graph_connected_component(graph, 0).sum() == graph.shape[0]

I'm wondering if sklearn has a public method to do this task or if this private function could be moved to some utils modules.

The idea of weighting eigenvectors is inspired by the gaussian network model(https://en.wikipedia.org/wiki/Gaussian_network_model), which is almost the same as spectral embedding. I add a new parameter to spectral clustering to decide whether to use weight for the sake of compatibility.

What do you think of these @lesteve @jmargeta @glemaitre @jnothman

lobpcg · 2019-08-13T19:58:15Z

The issue appears to be fixed already, so may be the PR and the issue can be closed? Running the provided code gives the correct answer for any n_components<4 (unclear why one would want to use larger n_components?):

import numpy as np
from scipy.sparse import coo_matrix
from sklearn.cluster import spectral_clustering

# affinity between nodes
row = [0, 0, 1, 2, 3, 3, 4]
col = [1, 2, 2, 3, 4, 5, 5]
val = [100, 100, 100, 1, 100, 100, 100]

coo = coo_matrix((val + val, (row + col, col + row)), shape=(6, 6))
print (coo.todense())

for n_c in np.arange(1,4):
    labels = spectral_clustering(coo, n_clusters=2, n_components=n_c)
    # clustering result, should be [0 0 0 1 1 1] or [1 1 1 0 0 0]
    print (labels)

[[  0 100 100   0   0   0]
 [100   0 100   0   0   0]
 [100 100   0   1   0   0]
 [  0   0   1   0 100 100]
 [  0   0   0 100   0 100]
 [  0   0   0 100 100   0]]
[1 1 1 0 0 0]
[1 1 1 0 0 0]
[0 0 0 1 1 1]

amueller · 2019-08-13T20:25:34Z

@lobpcg I just saw you commented somewhere about dropping the first eigenvector if and only if the graph is connected. That sounds like the right thing to me but I can't find the comment right now. Are we implementing that?

amueller · 2019-08-13T20:29:17Z

docs say:

    # The first eigen vector is constant only for fully connected graphs
    # and should be kept for spectral clustering (drop_first = False)
    # See spectral_embedding documentation.

but it looks like we never drop it. But I guess the assumption is the discretization method doesn't mind a constant vector?
I agree we should close this and the issue, though.

amueller · 2019-08-13T20:43:09Z

also eigensolver='amg' seems to give garbage but I figure that's related to #13393?

spectral_embedding(coo, n_components=2, random_state=0, drop_first=False, eigen_solver='amg')

with the above example doesn't give the result I'm expecting (which would be one constant eigenvector and one eigenvector separating the two clusters, which is what the other solvers give).

lobpcg · 2019-08-13T20:45:38Z

The "constant" vector is actually needed in #12316 implements an alternative method for choosing the clusters from the eigenvectors, outperforming both k-means and discretization method.s The constant vector presence should surely make no difference in k-means and discretization methods

Unrelated to this issue, there is some mystery to me why there is this distinction in the code whether or not the graph is fully connected...

lobpcg · 2019-08-13T20:49:32Z

also eigensolver='amg' seems to give garbage but I figure that's related to #13393?
spectral_embedding(coo, n_components=2, random_state=0, drop_first=False, eigen_solver='amg')
with the above example doesn't give the result I'm expecting (which would be one constant eigenvector and one eigenvector separating the two clusters, which is what the other solvers give).

Most likely - could you please give #13393 a try? The trouble may also be related to scipy/scipy#10621 since AMG is actually lobpcg with preconditioning.

amueller · 2019-08-13T20:59:08Z

@lobpcg when you say #13393, do you mean adding a small value to the diagonal? That didn't seem to help. To me it looks more like the eigenvalues are inverted as you suggested in #10720 (comment) - the 5th eigenvector is as expected (wow that seems bad)

lobpcg · 2019-08-13T21:08:02Z

@lobpcg when you say #13393, do you mean adding a small value to the diagonal? That didn't seem to help. To me it looks more like the eigenvalues are inverted as you suggested in #10720 (comment) - the 5th eigenvector is as expected (wow that seems bad)

Yes, in this case the issue is most likely just the same as in #10720 (comment) - thanks for reminding me my own recent post!

amueller · 2019-08-13T21:12:32Z

@lobpcg yeah it's a bit tricky to keep track of all of these, thanks for all your input!

cmarmo · 2020-09-02T10:10:54Z

Closing this following @amueller comment.
Thanks @lobpcg for checking this.

lesteve reviewed Mar 1, 2018

View reviewed changes

sky88088 changed the title ~~Improve spectral clustering implementation~~ [WIP] Improve spectral clustering implementation Mar 5, 2018

set drop_first true in spectral_clustering; weight eigenvectors with …

741ec13

…eigenvalues

amueller added the Waiting for Reviewer label Aug 5, 2019

github-actions bot added module:cluster module:manifold labels Mar 2, 2020

cmarmo closed this Sep 2, 2020

cmarmo mentioned this pull request Sep 2, 2020

Two improvement suggestions for spectral clustering #10736

Closed

	def _graph_is_connected(graph):
	""" Return whether the graph is connected (True) or Not (False)

	Parameters
	----------
	graph : array-like or sparse matrix, shape: (n_samples, n_samples)
	adjacency matrix of the graph, non-zero weight means an edge
	between the nodes

	Returns
	-------
	is_connected : bool
	True means the graph is fully connected and False means not
	"""
	if sparse.isspmatrix(graph):
	# sparse graph, find all the connected components
	n_connected_components, _ = connected_components(graph)
	return n_connected_components == 1
	else:
	# dense graph, find all connected components start from node 0
	return _graph_connected_component(graph, 0).sum() == graph.shape[0]

Uh oh!

[WIP] Improve spectral clustering implementation #10739

[WIP] Improve spectral clustering implementation #10739

Uh oh!

Conversation

sky88088 commented Mar 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

sklearn-lgtm commented Mar 1, 2018

Uh oh!

lesteve Mar 1, 2018

Choose a reason for hiding this comment

Uh oh!

sky88088 Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sky88088 Mar 2, 2018

Choose a reason for hiding this comment

Uh oh!

jmargeta Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sky88088 Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmargeta Mar 2, 2018

Choose a reason for hiding this comment

Uh oh!

sky88088 Mar 5, 2018

Choose a reason for hiding this comment

Uh oh!

jmargeta Mar 6, 2018

Choose a reason for hiding this comment

Uh oh!

sky88088 Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve commented Mar 1, 2018

Uh oh!

jnothman commented Mar 1, 2018 via email

Uh oh!

sky88088 commented Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sklearn-lgtm commented Mar 6, 2018

Uh oh!

sky88088 commented Mar 6, 2018

Uh oh!

lobpcg commented Aug 13, 2019

Uh oh!

amueller commented Aug 13, 2019

Uh oh!

amueller commented Aug 13, 2019

Uh oh!

amueller commented Aug 13, 2019

Uh oh!

lobpcg commented Aug 13, 2019

Uh oh!

lobpcg commented Aug 13, 2019

Uh oh!

amueller commented Aug 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lobpcg commented Aug 13, 2019

Uh oh!

amueller commented Aug 13, 2019

Uh oh!

cmarmo commented Sep 2, 2020

Uh oh!

Uh oh!

sky88088 commented Mar 1, 2018 •

edited

Loading

sky88088 Mar 2, 2018 •

edited

Loading

jmargeta Mar 2, 2018 •

edited

Loading

sky88088 Mar 2, 2018 •

edited

Loading

sky88088 Mar 6, 2018 •

edited

Loading

sky88088 commented Mar 2, 2018 •

edited

Loading

amueller commented Aug 13, 2019 •

edited

Loading