Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (4.2k points)

I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm. Adopting these example with k-means to my setting works in principle. However, k-means is not suitable since I don't know the number of clusters. From what I read so far -- please correct me here if needed -- DBSCAN or MeanShift seem the be more appropriate in my case. The scikit-learn website provides examples for each cluster algorithm. The problem is now, that with both DBSCAN and MeanShift I get errors I cannot comprehend, let alone solve.

My minimal code is as follows:

docs = []
for item in [database]:
    docs.append(item)

vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(docs)

X = X.todense() # <-- This line was needed to resolve the isse

db = DBSCAN(eps=0.3, min_samples=10).fit(X)
...

(My documents are already processed, i.e., stopwords have been removed and an Porter Stemmer has been applied.)

When I run this code, I get the following error when instatiating DBSCAN and calling fit():

...
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 248, in fit
clust = dbscan(X, **self.get_params())
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 86, in dbscan
n = X.shape[0]
IndexError: tuple index out of range

Clicking on the line in dbscan_.py that throws the error, I noticed the following line

...
X = np.asarray(X)
n = X.shape[0]
...

When I use these to lines directly in my code for testing, I get the same error. I don't really know what np.asarray(X) is doing here, but after the command X.shape = (). Hence X.shape[0] bombs -- before, X.shape[0] correctly refers to the number of documents. Out of curiosity, I removed X = np.asarray(X) from dbscan_.py. When I do this, something is computing heavily. But after some seconds, I get another error:

...
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 214, in extractor
(min_indx,max_indx) = check_bounds(indices,N)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 198, in check_bounds
max_indx = indices.max()
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 17, in _amax
out=out, keepdims=keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity

In short, I have no clue how to get DBSCAN working, or what I might have missed, in general.

1 Answer

0 votes
by (6.8k points)

The implementation in sklearn looks to assume you're handling a finite vector space, and needs to seek out the dimensionality of your information set. Text information is usually diagrammatic as sparse vectors, however currently with the identical spatial property. 

Your input data in all probability is not a knowledge matrix, however, the sklearn implementations desire them to be one.

You'll need to find a different implementation. Maybe attempt the implementation in ELKI, that is incredibly quick, and will not have this limitation.

You'll need to spend some time in understanding similarity first. For DBSCAN, you need to select a letter in an exceedingly manner that produces sense for your information. There is no rule of thumb; this is often domain-specific. Therefore, you initially have to be compelled to discern that the similarity threshold implies that two documents are similar.

Mean Shift may actually like your information to be vector space of mounted dimensionality. 

Learning Sklearn Cheat Sheet is quite an important segment when it comes to solving questions like DBSCAN.

Browse Categories

...