In [441]:
%matplotlib inline

Transductive Learning with Sci-Kit

Date: 18/Dec/2015 Author: Chirag Nagpal

Clustering is expensive, especially when our dataset contains millions of datapoints. Recomputing the clusters everytime we receive some new data is thus in many cases, intractable. With more data, there is also the possibility of degrading the previous clustering.

One solution to this problem, is to first infer the target classes using some unsupervised learning algorithm and then fit a classifier on the inferred targets, treating it as a supervised problem. This is known as Transductive learning.

Let us first create some synthetic data

We create a synthetic dataset using the sklearn.datasets.make_blobs interface. It generates 2D data with 3 clusters. In order to make the classification task more exotic, we add some random noise, using the numpy.random.rand interface. The data is normalised using the StandardScaler function

In [442]:
from sklearn import cluster, datasets

n_samples = 5000

colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])

blobs = datasets.make_blobs(n_samples=3*n_samples, random_state=8)
noise = np.random.rand(n_samples,2)

noise = StandardScaler().fit_transform(noise)
dataset = blobs

X, y = dataset
    # normalize dataset for easier parameter selection
X = StandardScaler().fit_transform(X)
X = np.concatenate((X, noise), axis=0)
plt.scatter(X[:, 0], X[:, 1], color="black", s=2)
plt.show()

Clustering the data

We fit DBSCAN clustering to our data. DBSCAN performs well for such datasets since it is sensitive to outliers (noise)

In [443]:
dbscan.fit(X)
y_pred = dbscan.labels_.astype(np.int)
plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=2)
plt.plot()
Out[443]:
[]

Fitting a classifier to the results from Clustering

We utilise the labels from the clustering, and fit a classifier to it. The samples marked as 'noise' during the clustering phase are treated as a separate class. Here, we utilise a one-vs-rest SVM with an RBF kernel.

In [444]:
from sklearn import svm
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier

clf = svm.SVC( kernel='rbf', decision_function_shape="ovr", degree=10)
#clf = RandomForestClassifier(n_estimators=100)

scores = cross_val_score(clf, X, y_pred, cv=3)
print "Classifier accuracy with respect to clustering:", np.mean(scores)
clf.fit(X, y_pred)
Classifier accuracy with respect to clustering: 0.98420207115
Out[444]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=10, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Visualising the classifier decision boundary

In [445]:
h = .02

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
n =0 

n_clusters = len(set(y_pred))

plt.figure(figsize=(n_clusters * 2 + 3, 2.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
                    hspace=.01)

for i in range(n_clusters):
    Zz = Z[:,i].reshape(xx.shape)
    plt.subplot(1, 4,  n)
    plt.scatter(X[:, 0], X[:, 1],  s=1)
    plt.contourf(xx, yy, Zz, alpha=.4)

    plt.xticks(())
    plt.yticks(())
    n+=1
plt.show()

We now add random data to infer the classes

In [446]:
X_new = StandardScaler().fit_transform(np.random.rand(n_samples*2,2))

We first infer using the classifier

In [447]:
t0 = time.time()
y_pred = clf.predict(X_new)
t1 = time.time()

classifier_t = t1-t0

plt.scatter(X_new[:, 0], X_new[:, 1], color=colors[y_pred].tolist(), s=5)
plt.text(.99, .01, ('%.5fs' % (classifier_t)).lstrip('0'),
                 transform=plt.gca().transAxes, size=15,
                 horizontalalignment='right')
plt.plot()
Out[447]:
[]

We then infer by reclustering the original and new data

In [448]:
X_new = np.concatenate((X, X_new), axis=0)
t0 = time.time()
dbscan.fit(X_new)
t1 = time.time()

cluster_t = t1-t0

y_pred_new = dbscan.labels_.astype(np.int)
plt.scatter(X_new[:, 0], X_new[:, 1], color=colors[y_pred_new].tolist(), s=10)
plt.text(.99, .01, ('%.5fs' % (cluster_t)).lstrip('0'),
                 transform=plt.gca().transAxes, size=15,
                 horizontalalignment='right')
plt.plot()
Out[448]:
[]

Comparing the results of the classifier and clustering

In [453]:
from sklearn.metrics import accuracy_score
print "Classifier accuracy with respect to reclustering is:", accuracy_score(y_pred_new[-(n_samples*2):], y_pred)*100, "%"
print "Classifier speed with respect to reclustering is:", cluster_t/classifier_t,"x"
Classifier accuracy with respect to reclustering is: 97.69 %
Classifier speed with respect to reclustering is: 2.31713583166 x