0 votes
1 view
in Data Science by (13.1k points)

How do I get the original indices of the data when using train_test_split()?

What I have is the following

from sklearn.cross_validation import train_test_split

import numpy as np

data = np.reshape(np.randn(20),(10,2)) # 10 training examples

labels = np.random.randint(2, size=10) # 10 labels

x1, x2, y1, y2 = train_test_split(data, labels, size=0.2)

But this does not give the indices of the original data. One workaround is to add the indices to data (e.g. data = [(i, d) for i, d in enumerate(data)]) and then pass them inside train_test_split and then expand again. Are there any cleaner solutions?

1 Answer

0 votes
by (19.9k points)

You can use pandas dataframes or series:

from sklearn.model_selection import train_test_split

import numpy as np

n_samples, n_features, n_classes = 10, 2, 2

data = np.random.randn(n_samples, n_features)  # 10 training examples

labels = np.random.randint(n_classes, size=n_samples)  # 10 labels

indices = np.arange(n_samples)

x1, x2, y1, y2, idx1, idx2 = train_test_split(

    data, labels, indices, test_size=0.2)

...