0 votes
1 view
in Data Science by (17.6k points)

I've tried to use this technique to correct very imbalanced classes.

My data set has classes e.g.:

In [123]:

data['CON_CHURN_TOTAL'].value_counts()

Out[123]:

0    100

1     10

Name: CON_CHURN_TOTAL, dtype: int64

I wanted to use SMOTETomek to under sample 0-class and over sample 1-class to achieve ratio 80 : 20. However, I cannot find a way to correct the dictionary. Of course in full code the ratio 80:20 will be calculated based on number of rows.

When I am trying:

from imblearn.combine import SMOTETomek

smt = SMOTETomek(ratio={1:20, 0:80})

I have error:

ValueError: With over-sampling methods, the number of samples in a class should be greater or equal to the original number of samples. Originally, there is 100 samples and 80 samples are asked.

But this method should be suitable for doing both under and over sampling at the same time.

Unfortunately the documentary is not working now due to 404 error.

1 Answer

0 votes
by (32.5k points)

If you want to have an under-sampling, you could pipeline 2 samplers.

Refer to the code below:

from sklearn.datasets import load_breast_cancer

import pandas as pd

from imblearn.pipeline import make_pipeline

from imblearn.over_sampling import SMOTE

from imblearn.under_sampling import NearMiss

data = load_breast_cancer()

X = pd.DataFrame(data=data.data, columns=data.feature_names)

count_class_0 = 300

count_class_1 = 300

pipe = make_pipeline(

    SMOTE(sampling_strategy={0: count_class_0}),

    NearMiss(sampling_strategy={1: count_class_1}

)

X_smt, y_smt = pipe.fit_resample(X, data.target)

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...