0 votes
1 view
in Data Science by (17.6k points)

I am working on a sequential labeling problem with unbalanced classes and I would like to use sample_weight to resolve the unbalance issue. Basically if I train the model for about 10 epochs, I get great results. If I train for more epochs, val_loss keeps dropping, but I get worse results. I'm guessing the model just detects more of the dominant class to the detriment of the smaller classes.

The model has two inputs, for word embeddings and character embeddings, and the input is one of 7 possible classes from 0 to 6.

With the padding, the shape of my input layer for word embeddings is (3000, 150) and the input layer for word embeddings is (3000, 150, 15). I use a 0.3 split for testing and training data, which means X_train for word embeddings is (2000, 150) and (2000, 150, 15) for char embeddings. y contains the correct class for each word, encoded in a one-hot vector of dimension 7, so its shape is (3000, 150, 7). y is likewise split into a training and testing set. Each input is then fed into a Bidirectional LSTM.

The output is a matrix with one of the 7 categories assigned for each word of the 2000 training samples, so the size is (2000, 150, 7).

At first, I simply tried to define sample_weight as an np.array of length 7 containing the weights for each class:

count = [list(array).index(1) for arrays in y for array in arrays]

count = dict(Counter(count))

count[0] = 0

total = sum([count[key] for key in count])

count = {k: count[key] / total for key in count}

category_weights = np.zeros(7)

for f in count:

    category_weights[f] = count[f]

But I get the following error ValueError: Found a sample_weight array with shape (7,) for an input with shape (2000, 150, 7). sample_weight cannot be broadcast.

Looking at the docs, it looks like I should instead be passing a 2D array with shape (samples, sequence_length). So I create a (3000, 150) array with a concatenation of the weights of every word of each sequence:

weights = []

for sample in y:

    current_weight = []

    for line in sample:

        current_weight.append(frequency[list(line).index(1)])

    weights.append(current_weight)

weights = np.array(weights)

and pass that to the fit function through the sample_weight parameter after having added the sample_weight_mode="temporal" option in compile().

I first got an error telling me the dimension was wrong, however after generating the weights for only the training sample, I end up with a (2000, 150) array that I can use to fit my model.

Is this a proper way to define sample_weights or am I doing it all wrong ? I can't say I've noticed any improvements from adding the weights, so I must have missed something.

1 Answer

0 votes
by (38.2k points)

Here, you should use class_weight to balance your dataset for training.

You need to pass a dictionary indicating the weight ratios between your 7 classes.

 If you want to give each sample a custom weight for consideration then using sample_weight is considerable.

Also, you cannot use both because  sample_weight overrides class_weight.

If you wish to learn more about how to use python for data science, then go through data science python programming course by Intellipaat for more insights.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...