I am training a simple model in keras for the NLP task with the following code. Variable names are self-explanatory for train, test and validation set. This dataset has 19 classes so the final layer of the network has 19 outputs. Labels are also one-hot encoded.

nb_classes = 19

model1 = Sequential()

model1.add(Embedding(nb_words,

EMBEDDING_DIM,

weights=[embedding_matrix],

input_length=MAX_SEQUENCE_LENGTH,

trainable=False))

model1.add(LSTM(num_lstm, dropout=rate_drop_lstm, recurrent_dropout=rate_drop_lstm))

model1.add(Dropout(rate_drop_dense))

model1.add(BatchNormalization())

model1.add(Dense(num_dense, activation=act))

model1.add(Dropout(rate_drop_dense))

model1.add(BatchNormalization())

model1.add(Dense(nb_classes, activation = 'sigmoid'))

model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

#One hot encode all labels

ytrain_enc = np_utils.to_categorical(train_labels)

yval_enc = np_utils.to_categorical(val_labels)

ytestenc = np_utils.to_categorical(test_labels)

model1.fit(train_data, ytrain_enc,

validation_data=(val_data, yval_enc),

epochs=200,

batch_size=384,

shuffle=True,

verbose=1)

After the first epoch, this gives me these outputs.

Epoch 1/200

216632/216632 [==============================] - 2442s - loss: 0.1427 - acc: 0.9443 - val_loss: 0.0526 - val_acc: 0.9826

I evaluate my model on the testing dataset and this also shows me accuracy around 0.98.

model1.evaluate(test_data, y = ytestenc, batch_size=384, verbose=1)

The labels are one-hot encoded, so I need a prediction vector of classes so that I can generate confusion matrix, etc.

PREDICTED_CLASSES = model1.predict_classes(test_data, batch_size=384, verbose=1)

temp = sum(test_labels == PREDICTED_CLASSES)

temp/len(test_labels)

0.83

This shows that total predicted classes were 83% accurate however model1.evaluate shows 98% accuracy!! What am I doing wrong here? Is my loss function okay with categorical class labels? Is my choice of sigmoid activation function for the prediction layer okay? or there is a difference in the way keras evaluates a model? Please suggest what can be wrong. This is my first try to make a deep model so I don't have much understanding of what's wrong here.