Using pre-trained word2vec with LSTM for word generation

Question

asked Jul 18, 2019 in Machine Learning by ParasSharma1 (19k points)

LSTM/RNN can be used for text generation. This shows the way to use pre-trained GloVe word embeddings for Keras model.

How to use pre-trained Word2Vec word embeddings with Keras LSTM model? This post did help.

How to predict / generate next word when the model is provided with the sequence of words as its input?

Sample approach tried:

# Sample code to prepare word2vec word embeddings
import gensim
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
sentences = [[word for word in document.lower().split()] for document in documents]
word_model = gensim.models.Word2Vec(sentences, size=200, min_count = 1, window = 5)
# Code tried to prepare LSTM model for word generation
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.models import Model, Sequential
from keras.layers import Dense, Activation
embedding_layer = Embedding(input_dim=word_model.syn0.shape[0], output_dim=word_model.syn0.shape[1], weights=[word_model.syn0])
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(word_model.syn0.shape[1]))
model.add(Dense(word_model.syn0.shape[0]))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='mse')

Sample code / psuedocode to train LSTM and predict will be appreciated.

1 Answer

Anurag · Answer 1 · 2019-07-19T05:53:46+0000

You can use a simple generator that would be implemented on top of your initial idea, it's an LSTM network wired to the pre-trained word2vec embeddings, that should be trained to predict the next word in a sentence.

Gensim Word2Vec

Your code syntax is fine, but you should change the number of iterations to train the model well.

The default iter = 5 seems really low to train a machine learning model. Even at least 100 iterations are just better than 5.

For example:

word_model = gensim.models.Word2Vec(sentences, size=100, min_count=1,
                                    window=5, iter=100)
pretrained_weights = word_model.wv.syn0
vocab_size, emdedding_size = pretrained_weights.shape
print('Result embedding shape:', pretrained_weights.shape)
print('Checking similar words:')
for word in ['model', 'network', 'train', 'learn']:
  most_similar = ', '.join('%s (%.2f)' % (similar, dist)
                           for similar, dist in word_model.most_similar(word)[:8])
  print(' %s -> %s' % (word, most_similar))

def word2idx(word):
return word_model.wv.vocab[word].index
def idx2word(idx):
return word_model.wv.index2word[idx]

The resultant embedding matrix is saved into a pretrained_weights array which has a shape (vocab_size, emdedding_size).

Keras model

The loss function in your code seems invalid. When the model predicts the next word, then its a classification task.

However, the loss should be categorical_crossentropy or sparse_categorical_crossentropy. This method avoids one-hot encoding, which is pretty expensive for a big vocabulary.

For example:

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size,
weights=[pretrained_weights]))
model.add(LSTM(units=emdedding_size))
model.add(Dense(units=vocab_size))
model.add(Activation('softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

Data preparation

If you use sparse_categorical_crossentropy loss, then both the sentences and labels must be word indices. Short sentences must be padded with zeros to the common length.

train_x = np.zeros([len(sentences), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(sentences)], dtype=np.int32)
for i, sentence in enumerate(sentences):
  for t, word in enumerate(sentence[:-1]):
    train_x[i, t] = word2idx(word)
  train_y[i] = word2idx(sentence[-1])

Sample generation

The trained model outputs the vector of probabilities, from which the next word is sampled and added to the input. Here the generated text would be better and more diverse if the next word is sampled, rather than a pick as argmax.

Here is an example of temperature based random sampling:

def sample(preds, temperature=1.0):
  if temperature <= 0:
    return np.argmax(preds)
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

def generate_next(text, num_generated=10):
  word_idxs = [word2idx(word) for word in text.lower().split()]
  for i in range(num_generated):
    prediction = model.predict(x=np.array(word_idxs))
    idx = sample(prediction[-1], temperature=0.7)
    word_idxs.append(idx)
  return ' '.join(idx2word(idx) for idx in word_idxs)

Output:

deep convolutional... -> deep convolutional arithmetic initialization step unbiased effectiveness
simple and effective... -> simple and effective family of variables preventing compute automatically
a nonconvex... -> a nonconvex technique compared layer converges so independent onehidden markov
a... -> a function parameterization necessary both both intuitions with technique valpola utilizes

Hope this answer helps.

Using pre-trained word2vec with LSTM for word generation

1 Answer

Related questions

Browse Categories