0 votes
1 view
in Machine Learning by (33.2k points)

Restore original text from Keras’s IMDb dataset

I want to restore IMDb's original text from Keras’s IMDB dataset.

First, when I load Keras’s IMDB dataset, it returned a sequence of word index.

>>> (X_train, y_train), (X_test, y_test) = imdb.load_data() >>> X_train[0]

I found imdb.get_word_index method(), it returns word index dictionary like {‘create’: 984, ‘make’: 94,…}. For converting, I create index word dictionary.

>>> word_index = imdb.get_word_index()

>>> index_word = {v:k for k,v in word_index.items()}

Then, I tried to restore the original text like the following.

>>> ' '.join(index_word.get(w) for w in X_train[5])

"the effort still been that usually makes for of finished sucking ended cbc's an because before if just though something know novel female i i slowly lot of above freshened with connect in of script their that out end his deceptively i i"

I’m not good at English, but I know this sentence is something strange.

Why does this happen? How can I restore the original text?

1 Answer

0 votes
by (33.2k points)
edited by

I think you need more details about the parameters used in your code. I would try to explain these parameters as follow :

start_char: int. This character is used to mark the start of the sequence. This function needs to Set to 1 because 0 is usually the padding character.

oov_char: int. words that were cut out because of the num_words or skip_top limit will be replaced with this character.

index_from: int. Index actual words with this index and higher.

It looks like the word indices in your dictionary starts from 1.

If you noticed that the indices returned by your keras have <START> and <UNKNOWN> as indexes 1 and 2. (And it assumes you will use 0 for <PADDING>).

Code for solution:

import keras

NUM_WORDS=1000 # only use top 1000 words

INDEX_FROM=3   # word index offset

train,test = keras.datasets.imdb.load_data(num_words=NUM_WORDS, index_from=INDEX_FROM)

train_x,train_y = train

test_x,test_y = test

word_to_id = keras.datasets.imdb.get_word_index()

word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}

word_to_id["<PAD>"] = 0

word_to_id["<START>"] = 1

word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items()}

print(' '.join(id_to_word[id] for id in train_x[0] ))


"<START> this film was just brilliant casting <UNK> <UNK> story

 direction <UNK> really <UNK> the part they played and you could just

 imagine being there robert <UNK> is an amazing actor ..."

Hope this answer helps.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !