Restore original text from Keras’s imdb dataset

Question

asked Jul 6, 2019 in Machine Learning by Anurag (33.1k points)

Restore original text from Keras’s IMDb dataset

I want to restore IMDb's original text from Keras’s IMDB dataset.

First, when I load Keras’s IMDB dataset, it returned a sequence of word index.

>>> (X_train, y_train), (X_test, y_test) = imdb.load_data() >>> X_train[0]

I found imdb.get_word_index method(), it returns word index dictionary like {‘create’: 984, ‘make’: 94,…}. For converting, I create index word dictionary.

>>> word_index = imdb.get_word_index()
>>> index_word = {v:k for k,v in word_index.items()}

Then, I tried to restore the original text like the following.

>>> ' '.join(index_word.get(w) for w in X_train[5])
"the effort still been that usually makes for of finished sucking ended cbc's an because before if just though something know novel female i i slowly lot of above freshened with connect in of script their that out end his deceptively i i"

I’m not good at English, but I know this sentence is something strange.

Why does this happen? How can I restore the original text?

1 Answer

Anurag · Answer 1 · 2019-07-08T05:38:51+0000

I think you need more details about the parameters used in your code. I would try to explain these parameters as follow :

start_char: int. This character is used to mark the start of the sequence. This function needs to Set to 1 because 0 is usually the padding character.

oov_char: int. words that were cut out because of the num_words or skip_top limit will be replaced with this character.

index_from: int. Index actual words with this index and higher.

It looks like the word indices in your dictionary starts from 1.

If you noticed that the indices returned by your keras have <START> and <UNKNOWN> as indexes 1 and 2. (And it assumes you will use 0 for <PADDING>).

Code for solution:

import keras
NUM_WORDS=1000 # only use top 1000 words
INDEX_FROM=3 # word index offset
train,test = keras.datasets.imdb.load_data(num_words=NUM_WORDS, index_from=INDEX_FROM)
train_x,train_y = train
test_x,test_y = test
word_to_id = keras.datasets.imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in train_x[0] ))

Output:

"<START> this film was just brilliant casting <UNK> <UNK> story
direction <UNK> really <UNK> the part they played and you could just
imagine being there robert <UNK> is an amazing actor ..."

Hope this answer helps.

Restore original text from Keras’s imdb dataset

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources