Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Machine Learning by (19k points)

I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.

My simplified model is the following:

InputSize = 15

MaxLen = 64

HiddenSize = 16

inputs = keras.layers.Input(shape=(MaxLen, InputSize))

x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)

x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)

predictions = keras.layers.Activation('softmax')(x)

The summary of the network is:

_________________________________________________________________

Layer (type)                 Output Shape              Param #   

=================================================================

input_1 (InputLayer)         (None, 64, 15)            0         

_________________________________________________________________

gru_1 (GRU)                  (None, 64, 16)            1536      

_________________________________________________________________

time_distributed_1 (TimeDist (None, 64, 15)            255       

_________________________________________________________________

activation_1 (Activation)    (None, 64, 15)            0         

=================================================================

This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).

However, if I switch to a simple Dense layer:

inputs = keras.layers.Input(shape=(MaxLen, InputSize))

x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)

x = keras.layers.Dense(InputSize)(x)

predictions = keras.layers.Activation('softmax')(x)

I still only have 255 parameters:

_________________________________________________________________

Layer (type)                 Output Shape              Param #   

=================================================================

input_1 (InputLayer)         (None, 64, 15)            0         

_________________________________________________________________

gru_1 (GRU)                  (None, 64, 16)            1536      

_________________________________________________________________

dense_1 (Dense)              (None, 64, 15)            255       

_________________________________________________________________

activation_1 (Activation)    (None, 64, 15)            0         

=================================================================

I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).

1 Answer

0 votes
by (33.1k points)

The TimeDistributedDense method applies the same dense layer to every time step during GRU/LSTM Cell unrolling. So the error function should be between the predicted label sequence and the actual label sequence. 

You can say that with return_sequences=False, the Dense layer is applied only once in the last cell. This is usually the case when RNNs are used for classification problems. If return_sequences=True then the Dense layer is applied to every timestep just like TimeDistributedDense.

Your both the models are the same, but if u change your second model to "return_sequences=False" then the Dense will be applied only at the last cell. Try changing it and the model will throw as an error because then the Y will be of size [Batch_size, InputSize], it is no more a sequence to sequence but a full sequence to label problem.

from keras.models import Sequential

from keras.layers import Dense, Activation, TimeDistributed

from keras.layers.recurrent import GRU

import numpy as np

InputSize = 15

MaxLen = 64

HiddenSize = 16

OutputSize = 8

n_samples = 1000

model1 = Sequential()

model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))

model1.add(TimeDistributed(Dense(OutputSize)))

model1.add(Activation('softmax'))

model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')

model2 = Sequential()

model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))

model2.add(Dense(OutputSize))

model2.add(Activation('softmax'))

model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')

model3 = Sequential()

model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize)))

model3.add(Dense(OutputSize))

model3.add(Activation('softmax'))

model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')

X = np.random.random([n_samples,MaxLen,InputSize])

Y1 = np.random.random([n_samples,MaxLen,OutputSize])

Y2 = np.random.random([n_samples, OutputSize])

model1.fit(X, Y1, batch_size=128, nb_epoch=1)

model2.fit(X, Y1, batch_size=128, nb_epoch=1)

model3.fit(X, Y2, batch_size=128, nb_epoch=1)

print(model1.summary())

print(model2.summary())

print(model3.summary())

In the above code example, the architecture of model1 and model2 are sample sequence to sequence models and model3 is a full sequence to label model.  For more details on Keras and other subsequent topics like this study Artificial Intelligence.

Since Keras is related to Machine Learning, undergoing a Machine Learning Certification would make one understand about the subsequent topic.

Hope this answer helps you!            

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94.1k users

Browse Categories

...