Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.

My simplified model is the following:

InputSize = 15

MaxLen = 64

HiddenSize = 16

inputs = keras.layers.Input(shape=(MaxLen, InputSize))

x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)

x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)

predictions = keras.layers.Activation('softmax')(x)

The summary of the network is:

_________________________________________________________________

Layer (type)                 Output Shape              Param #   

=================================================================

input_1 (InputLayer)         (None, 64, 15)            0         

_________________________________________________________________

gru_1 (GRU)                  (None, 64, 16)            1536      

_________________________________________________________________

time_distributed_1 (TimeDist (None, 64, 15)            255       

_________________________________________________________________

activation_1 (Activation)    (None, 64, 15)            0         

=================================================================

This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).

However, if I switch to a simple Dense layer:

inputs = keras.layers.Input(shape=(MaxLen, InputSize))

x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)

x = keras.layers.Dense(InputSize)(x)

predictions = keras.layers.Activation('softmax')(x)

I still only have 255 parameters:

_________________________________________________________________

Layer (type)                 Output Shape              Param #   

=================================================================

input_1 (InputLayer)         (None, 64, 15)            0         

_________________________________________________________________

gru_1 (GRU)                  (None, 64, 16)            1536      

_________________________________________________________________

dense_1 (Dense)              (None, 64, 15)            255       

_________________________________________________________________

activation_1 (Activation)    (None, 64, 15)            0         

=================================================================

I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).

1 Answer

0 votes
by (33.1k points)

The TimeDistributedDense method applies the same dense layer to every time step during GRU/LSTM Cell unrolling. So the error function should be between the predicted label sequence and the actual label sequence. 

You can say that with return_sequences=False, the Dense layer is applied only once in the last cell. This is usually the case when RNNs are used for classification problems. If return_sequences=True then the Dense layer is applied to every timestep just like TimeDistributedDense.

Your both the models are the same, but if u change your second model to "return_sequences=False" then the Dense will be applied only at the last cell. Try changing it and the model will throw as an error because then the Y will be of size [Batch_size, InputSize], it is no more a sequence to sequence but a full sequence to label problem.

from keras.models import Sequential

from keras.layers import Dense, Activation, TimeDistributed

from keras.layers.recurrent import GRU

import numpy as np

InputSize = 15

MaxLen = 64

HiddenSize = 16

OutputSize = 8

n_samples = 1000

model1 = Sequential()

model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))

model1.add(TimeDistributed(Dense(OutputSize)))

model1.add(Activation('softmax'))

model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')

model2 = Sequential()

model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))

model2.add(Dense(OutputSize))

model2.add(Activation('softmax'))

model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')

model3 = Sequential()

model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize)))

model3.add(Dense(OutputSize))

model3.add(Activation('softmax'))

model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')

X = np.random.random([n_samples,MaxLen,InputSize])

Y1 = np.random.random([n_samples,MaxLen,OutputSize])

Y2 = np.random.random([n_samples, OutputSize])

model1.fit(X, Y1, batch_size=128, nb_epoch=1)

model2.fit(X, Y1, batch_size=128, nb_epoch=1)

model3.fit(X, Y2, batch_size=128, nb_epoch=1)

print(model1.summary())

print(model2.summary())

print(model3.summary())

In the above code example, the architecture of model1 and model2 are sample sequence to sequence models and model3 is a full sequence to label model.  For more details on Keras and other subsequent topics like this study Artificial Intelligence.

Since Keras is related to Machine Learning, undergoing a Machine Learning Certification would make one understand about the subsequent topic.

Hope this answer helps you!            

Browse Categories

...