TimeDistributed(Dense) vs Dense in Keras - Same number of parameters

Question

asked Jul 23, 2019 in Machine Learning by ParasSharma1 (19k points)

I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.

My simplified model is the following:

InputSize = 15
MaxLen = 64
HiddenSize = 16
inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)
predictions = keras.layers.Activation('softmax')(x)

The summary of the network is:

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

input_1 (InputLayer) (None, 64, 15) 0

_________________________________________________________________

gru_1 (GRU) (None, 64, 16) 1536

_________________________________________________________________

time_distributed_1 (TimeDist (None, 64, 15) 255

_________________________________________________________________

activation_1 (Activation) (None, 64, 15) 0

=================================================================

This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).

However, if I switch to a simple Dense layer:

inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.Dense(InputSize)(x)
predictions = keras.layers.Activation('softmax')(x)

I still only have 255 parameters:

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

input_1 (InputLayer) (None, 64, 15) 0

_________________________________________________________________

gru_1 (GRU) (None, 64, 16) 1536

_________________________________________________________________

dense_1 (Dense) (None, 64, 15) 255

_________________________________________________________________

activation_1 (Activation) (None, 64, 15) 0

=================================================================

I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).

1 Answer

Anurag · Answer 1 · 2019-07-23T12:18:15+0000

The TimeDistributedDense method applies the same dense layer to every time step during GRU/LSTM Cell unrolling. So the error function should be between the predicted label sequence and the actual label sequence.

You can say that with return_sequences=False, the Dense layer is applied only once in the last cell. This is usually the case when RNNs are used for classification problems. If return_sequences=True then the Dense layer is applied to every timestep just like TimeDistributedDense.

Your both the models are the same, but if u change your second model to "return_sequences=False" then the Dense will be applied only at the last cell. Try changing it and the model will throw as an error because then the Y will be of size [Batch_size, InputSize], it is no more a sequence to sequence but a full sequence to label problem.

from keras.models import Sequential
from keras.layers import Dense, Activation, TimeDistributed
from keras.layers.recurrent import GRU
import numpy as np
InputSize = 15
MaxLen = 64
HiddenSize = 16
OutputSize = 8
n_samples = 1000
model1 = Sequential()
model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model1.add(TimeDistributed(Dense(OutputSize)))
model1.add(Activation('softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model2 = Sequential()
model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model2.add(Dense(OutputSize))
model2.add(Activation('softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model3 = Sequential()
model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize)))
model3.add(Dense(OutputSize))
model3.add(Activation('softmax'))
model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')
X = np.random.random([n_samples,MaxLen,InputSize])
Y1 = np.random.random([n_samples,MaxLen,OutputSize])
Y2 = np.random.random([n_samples, OutputSize])
model1.fit(X, Y1, batch_size=128, nb_epoch=1)
model2.fit(X, Y1, batch_size=128, nb_epoch=1)
model3.fit(X, Y2, batch_size=128, nb_epoch=1)
print(model1.summary())
print(model2.summary())
print(model3.summary())

In the above code example, the architecture of model1 and model2 are sample sequence to sequence models and model3 is a full sequence to label model. For more details on Keras and other subsequent topics like this study Artificial Intelligence.

Since Keras is related to Machine Learning, undergoing a Machine Learning Certification would make one understand about the subsequent topic.

Hope this answer helps you!

TimeDistributed(Dense) vs Dense in Keras - Same number of parameters

TimeDistributed(Dense) vs Dense in Keras - Same number of parameters

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions