TimeDistributed(Dense) vs Dense in Keras - Same number of parameters

Question

asked Jul 19, 2019 in Machine Learning by ParasSharma1 (19k points)

I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.

My simplified model is the following:

InputSize = 15
MaxLen = 64
HiddenSize = 16

inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)
predictions = keras.layers.Activation('softmax')(x)

The summary of the network is:

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

input_1 (InputLayer) (None, 64, 15) 0

_________________________________________________________________

gru_1 (GRU) (None, 64, 16) 1536

_________________________________________________________________

time_distributed_1 (TimeDist (None, 64, 15) 255

_________________________________________________________________

activation_1 (Activation) (None, 64, 15) 0

=================================================================

This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).

However, if I switch to a simple Dense layer:

inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.Dense(InputSize)(x)
predictions = keras.layers.Activation('softmax')(x)
I still only have 255 parameters:

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

input_1 (InputLayer) (None, 64, 15) 0

_________________________________________________________________

gru_1 (GRU) (None, 64, 16) 1536

_________________________________________________________________

dense_1 (Dense) (None, 64, 15) 255

_________________________________________________________________

activation_1 (Activation) (None, 64, 15) 0

=================================================================

I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).

def build(self, input_shape):
    assert len(input_shape) >= 2
    input_dim = input_shape[-1]
    self.kernel = self.add_weight(shape=(input_dim, self.units),

It also uses keras.dot to apply the weights:

def call(self, inputs):
output = K.dot(inputs, self.kernel)

The docs of keras.dot implies that it works fine on n-dimensional tensors. I wonder if its exact behavior means that Dense() will in effect be called at every time step. If so, the question still remains what TimeDistributed() achieves in this case.

1 Answer

Anurag · Answer 1 · 2019-07-20T05:06:42+0000

Time Distributed Dense applies the same dense layer to every time step during GRU/LSTM Cell unrolling. That’s why the error function will be between the predicted label sequence and the actual label sequence.

Using return_sequences=False, the Dense layer will get applied only once in the last cell. This is normally the case when RNNs are used for classification problems.

If return_sequences=True, then the Dense layer is used to apply at every timestep just like TimeDistributedDense.

In your models both are the same, but if u change your second model to "return_sequences=False", then the Dense will be applied only at the last cell.

You should try changing it and the model will throw it as error because then the Y will be of size [Batch_size, InputSize], it is no more a sequence to sequence but a full sequence to label problem.

For example:

from keras.models import Sequential
from keras.layers import Dense, Activation, TimeDistributed
from keras.layers.recurrent import GRU
import numpy as np
InputSize = 15
MaxLen = 64
HiddenSize = 16
OutputSize = 8
n_samples = 1000
model1 = Sequential()
model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model1.add(TimeDistributed(Dense(OutputSize)))
model1.add(Activation('softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')

model2 = Sequential()
model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model2.add(Dense(OutputSize))
model2.add(Activation('softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model3 = Sequential()
model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize)))
model3.add(Dense(OutputSize))
model3.add(Activation('softmax'))
model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')
X = np.random.random([n_samples,MaxLen,InputSize])
Y1 = np.random.random([n_samples,MaxLen,OutputSize])
Y2 = np.random.random([n_samples, OutputSize])
model1.fit(X, Y1, batch_size=128, nb_epoch=1)
model2.fit(X, Y1, batch_size=128, nb_epoch=1)
model3.fit(X, Y2, batch_size=128, nb_epoch=1)
print(model1.summary())
print(model2.summary())
print(model3.summary())

Study Artificial Intelligence and Machine Learning for more details on Keras. One can also get certified in Machine Learning Certification.

You can see that the architecture of model1 and model 2 is sequence to sequence, but model 3 is a complete sequence to label model.

TimeDistributed(Dense) vs Dense in Keras - Same number of parameters

1 Answer

Related questions

Browse Categories