Back

Explore Courses Blog Tutorials Interview Questions
0 votes
4 views
in Machine Learning by (19k points)

I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.

My simplified model is the following:

InputSize = 15

MaxLen = 64

HiddenSize = 16

inputs = keras.layers.Input(shape=(MaxLen, InputSize))

x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)

x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)

predictions = keras.layers.Activation('softmax')(x)

The summary of the network is:

_________________________________________________________________

Layer (type)                 Output Shape Param #   

=================================================================

input_1 (InputLayer)         (None, 64, 15) 0         

_________________________________________________________________

gru_1 (GRU)                  (None, 64, 16) 1536      

_________________________________________________________________

time_distributed_1 (TimeDist (None, 64, 15)            255       

_________________________________________________________________

activation_1 (Activation)    (None, 64, 15) 0         

=================================================================

This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).

However, if I switch to a simple Dense layer:

inputs = keras.layers.Input(shape=(MaxLen, InputSize))

x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)

x = keras.layers.Dense(InputSize)(x)

predictions = keras.layers.Activation('softmax')(x)

I still only have 255 parameters:

_________________________________________________________________

Layer (type)                 Output Shape Param #   

=================================================================

input_1 (InputLayer)         (None, 64, 15) 0         

_________________________________________________________________

gru_1 (GRU)                  (None, 64, 16) 1536      

_________________________________________________________________

dense_1 (Dense)              (None, 64, 15) 255       

_________________________________________________________________

activation_1 (Activation)    (None, 64, 15) 0         

=================================================================

I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).

def build(self, input_shape):

    assert len(input_shape) >= 2

    input_dim = input_shape[-1]

    self.kernel = self.add_weight(shape=(input_dim, self.units),

It also uses keras.dot to apply the weights:

def call(self, inputs):

    output = K.dot(inputs, self.kernel)

The docs of keras.dot implies that it works fine on n-dimensional tensors. I wonder if its exact behavior means that Dense() will in effect be called at every time step. If so, the question still remains what TimeDistributed() achieves in this case.

1 Answer

0 votes
by (33.1k points)

Time Distributed Dense applies the same dense layer to every time step during GRU/LSTM Cell unrolling. That’s why the error function will be between the predicted label sequence and the actual label sequence.

Using return_sequences=False, the Dense layer will get applied only once in the last cell. This is normally the case when RNNs are used for classification problems. 

If return_sequences=True, then the Dense layer is used to apply at every timestep just like TimeDistributedDense.

In your models both are the same, but if u change your second model to "return_sequences=False", then the Dense will be applied only at the last cell. 

You should try changing it and the model will throw it as error because then the Y will be of size [Batch_size, InputSize], it is no more a sequence to sequence but a full sequence to label problem.

For example:

from keras.models import Sequential

from keras.layers import Dense, Activation, TimeDistributed

from keras.layers.recurrent import GRU

import numpy as np

InputSize = 15

MaxLen = 64

HiddenSize = 16

OutputSize = 8

n_samples = 1000

model1 = Sequential()

model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))

model1.add(TimeDistributed(Dense(OutputSize)))

model1.add(Activation('softmax'))

model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')


 

model2 = Sequential()

model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))

model2.add(Dense(OutputSize))

model2.add(Activation('softmax'))

model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')

model3 = Sequential()

model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize)))

model3.add(Dense(OutputSize))

model3.add(Activation('softmax'))

model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')

X = np.random.random([n_samples,MaxLen,InputSize])

Y1 = np.random.random([n_samples,MaxLen,OutputSize])

Y2 = np.random.random([n_samples, OutputSize])

model1.fit(X, Y1, batch_size=128, nb_epoch=1)

model2.fit(X, Y1, batch_size=128, nb_epoch=1)

model3.fit(X, Y2, batch_size=128, nb_epoch=1)

print(model1.summary())

print(model2.summary())

print(model3.summary())

Study Artificial Intelligence and Machine Learning for more details on Keras. One can also get certified in Machine Learning Certification.

You can see that the architecture of model1 and model 2 is sequence to sequence, but model 3 is a complete sequence to label model.

Browse Categories

...