Gated Recurrent Unit (GRU)

Gated-Recurrent-Unit-feature.jpg

If you are into deep learning and want to work with sequential data like time series, speech, or text, then it is very important for you to understand GRU (Gated Recurrent Unit). GRU is basically a type of Recurrent Neural Network (RNN) that helps to solve some of the major limitations of the traditional RNNs. They are used to control the way information is passed and remembered. This helps RNNs to handle long sequences in a better way, making them useful for many machine learning operations.

In this blog, we will discuss everything about GRU, how it works, where you can use it, and how you can implement it easily in Python. So let’s get started!

Table of Contents

What is GRU?

GRU, which stands for Gated Recurrent Unit, is basically an improved version of RNN (Recurrent Neural Network). It was introduced in 2014 by Kyunghyun Cho and his team. GRU refers to a type of neural network used in processing time-series or sequential data. GRUs are very similar to Long Short-Term Memory (LSTM) and are used to control the flow of information just like LSTMs. They are comparatively newer than LSTMs, and therefore they are more improvements and have a simpler architecture.

What is GRU

In the above diagram, you can see the difference between the architectures of LSTM and GRU. On the left side, you can see that the LSTM takes three inputs: the previous hidden state (Ht-1), the previous cell state (Ct-1), and the current input (xt). It then produced two outputs: the new hidden state (Ht) and the cell state (Ct). On the right side, you can see that the GRU has a simpler structure: it only takes two inputs (Ht-1) and the current input (xt), which helps to produce the new hidden state (Ht). In short, GRUs consist of fewer components than LSTMs but make it easier and faster to train the models.

Boost your tech career in Machine Learning – Sign up now!
Practical projects, job-ready skills, and expert guidance.
quiz-icon

The Architecture of Gated Recurrent Unit (GRU)

Now we will talk about the architecture of GRU and how it works. A GRU is almost similar to an LSTM cell or RNN cell.

The Architecture of Gated Recurrent Unit (GRU)


Here, at each time stamp t, GRU takes the input xt and the hidden state from the previous step Ht-1. It then processes the two inputs and gives a new hidden state Ht, which is then passed to the next step in the sequence.

Unlike an LSTM, which has three gates, a GRU comprises two primary gates, namely the update gate and the reset gate. These two gates help the GRU to decide what information it should forget and what it should remember to keep moving through the sequence.

Reset Gate (Short Term Memory)

The reset gate in GRU helps the model decide how much past information it should forget. When the reset value is near to 0, it means that most of the previous hidden states are ignored by the GRU and it is focusing more on the current input. This can be useful where new input is more significant compared to the past input, like when there is a sudden change in a time series.

The formula for the reset gate is given below:

rt = σ(Wr . [ht-1 , xt] + br)

Here, 

  • rt denotes the output time of the reset gate.
  • ht-1  is used to denote the previous hidden state.
  •  xt denotes the current input.
  • Wr and br denote the weights and bias of the reset gate.
  • σ is used to denote the sigmoid activation function.

Update Gate

The update gate is used to control the amount of previous hidden state that should be carried forward to the next time step. If the update gate is close to 1, most of the past information is kept; if the information is close to 0, the GRU uses the new information. This gate helps the GRU to maintain long-term memory by deciding what it should remember and what it should update.

The formula for the update gate is: 

zt​ = σ(Wz​⋅ [ht−1​, xt​] + bz​)

Here,

  • Zt denotes the update gate output at time t.
  • ht-1 denotes the previous hidden state.
  • xt denotes the current input.
  • Wz and bz denote the weights and bias for the update gate.
  • σ is the sigmoid activation function.

How GRU Works?

In order to understand the working of GRU, imagine it to be a smart filter that learns what information to remember and what to forget. The working process of GRU is given below:

Prepare the Inputs

For a GRU to work properly, it requires two pieces of information. At first, it takes the current input, which is denoted by Xt. After that, it takes the previous hidden state, which consists of the memory from the last step, called ht-1. Both these inputs are in the form of vectors and help the GRU to understand the current context based on what has happened so far.     

Gate Calculations

In a GRU, there are mainly two types of gates: the reset gate and the update gate. Although sometimes a third gate is also referred to as “forget gate”, which is actually a part of the reset gate. These gates help to control the way information flows through the network. In order to calculate the values for each gate, the GRU does some element-wise operations with the help of the current input Xt and the previous hidden state ht-1. Each gate consists of its own set of weights; therefore, the input and hidden state are multiplied in such a way that it becomes customized for each gate. After that, an activation function like a sigmoid is applied to get the results. The sigmoid function is used to transform the values into a range between 0 and 1, which tells the gate the amount of information to pass through or block. This process helps the GRU to decide what to remember and what it should forget at each step.

Now, let us see the functioning of these gates in detail. In order to find the hidden state Ht in GRU, you have to follow a two-step process. The first step is to generate the candidate hidden state, which is shown below:

Get 100% Hike!

Master Most in Demand Skills Now!

Candidate Hidden State

The candidate hidden state is an important part of a GRU updates its memory at each time step. To calculate the Candidate hidden state (ĥt)​, the GRU uses the reset gate to decide how much of the previous hidden state(ht-1) you consider. If the reset gate is close to 0, the GRU ignores the past input and focuses on the new input. After that, it combines the current input (xt) along with the reset version of the past hidden state and then applies a tanh activation function to it. This function helps to shrink the values into a range between -1 and 1. This causes the output to be more predictable and simple, and you can learn from it easily.

The formula to calculate the candidate hidden state is given below:

ĥt = tanh(xt . Uh + (rt . ht-1) . Wh)

Here,

  • ĥt denotes the candidate hidden state.
  • xt denotes the current input.
  • ht-1 denotes the previous hidden state.
  • rt is the reset gate value.
  • Uh and Wh are the weight matrices.
  • . denotes multiplication of matrices.
  • tanh is the activation function.

The candidate hidden state is not the final output. It is a suggestion that will be used along with the previous hidden state with the help of the update gate. This helps to decide how much new information should be included in the final hidden state (ht).

Hidden State

The hidden state is basically the final output produced by GRU at each time step. In order to calculate the hidden state (ht), the takes a mix of two things. They are as follows:

1. The previous hidden state (ht-1), which is composed of the memory of the previous steps.

2. The candidate hidden state (ĥt ), which consists of the new information based on the current input.

The update is used to decide how much of the previous hidden state and the candidate hidden state should be used. If the update gate is more inclined towards the previous hidden state, the model remembers more about the past input. If it leans towards the candidate hidden state, the model is more focused on the new input.

The formula for the hidden state is:

ht = (1 – zt) . ĥt + zt . ht-1

Here,

  • ht denotes the new hidden state.
  • zt is the updated gate value.
  • ĥt  is the candidate hidden state.
  • ht-1 is the previous hidden state.
  • . denotes multiplication of elements.

This mix of new and old information helps GRU to learn about both short and long-term patterns. This makes it effective for various tasks like language modeling, translation, and predicting time series.

GRU vs LSTM

GRUs are faster and simpler because they combine the forget gate and the input gate into one gate called the update gate. The GRUs, unlike the LSTMs, do not have a distinct cell that stores the information. They keep everything in a hidden state. That assists GRUs to be simpler than LSTMs and also runs faster, particularly when you deal with large amounts of data.

Given below are the differences between GRU and LSTM in tabular format:

Feature GRU (Gated Recurrent Unit) LSTM (Long Short-Term Memory)
Gates Consists of 2 gates: reset gate and update gate Consists of 3 gates: input gate, forget gate, and output gate
Cell State Does not use a separate cell state Uses a separate cell state to store memory
Hidden State Stores all information directly in the hidden state Splits information between the hidden state and the cell state
Complexity Simpler and faster to compute More complex due to the extra gate and state
Training Time Faster training because of fewer operations Slower training due to more calculations
Memory Control Less control over long-term memory Better control of long-term memory
Performance Performs well on smaller datasets and quick tasks Performs better on longer sequences or more complex data
Popularity Used when speed and simplicity are important Used when accuracy and deep memory are required

Implementing GRU in Python

Now, we will talk about the implementation of the GRU model in Python using Keras. Below is the step-by-step process:

Step 1: Importing Libraries

At first, you have to import some of the important libraries to implement the GRU model. They are as follows:

1. Numpy: It is used for handling numerical data and manipulating arrays.

2. Pandas: It is used for data manipulation and reading datasets (CSV files).

3. MinMaxScaler: It is used for normalization of the dataset.

4. TensorFlow: It is used for building and training the GRU model.

5. Adam: It is the optimization algorithm used during the training process of the model.

Code:

Python

Explanation:

The above Python code is used to set up everything that is needed to build a deep learning model based on GRU for the processing of scaled data, like time series. The output will not be created by this code because the model is not trained to perform predictions.

Step 2: Loading the Dataset

The dataset that you are going to use is a time-series dataset that contains daily temperature data, i.e., a forecasting dataset. It covers 8,000 days, starting from January 1, 2010, and is used to forecast temperatures. You can download the dataset from the given link

Code:

Python

Output:

Loading the dataset for implementing GRU.

Explanation:

The above code is used to read the temperature from a CSV file. It treats the ‘Date’ column as actual date values, sets the column as the index, and then prints the first few rows.

Step 3: Preprocessing the Data

Here, you have to scale the data so that all values fall between 0 and 1by using MinMaxScaler. This helps to make sure that no feature dominates the other and helps the neural network learn in a more efficient way.

Code:

Python

Explanation:

The above Python code is used to scan all the values related to temperature ranging between 0 and 1 to make sure that they are ready for training a neural network. This code won’t generate any output because the data is only transformed and stored. It is not printed.

Step 4: Preparing Data for GRU

Here, you have to create a function to get your data ready for training the model. The create_dataset() function is used to break the data into small parts of a given length. This will help the model to learn from past values and help it to predict the next one. After that, the X.reshape() changes the shape of the input data into 3 dimensions. They are samples, time steps, and features. 

Code:

Python

Explanation:

The above code is used to prepare the time-series data. It is used for training by creating input-output pairs using a sliding window of 100 time steps. For every 100 days of temperature data, the function then sets the value of the next day as the target to be predicted. After that, it reshapes the input (X) into a 3D format: samples, time steps, and features.

Step 5: Building the GRU Model

Now, you have to define your GRU model with the following components:

1. GRU(units=50): It is used to add a GRU layer with 50 units (neurons).

2. return_sequences=True: It helps to ensure that the GRU layer returns you the entire sequence that is required for stacking multiple GRU layers.

3. Dense(units=1): It is the final layer of the model that provides you with one predicted value, like the temperature for the next day.

4. Adam(): It is an adaptive optimizer that is used in deep learning/

Code:

Python

Output:

Building the GRU model

Explanation:

The above Python code is used to build a GRU-based model with the help of the sequential API. At first, a GRU layer is considered, which consists of 50 units but only returns the final output. After that, a Dense layer with 1 unit is added, which helps to predict the next value of temperature. At last, the model is compiled using the Adam Optimizer, and for the loss function, the Mean Squared Error is used.

Step 6: Training the Model

Here, the GRU model is trained using model.fit() with the help of the input data. You have to set epochs=10, which means that the model will go through the entire dataset 10 times, and batch_size=32, which means that the model will train using 32 samples at a time to update the weights of the model.

Code:

Python

Output:

Training the GRU model

Explanation:

The above code is used to train the GRU model on the input data (X, y) for 10 rounds. It uses 32 samples at a time to update its learning.

Step 7: Making Predictions

Now, it is time for you to make predictions using the GRU model. In order to do this, the code selects the last 100 temperature values from the dataset as the input sequence. This input is then reshaped into a 3D format. This is because the GRU model needs the data to be in a 3D format: 1 sample, 100 time steps, and 1 feature (temperature). At last, the model.predict() is used to predict the next temperature value based on this output.

Code:

Python

Output:

Making Predictions with the GRU model

Explanation:

The above code is used to reshape the last 100 data points in order to fit the input format of the GRU model. It uses the trained model to predict the next value of the temperature.

Step 8: Inverse Transforming the Predictions

Here, inverse transforming the predictions means you have to change the scaled values (which were between 0 and 1) back to their original range. You can do this by using a scaler.inverse_transform (), so that the predicted values make sense in the real world.

Code:

Python

Output:

Inverse Transforming the Predictions

Explanation:

Inverse transforming the predictions means you have to change the scaled values back to their original temperature range. This can be done using scaler.inverse_transform(), so that the predicted values make sense.

Advantages of GRU

1. Faster Training: The training process of GRUs is comparatively quicker than LSTMs. This is because they have fewer gates and a simpler structure compared to LSTMs.

2. Fewer Parameters: Since only two gates are used by GRUs (reset and update), they need fewer parameters compared to LSTMs.

3. Handles Long-Term Dependencies: GRUs are good at remembering important information for a long period of time. This can be helpful in tasks like language translation and time-series forecasting.

4. Simpler Architecture: GRUs consist of fewer components than LSTMs and are easier to implement and debug, especially if you are new to deep learning.

5. Good performance: Although GRUs have a simple structure, they often perform in the same way as LSTMs while using less memory and processing power.

Disadvantages of GRU

1. Not as Powerful for Complex Tasks: GRUs can forget important information if the sequence is too long.

2. No separate Memory Cell: Unlike LSTMs, GRUs don’t have a separate cell state. This can limit their ability to remember older data.

3. Too simple for some tasks: Since GRUs have fewer gates, they miss complex patterns that can be caught by LSTMs easily.

4. Slower than Basic RNNs: Even though GRUs are faster than LSTMs, they are slower than simple RNNs.

Boost your tech career in Machine Learning with our free course!
Sign up now!
quiz-icon

Conclusion

GRU is a powerful and efficient neural network. It provides you with a great balance between performance and simplicity. It helps you to solve the limitations of traditional RNNs with the help of reset and update gates. GRUs are faster and easier to train, require fewer resources, and perform really well in tasks like forecasting, language modeling, and speech recognition. GRUs can be a good choice if you are a beginner or while working on a real-world project, because they provide you with a simpler model for handling sequential data. To learn more about RNNs, go through our blog and enroll in our Machine Learning Course.

Gated Recurrent Unit (GRU) – FAQs

Q1. Can GRUs be used for real-time predictions?

Sure, because GRUs are lightweight and fast. This makes it suitable for real-time tasks like live forecasting.

Q2. Do GRUs work with multivariate time-series data?

Yes, GRUs are able to handle input features by adjusting their input shape.

Q3. Is it okay to use GRU without scaling the data?

No, it is not recommended. This is because scaling helps the model to learn faster and more accurately.

Q4. Can GRUs be used in both classification and regression tasks?

Yes, you can use GRUs for both classification and regression tasks. This can be done by changing the final layer and loss function.

Q5. Are GRUs suitable for small datasets?

Yes, GRUs perform well with small datasets. This is because they have a simpler structure with fewer parameters.

About the Author

Principal Data Scientist, Accenture

Meet Akash, a Principal Data Scientist with expertise in advanced analytics, machine learning, and AI-driven solutions. With a master’s degree from IIT Kanpur, Aakash combines technical knowledge with industry insights to deliver impactful, scalable models for complex business challenges.

EPGC Data Science Artificial Intelligence