If you are into deep learning and want to work with sequential data like time series, speech, or text, then it is very important for you to understand GRU (Gated Recurrent Unit). GRU is basically a type of Recurrent Neural Network (RNN) that helps to solve some of the major limitations of the traditional RNNs. They are used to control the way information is passed and remembered. This helps RNNs to handle long sequences in a better way, making them useful for many machine learning operations.
In this blog, we will discuss everything about GRU, how it works, where you can use it, and how you can implement it easily in Python. So let’s get started!
Table of Contents
What is GRU?
GRU, which stands for Gated Recurrent Unit, is basically an improved version of RNN (Recurrent Neural Network). It was introduced in 2014 by Kyunghyun Cho and his team. GRU refers to a type of neural network used in processing time-series or sequential data. GRUs are very similar to Long Short-Term Memory (LSTM) and are used to control the flow of information just like LSTMs. They are comparatively newer than LSTMs, and therefore they are more improvements and have a simpler architecture.

In the above diagram, you can see the difference between the architectures of LSTM and GRU. On the left side, you can see that the LSTM takes three inputs: the previous hidden state (Ht-1), the previous cell state (Ct-1), and the current input (xt). It then produced two outputs: the new hidden state (Ht) and the cell state (Ct). On the right side, you can see that the GRU has a simpler structure: it only takes two inputs (Ht-1) and the current input (xt), which helps to produce the new hidden state (Ht). In short, GRUs consist of fewer components than LSTMs but make it easier and faster to train the models.
Boost your tech career in Machine Learning – Sign up now!
Practical projects, job-ready skills, and expert guidance.
The Architecture of Gated Recurrent Unit (GRU)
Now we will talk about the architecture of GRU and how it works. A GRU is almost similar to an LSTM cell or RNN cell.
Here, at each time stamp t, GRU takes the input xt and the hidden state from the previous step Ht-1. It then processes the two inputs and gives a new hidden state Ht, which is then passed to the next step in the sequence.
Unlike an LSTM, which has three gates, a GRU comprises two primary gates, namely the update gate and the reset gate. These two gates help the GRU to decide what information it should forget and what it should remember to keep moving through the sequence.
Reset Gate (Short Term Memory)
The reset gate in GRU helps the model decide how much past information it should forget. When the reset value is near to 0, it means that most of the previous hidden states are ignored by the GRU and it is focusing more on the current input. This can be useful where new input is more significant compared to the past input, like when there is a sudden change in a time series.
The formula for the reset gate is given below:
rt = σ(Wr . [ht-1 , xt] + br)
Here,
- rt denotes the output time of the reset gate.
- ht-1 is used to denote the previous hidden state.
- xt denotes the current input.
- Wr and br denote the weights and bias of the reset gate.
- σ is used to denote the sigmoid activation function.
Update Gate
The update gate is used to control the amount of previous hidden state that should be carried forward to the next time step. If the update gate is close to 1, most of the past information is kept; if the information is close to 0, the GRU uses the new information. This gate helps the GRU to maintain long-term memory by deciding what it should remember and what it should update.
The formula for the update gate is:
zt = σ(Wz⋅ [ht−1, xt] + bz)
Here,
- Zt denotes the update gate output at time t.
- ht-1 denotes the previous hidden state.
- xt denotes the current input.
- Wz and bz denote the weights and bias for the update gate.
- σ is the sigmoid activation function.
How GRU Works?
In order to understand the working of GRU, imagine it to be a smart filter that learns what information to remember and what to forget. The working process of GRU is given below:
For a GRU to work properly, it requires two pieces of information. At first, it takes the current input, which is denoted by Xt. After that, it takes the previous hidden state, which consists of the memory from the last step, called ht-1. Both these inputs are in the form of vectors and help the GRU to understand the current context based on what has happened so far.
Gate Calculations
In a GRU, there are mainly two types of gates: the reset gate and the update gate. Although sometimes a third gate is also referred to as “forget gate”, which is actually a part of the reset gate. These gates help to control the way information flows through the network. In order to calculate the values for each gate, the GRU does some element-wise operations with the help of the current input Xt and the previous hidden state ht-1. Each gate consists of its own set of weights; therefore, the input and hidden state are multiplied in such a way that it becomes customized for each gate. After that, an activation function like a sigmoid is applied to get the results. The sigmoid function is used to transform the values into a range between 0 and 1, which tells the gate the amount of information to pass through or block. This process helps the GRU to decide what to remember and what it should forget at each step.
Now, let us see the functioning of these gates in detail. In order to find the hidden state Ht in GRU, you have to follow a two-step process. The first step is to generate the candidate hidden state, which is shown below:
Get 100% Hike!
Master Most in Demand Skills Now!
Candidate Hidden State
The candidate hidden state is an important part of a GRU updates its memory at each time step. To calculate the Candidate hidden state (ĥt), the GRU uses the reset gate to decide how much of the previous hidden state(ht-1) you consider. If the reset gate is close to 0, the GRU ignores the past input and focuses on the new input. After that, it combines the current input (xt) along with the reset version of the past hidden state and then applies a tanh activation function to it. This function helps to shrink the values into a range between -1 and 1. This causes the output to be more predictable and simple, and you can learn from it easily.
The formula to calculate the candidate hidden state is given below:
ĥt = tanh(xt . Uh + (rt . ht-1) . Wh)
Here,
- ĥt denotes the candidate hidden state.
- xt denotes the current input.
- ht-1 denotes the previous hidden state.
- rt is the reset gate value.
- Uh and Wh are the weight matrices.
- . denotes multiplication of matrices.
- tanh is the activation function.
The candidate hidden state is not the final output. It is a suggestion that will be used along with the previous hidden state with the help of the update gate. This helps to decide how much new information should be included in the final hidden state (ht).
Hidden State
The hidden state is basically the final output produced by GRU at each time step. In order to calculate the hidden state (ht), the takes a mix of two things. They are as follows:
1. The previous hidden state (ht-1), which is composed of the memory of the previous steps.
2. The candidate hidden state (ĥt ), which consists of the new information based on the current input.
The update is used to decide how much of the previous hidden state and the candidate hidden state should be used. If the update gate is more inclined towards the previous hidden state, the model remembers more about the past input. If it leans towards the candidate hidden state, the model is more focused on the new input.
The formula for the hidden state is:
ht = (1 – zt) . ĥt + zt . ht-1
Here,
- ht denotes the new hidden state.
- zt is the updated gate value.
- ĥt is the candidate hidden state.
- ht-1 is the previous hidden state.
- . denotes multiplication of elements.
This mix of new and old information helps GRU to learn about both short and long-term patterns. This makes it effective for various tasks like language modeling, translation, and predicting time series.
GRU vs LSTM
GRUs are faster and simpler because they combine the forget gate and the input gate into one gate called the update gate. The GRUs, unlike the LSTMs, do not have a distinct cell that stores the information. They keep everything in a hidden state. That assists GRUs to be simpler than LSTMs and also runs faster, particularly when you deal with large amounts of data.
Given below are the differences between GRU and LSTM in tabular format:
Feature |
GRU (Gated Recurrent Unit) |
LSTM (Long Short-Term Memory) |
Gates |
Consists of 2 gates: reset gate and update gate |
Consists of 3 gates: input gate, forget gate, and output gate |
Cell State |
Does not use a separate cell state |
Uses a separate cell state to store memory |
Hidden State |
Stores all information directly in the hidden state |
Splits information between the hidden state and the cell state |
Complexity |
Simpler and faster to compute |
More complex due to the extra gate and state |
Training Time |
Faster training because of fewer operations |
Slower training due to more calculations |
Memory Control |
Less control over long-term memory |
Better control of long-term memory |
Performance |
Performs well on smaller datasets and quick tasks |
Performs better on longer sequences or more complex data |
Popularity |
Used when speed and simplicity are important |
Used when accuracy and deep memory are required |
Implementing GRU in Python
Now, we will talk about the implementation of the GRU model in Python using Keras. Below is the step-by-step process:
Step 1: Importing Libraries
At first, you have to import some of the important libraries to implement the GRU model. They are as follows:
1. Numpy: It is used for handling numerical data and manipulating arrays.
2. Pandas: It is used for data manipulation and reading datasets (CSV files).
3. MinMaxScaler: It is used for normalization of the dataset.
4. TensorFlow: It is used for building and training the GRU model.
5. Adam: It is the optimization algorithm used during the training process of the model.
Code:
Explanation:
The above Python code is used to set up everything that is needed to build a deep learning model based on GRU for the processing of scaled data, like time series. The output will not be created by this code because the model is not trained to perform predictions.
Step 2: Loading the Dataset
The dataset that you are going to use is a time-series dataset that contains daily temperature data, i.e., a forecasting dataset. It covers 8,000 days, starting from January 1, 2010, and is used to forecast temperatures. You can download the dataset from the given link.
Code:
Output:
Explanation:
The above code is used to read the temperature from a CSV file. It treats the ‘Date’ column as actual date values, sets the column as the index, and then prints the first few rows.
Step 3: Preprocessing the Data
Here, you have to scale the data so that all values fall between 0 and 1by using MinMaxScaler. This helps to make sure that no feature dominates the other and helps the neural network learn in a more efficient way.
Code:
Explanation:
The above Python code is used to scan all the values related to temperature ranging between 0 and 1 to make sure that they are ready for training a neural network. This code won’t generate any output because the data is only transformed and stored. It is not printed.
Step 4: Preparing Data for GRU
Here, you have to create a function to get your data ready for training the model. The create_dataset() function is used to break the data into small parts of a given length. This will help the model to learn from past values and help it to predict the next one. After that, the X.reshape() changes the shape of the input data into 3 dimensions. They are samples, time steps, and features.
Code:
Explanation:
The above code is used to prepare the time-series data. It is used for training by creating input-output pairs using a sliding window of 100 time steps. For every 100 days of temperature data, the function then sets the value of the next day as the target to be predicted. After that, it reshapes the input (X) into a 3D format: samples, time steps, and features.
Step 5: Building the GRU Model
Now, you have to define your GRU model with the following components:
1. GRU(units=50): It is used to add a GRU layer with 50 units (neurons).
2. return_sequences=True: It helps to ensure that the GRU layer returns you the entire sequence that is required for stacking multiple GRU layers.
3. Dense(units=1): It is the final layer of the model that provides you with one predicted value, like the temperature for the next day.
4. Adam(): It is an adaptive optimizer that is used in deep learning/
Code:
Output:
Explanation:
The above Python code is used to build a GRU-based model with the help of the sequential API. At first, a GRU layer is considered, which consists of 50 units but only returns the final output. After that, a Dense layer with 1 unit is added, which helps to predict the next value of temperature. At last, the model is compiled using the Adam Optimizer, and for the loss function, the Mean Squared Error is used.
Step 6: Training the Model
Here, the GRU model is trained using model.fit() with the help of the input data. You have to set epochs=10, which means that the model will go through the entire dataset 10 times, and batch_size=32, which means that the model will train using 32 samples at a time to update the weights of the model.
Code:
Output:
Explanation:
The above code is used to train the GRU model on the input data (X, y) for 10 rounds. It uses 32 samples at a time to update its learning.
Step 7: Making Predictions
Now, it is time for you to make predictions using the GRU model. In order to do this, the code selects the last 100 temperature values from the dataset as the input sequence. This input is then reshaped into a 3D format. This is because the GRU model needs the data to be in a 3D format: 1 sample, 100 time steps, and 1 feature (temperature). At last, the model.predict() is used to predict the next temperature value based on this output.
Code:
Output:
Explanation:
The above code is used to reshape the last 100 data points in order to fit the input format of the GRU model. It uses the trained model to predict the next value of the temperature.
Here, inverse transforming the predictions means you have to change the scaled values (which were between 0 and 1) back to their original range. You can do this by using a scaler.inverse_transform (), so that the predicted values make sense in the real world.
Code:
Output:
Explanation:
Inverse transforming the predictions means you have to change the scaled values back to their original temperature range. This can be done using scaler.inverse_transform(), so that the predicted values make sense.
Advantages of GRU
1. Faster Training: The training process of GRUs is comparatively quicker than LSTMs. This is because they have fewer gates and a simpler structure compared to LSTMs.
2. Fewer Parameters: Since only two gates are used by GRUs (reset and update), they need fewer parameters compared to LSTMs.
3. Handles Long-Term Dependencies: GRUs are good at remembering important information for a long period of time. This can be helpful in tasks like language translation and time-series forecasting.
4. Simpler Architecture: GRUs consist of fewer components than LSTMs and are easier to implement and debug, especially if you are new to deep learning.
5. Good performance: Although GRUs have a simple structure, they often perform in the same way as LSTMs while using less memory and processing power.
Disadvantages of GRU
1. Not as Powerful for Complex Tasks: GRUs can forget important information if the sequence is too long.
2. No separate Memory Cell: Unlike LSTMs, GRUs don’t have a separate cell state. This can limit their ability to remember older data.
3. Too simple for some tasks: Since GRUs have fewer gates, they miss complex patterns that can be caught by LSTMs easily.
4. Slower than Basic RNNs: Even though GRUs are faster than LSTMs, they are slower than simple RNNs.
Boost your tech career in Machine Learning with our free course!
Sign up now!
Conclusion
GRU is a powerful and efficient neural network. It provides you with a great balance between performance and simplicity. It helps you to solve the limitations of traditional RNNs with the help of reset and update gates. GRUs are faster and easier to train, require fewer resources, and perform really well in tasks like forecasting, language modeling, and speech recognition. GRUs can be a good choice if you are a beginner or while working on a real-world project, because they provide you with a simpler model for handling sequential data. To learn more about RNNs, go through our blog and enroll in our Machine Learning Course.
Gated Recurrent Unit (GRU) – FAQs
Q1. Can GRUs be used for real-time predictions?
Sure, because GRUs are lightweight and fast. This makes it suitable for real-time tasks like live forecasting.
Q2. Do GRUs work with multivariate time-series data?
Yes, GRUs are able to handle input features by adjusting their input shape.
Q3. Is it okay to use GRU without scaling the data?
No, it is not recommended. This is because scaling helps the model to learn faster and more accurately.
Q4. Can GRUs be used in both classification and regression tasks?
Yes, you can use GRUs for both classification and regression tasks. This can be done by changing the final layer and loss function.
Q5. Are GRUs suitable for small datasets?
Yes, GRUs perform well with small datasets. This is because they have a simpler structure with fewer parameters.