This is mainly due to the sequential computation in the LSTM layer. Remember that LSTM requires sequential input to calculate the hidden layer weights iteratively, in other words, you must wait for the hidden state at time t-1 to calculate the hidden state at time t.
That's not a good idea for GPU cores since they are many small cores who like doing computations in parallel, sequential computation can't fully utilize their computing powers. That's why we are seeing GPU load around 10% - 20% most of the time.
But in the phase of backpropagation, GPU could run a derivative computation in parallel, so we can see the GPU load peak around 80%.