3,172 view times

# Building your Recurrent Neural Network – Step by Step

## 1.1 – RNN cell

A recurrent neural network can be seen as the repeated use of a single cell. You are first going to implement the computations for a single time-step. The following figure describes the operations for a single time-step of an RNN cell.

#### rnn cell versus rnn_cell_forward

• Note that an RNN cell outputs the hidden state a⟨t⟩a⟨t⟩.
• The rnn cell is shown in the figure as the inner box which has solid lines.
• The function that we will implement, rnn_cell_forward, also calculates the prediction ŷ ⟨t⟩y^⟨t⟩
• The rnn_cell_forward is shown in the figure as the outer box that has dashed lines.

Exercise: Implement the RNN-cell described in Figure (2).

Instructions:

1. Compute the hidden state with tanh activation: $$a^{\langle t \rangle} = \tanh(W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a)$$.
2. Using your new hidden state $$a^{\langle t \rangle}$$, compute the prediction $$\hat{y}^{\langle t \rangle} = softmax(W_{ya} a^{\langle t \rangle} + b_y)$$. We provided the function softmax.
3. Store $$(a^{\langle t \rangle}, a^{\langle t-1 \rangle}, x^{\langle t \rangle}, parameters)$$ in a cache.
4. Return $$a^{\langle t \rangle} , \hat{y}^{\langle t \rangle}$$ and cache

• numpy.tanh
• We’ve created a softmax function that you can use. It is located in the file ‘rnn_utils.py’ and has been imported.
• For matrix multiplication, use numpy.dot
# GRADED FUNCTION: rnn_cell_forward

def rnn_cell_forward(xt, a_prev, parameters):
"""
Implements a single forward step of the RNN-cell as described in Figure (2)

Arguments:
xt -- your input data at timestep "t", numpy array of shape (n_x, m).
a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
parameters -- python dictionary containing:
Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
ba --  Bias, numpy array of shape (n_a, 1)
by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
Returns:
a_next -- next hidden state, of shape (n_a, m)
yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
cache -- tuple of values needed for the backward pass, contains (a_next, a_prev, xt, parameters)
"""

# Retrieve parameters from "parameters"
Wax = parameters["Wax"]
Waa = parameters["Waa"]
Wya = parameters["Wya"]
ba = parameters["ba"]
by = parameters["by"]

### START CODE HERE ### (≈2 lines)
# compute next activation state using the formula given above
a_next = np.tanh(np.dot(Waa,a_prev)+np.dot(Wax,xt)+ba)
# compute output of the current cell using the formula given above
yt_pred = softmax(np.dot(Wya,a_next)+by)
### END CODE HERE ###

# store values you need for backward propagation in cache
cache = (a_next, a_prev, xt, parameters)

return a_next, yt_pred, cache


## 1.2 – RNN forward pass

• A recurrent neural network (RNN) is a repetition of the RNN cell that you’ve just built.
• If your input sequence of data is 10 time steps long, then you will re-use the RNN cell 10 times.
• Each cell takes two inputs at each time step:
• $$a^{\langle t-1 \rangle}$$: The hidden state from the previous cell.
• $$x^{\langle t \rangle}$$: The current time-step’s input data.
• It has two outputs at each time step:
• A hidden state ($$a^{\langle t \rangle}$$)
• A prediction ($$y^{\langle t \rangle}$$)
• The weights and biases $$(W_{aa}, b_{a}, W_{ax}, b_{x})$$ are re-used each time step.
• They are maintained between calls to rnn_cell_forward in the ‘parameters’ dictionary.

Exercise: Code the forward propagation of the RNN described in Figure (3).

Instructions:

• Create a 3D array of zeros, a of shape $$(n_{a}, m, T_{x})$$ that will store all the hidden states computed by the RNN.
• Create a 3D array of zeros, $$\hat{y}$$, of shape $$(n_{y}, m, T_{x})$$ that will store the predictions.
• Note that in this case, $$T_{y} = T_{x}$$ (the prediction and input have the same number of time steps).
• Initialize the 2D hidden state a_next by setting it equal to the initial hidden state, $$a_{0}$$.
• At each time step t:
• Get $$x^{\langle t \rangle}$$, which is a 2D slice of x for a single time step t.
• $$x^{\langle t \rangle}$$ has shape $$(n_{x}, m)$$
• x has shape $$(n_{x}, m, T_{x})$$
• Update the 2D hidden state $$a^{\langle t \rangle}$$ (variable name a_next), the prediction $$\hat{y}^{\langle t \rangle}$$ and the cache by running rnn_cell_forward.
• $$a^{\langle t \rangle}$$ has shape $$(n_{a}, m)$$
• Store the 2D hidden state in the 3D tensor a, at the $$t^{th}$$ position.
• a has shape $$(n_{a}, m, T_{x})$$
• Store the 2D $$\hat{y}^{\langle t \rangle}$$ prediction (variable name yt_pred) in the 3D tensor $$\hat{y}_{pred}$$ at the $$t^{th}$$ position.
• $$\hat{y}^{\langle t \rangle}$$ has shape $$(n_{y}, m)$$
• $$\hat{y}$$ has shape $$(n_{y}, m, T_x)$$
• Append the cache to the list of caches.
• Return the 3D tensor a and $$\hat{y}$$, as well as the list of caches.

• np.zeros
• If you have a 3 dimensional numpy array and are indexing by its third dimension, you can use array slicing like this: var_name[:,:,i].
# GRADED FUNCTION: rnn_forward

def rnn_forward(x, a0, parameters):
"""
Implement the forward propagation of the recurrent neural network described in Figure (3).

Arguments:
x -- Input data for every time-step, of shape (n_x, m, T_x).
a0 -- Initial hidden state, of shape (n_a, m)
parameters -- python dictionary containing:
Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
ba --  Bias numpy array of shape (n_a, 1)
by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

Returns:
a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
y_pred -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
caches -- tuple of values needed for the backward pass, contains (list of caches, x)
"""

# Initialize "caches" which will contain the list of all caches
caches = []

# Retrieve dimensions from shapes of x and parameters["Wya"]
n_x, m, T_x = x.shape
n_y, n_a = parameters["Wya"].shape

### START CODE HERE ###

# initialize "a" and "y_pred" with zeros (≈2 lines)
a = np.zeros((n_a,m,T_x))
y_pred = np.zeros((n_y,m,T_x))

# Initialize a_next (≈1 line)
a_next = a0

# loop over all time-steps of the input 'x' (1 line)
for t in range(T_x):
# Update next hidden state, compute the prediction, get the cache (≈2 lines)
xt = None
a_next, yt_pred, cache = rnn_cell_forward(x[:,:,t], a_next, parameters)
# Save the value of the new "next" hidden state in a (≈1 line)
a[:,:,t] = a_next
# Save the value of the prediction in y (≈1 line)
y_pred[:,:,t] = yt_pred
# Append "cache" to "caches" (≈1 line)
caches.append(cache)

### END CODE HERE ###

# store values needed for backward propagation in cache
caches = (caches, x)

return a, y_pred, caches


### 2.1 – LSTM cell

Exercise: Implement the LSTM cell described in the Figure (4).

Instructions:

1. Concatenate the hidden state $$a^{\langle t-1 \rangle}$$ and input $$x^{\langle t \rangle}$$ into a single matrix:

$$concat = \begin{bmatrix} a^{\langle t-1 \rangle} \ x^{\langle t \rangle} \end{bmatrix}$$

1. Compute all the formulas 1 through 6 for the gates, hidden state, and cell state.
2. Compute the prediction $$y^{\langle t \rangle}$$.

• You can use numpy.concatenate. Check which value to use for the axis parameter.
• The functions sigmoid() and softmax are imported from rnn_utils.py.
• numpy.tanh
• Use np.dot for matrix multiplication.
• Notice that the variable names Wibi refer to the weights and biases of the update gate. There are no variables named “Wu” or “bu” in this function.

### 2.2 – Forward pass for LSTM

Now that you have implemented one step of an LSTM, you can now iterate this over this using a for-loop to process a sequence of $$T_x$$ inputs.

Exercise: Implement lstm_forward() to run an LSTM over $$T_x$$ time-steps.

Instructions

• Get the dimensions $$n_x, n_a, n_y, m, T_x$$ from the shape of the variables: x and parameters.
• Initialize the 3D tensors a, c and y.
• a: hidden state, shape $$(n_{a}, m, T_{x})$$
• c: cell state, shape $$(n_{a}, m, T_{x})$$
• y: prediction, shape $$(n_{y}, m, T_{x})$$ (Note that $$T_{y} = T_{x}$$ in this example).
• Note Setting one variable equal to the other is a “copy by reference”. In other words, don’t do c = a’, otherwise both these variables point to the same underlying variable.
• Initialize the 2D tensor $$a^{\langle t \rangle}$$
• $$a^{\langle t \rangle}$$ stores the hidden state for time step t. The variable name is a_next.
• $$a^{\langle 0 \rangle}$$, the initial hidden state at time step 0, is passed in when calling the function. The variable name is a0.
• $$a^{\langle t \rangle}$$ and $$a^{\langle 0 \rangle}$$ represent a single time step, so they both have the shape $$(n_{a}, m)$$
• Initialize $$a^{\langle t \rangle}$$ by setting it to the initial hidden state ($$a^{\langle 0 \rangle}$$) that is passed into the function.
• Initialize $$c^{\langle t \rangle}$$ with zeros.
• The variable name is c_next.
• $$c^{\langle t \rangle}$$ represents a single time step, so its shape is $$(n_{a}, m)$$
• Note: create c_next as its own variable with its own location in memory. Do not initialize it as a slice of the 3D tensor c. In other words, don’t do c_next = c[:,:,0].
• For each time step, do the following:
• From the 3D tensor x, get a 2D slice $$x^{\langle t \rangle}$$ at time step t.
• Call the lstm_cell_forward function that you defined previously, to get the hidden state, cell state, prediction, and cache.
• Store the hidden state, cell state and prediction (the 2D tensors) inside the 3D tensors.
• Also append the cache to the list of caches.
# GRADED FUNCTION: lstm_forward

def lstm_forward(x, a0, parameters):
"""
Implement the forward propagation of the recurrent neural network using an LSTM-cell described in Figure (4).

Arguments:
x -- Input data for every time-step, of shape (n_x, m, T_x).
a0 -- Initial hidden state, of shape (n_a, m)
parameters -- python dictionary containing:
Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
Wi -- Weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
bi -- Bias of the update gate, numpy array of shape (n_a, 1)
Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
bc -- Bias of the first "tanh", numpy array of shape (n_a, 1)
Wo -- Weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
bo -- Bias of the output gate, numpy array of shape (n_a, 1)
Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

Returns:
a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
y -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
c -- The value of the cell state, numpy array of shape (n_a, m, T_x)
caches -- tuple of values needed for the backward pass, contains (list of all the caches, x)
"""

# Initialize "caches", which will track the list of all the caches
caches = []

### START CODE HERE ###
Wy = parameters['Wy'] # saving parameters['Wy'] in a local variable in case students use Wy instead of parameters['Wy']
# Retrieve dimensions from shapes of x and parameters['Wy'] (≈2 lines)
n_x, m, T_x = x.shape
n_y, n_a = Wy.shape

# initialize "a", "c" and "y" with zeros (≈3 lines)
a = np.zeros((n_a, m, T_x))
c = a
y = np.zeros((n_y, m, T_x))

# Initialize a_next and c_next (≈2 lines)
a_next = a0
c_next = np.zeros(a_next.shape)

# loop over all time-steps
for t in range(T_x):
# Get the 2D slice 'xt' from the 3D input 'x' at time step 't'
xt = None
# Update next hidden state, next memory state, compute the prediction, get the cache (≈1 line)
a_next, c_next, yt, cache = lstm_cell_forward(x[:,:,t], a_next, c_next, parameters)
# Save the value of the new "next" hidden state in a (≈1 line)
a[:,:,t] = a_next
# Save the value of the prediction in y (≈1 line)
y[:,:,t] = yt
# Save the value of the next cell state (≈1 line)
c[:,:,t]  = c_next
# Append the cache into caches (≈1 line)
caches.append(cache)

### END CODE HERE ###

# store values needed for backward propagation in cache
caches = (caches, x)

return a, y, c, caches


### 3.1 – Basic RNN backward pass

We will start by computing the backward pass for the basic RNN-cell and then in the following sections, iterate through the cells.

Figure 6: RNN-cell’s backward pass. Just like in a fully-connected neural network, the derivative of the cost function J backpropagates through the time steps of the RNN by following the chain-rule from calculus. Internal to the cell, the chain-rule is also used to calculate $$(\frac{\partial J}{\partial W_{ax}},\frac{\partial J}{\partial W_{aa}},\frac{\partial J}{\partial b})$$ to update the parameters $$(W_{ax}, W_{aa}, b_a)$$. The operation can utilize the cached results from the forward path.

Recall from lecture, the shorthand for the partial derivative of cost relative to a variable is dVariable. For example, $$\frac{\partial J}{\partial W_{ax}}$$ is $$dW_{ax}$$. This will be used throughout the remaining sections.

Figure 7: This implementation of rnn_cell_backward does not include the output dense layer and softmax which are included in rnn_cell_forward.
$$da_{next}$$ is $$\frac{\partial{J}}{\partial a^{\langle t \rangle}}$$ and includes loss from previous stages and current stage output logic. This addition will be part of your implementation of rnn_backward

##### Equations

To compute the rnn_cell_backward you can utilize the following equations. It is a good exercise to derive them by hand. Here, ∗∗ denotes element-wise multiplication while the absence of a symbol indicates matrix multiplication.

$$a^{\langle t \rangle} = \tanh(W_{ax} x^{\langle t \rangle} + W_{aa} a^{\langle t-1 \rangle} + b_{a})\tag{-}$$

$$\displaystyle \frac{\partial \tanh(x)} {\partial x} = 1 – \tanh^2(x) \tag{-}$$

$$\displaystyle {dW_{ax}} = da_{next} * ( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) ) x^{\langle t \rangle T}\tag{1}$$

$$\displaystyle dW_{aa} = da_{next} * (( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) ) a^{\langle t-1 \rangle T}\tag{2}$$

$$\displaystyle db_a = da_{next} * \sum_{batch}( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) )\tag{3}$$

$$\displaystyle dx^{\langle t \rangle} = da_{next} * { W_{ax}}^T ( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) )\tag{4}$$

$$\displaystyle da_{prev} = da_{next} * { W_{aa}}^T ( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) )\tag{5}$$

def rnn_cell_backward(da_next, cache):
"""
Implements the backward pass for the RNN-cell (single time-step).

Arguments:
da_next -- Gradient of loss with respect to next hidden state
cache -- python dictionary containing useful values (output of rnn_cell_forward())

Returns:
dx -- Gradients of input data, of shape (n_x, m)
da_prev -- Gradients of previous hidden state, of shape (n_a, m)
dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
dba -- Gradients of bias vector, of shape (n_a, 1)
"""

# Retrieve values from cache
(a_next, a_prev, xt, parameters) = cache

# Retrieve values from parameters
Wax = parameters["Wax"]
Waa = parameters["Waa"]
Wya = parameters["Wya"]
ba = parameters["ba"]
by = parameters["by"]

### START CODE HERE ###
# compute the gradient of the loss with respect to z (optional) (≈1 line)
dz = (1- a_next**2) * da_next

# compute the gradient of the loss with respect to Wax (≈2 lines)
dxt = np.dot(Wax.T, dz)
dWax = np.dot(dz, xt.T)

# compute the gradient with respect to Waa (≈2 lines)
da_prev = np.dot(Waa.T, dz)
dWaa = np.dot(dz, a_prev.T)

# compute the gradient with respect to b (≈1 line)
dba = np.sum(dz, 1, keepdims=True)

### END CODE HERE ###

# Store the gradients in a python dictionary
gradients = {"dxt": dxt, "da_prev": da_prev, "dWax": dWax, "dWaa": dWaa, "dba": dba}


def rnn_backward(da, caches):
"""
Implement the backward pass for a RNN over an entire sequence of input data.

Arguments:
da -- Upstream gradients of all hidden states, of shape (n_a, m, T_x)
caches -- tuple containing information from the forward pass (rnn_forward)

Returns:
dx -- Gradient w.r.t. the input data, numpy-array of shape (n_x, m, T_x)
da0 -- Gradient w.r.t the initial hidden state, numpy-array of shape (n_a, m)
dWax -- Gradient w.r.t the input's weight matrix, numpy-array of shape (n_a, n_x)
dWaa -- Gradient w.r.t the hidden state's weight matrix, numpy-arrayof shape (n_a, n_a)
dba -- Gradient w.r.t the bias, of shape (n_a, 1)
"""

### START CODE HERE ###

# Retrieve values from the first cache (t=1) of caches (≈2 lines)
(caches, x) = caches
(a1, a0, x1, parameters) = caches[0]

# Retrieve dimensions from da's and x1's shapes (≈2 lines)
n_a, m, T_x = da.shape
n_x, m = x1.shape

# initialize the gradients with the right sizes (≈6 lines)
dx = np.zeros((n_x, m, T_x))
dWax = np.zeros((n_a, n_x))
dWaa = np.zeros((n_a, n_a))
dba = np.zeros((n_a, 1))
da0 = np.zeros((n_a, m))
da_prevt = np.zeros((n_a, m))

# Loop through all the time steps
for t in reversed(range(T_x)):
# Compute gradients at time step t. Choose wisely the "da_next" and the "cache" to use in the backward propagation step. (≈1 line)
gradients = rnn_cell_backward(da[:,:, t] + da_prevt, caches[t])
# Retrieve derivatives from gradients (≈ 1 line)
# Increment global derivatives w.r.t parameters by adding their derivative at time-step t (≈4 lines)
dx[:, :, t] = dxt
dWax += dWaxt
dWaa += dWaat
dba += dbat

# Set da0 to the gradient of a which has been backpropagated through all time-steps (≈1 line)
da0 = da_prevt
### END CODE HERE ###

# Store the gradients in a python dictionary
gradients = {"dx": dx, "da0": da0, "dWax": dWax, "dWaa": dWaa,"dba": dba}

##################################################
def lstm_cell_backward(da_next, dc_next, cache):
"""
Implement the backward pass for the LSTM-cell (single time-step).

Arguments:
da_next -- Gradients of next hidden state, of shape (n_a, m)
dc_next -- Gradients of next cell state, of shape (n_a, m)
cache -- cache storing information from the forward pass

Returns:
dxt -- Gradient of input data at time-step t, of shape (n_x, m)
da_prev -- Gradient w.r.t. the previous hidden state, numpy array of shape (n_a, m)
dc_prev -- Gradient w.r.t. the previous memory state, of shape (n_a, m, T_x)
dWf -- Gradient w.r.t. the weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
dWi -- Gradient w.r.t. the weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
dWc -- Gradient w.r.t. the weight matrix of the memory gate, numpy array of shape (n_a, n_a + n_x)
dWo -- Gradient w.r.t. the weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
dbf -- Gradient w.r.t. biases of the forget gate, of shape (n_a, 1)
dbi -- Gradient w.r.t. biases of the update gate, of shape (n_a, 1)
dbc -- Gradient w.r.t. biases of the memory gate, of shape (n_a, 1)
dbo -- Gradient w.r.t. biases of the output gate, of shape (n_a, 1)
"""

# Retrieve information from "cache"
(a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters) = cache

### START CODE HERE ###
# Retrieve dimensions from xt's and a_next's shape (≈2 lines)
n_x, m = xt.shape
n_a, m = a_next.shape

# Compute gates related derivatives, you can find their values can be found by looking carefully at equations (7) to (10) (≈4 lines)
dot = da_next * np.tanh(c_next) * ot * (1 - ot)
dcct = (dc_next * it + ot * (1 - np.square(np.tanh(c_next))) * it * da_next) * (1 - np.square(cct))
dit = (dc_next * cct + ot * (1 - np.square(np.tanh(c_next))) * cct * da_next) * it * (1 - it)
dft = (dc_next * c_prev + ot *(1 - np.square(np.tanh(c_next))) * c_prev * da_next) * ft * (1 - ft)

# Code equations (7) to (10) (≈4 lines)
##dit = None
##dft = None
##dot = None
##dcct = None
concat = np.concatenate((a_prev, xt), axis=0)

# Compute parameters related derivatives. Use equations (11)-(14) (≈8 lines)
dWf = np.dot(dft, concat.T)
dWi = np.dot(dit, concat.T)
dWc = np.dot(dcct, concat.T)
dWo = np.dot(dot, concat.T)
dbf = np.sum(dft, axis=1 ,keepdims = True)
dbi = np.sum(dit, axis=1, keepdims = True)
dbc = np.sum(dcct, axis=1,  keepdims = True)
dbo = np.sum(dot, axis=1, keepdims = True)

# Compute derivatives w.r.t previous hidden state, previous memory state and input. Use equations (15)-(17). (≈3 lines)
da_prev = np.dot(parameters['Wf'][:, :n_a].T, dft) + np.dot(parameters['Wi'][:, :n_a].T, dit) + np.dot(parameters['Wc'][:, :n_a].T, dcct) + np.dot(parameters['Wo'][:, :n_a].T, dot)
dc_prev = dc_next * ft + ot * (1 - np.square(np.tanh(c_next))) * ft * da_next
dxt = np.dot(parameters['Wf'][:, n_a:].T, dft) + np.dot(parameters['Wi'][:, n_a:].T, dit) + np.dot(parameters['Wc'][:, n_a:].T, dcct) + np.dot(parameters['Wo'][:, n_a:].T, dot)
### END CODE HERE ###

gradients = {"dxt": dxt, "da_prev": da_prev, "dc_prev": dc_prev, "dWf": dWf,"dbf": dbf, "dWi": dWi,"dbi": dbi,
"dWc": dWc,"dbc": dbc, "dWo": dWo,"dbo": dbo}

#########################################################
def lstm_backward(da, caches):

"""
Implement the backward pass for the RNN with LSTM-cell (over a whole sequence).

Arguments:
da -- Gradients w.r.t the hidden states, numpy-array of shape (n_a, m, T_x)
caches -- cache storing information from the forward pass (lstm_forward)

Returns:
dx -- Gradient of inputs, of shape (n_x, m, T_x)
da0 -- Gradient w.r.t. the previous hidden state, numpy array of shape (n_a, m)
dWf -- Gradient w.r.t. the weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
dWi -- Gradient w.r.t. the weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
dWc -- Gradient w.r.t. the weight matrix of the memory gate, numpy array of shape (n_a, n_a + n_x)
dWo -- Gradient w.r.t. the weight matrix of the save gate, numpy array of shape (n_a, n_a + n_x)
dbf -- Gradient w.r.t. biases of the forget gate, of shape (n_a, 1)
dbi -- Gradient w.r.t. biases of the update gate, of shape (n_a, 1)
dbc -- Gradient w.r.t. biases of the memory gate, of shape (n_a, 1)
dbo -- Gradient w.r.t. biases of the save gate, of shape (n_a, 1)
"""

# Retrieve values from the first cache (t=1) of caches.
(caches, x) = caches
(a1, c1, a0, c0, f1, i1, cc1, o1, x1, parameters) = caches[0]

### START CODE HERE ###
# Retrieve dimensions from da's and x1's shapes (≈2 lines)
n_a, m, T_x = da.shape
n_x, m = x1.shape

# initialize the gradients with the right sizes (≈12 lines)
dx = np.zeros((n_x, m, T_x))
da0 = np.zeros((n_a, m))
da_prevt = np.zeros(da0.shape)
dc_prevt = np.zeros(da0.shape)
dWf = np.zeros((n_a, n_a + n_x))
dWi = np.zeros(dWf.shape)
dWc = np.zeros(dWf.shape)
dWo = np.zeros(dWf.shape)
dbf = np.zeros((n_a, 1))
dbi = np.zeros(dbf.shape)
dbc = np.zeros(dbf.shape)
dbo = np.zeros(dbf.shape)

# loop back over the whole sequence
for t in reversed(range(T_x)):
# Compute all gradients using lstm_cell_backward
gradients = lstm_cell_backward(da[:, :, t], dc_prevt, caches[t])
`