2,583 view times

Character level language model – Dinosaurus Island

1.2 – Overview of the model

Your model will have the following structure:

  • Initialize parameters
  • Run the optimization loop
    • Forward propagation to compute the loss function
    • Backward propagation to compute the gradients with respect to the loss function
    • Clip the gradients to avoid exploding gradients
    • Using the gradients, update your parameters with the gradient descent update rule.
  • Return the learned parameters

Figure 1: Recurrent Neural Network, similar to what you had built in the previous notebook “Building a Recurrent Neural Network – Step by Step”.

  • At each time-step, the RNN tries to predict what is the next character given the previous characters.
  • The dataset \(\mathbf{X} = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, …, x^{\langle T_x \rangle})\) is a list of characters in the training set.
  • \(\mathbf{Y} = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, …, y^{\langle T_x \rangle})\) is the same list of characters but shifted one character forward.
  • At every time-step t, \(y^{\langle t \rangle} = x^{\langle t+1 \rangle}\). The prediction at time t is the same as the input at time t + 1.

2.1 – Clipping the gradients in the optimization loop

In this section you will implement the clip function that you will call inside of your optimization loop.

Exploding gradients

  • When gradients are very large, they’re called “exploding gradients.”
  • Exploding gradients make the training process more difficult, because the updates may be so large that they “overshoot” the optimal values during back propagation.

Recall that your overall loop structure usually consists of:

  • forward pass,
  • cost computation,
  • backward pass,
  • parameter update.

Before updating the parameters, you will perform gradient clipping to make sure that your gradients are not “exploding.”

gradient clipping

In the exercise below, you will implement a function clip that takes in a dictionary of gradients and returns a clipped version of gradients if needed.

  • There are different ways to clip gradients.
  • We will use a simple element-wise clipping procedure, in which every element of the gradient vector is clipped to lie between some range [-N, N].
  • For example, if the N=10
    • The range is [-10, 10]
    • If any component of the gradient vector is greater than 10, it is set to 10.
    • If any component of the gradient vector is less than -10, it is set to -10.
    • If any components are between -10 and 10, they keep their original values.

Figure 2: Visualization of gradient descent with and without gradient clipping, in a case where the network is running into “exploding gradient” problems.

Exercise: Implement the function below to return the clipped gradients of your dictionary gradients.

  • Your function takes in a maximum threshold and returns the clipped versions of the gradients.
  • You can check out numpy.clip.
    • You will need to use the argument “out = ...“.
    • Using the “out” parameter allows you to update a variable “in-place”.
    • If you don’t use “out” argument, the clipped variable is stored in the variable “gradient” but does not update the gradient variables dWaxdWaadWyadbdby.

eg(np.clip)

>>> a = np.arange(10)
>>> np.clip(a, 1, 8)
array([1, 1, 2, 3, 4, 5, 6, 7, 8, 8])
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.clip(a, 3, 6, out=a)
array([3, 3, 3, 3, 4, 5, 6, 6, 6, 6])
>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.clip(a, [3, 4, 1, 1, 1, 4, 4, 4, 4, 4], 8)
array([3, 4, 2, 3, 4, 5, 6, 7, 8, 8])
##################################################
### GRADED FUNCTION: clip

def clip(gradients, maxValue):
    '''
    Clips the gradients' values between minimum and maximum.
    
    Arguments:
    gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
    maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
    
    Returns: 
    gradients -- a dictionary with the clipped gradients.
    '''
    
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
   
    ### START CODE HERE ###
    # clip to mitigate exploding gradients, loop over [dWax, dWaa, dWya, db, dby]. (≈2 lines)
    for gradient in [dWax, dWaa, dWya, db, dby]:
        np.clip(gradient, -maxValue, maxValue, out=gradient)
    ### END CODE HERE ###
    
    gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
    
    return gradients

2.2 – Sampling

Now assume that your model is trained. You would like to generate new text (characters). The process of generation is explained in the picture below:

Figure 3: In this picture, we assume the model is already trained. We pass in \(x^{\langle 1\rangle} = \vec{0}\) at the first time step, and have the network sample one character at a time.

Exercise: Implement the sample function below to sample characters. You need to carry out 4 steps:

  • Step 1: Input the “dummy” vector of zeros \(x^{\langle 1 \rangle} = \vec{0}\).
    • This is the default input before we’ve generated any characters.
      We also set \(a^{\langle 0 \rangle} = \vec{0}\)
  • Step 2: Run one step of forward propagation to get \(a^{\langle 1 \rangle}\) and \(\hat{y}^{\langle 1 \rangle}\). Here are the equations:

hidden state:
$$ a^{\langle t+1 \rangle} = \tanh(W_{ax} x^{\langle t+1 \rangle } + W_{aa} a^{\langle t \rangle } + b)\tag{1}$$

activation:
$$ z^{\langle t + 1 \rangle } = W_{ya} a^{\langle t + 1 \rangle } + b_y \tag{2}$$

prediction:
$$ \hat{y}^{\langle t+1 \rangle } = softmax(z^{\langle t + 1 \rangle })\tag{3}$$

  • Details about \(\hat{y}^{\langle t+1 \rangle }\):
  • Note that \(\hat{y}^{\langle t+1 \rangle }\) is a (softmax) probability vector (its entries are between 0 and 1 and sum to 1).
  • \(\hat{y}^{\langle t+1 \rangle}_i\) represents the probability that the character indexed by “i” is the next character.
  • We have provided a softmax() function that you can use.
  • Step 3: Sampling:
    • Now that we have \(y^{\langle t+1 \rangle}\), we want to select the next letter in the dinosaur name. If we select the most probable, the model will always generate the same result given a starting letter.
      • To make the results more interesting, we will use np.random.choice to select a next letter that is likely, but not always the same.
    • Sampling is the selection of a value from a group of values, where each value has a probability of being picked.
    • Sampling allows us to generate random sequences of values.
    • Pick the next character’s index according to the probability distribution specified by \(\hat{y}^{\langle t+1 \rangle }\).
    • This means that if \(\hat{y}^{\langle t+1 \rangle }_i = 0.16\), you will pick the index “i” with 16% probability.
    • You can use np.random.choice.Example of how to use np.random.choice(): np.random.seed(0) probs = np.array([0.1, 0.0, 0.7, 0.2]) idx = np.random.choice([0, 1, 2, 3] p = probs)
    • This means that you will pick the index (idx) according to the distribution:\(P(index = 0) = 0.1, P(index = 1) = 0.0, P(index = 2) = 0.7, P(index = 3) = 0.2\).
    • Note that the value that’s set to p should be set to a 1D vector.
    • Also notice that \(\hat{y}^{\langle t+1 \rangle}\), which is y in the code, is a 2D array.
# GRADED FUNCTION: sample

def sample(parameters, char_to_ix, seed):
    """
    Sample a sequence of characters according to a sequence of probability distributions output of the RNN

    Arguments:
    parameters -- python dictionary containing the parameters Waa, Wax, Wya, by, and b. 
    char_to_ix -- python dictionary mapping each character to an index.
    seed -- used for grading purposes. Do not worry about it.

    Returns:
    indices -- a list of length n containing the indices of the sampled characters.
    """
    
    # Retrieve parameters and relevant shapes from "parameters" dictionary
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    vocab_size = by.shape[0]
    n_a = Waa.shape[1]
    
    ### START CODE HERE ###
    # Step 1: Create the a zero vector x that can be used as the one-hot vector 
    # representing the first character (initializing the sequence generation). (≈1 line)
    x = np.zeros((vocab_size, 1))
    # Step 1': Initialize a_prev as zeros (≈1 line)
    a_prev = np.zeros((n_a, 1))
    
    # Create an empty list of indices, this is the list which will contain the list of indices of the characters to generate (≈1 line)
    indices = []
    
    # idx is the index of the one-hot vector x that is set to 1
    # All other positions in x are zero.
    # We will initialize idx to -1
    idx = -1 
    
    # Loop over time-steps t. At each time-step:
    # sample a character from a probability distribution 
    # and append its index (`idx`) to the list "indices". 
    # We'll stop if we reach 50 characters 
    # (which should be very unlikely with a well trained model).
    # Setting the maximum number of characters helps with debugging and prevents infinite loops. 
    counter = 0
    newline_character = char_to_ix['\n']
    
    while (idx != newline_character and counter != 50):
        
        # Step 2: Forward propagate x using the equations (1), (2) and (3)
        a = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b)
        z = np.dot(Wya, a) + by
        y = softmax(z)
        
        # for grading purposes
        np.random.seed(counter+seed) 
        
        # Step 3: Sample the index of a character within the vocabulary from the probability distribution y
        # (see additional hints above)
        idx = np.random.choice(list(range(vocab_size)), p = y.ravel())

        # Append the index to "indices"
        indices.append(idx)
        
        # Step 4: Overwrite the input x with one that corresponds to the sampled index `idx`.
        # (see additional hints above)
        x = np.zeros((vocab_size, 1))
        x[idx] = 1
        
        # Update "a_prev" to be "a"
        a_prev = a
        
        # for grading purposes
        seed += 1
        counter +=1
        
    ### END CODE HERE ###

    if (counter == 50):
        indices.append(char_to_ix['\n'])
    
    return indices
##########################################
# GRADED FUNCTION: optimize

def optimize(X, Y, a_prev, parameters, learning_rate = 0.01):
    """
    Execute one step of the optimization to train the model.
    
    Arguments:
    X -- list of integers, where each integer is a number that maps to a character in the vocabulary.
    Y -- list of integers, exactly the same as X but shifted one index to the left.
    a_prev -- previous hidden state.
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        b --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    learning_rate -- learning rate for the model.
    
    Returns:
    loss -- value of the loss function (cross-entropy)
    gradients -- python dictionary containing:
                        dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
                        dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
                        dWya -- Gradients of hidden-to-output weights, of shape (n_y, n_a)
                        db -- Gradients of bias vector, of shape (n_a, 1)
                        dby -- Gradients of output bias vector, of shape (n_y, 1)
    a[len(X)-1] -- the last hidden state, of shape (n_a, 1)
    """
    
    ### START CODE HERE ###
    
    # Forward propagate through time (≈1 line)
    loss, cache = rnn_forward(X, Y, a_prev, parameters)
    
    # Backpropagate through time (≈1 line)
    gradients, a = rnn_backward(X, Y, parameters, cache)
    
    # Clip your gradients between -5 (min) and 5 (max) (≈1 line)
    gradients = clip(gradients, maxValue=5)
    
    # Update parameters (≈1 line)
    parameters = update_parameters(parameters, gradients, learning_rate)
    
    ### END CODE HERE ###
    
    return loss, gradients, a[len(X)-1]
####################################################
# GRADED FUNCTION: model

def model(data, ix_to_char, char_to_ix, num_iterations = 35000, n_a = 50, dino_names = 7, vocab_size = 27):
    """
    Trains the model and generates dinosaur names. 
    
    Arguments:
    data -- text corpus
    ix_to_char -- dictionary that maps the index to a character
    char_to_ix -- dictionary that maps a character to an index
    num_iterations -- number of iterations to train the model for
    n_a -- number of units of the RNN cell
    dino_names -- number of dinosaur names you want to sample at each iteration. 
    vocab_size -- number of unique characters found in the text (size of the vocabulary)
    
    Returns:
    parameters -- learned parameters
    """
    
    # Retrieve n_x and n_y from vocab_size
    n_x, n_y = vocab_size, vocab_size
    
    # Initialize parameters
    parameters = initialize_parameters(n_a, n_x, n_y)
    
    # Initialize loss (this is required because we want to smooth our loss)
    loss = get_initial_loss(vocab_size, dino_names)
    
    # Build list of all dinosaur names (training examples).
    with open("dinos.txt") as f:
        examples = f.readlines()
    examples = [x.lower().strip() for x in examples]
    
    # Shuffle list of all dinosaur names
    np.random.seed(0)
    np.random.shuffle(examples)
    
    # Initialize the hidden state of your LSTM
    a_prev = np.zeros((n_a, 1))
    
    # Optimization loop
    for j in range(num_iterations):
        
        ### START CODE HERE ###
        
        # Set the index `idx` (see instructions above)
        idx = j % len(examples)
        
        # Set the input X (see instructions above)
        single_example = [None]
        single_example_chars = char_to_ix
        single_example_ix = idx
        X = single_example + [single_example_chars[ch] for ch in examples[single_example_ix]] 
        
        # Set the labels Y (see instructions above)
        ix_newline = single_example_chars["\n"]
        Y = X[1:]+[ix_newline]
        
        # Perform one optimization step: Forward-prop -> Backward-prop -> Clip -> Update parameters
        # Choose a learning rate of 0.01
        curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters, learning_rate = 0.01)
        
        ### END CODE HERE ###
        
        # Use a latency trick to keep the loss smooth. It happens here to accelerate the training.
        loss = smooth(loss, curr_loss)

        # Every 2000 Iteration, generate "n" characters thanks to sample() to check if the model is learning properly
        if j % 2000 == 0:
            
            print('Iteration: %d, Loss: %f' % (j, loss) + '\n')
            
            # The number of dinosaur names to print
            seed = 0
            for name in range(dino_names):
                
                # Sample indices and print them
                sampled_indices = sample(parameters, char_to_ix, seed)
                print_sample(sampled_indices, ix_to_char)
                
                seed += 1  # To get the same result (for grading purposes), increment the seed by one. 
      
            print('\n')
        
    return parameters

….这效果很zz

发表评论

电子邮件地址不会被公开。 必填项已用*标注