174 view times

# Sequence model

eg of sequence data:

One way to make NN model of word input is a classic MLP. We coulde map every word location in the dictionary into a sparse matrix by onehot. But there are two problems: Every word have different length or say scale, which would decrease model performance. 2 MLP could not learn relationship, some hidden share info between different word.

## Recurrent NN

The images in the left and right is different interpretation but same meaning. RNN just throw the word $$x^i$$into multiple layers and get prediction output $$y^i$$ extracted feature $$a^i$$, which would be input into next multiple NN. At the begining, researcher usually input random vector or zero vector as feature$$a^0$$. One shortcoming of RNN is that when it makes prediction from $$x^i$$, it only consider the information(word) before $$x^i$$ and ignore the information $$x^{i+1}…x^n$$. However, when we analyze a sentence, the words from whole sentence around located word should be included.

$$W_{aa},W_{ax}$$ could be add togerther so as to vectorize and speed up the algorthim.为什么numpy的向量化能加速，这里简单说一点。numpy里所有东西用c编写，同样一句话用python这类解释性语言写里面的指令会多很多。

RNN architecture: many-to-many(eg: input and output consist of many words{translation of a sentence}), and many-to-one(eg: input consist of many word but ouput is a number{ review(1-5) from a review sentence}),one-to-one,one-to-many(eg:music generation)

## Model building

training

1 tokenize sentence.(form a vocabulary by mapping each words to one-hot vector, and then end of sentence(EOS){eg:punctuation . as indication(label) of a sentence )

2 build up the architecture in the figure and then input trainingset and labels.

3 building up loss function of each labels and sum them together.

sampling

4 :the differece of training step is that $$y^1….y^n$$ is not a sentence word from training set but a generated words from feature $$a^i$$ in early layers

5 set a length threshold to end or EOS to end

## Vanishing gradients with RNN

gradient explosure: gradient clipping( when gradient is bigger than a threshold, rescale(small) the gradient based on maximum value.

gradient vanishing：gated recurrent unit(GRU){a similar basic block of LSTM} RNN unit

The main refinement is a sigmoid function to control the input of former layers.( I think sigmoid activation play a key role as a sifter in many unit(SE block, some attention module in GAN, CNN))

Because of smaller structure of GRU, it is more flexible and effective.

## BRNN

drawback: we need whole sequence before making prediction anywhere. So in application, we need to wait for the person to stop talking to get the entire utterance before we can actually process the speech.