278 view times

Various sequence to sequence atchitecture

image caption: pretraining CNN get the image features and then input into a RNN model to get the text output.

Machine translation could be viewed as a one-to-many language model. The main difference is the green part in the figure below. In translation task, the input should ge through some NN layers (maybe say encoder) and then generated output predicted translation sentence.

And other difference is that we need to define the loss function to get the best output not a random generated results.

Beam search

Greedy search is not good since it only care about the maximum to last word. Beam search is alternative.

french: “Jane visite l’Afrique en septembre.”
translation 1-Jane is visiting Africa in September.
translation 2-Jane is going to be visiting Africa in September.

Beam search, we need to set the parameter width(in figure, B=3).

first, calculate \(y^0\),get the words {in, jane, september} with highest three probility. Then, calculate \(P(y^{<2>},y^{<1>}|x)\) the three second words in highest three probility. We will get {in september,jane is, jane visit}. And then get the third word(second figure above). In short, difference of beam search to greedy search is that in every search(iteration), it considers more possible result(B=10,10 results with higher possbility). If b=1, beam search is same to greedy search.

Error analysis:

1. Beam search is try to find the best \(\hat{y}\) to satify the equation in the third figure. If \(p(\hat{y}|x)<p(y^*|x\), it means beam search need to be refined.
2. If \(p(\hat{y}|x)>=p(y^*|x)\), it means that RNN does not predict the best output, thus, we need to refine RNN.


bleu(bilingual evaluation understanding) score is a evaluation index to analyze how good the machine translation model is.

precision: times of each word in ML output appear in reference(ground truth). seven ‘the’ in generated sentence and each the appear in the ground truth sentence. So the precision is 7/7. Modified precision consider the frequency of each word in output sentence appear in ground truth sentence. In the figure, ‘the’ appear in the gound turth 2 times at most. Thus, the maximun score of ‘the’ is 2 and the modified precision is 2/7.

bigram: the pairs of words appearing next to each other.

n-gram precision:
we calculate the bleu score in a two words group(eg,’the cat’,’cat the’…), so we call it bigram, if three word as a group, then three gram and n, ngram.


\(Count^{clip}_{w_i,j}\):for the jth reference, wi truncation times
\(Count_{w_i}\):the times of word wi appear in output sentence(eg, wi’the’=3 in the figure).
\(Ref_{j-}Count_{w_i}\): the times of word wi appear in the jth reference.(eg, in reference 1, Ref(‘the’)=2)
\(Count^{clip}\):the times of word appear in all reference

Modified n-gram precision on blocks of text

penalty in order to generate longer sentence:

Attention model

Here, I want share great gratitude to Andrew Ng, I learn a lot from his course about tensorflow, CNN, NLP, machine learning~ All these build up the groundwork for my future study. Thanks a lot~


  1. https://www.aclweb.org/anthology/P02-1040.pdf
  2. andrew course


邮箱地址不会被公开。 必填项已用*标注