Jun 01, 2021 Article blog
This article was reproduced to Know ID: Charles (Bai Lu) knows his personal column
Download the W3Cschool Mobile App, 0 Foundation Anytime, Anywhere Learning Programming >> Poke this to learn
The guilt of not tweeting for days led me to decide to come to water an article today.
Like the previous tweet, "Python Plays CartPole," this is a simple example from PyTorch's official tutorial.
To demonstrate my sincerity, I'll still cover the basic models used in this article in depth: Seq2Seq and the Attention mechanism.
The content will still be very long
Hope to be helpful for new NLP/Deep Learning children's shoes
Nonsense doesn't say much, go straight into the main topic
Baidu web download link:
https://pan.baidu.com/s/1y3KcMboz_xZJ9Afh5nRkUw
Password: qvhd
Official English tutorial links:
http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
Other than that:
Students who have difficulty reading English literature need not worry, I have translated this tutorial into Chinese put into the relevant documents.
Develop tools
System: Windows10
Python version: 3.6.4
Related modules:
torch module;
numpy module;
matplotlib module;
and some Python's own modules.
The PyTorch version is:
0.3.0
Install Python and add it to the environment variable, and pip installs the relevant modules that are required.
Additional notes:
PyTorch does not support direct pip installations for the time being.
There are two options:
(1) after the installation of anaconda3 installed in the environment of anaconda3 (direct pip installation can be);
2 Use the compiled whl file installation, download link is:
https://pan.baidu.com/s/1dF6ayLr#list/path=%2Fpytorch
PS:
Some of the content refers to relevant web blogs and books.
(1) Single-tier network
The structure of a single-tier network is similar to the following:
The input x
is
transformed by wx-b
and the activation function
f
gets the output
y.
Students who believe that they have a preliminary understanding of machine learning/deep learning know that this is in fact a single-layer perception machine
For convenience, let's paint it like this (please ignore my poor drawing level):
x
is the input vector,
y
is the output vector, and the arrow represents a transformation, i.e.
y
sf
(Wx
s
b).
(2) Classic RNN
In practice, we encounter a lot of sequential data:
X1,X2,X3,X4...
For example, in our machine translation model, X1 can be thought of as the first word, X2 can be thought of as the second word, and so on.
The original neural network did not handle the sequential data very well, so the savior RNN appeared, which introduced the concept of hidden state h, using the h-pair sequence to extract features, and then converted to output. Here's a detailed description of how it was calculated (h0 in the figure below is an initial hidden state, and for simplicity, let's assume that it is a reasonable value set according to the specific model):
thereinto:
Again, all letters are vectors, and arrows represent a transformation of vectors.
H2 is calculated similar to h1, and the parameters P, Q, and b are used at each step, which means that the parameters of each step are shared:
thereinto:
And so on (remember that the parameters are the same!!! /b20> ), this calculation can be sustained indefinitely (not limited to the length of 4 !!! in the figure )。
So how does the output of RNN get it?
The output value of RNN is calculated by h:
thereinto:
Similarly, there are y2, y3, y4... :
Of course, as before, the parameters W and c here are shared.
These are the most classic RNN structures, and we can see that they have a fatal drawback:
The input and output sequences must be equal length!
This shortcoming has led to a less wide range of classic RNNs than expected.
(3) Improve the classic RNN
Scenario 1 (input N, output 1):
Suppose our question requires us to enter a sequence and output a separate number.
So let's just do the output transformation on the last h:
Scenario 2 (input 1, output N):
What happens when the input is a single number, not a sequence?
We can only start the input calculation at the beginning of the sequence:
Of course, you can also enter input information x as input for each stage:
Scenario 3 (input is N and output is M):
This is one of the most important variants of RNN, and this structure is also known as:
The Encoder-Decoder model, or Seq2Seq model.
Our machine translation model is based on it.
The Seq2Seq structure first encodes the input data into a context volume c:
thereinto:
That is, the amount of context c can be directly equal to the last hidden state, can also be the last hidden state to do a transformation V to get, of course, can also be all the hidden state to do a transformation V to get and so on.
The above RNN structure is generally referred to as Encoder.
Once we get c, we need another RNN network to decode it, Decoder. You can enter this c into Decoder as the initial state h'0:
Of course you can also use c as input to Decoder every step of the way:
Come on, add it:
The missing part (e.g. some blue squares do not have x input) you can calculate the output as 0 and then in the formula listed in the classic RNN, and the others are similar.
(4) Attention mechanism
In the Encoder-Decoder structure, Encoder encodes all the input sequences into a uniform semantic feature c and then decodes them, and when the input sequence is long, c is likely to be unable to store all the information in the input sequence.
The Attention mechanism solves these problems well. It enters a different c at each step of Decoder:
Where c is generated from h in Encoder:
Aij represents the relevance of the hj in stage j in Encoder and phase i in Decoder.
So how do these weights determine aij? b20> and we generally think of it as related to the hidden state of Encoder's j-stage and Decoder's i-1 phase' hidden state.
For example, we're going to calculate a1j:
Then we need to calculate a2j:
And so on.
(5) Final task: French translated into English
With the front paving, I believe we can all understand the official website tutorial.
Here we only do a brief introduction, detailed modeling and implementation process can refer to my translation of the official documents.
The Encoder network is:
The Decoder network is:
Where encoder's last hidden state is the initial hidden state of decoder. T he weight calculation of the attention mechanism is similar to that described in (4). The structure of the GRU network is:
GRU network structure here will not be introduced in detail, the length is too long to estimate that no one can see it, first of all, this is the case
In the relevant documents I also provide 4 related papers for interested parties to read and study. (T_T pure English . .
The results are shown
Run the Translation.py file in the cmd window.
Error curve:
Output of cmd window during training:
Model testing:
As a comparison:
It's exactly the same as the last test result, with wood and !!!
Of course, some translation results are not very satisfactory. Because the model and training data is too simple (T_T here is no example)
Attention diagram of the last four sentences:
That's all~~~
Interested students can further modify the model to get better results, of course, you can also find other data sets to make such as the model of the Chinese and British