Jun 01, 2021 Article blog
The article was reproduced from the public number: front-end troy students
Old-fashioned questions can be answered in a different way! This article does not explain the network part that occurs after URL input, only the analysis part after getting to the page from an algorithmic point of view, mainly divided into the following steps:
DOM
tree
样式
calculated
布局树
Layout Tree
)
Because browsers cannot directly understand
HTML
strings, this series of byte streams is transformed into a meaningful and easy-to-manipulate data structure that is
DOM树
DOM树
is essentially a multi-fork tree with
document
as its root node.
So how is it parsed?
First, let's be clear: HTML grammar is not
上下文无关文法
grammar.
Here, it is necessary to discuss what is
上下文无关文法
There is a very clear definition in the compilation principles of computer science:
If a formal grammar G -(N, s, P, S) is generated in the form of A V-w, it is called context-independent syntax. Where V ∈ N, w ∈ (N∪.
It explains the meaning of each parameter in the G s (N, S, P, S):
To put it more colloquially,
上下文无关的文法
means that the left side of all generated styles in this grammar is a non-terminator.
See here, if there is a little circle, I give you an example you understand.
Like what:
A -> B
In this grammar, there is a non-terminator on the left side of each production, which is
上下文无关的文法
In this case,
xBy
must be able to regulate
xAy
Let's take a look at a counter-example:
aA -> B
Aa -> B
This situation is not
上下文无关的文法
and when we encounter
B
we don't know exactly whether
A
can be statuted, depending on whether
a
exists on the left or right, that is, context-sensitive.
As to why it is
非上下文无关文法
the first thing to note is that the canonical HTML syntax is
上下文无关文法
grammar and can embody its
非上下文无关
is
non-standard syntax.
Here I can prove it by citing only one counter-example.
For example, when the parser scans to
form
label,
the context-independent grammar
is handled by creating the DOM object directly for the form, whereas in a real HTML5 scenario, the parser looks at the context of the
form
and if the parent label of the
form
label is also
form
then skip the current
form
label
directly,
otherwise the DOM object is created.
A regular programming language is
context-independent,
whereas HTML, on the contrary, is a
non-context-independent
feature that determines that
HTML Parser
cannot be done using the parser of a regular programming language and requires a different approach.
The HTML5 specification describes the parsing algorithm in detail. The algorithm is divided into two phases:
The two corresponding processes are lexical analysis and grammatical analysis.
The algorithm is entered as
HTML text,
output as
HTML标记
and
becomes a tag generator.
T
his is done using a
finite automatic state machine.
That is, when one or more characters are received in the current state, they are updated to the next state.
<html>
<body>
Hello sanyuan
</body>
</html>
Demonstrate the process of
标记化
with a simple example.
A
<
was encountered with the status
tag on.
The character that receives the word
[a-z]
enters the
tag name state.
This state remains until a
>
is encountered, indicating that the tag name record is complete and then
becomes a data state.
Next encounter
body
tag to do the same.
At this point both
html
and
body
tags are logged.
Now come to the > in the <body >, enter the data state, and then keep this state to receive the character hello sanyuan after that state.
Then receive the < in the
<
go back to
the tag open,
and receive the next
/
after which a
end tag
token is created.
Then you go into
the tag name state
and encounter
>
back to the
data state.
The </body > is then processed in the same style.
As mentioned earlier, the DOM tree is a multi-fork tree with
document
as its root node. S
o the parser first creates a
document
object. T
he tag generator sends information about each tag to
the tree builder.
W
hen
the tree builder receives
the appropriate tag,
the corresponding DOM object is created.
After you create this
DOM对象
you do two things:
DOM对象
to the DOM tree.
闭合标签
meaning) element.
Or take the example and say:
<html>
<body>
Hello sanyuan
</body>
</html>
First, the state is the initialization state.
You receive an
html
tag from the tag generator, and that's when the state changes to
before html.
Create a DOM element of
HTMLHtmlElement
at the same time, add it to the
document
root object, and stack it.
Then the state automatically changes to
before head,
where
body
comes from the tag builder, indicating that there is no
head
at which point the tree builder automatically
creates
an
HTMLHeadElement
and adds it to the
DOM树
Now go into the head state, and then jump directly to the after head.
Now
the tag generator
has a
body
tag, creates
HTMLBodyElement, inserts
it into the
DOM
tree, and presses into the open tag stack.
The state then changes to
in body,
and then receives the next series of characters:
Hello sanyuan.
W
hen the first character is received, a
Text
node is created and inserted into it, and then the
Text
node is inserted under
body元素
in the DOM tree.
As you continue to receive later characters, they are attached to the
Text
node.
Now,
the tag generator
passes the end tag of a
body
and
enters the after body
state.
Marker Builder
finally passes an
html
end tag and enters the state of after
after body,
indicating that the resolution process ends here.
When it comes to
HTML5
specification, it has to be said that it has a strong
tolerance strategy,
is very fault tolerant, and although there are mixed reviews, I think as a senior front-end engineer, it's important to know what
HTML Parser
has done with fault tolerance.
Next up are some classic fault tolerance examples in WebKit, and there are others you'd like to add.
1. Use not < br>
if (t->isCloseTag(brTag) && m_document->inCompatMode()) {
reportError(MalformedBRError);
t->beginTag = true;
}
All in the form of < br >.
2. Table discrete
<table>
<table>
<tr><td>inner table</td></tr>
</table>
<tr><td>outer table</td></tr>
</table>
WebKit
is automatically converted to:
<table>
<tr><td>outer table</td></tr>
</table>
<table>
<tr><td>inner table</td></tr>
</table>
3. Form elements are nested
Ignore the
form
inside directly at this time.
For CSS styles, there are generally three sources:
First, the browser doesn't recognize CSS-style text directly, so the first thing the rendering engine does when it receives CSS text is to turn it into a structured object, the style Sheets.
This formatting process is too complex, and there are different optimization strategies for different browsers, which is not the case here.
This final structure can be viewed in the browser console through
document.styleSheets
Of course, this structure contains the above three CSS sources, providing the basis for later style operations.
There are some CSS-style values that are not easily understood by the rendering engine, so they need to be standardized before the style is calculated, such as
em
px
red
#ff0000
bold
700
and so on.
Styles have been
格式化
and
标准化
and then you can calculate the specific style information for each node.
In fact, the calculation is not complicated, mainly two rules: inheritance and cascade.
Each child node inherits the parent's style properties by default, and if it is not found in the parent node, the browser default style, also known as
UserAgent样式
is adopted.
This is the rule of inheritance and is very easy to understand.
Then there is the cascading rules, the biggest feature of CSS is its cascading, that is, the final style depends on the effect of the various properties, and even a lot of strange cascade phenomenon, have seen the "CSS World" students should have a deep understanding of this, the specific cascading rules belong to the category of in-depth CSS language, here is not much introduced.
It is worth noting, however, that after the style has been calculated, all style values are hung in
window.computedStyle
which is convenient for JS to obtain the calculated style.
Now that you've generated
DOM树
and
DOM样式
the next step is to
确定元素的位置
through the browser's layout system, which is to create a
布局树
Tree.
The general work of layout tree generation is as follows:
布局树中
It is important to note that this layout tree value contains visible elements that will not be placed in the
head
label and the element with
display: none
set.
Some people say that
Render Tree
will be generated first, that is, the rendering tree, but that was 16 years ago, and now the Chrome team has done a lot of refactoring and there is no process for
Render Tree
The layout tree is well-documented and fully functional with
Render Tree
The reason not to talk about the details of the layout is because it is too complex, one by one will appear that the article is too bloated, but most of the time we just need to know what it does, if you want to go into the principle, know how it is done, I highly recommend that you read the Everyone FED team article from Chrome source to see how the browser layout layout.
Take a look at the main veins of this section:
That's
W3Cschool编程狮
says about
answering from an "algorithm" point of view - what happens to page rendering after entering the URL?
Related to the introduction, I hope to help you.