May 31, 2021 Article blog
XPath is the XML path language, which is a language used to determine a location in a part of an XML (subset of standard common markup languages) document. X Path's XML-based tree structure provides the ability to find nodes in the data tree. X Path was originally proposed as a generic syntax model between XPointer and XSL. But XPath was quickly adopted by developers as a small query language.
Simply put: XML Path Language is a language that looks for information in XML and HTML documents and can be used to traverse elements and properties in XML and HTML documents.
Here to introduce you to two small editors used more is also the most widely used and convenient tools:
Of course, Chrome plug-in XPath Helper can also look for the installation package to install through the plug-in partner, the installation method steps are as follows:
In XPath, there are seven types of nodes: elements, properties, text, namespaces, processing instructions, comments, and document (root) nodes. X ML documents are treated as node trees. The root of a tree is called a document node or root node.
XPath uses path expressions to select nodes or node sets in an XML document. Nodes are selected by following a path or step.
How to use it:
Use // to get elements throughout the page, then write the label name, and then extract the predicate, such as:
//title[@lang='en']
Points of knowledge to be aware of:
//title[contains(@lang,'en')]
lxml is an HTML/XML parser whose primary function is how to parse and extract HTML/XML data.
The library is installed with pip for third-party libraries, as commanded as follows:
pip install lxml
Basic use:
The following is the case-realistic html file code, which can be saved by the small partner in front of the screen to practice together.
<div>
<ul>
<li class="im-0"><a href="link1.html"> first </a></li>
<li class="im-1"><a href="link2.html"> second </a></li>
<li class="im-active"><a href="link3.html"> third </a></li>
<li class="im-1"><a href="link4.html"> fourth </a></li>
<li class="im-0"><a href="link5.html"> fifth </a>
</ul>
</div>
Case 1: Resolve the string to an html document
from lxml import etree
text = ''
HTML = Etree.html (text) # read
print(html)
#Press HTML by string
result = etree.tostring(html).decode('utf-8')
print(result)
Case 2: Read html code from a file:
from lxml import etree
HTML = Etree.Parse ('Hello.html') # read
#Press HTML by string
result = etree.tostring(html).decode('utf-8')
print(result)
Case 3: Use the Xpath syntax in lxml
from lxml import etree
html = etree.parse('hello.html')
# Get all LI tags:
# result = html.xpath('//li')
# print(result)
# for i in result:
# print(etree.tostring(i))
# Get all the values of all Class properties under all li elements:
# result = html.xpath('//li/@class')
# print(result)
# Get the HREF under the LI tab https://www.w3cschool.cn/ A tag:
# result = html.xpath('//li/a[@href=" https://www.w3cschool.cn/ "]')
# print(result)
# Get all span tags under the Li tab:
# result = html.xpath('//li//span')
# print(result)
# Get all CLASS in the A tag under the LI tag:
# result = html.xpath('//li/a//@class')
# print(result)
# Get the value corresponding to the value of the HREF attribute of the last LI:
# result = html.xpath('//li[last()]/a/@href')
# print(result)
# Get the contents of the penultimate Li element:
# result = html.xpath('//li[last()-1]/a')
# print(result)
# print(result[0].text)
# 获 第二 的 第二 的 的 的 的 的 的 的 的 的 的 的:::
result = html.xpath('//li[last()-1]/a/text()')
print(result)
The paper has to come to the end sense shallow, know this matter to bow, a hard work, a harvest.
Recommended lessons: Python static reptiles, Python Scrapy network crawlers.