Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Python Reptiles ---- quick start with Xpath syntax


May 31, 2021 Article blog


Table of contents


What is Xpath?

XPath is the XML path language, which is a language used to determine a location in a part of an XML (subset of standard common markup languages) document. X Path's XML-based tree structure provides the ability to find nodes in the data tree. X Path was originally proposed as a generic syntax model between XPointer and XSL. But XPath was quickly adopted by developers as a small query language.

Simply put: XML Path Language is a language that looks for information in XML and HTML documents and can be used to traverse elements and properties in XML and HTML documents.

XPath development tools

Here to introduce you to two small editors used more is also the most widely used and convenient tools:

  • Chrome plug-in XPath Helper (requires scientific Internet access);
  • Firefox plug-in Try XPath.

Of course, Chrome plug-in XPath Helper can also look for the installation package to install through the plug-in partner, the installation method steps are as follows:

  1. Open the plug-in partner, choose to download the good plug-in;
  2. Choose to extract plug-in content to the desktop, the desktop will be one more folder;
  3. Put the folder under the path you want to put;
  4. Open Google Chrome, select the extension, developer mode open, choose to load the unzipped extension, select the path to open it.

XPath node

In XPath, there are seven types of nodes: elements, properties, text, namespaces, processing instructions, comments, and document (root) nodes. X ML documents are treated as node trees. The root of a tree is called a document node or root node.

XPath syntax

XPath uses path expressions to select nodes or node sets in an XML document. Nodes are selected by following a path or step.

How to use it:

Use // to get elements throughout the page, then write the label name, and then extract the predicate, such as:

//title[@lang='en']

Points of knowledge to be aware of:

  • / and // difference: / means to get only child nodes, / to get children and grandchildren nodes, generally // more used, of course, depending on the situation;
  • contains: Sometimes a property contains more than one value, so you can use the contains function, as follows:

//title[contains(@lang,'en')]

  • The subscript in the predicate starts at 1, not 0

lxml library

lxml is an HTML/XML parser whose primary function is how to parse and extract HTML/XML data.

The library is installed with pip for third-party libraries, as commanded as follows:

pip install lxml

Basic use:

The following is the case-realistic html file code, which can be saved by the small partner in front of the screen to practice together.

<div>

<ul>

<li class="im-0"><a href="link1.html"> first </a></li>

<li class="im-1"><a href="link2.html"> second </a></li>

<li class="im-active"><a href="link3.html"> third </a></li>

<li class="im-1"><a href="link4.html"> fourth </a></li>

<li class="im-0"><a href="link5.html"> fifth </a>

</ul>

</div>

Case 1: Resolve the string to an html document

from lxml import etree

text = ''

HTML = Etree.html (text) # read

print(html)

#Press HTML by string

result = etree.tostring(html).decode('utf-8')

print(result)

Case 2: Read html code from a file:

from lxml import etree

HTML = Etree.Parse ('Hello.html') # read

#Press HTML by string

result = etree.tostring(html).decode('utf-8')

print(result)

Case 3: Use the Xpath syntax in lxml

from lxml import etree

html = etree.parse('hello.html')

# Get all LI tags:

# result = html.xpath('//li')

# print(result)

# for i in result:

#     print(etree.tostring(i))

# Get all the values of all Class properties under all li elements:

# result = html.xpath('//li/@class')

# print(result)

# Get the HREF under the LI tab https://www.w3cschool.cn/ A tag:

# result = html.xpath('//li/a[@href=" https://www.w3cschool.cn/ "]')

# print(result)

# Get all span tags under the Li tab:

# result = html.xpath('//li//span')

# print(result)

# Get all CLASS in the A tag under the LI tag:

# result = html.xpath('//li/a//@class')

# print(result)

# Get the value corresponding to the value of the HREF attribute of the last LI:

# result = html.xpath('//li[last()]/a/@href')

# print(result)

# Get the contents of the penultimate Li element:

# result = html.xpath('//li[last()-1]/a')

# print(result)

# print(result[0].text)

# 获 第二 的 第二 的 的 的 的 的 的 的 的 的 的 的:::

result = html.xpath('//li[last()-1]/a/text()')

print(result)

summary:

The paper has to come to the end sense shallow, know this matter to bow, a hard work, a harvest.

Recommended lessons: Python static reptiles, Python Scrapy network crawlers.