Python3 XML parsing


What is XML?

XML refers to the extensible tag language (e X tensible M arkup L anguage), a subset of the standard common tag language, which is a markup language used to mark electronic files to make them structural. You can take the XML tutorial from this site

XML is designed to transfer and store data.

XML is a set of rules that define semantic tags that divide documents into parts and identify them.

It is also a meta-tag language, which defines the synth language used to define other semantic, structured markup languages related to a particular domain.


Python's analysis of XML

Common XML programming interfaces are DOM and SAX, which handle XML files in different ways and, of course, in different situations.

Python has three ways to resolve XML, SAX, DOM, and ElementTree:

1.SAX (simple API for XML )

The Python standard library contains SAX parsers, which use an event-driven model to process XML files by triggering events one by one during the resolution of XML and calling user-defined callback functions.

2.DOM(Document Object Model)

The XML data is parsed into a tree in memory, and the XML is operated by the operation of the tree.

The XML instance file movies used in this section .xml as follows:

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

Python uses SAX to resolve xml

SAX is an event-driven API.

Parsing XML documents with SAX involves two parts: the parser and the event processor.

The parser is responsible for reading the XML document and sending events to the event processor, such as the element starting to end the event with the element;

The event handler, on the other hand, is responsible for handling the transmitted XML data accordingly.

  • 1, large-scale document processing;
  • 2, only need part of the file, or just get specific information from the file.
  • 3, want to build their own object model.

Using sax in Python to handle xml starts with the introduction of the parse function in xml.sax and contentHandler in xml.sax.handler.

Introduction to the ContentHandler class approach

Characters (content) method

Timing of call:

Starting with the line, there are characters before the label is encountered, and the value of content is these strings.

From one label, before the next label is encountered, there are characters, and the value of content is these strings.

From a label, before encountering a line end character, there are characters, and the value of content is these strings.

A label can be a start label or an end label.

StartDocument() method

Called when the document starts.

EndDocument() method

The parser is called when it reaches the end of the document.

StartElement (name, attrs) method

When the XML start tag is called, name is the name of the label and attrs is the label's property value dictionary.

EndElement (name) method

Called when the XML end label is encountered.


make_parser method

The following method creates a new parser object and returns it.

xml.sax.make_parser( [parser_list] )

Description of the parameters:

  • parser_list - optional parameters, parser list

Parser method

Here's how to create an SAX parser and parse the xml document:

xml.sax.parse( xmlfile, contenthandler[, errorhandler])

Description of the parameters:

  • xmlfile - xml file name
  • Contenthandler - Must be a ContentHandler object
  • Errorhandler - If this parameter is specified, the errorhandler must be an SAX ErrorHandler object