xmlproc: A Python XML parser.

An overview of xmlproc

xmlproc has been designed in a highly modular fashion, with the intention that it should be possible to reuse the modules in different contexts and applications. Emphasis has been placed on flexibility, leading to a rather large and perhaps bewildering interface. This document attempts to explain the different pieces and how they fit together.

The command-line parsers

xmlproc ships with two command-line parser applications that can be used to parse xml documents into ESIS or canonical XML and also to verify that documents are well-formed or valid. The diagram below shows how these applications use the xmlproc APIs.

[Diagram #1]

As the diagram shows one application (xpcmd.py) uses the object called XMLProcessor. This is the well-formedness parser that does not read the external DTD and which does not validate. The second application (xvcmd.py) uses an object called XMLValidator, which again uses the well-formedness parser XMLProcessor for basic parsing, but provides full validation on top of this.

The command-line interfaces are documented here.

The parsing API

If you want to use xmlproc in an application of your own what you do is basically what I did with xpcmd.py and xvcmd.py: you use the xmlproc modulse in an application to provide it with XML parsing functionality. On top of that you must build whatever you want to use xmlproc for yourself. (In the case of xvcmd.py and xpcmd.py this is simply outputting parse results to the console and interpreting command-line options.

The diagram below shows the main objects involved in this API:

[Diagram #2]

The central object is the Parser object, which can be either an XMLProcessor or an XMLValidator (they have the same interface, but only the latter validates and the former is faster). When it is created the Parser creates the four objects shown in the diagram. It also provides methods that can be used to make it use objects that you provide instead of these objects. (See the documentation for details.)

The roles of these four objects are:

Application
This object receives all data events from the parsing of the document such as handle_start_tag, handle_end_tag, handle_data and so on.
ErrorHandler
This object receives all error events.
PubIdResolver
Whenever an entity reference is encountered, this object will be given the public identifier and system identifier (file name/URL) of the entity and asked to supply the system identifier to be used.
InputSourceFactory
Once the PubIdResolver has returned the correct file name/URL for the entity, the InputSourceFactory will be asked to create a file-like object from which the entity contents can be read.

So, to act on the document content, make an Application object and tell the parser to use it. To control error reporting, make an ErrorHandler. To control the resolution of public identifiers (and also to remap system identifiers), make a PubIdResolver. To add support for new kinds of URLs or to provide your own support for a class of URLs, make an InputSourceFactory.

Example


from xml.parsers.xmlproc import xmlproc

class MyApplication(xmlproc.Application):
    pass # Add some useful stuff here

p=xmlproc.XMLProcessor()  # Make this xmlval.XMLValidator if you want to validate
p.set_application(MyApplication())
p.parse_resource("foo.xml")

07.feb.99 04:29, Lars Marius Garshol, larsga@ifi.uio.no.