The package itools.html includes a parser for HTML documents. Its programming interface is similar, but not exactly the same, to that of the XML parser from the itools.xml package (see Section 16.1).
Example:
>>> from itools.html import HTMLParser
>>> from itools.xml import START_ELEMENT, END_ELEMENT, TEXT
>>>
>>> data = 'Hello <em>Baby</em>'
>>> for type, value, line in HTMLParser(data):
... if type == START_ELEMENT:
... tag_uri, tag_name, attributes = value
... print 'START TAG :', tag_name
... elif type == END_ELEMENT:
... tag_uri, tag_name = value
... print 'END TAG :', tag_name
... elif type == TEXT:
... print 'TEXT :', value
...
TEXT : Hello
START TAG : em
TEXT : Baby
END TAG : em
This example just prints a message to the standard output each time the start of an element, the end of an element or a text node is found.
The parser returns a list of events, where every event is a tuple of three values: the event type, the value (which depends on the event type) and the line number. The events implemented are:
Event
Value
DOCUMENT_TYPE
value
START_ELEMENT
(tag name, attributes)
END_ELEMENT
(tag name)
TEXT
value
COMMENT
value
All values (text nodes, comments, attribute values, etc.) are returned as byte strings, in the source encoding.
The element attributes are returned as a dictionary where the key is the name of the attribute and the value is the value of the attribute.
For example, when processing the XML fragment:
<a href="http://www.gnu.org/"
title="GNU's Not Unix">GNU</a>
The parser will return the attributes this way:
{'href': 'http://www.gnu.org/',
'title': "GNU's Not Unix"}