Part 3: Understanding what was fetched

(Earlier posts in this series: here
and here)

There are a lot of ways to extract data from a HTML file. You can
do simple string
(by the way, why is the python documentation for
basic string objects hidden under the non-descript heading
”Sequence types”, and why is there no reference to that part of
the documentation from the separate
string module
, which hardly does anything?) and rexep
, or you can use more
. Funnily enough, there are two of these in the Python
standard library, and both of them are callback based — why no
tree-based interface? If the HTML code is modern and well-formed,
you can even use a vast
XML tools (and if
it’s not, you can fix it with HTML Tidy).

I ended up using the BaseHTMLProcessor
approach from Dive
Into Python.
, which has a whole
devoted to the art of HTML parsing. Basically, you
subclass BaseHTMLProcessor, implementing callbacks for various
tags, which are called as these tags are encountered in the
document. Your class is responsible for keeping track of whatever
state (ie what depth you are in the document, what tags were
encountered before this one, and so on) that needs to be kept.

There are some things that are cumbersome with this approach. For
example, automatic HTML entity resolving would be good. The HTML
represents a single header with the string ”räksmörgås” (a common
test phrase for
swedish programmers), and so it should only result in three
callbacks: start_h1, handle_data (which should
be called with the string ”räksmörgås”), and end_h1.

Instead, the following callbacks are called:

  • start_h1
  • handle_data (called with the string ‘R‘)
  • handle_entityref (called with the string ‘auml‘)
  • handle_data (called with the string ‘ksm‘)
  • handle_entityref (called with the string ‘ouml‘)

…you get the idea. There exists a
that helps with the entity resolving, but for the HTML
case, this could have been solved at a lower-level stage.

Still, for the parsing problems I have, the
callback-based/keep-your-own-goddam-state-approach works. Most of
the time I’m just concerned with finding the elements
in a table
, meaning I have to keep track of what cells I’ve
seen and when a new table row starts, things like that. As I go
along, build up a list of mappings or something similar, and then
just use that list once done. The calling code gets quite nice and

cl = SFSChangelogExtractor()
cl.feed(open("downloaded/lawinfo/%s.html" % self.basefile).read())
for c in cl.changelog:
    if c.item('SFS-nummer') == current_transitional_id: ...

(Note that the ‘c’ object here is not a standard dictionary, but a
mapping-ish object that also keeps track of the order keys have
been inserted. That’s why it’s c.item('SFS-nummer') and
not c['SFS-nummer']. That, and the fact that I was too
lazy to implement the special
needed to do a proper Duck Typed

The one exception is the problem of finding all the plaintext in a
law text like this
, but it’s even easier: Just increment a counter whenever a
<pre> tag is encountered, decrement it when seing
</pre>. In handle_entityref and handle_text, just
check if the counter is > 0 and if so, append the text to a StringIO