Part 3: Understanding what was fetched

(Earlier posts in this series: here and here)

There are a lot of ways to extract data from a HTML file. You can do simple string searching (by the way, why is the python documentation for basic string objects hidden under the non-descript heading “Sequence types”, and why is there no reference to that part of the documentation from the separate string module, which hardly does anything?) and rexep munging, or you can use more sophisticated HTML parsers. Funnily enough, there are two of these in the Python standard library, and both of them are callback based — why no tree-based interface? If the HTML code is modern and well-formed, you can even use a vast array of XML tools (and if it’s not, you can fix it with HTML Tidy).

I ended up using the BaseHTMLProcessor approach from Dive Into Python., which has a whole chapter devoted to the art of HTML parsing. Basically, you subclass BaseHTMLProcessor, implementing callbacks for various tags, which are called as these tags are encountered in the document. Your class is responsible for keeping track of whatever state (ie what depth you are in the document, what tags were encountered before this one, and so on) that needs to be kept.

There are some things that are cumbersome with this approach. For example, automatic HTML entity resolving would be good. The HTML fragment “<h1>r&auml;ksm&ouml;rg&aring;s</h1gt;” represents a single header with the string “räksmörgås” (a common test phrase for swedish programmers), and so it should only result in three callbacks: start_h1, handle_data (which should be called with the string “räksmörgås“), and end_h1.

Instead, the following callbacks are called:

  • start_h1
  • handle_data (called with the string ‘R‘)
  • handle_entityref (called with the string ‘auml‘)
  • handle_data (called with the string ‘ksm‘)
  • handle_entityref (called with the string ‘ouml‘)

…you get the idea. There exists a mapping that helps with the entity resolving, but for the HTML case, this could have been solved at a lower-level stage.

Still, for the parsing problems I have, the callback-based/keep-your-own-goddam-state-approach works. Most of the time I’m just concerned with finding the elements in a table, meaning I have to keep track of what cells I’ve seen and when a new table row starts, things like that. As I go along, build up a list of mappings or something similar, and then just use that list once done. The calling code gets quite nice and simple:

cl = SFSChangelogExtractor()
cl.feed(open("downloaded/lawinfo/%s.html" % self.basefile).read())
for c in cl.changelog:
    if c.item('SFS-nummer') == current_transitional_id: ...
  

(Note that the ‘c’ object here is not a standard dictionary, but a mapping-ish object that also keeps track of the order keys have been inserted. That’s why it’s c.item(’SFS-nummer’) and not c[’SFS-nummer’]. That, and the fact that I was too lazy to implement the special methods needed to do a proper Duck Typed dictionary.)

The one exception is the problem of finding all the plaintext in a law text like this one, but it’s even easier: Just increment a counter whenever a <pre> tag is encountered, decrement it when seing </pre>. In handle_entityref and handle_text, just check if the counter is > 0 and if so, append the text to a StringIO object.

Tags: , , ,

Leave a Reply