Part 2: Fetching stuff from the web
tisdag, december 14th, 2004(A continuation of the series started in this post)
Now, the first thing that needs to be done is to actually get the text of the laws from the web. Before that can be done, a list of available laws must be fetched. In Swedish law, most laws have nice ID’s known as “SFS-nummer”, usually on the form “yyyy:nnn”, where yyyy is the year it was issued, and nnn is incremented for each law passed that year. Some of the older laws don’t strictly follow this convention, and can have ID’s like “1736:0123 2″ or “1844:50 s.2″.
To get a list of all laws passed between two years, one can use this form from Regeringskansliets Rättsdatabaser. It uses the normal GET method, and so it’s quite easy to construct a URL that will return all laws between, say, 1600 and 1850 (linebreaks inserted for readability):
http://62.95.69.15/cgi-bin/thw?${HTML}=sfsr_lst&
${OOHTML}=sfsr_dok&${SNHTML}=sfsr_err&${MAXPAGE}=26&
${BASE}=SFSR&${FORD}=FIND&ÅR=FRÅN+1600&
ÅR=TILL+1850
To fetch this, I used urllib2. Now for a little aside rant: Why are there two urllibs in the standard distribution? I understand that basic urllib has a simple interface and urllib2 a more complex, but would it be so hard to design a single module that lets you do simple things easy, and progress onto hard things? In the Perl world, you can start with LWP::Simple and then go on to more advanced stuff, but with python it’s either simple urllib requests with practially no control at all, or urllib2 with it’s way-too-complex system of chained handlers. I will return to this rant in a little bit, but for now let’s have some useful content. This is the code used to fetch stuff:
url = "http://62.95.69.15/...ÅR=FRÅN+%s&ÅR=TILL+%s" % (1600,1850) sock = urllib.urlopen(url) html = sock.read()
So, as long as your needs are simple, like just wanting to do a simple GET or POST, urllib and/or urllib2 will work. However, I encountered a more complex scenario when I wanted to download court verdicts from Domstolsväsendets rättsinformation: This is a web app that relies on HTTP posts, HTTP redirects, session tracking cookies, frames, javascript links and is, in general, incredibly fragile. The slightest “error” in what you send and the server answers with a “500 Server Error: java.lang.NullPointerException”
The first problem was that the application requires cookie support, which urllib2 doesn’t have (as it doesn’t have any concept of state between requests). At first I thought I could fix the cookie support by reading and setting headers, the way it was done back in the day when men were men and knew the cookie spec by heart. Turns out the web service sets a cookie when you issue a search request, but the answer from the search request is a HTTP redirect. To get the resulting list of matches, you need to present that same cookie that was set.
Now, let’s continue the rant: urllib2 blindly follows the redirect, giving us no chance to set the Cookie header. From the documentation, it appears that it should be possible to override this behaviour by subclassing HTTPRedirectHandler and passing the instance to build_opener, which creates a chain of instances of BaseHandler or a subclassed class. Reading the documentation for urllib2 makes me think that someones OO/design patterns fetish was not kept properly in check. Anyway, I could not get that to work.
Another thing that bugs me about urllib2 is that is has no support for implementing Robots Exclusion Standard (RES) support. Right now, neither Regeringskansliets databaser or Domstolsväsendets rättsinformation has a /robots.txt, but if they put one in tomorrow I think I should respect it.
I did briefly use ClientCookie, which is an add-on module for urllib2 that provides automatic cookie support, and it did solve my first problem. Although I did not try it, it can also be used to provide RES support, proper Referer setting, and some other goodies. It seems that at least the cookie handling functionality of ClientCookie has been folded into urllib2 in Python 2.4, which is a good thing.
However, some time after I first got some code to work with the site, they changed something around and made it even more fragile. No matter what I did, I couldn’t get the site to respond with anything other than a “500 Server Error“, even though I checked the client-server communication when using IE (with the excellent Fiddler utility), and replicated the behaviour down to the exact header level.
So, I remembered that Erik had told me about the virtues of webscraping using IE and COM automation. Since I’m only running the code on windows machines, giving up platform independence wasn’t that big a deal, and the rich COM support in both Python and IE made it quite easy (after installing pywin32 for COM support). Here’s the basic code:
from win32com.client import Dispatch
ie = Dispatch("InternetExplorer.Application")
ie.Visible = 1
ie.Navigate("http://www.rattsinfosok.dom.se/lagrummet/index.jsp")
while ie.Busy: sleep(1)
ie.Document.frames(1).Document.forms(0).all.item("txtAvgDatumFran").value = startdate.strftime("%Y-%m-%d")
ie.Document.frames(1).Document.forms(0).all.item("txtAvgDatumTill").value = "%s-12-31" % year
ie.Document.frames(1).Document.forms(0).all.item("slctDomstol").value = "ALLAMYND"
ie.Document.frames(1).Document.forms(0).all.item("buttonSok").click()
while ie.Busy: sleep(1)
html = ie.Document.frames(1).Document.body.outerHTML.encode('iso-8859-1')
With such a javascript- and browser behaviour dependent web app such as this, you can really save yourself a whole lot of trouble if your code can use that web browser instead of trying to emulate that web browser. For one thing, behaviour implemented in javascript (OnClick-handlers and the like) is reproduced correctly without any extra work.
Well, that’s all for now about fetching stuff from the web. Next installment will center around making sense of what we’ve just fetched, i.e. parsing HTML and stuff.