Part 2: Fetching stuff from the web

(A continuation of the series started in this post)

Now, the first thing that needs to be done is to actually get
the text of the laws from the web. Before that can be done, a
list of available laws must be fetched. In Swedish law, most
laws have nice ID’s known as ”SFS-nummer”, usually on the form
”yyyy:nnn”, where yyyy is the year it was issued, and nnn is
incremented for each law passed that year. Some of the older
laws don’t strictly follow this convention, and can have ID’s
like ”1736:0123 2” or ”1844:50 s.2”.

To get a list of all laws passed between two years, one can use this form from
. It uses the normal GET method, and so it’s
quite easy to construct a URL that will return all laws between,
say, 1600 and 1850 (linebreaks inserted for readability):${HTML}=sfsr_lst&

To fetch this, I used urllib2. Now for a little aside rant: Why
are there two urllibs in the standard distribution? I understand
that basic urllib has a simple interface and urllib2 a more
complex, but would it be so hard to design a single module
that lets you do simple things easy, and progress onto hard
things? In the Perl world, you can start with LWP::Simple and
then go on to more advanced stuff, but with python it’s either
simple urllib requests with practially no control at all, or
urllib2 with it’s way-too-complex system of chained handlers. I
will return to this rant in a little bit, but for now let’s have
some useful content. This is the code used to fetch stuff:

url = "ÅR=FRÅN+%s&ÅR=TILL+%s" % (1600,1850)
sock = urllib.urlopen(url)
html =

So, as long as your needs are simple, like just wanting to do a
simple GET or POST, urllib and/or urllib2 will work. However, I
encountered a more complex scenario when I wanted to download
court verdicts from Domstolsväsendets
: This is a web app that relies on HTTP
posts, HTTP redirects, session tracking cookies, frames,
javascript links and is, in general, incredibly fragile. The
slightest ”error” in what you send and the server answers with a
500 Server Error: java.lang.NullPointerException

The first problem was that the application requires cookie
support, which urllib2 doesn’t have (as it doesn’t have any
concept of state between requests). At first I thought I could fix
the cookie support by reading and setting headers, the way it was
done back in the day when men were men and knew the cookie
by heart. Turns out the web service sets a cookie when
you issue a search request, but the answer from the search request
is a HTTP redirect. To get the resulting list of matches, you need
to present that same cookie that was set.

Now, let’s continue the rant: urllib2 blindly follows the
redirect, giving us no chance to set the Cookie header. From the
documentation, it appears that it should be possible to override
this behaviour by subclassing HTTPRedirectHandler and passing
the instance to build_opener, which creates a chain of instances
of BaseHandler or a subclassed class. Reading the documentation
for urllib2 makes me think that someones OO/design patterns
fetish was not kept properly in check. Anyway, I could not get
that to work.

Another thing that bugs me about urllib2 is that is has no
support for implementing Robots Exclusion Standard
(RES) support. Right now, neither Regeringskansliets databaser
or Domstolsväsendets rättsinformation has a
/robots.txt, but if they put one in tomorrow I think I
should respect it.

I did briefly use ClientCookie,
which is an add-on module for urllib2 that provides automatic
cookie support, and it did solve my first problem. Although I
did not try it, it can also be used to provide RES support,
proper Referer setting, and some other goodies. It seems that
at least the cookie handling functionality of ClientCookie has
been folded into urllib2 in Python 2.4, which is a good thing.

However, some time after I first got some code to work with the
site, they changed something around and made it even more
fragile. No matter what I did, I couldn’t get the site to
respond with anything other than a ”500 Server Error”,
even though I checked the client-server communication when using
IE (with the excellent Fiddler utility),
and replicated the behaviour down to the exact header level.

So, I remembered that Erik had told me about
the virtues of webscraping using IE
and COM automation
. Since I’m only running the code on
windows machines, giving up platform independence wasn’t that
big a deal, and the rich COM support in both Python and IE made
it quite easy (after installing pywin32 for
COM support). Here’s the basic code:

      from win32com.client import Dispatch
      ie = Dispatch("InternetExplorer.Application")
      ie.Visible = 1
      while ie.Busy: sleep(1)
      ie.Document.frames(1).Document.forms(0).all.item("txtAvgDatumFran").value = startdate.strftime("%Y-%m-%d")
      ie.Document.frames(1).Document.forms(0).all.item("txtAvgDatumTill").value = "%s-12-31" % year
      ie.Document.frames(1).Document.forms(0).all.item("slctDomstol").value = "ALLAMYND"
      while ie.Busy: sleep(1)
      html = ie.Document.frames(1).Document.body.outerHTML.encode('iso-8859-1')

With such a javascript- and browser behaviour dependent web app
such as this, you can really save yourself a whole lot of
trouble if your code can use that web browser instead of
trying to emulate that web browser. For one thing,
behaviour implemented in javascript (OnClick-handlers and the
like) is reproduced correctly without any extra work.

Well, that’s all for now about fetching stuff from the web. Next
installment will center around making sense of what we’ve just
fetched, i.e. parsing HTML and stuff.