Part 3: Understanding what was fetched

(Earlier posts in this series: here
and here)

There are a lot of ways to extract data from a HTML file. You can
do simple string
searching
(by the way, why is the python documentation for
basic string objects hidden under the non-descript heading
”Sequence types”, and why is there no reference to that part of
the documentation from the separate
string module
, which hardly does anything?) and rexep
munging
, or you can use more
sophisticated
HTML
parsers
. Funnily enough, there are two of these in the Python
standard library, and both of them are callback based — why no
tree-based interface? If the HTML code is modern and well-formed,
you can even use a vast
array
of
XML tools (and if
it’s not, you can fix it with HTML Tidy).

I ended up using the BaseHTMLProcessor
approach from Dive
Into Python.
, which has a whole
chapter
devoted to the art of HTML parsing. Basically, you
subclass BaseHTMLProcessor, implementing callbacks for various
tags, which are called as these tags are encountered in the
document. Your class is responsible for keeping track of whatever
state (ie what depth you are in the document, what tags were
encountered before this one, and so on) that needs to be kept.

There are some things that are cumbersome with this approach. For
example, automatic HTML entity resolving would be good. The HTML
fragment
<h1>r&auml;ksm&ouml;rg&aring;s</h1gt;
represents a single header with the string ”räksmörgås” (a common
test phrase for
swedish programmers), and so it should only result in three
callbacks: start_h1, handle_data (which should
be called with the string ”räksmörgås”), and end_h1.

Instead, the following callbacks are called:

  • start_h1
  • handle_data (called with the string ‘R‘)
  • handle_entityref (called with the string ‘auml‘)
  • handle_data (called with the string ‘ksm‘)
  • handle_entityref (called with the string ‘ouml‘)

…you get the idea. There exists a
mapping
that helps with the entity resolving, but for the HTML
case, this could have been solved at a lower-level stage.

Still, for the parsing problems I have, the
callback-based/keep-your-own-goddam-state-approach works. Most of
the time I’m just concerned with finding the elements
in a table
, meaning I have to keep track of what cells I’ve
seen and when a new table row starts, things like that. As I go
along, build up a list of mappings or something similar, and then
just use that list once done. The calling code gets quite nice and
simple:

cl = SFSChangelogExtractor()
cl.feed(open("downloaded/lawinfo/%s.html" % self.basefile).read())
for c in cl.changelog:
    if c.item('SFS-nummer') == current_transitional_id: ...
  

(Note that the ‘c’ object here is not a standard dictionary, but a
mapping-ish object that also keeps track of the order keys have
been inserted. That’s why it’s c.item('SFS-nummer') and
not c['SFS-nummer']. That, and the fact that I was too
lazy to implement the special
methods
needed to do a proper Duck Typed
dictionary.)

The one exception is the problem of finding all the plaintext in a
law text like this
one
, but it’s even easier: Just increment a counter whenever a
<pre> tag is encountered, decrement it when seing
</pre>. In handle_entityref and handle_text, just
check if the counter is > 0 and if so, append the text to a StringIO
object.

Part 2: Fetching stuff from the web

(A continuation of the series started in this post)

Now, the first thing that needs to be done is to actually get
the text of the laws from the web. Before that can be done, a
list of available laws must be fetched. In Swedish law, most
laws have nice ID’s known as ”SFS-nummer”, usually on the form
”yyyy:nnn”, where yyyy is the year it was issued, and nnn is
incremented for each law passed that year. Some of the older
laws don’t strictly follow this convention, and can have ID’s
like ”1736:0123 2” or ”1844:50 s.2”.

To get a list of all laws passed between two years, one can use this form from
Regeringskansliets
Rättsdatabaser
. It uses the normal GET method, and so it’s
quite easy to construct a URL that will return all laws between,
say, 1600 and 1850 (linebreaks inserted for readability):

http://62.95.69.15/cgi-bin/thw?${HTML}=sfsr_lst&
${OOHTML}=sfsr_dok&${SNHTML}=sfsr_err&${MAXPAGE}=26&
${BASE}=SFSR&${FORD}=FIND&ÅR=FRÅN+1600&
ÅR=TILL+1850

To fetch this, I used urllib2. Now for a little aside rant: Why
are there two urllibs in the standard distribution? I understand
that basic urllib has a simple interface and urllib2 a more
complex, but would it be so hard to design a single module
that lets you do simple things easy, and progress onto hard
things? In the Perl world, you can start with LWP::Simple and
then go on to more advanced stuff, but with python it’s either
simple urllib requests with practially no control at all, or
urllib2 with it’s way-too-complex system of chained handlers. I
will return to this rant in a little bit, but for now let’s have
some useful content. This is the code used to fetch stuff:

url = "http://62.95.69.15/...ÅR=FRÅN+%s&ÅR=TILL+%s" % (1600,1850)
sock = urllib.urlopen(url)
html = sock.read()
  

So, as long as your needs are simple, like just wanting to do a
simple GET or POST, urllib and/or urllib2 will work. However, I
encountered a more complex scenario when I wanted to download
court verdicts from Domstolsväsendets
rättsinformation
: This is a web app that relies on HTTP
posts, HTTP redirects, session tracking cookies, frames,
javascript links and is, in general, incredibly fragile. The
slightest ”error” in what you send and the server answers with a
500 Server Error: java.lang.NullPointerException

The first problem was that the application requires cookie
support, which urllib2 doesn’t have (as it doesn’t have any
concept of state between requests). At first I thought I could fix
the cookie support by reading and setting headers, the way it was
done back in the day when men were men and knew the cookie
spec
by heart. Turns out the web service sets a cookie when
you issue a search request, but the answer from the search request
is a HTTP redirect. To get the resulting list of matches, you need
to present that same cookie that was set.

Now, let’s continue the rant: urllib2 blindly follows the
redirect, giving us no chance to set the Cookie header. From the
documentation, it appears that it should be possible to override
this behaviour by subclassing HTTPRedirectHandler and passing
the instance to build_opener, which creates a chain of instances
of BaseHandler or a subclassed class. Reading the documentation
for urllib2 makes me think that someones OO/design patterns
fetish was not kept properly in check. Anyway, I could not get
that to work.

Another thing that bugs me about urllib2 is that is has no
support for implementing Robots Exclusion Standard
(RES) support. Right now, neither Regeringskansliets databaser
or Domstolsväsendets rättsinformation has a
/robots.txt, but if they put one in tomorrow I think I
should respect it.

I did briefly use ClientCookie,
which is an add-on module for urllib2 that provides automatic
cookie support, and it did solve my first problem. Although I
did not try it, it can also be used to provide RES support,
proper Referer setting, and some other goodies. It seems that
at least the cookie handling functionality of ClientCookie has
been folded into urllib2 in Python 2.4, which is a good thing.

However, some time after I first got some code to work with the
site, they changed something around and made it even more
fragile. No matter what I did, I couldn’t get the site to
respond with anything other than a ”500 Server Error”,
even though I checked the client-server communication when using
IE (with the excellent Fiddler utility),
and replicated the behaviour down to the exact header level.

So, I remembered that Erik had told me about
the virtues of webscraping using IE
and COM automation
. Since I’m only running the code on
windows machines, giving up platform independence wasn’t that
big a deal, and the rich COM support in both Python and IE made
it quite easy (after installing pywin32 for
COM support). Here’s the basic code:

      from win32com.client import Dispatch
      ie = Dispatch("InternetExplorer.Application")
      ie.Visible = 1
      ie.Navigate("http://www.rattsinfosok.dom.se/lagrummet/index.jsp")
      while ie.Busy: sleep(1)
      ie.Document.frames(1).Document.forms(0).all.item("txtAvgDatumFran").value = startdate.strftime("%Y-%m-%d")
      ie.Document.frames(1).Document.forms(0).all.item("txtAvgDatumTill").value = "%s-12-31" % year
      ie.Document.frames(1).Document.forms(0).all.item("slctDomstol").value = "ALLAMYND"
      ie.Document.frames(1).Document.forms(0).all.item("buttonSok").click()
      while ie.Busy: sleep(1)
      html = ie.Document.frames(1).Document.body.outerHTML.encode('iso-8859-1')
  

With such a javascript- and browser behaviour dependent web app
such as this, you can really save yourself a whole lot of
trouble if your code can use that web browser instead of
trying to emulate that web browser. For one thing,
behaviour implemented in javascript (OnClick-handlers and the
like) is reproduced correctly without any extra work.

Well, that’s all for now about fetching stuff from the web. Next
installment will center around making sense of what we’ve just
fetched, i.e. parsing HTML and stuff.

Lagen.nu behind the scenes

Now that lagen.nu has been out for
some time, it might be a good
idea to write down what I’ve learned from it so far, in blog
form. Much of the discussion will be centered around python, a
language I’m far from proficient in, but it’s possible that
someone will learn at least something from it.

First, take a look at this
post
that explains what lagen.nu is, from a user
perspective.

This post is about how the site is produced. When I started out, I
had no clear idea of what I wanted to do, other than to download
the text of all swedish laws and convert it to some sort of nice
HTML. I knew I wanted to do as much as possible with static HTML
files, and I had a hunch that XML would be involved in some way.

So, essentially, the code only needs to run off-line, with no
GUI required.

I thought about doing this in C#, since it would be a good
experience building project in a language for which expertise is
highly sought after. But since I’m no longer
programming for food
(actually I am, for another four days,
but still), I took the opportunity to do it in python, a language which I’ve
always liked but never become friends with.

From a high level, the code does the following:

  • Finds out what laws are available
  • Downloads the law text HTML documents
  • Converts the text to XML
  • Transforms the XML to HTML

There are some extra steps involved in creating the front page,
RSS feeds, and handling the verdicts database, but these are the
main steps.

The result of the program is a tree with static HTML files, ready
for deployment.

I started out by looking
for a good Python IDE
. I did not find it, and settled for Emacs
with python-mode.

Once set up with a recent version of python-mode, properly
configured, I had a nice light-weight development
environment. Here’s my minimal configuration (this goes into your
.emacs file):

(autoload 'python-mode "python-mode" "Python Mode." t)
(add-to-list 'auto-mode-alist '("\.py'" . python-mode))
(add-to-list 'interpreter-mode-alist '("python" . python-mode))
(setq py-python-command "C:\Python23\python.exe")
  

My code lives in classes, and to test things out, I have code at
the end of the main code file that looks sort of like the
following:

if __name__ == "__main__":
    vc = VerdictCollection()
    vc.get(2004,refreshidx=True)
  

(That is, if I want to test the get method of the
VerdictCollection class). To test the code, I just press
C-c C-c in the python editor window. The entire python buffer gets
sent to the python shell, and the last part (after if __name__
== "__main__":
) executes.

Things that are good about this environment:

  • Free, in both senses of the word
  • The intendation support really works, which is quite important with python
  • Reasonably fast edit-run cycle
  • The interactive python shell

Things that are bad:

  • I can’t debug stuff. It seems like it should be
    possible
    , but I have no pdb.exe, which seems to be a
    requirement. In particular, it would be nice to be able to
    automatically start debugging when an unhandled exception is
    raised.
  • Copy and paste from the *Python* buffer has character set
    problems. For example, if my code outputs a § sign, and I cut’n
    paste it into another file, emacs will complain:

    These default  coding systems were tried: 
    iso-latin-1-dos
    However, none of them safely encodes the target text.

    This is bogus, since the § sign is perfectly legal in latin-1.

I use the standard python.org distribution of Python 2.3 (I
haven’t gotten around to upgrading to 2.4 yet), not the ActiveState
one
. I tried it, and like the fact that the win32com module is
bundled, but the python.org version is a leaner download and has a
more usable HTML help application (particularly the good index).

To get a grip of how to do things with python, I’ve used the
online version of Mark Pilgrim’s Dive Into
Python
, as well as the Python
cookbook
. This, together with the reference manual, (the eff-bot
guide to) The Standard Python Library
and Text Processing in Python has
been all I need so far.

Legal document standards, part 2

Rasmus blogs
about
open standards and open access for law texts, a topic of
great interest to me as of lately (or ‘obsession’, apparently :-). We
both agree that there are too little of that, and that the
much-heralded open governement could do better in this area.

Anyway, I came across an interesting
report
, a summary from a conference held about two years ago. It
featured various people working with legal information systems, both
in the government and private companies, sharing their views on
standardisation of document formats and systems. There are views from
the people behind Rixlex, Infodata and Notisum, amongst others, but
also an interesting view into the state of legal information standards
in Norway. They seem to be way ahead of Sweden in this area.

The general consensus seemed to be ”standardization is good, and we
should do it”, but with no real commitments or timeplans. Maybe there
has been developments that I don’t know about since then. This was,
after all, two years ago.

Meanwhile, if you want to do interesting stuff today with
the body of swedish law, such as making a WAP version or performing
graph analysis of all references contained in the 7500+ texts, just download my completely
non-{standardized,documented} XML version
and go nuts!

There are now at least three document standards, or efforts to
create such, for marking up law texts and other legal doucments on my
radar: uscfrag
(mentioned earlier), used
by Cornell University for marking up US Code, LegalXML which seems to be
US-centric, and LEXML, which
appears to be more EU-centric. It even has it’s own Sourceforge page!

I had no idea so much was going on in so many committees when I
started working on the XMLization of swedish law. In a way I’m glad
that I didn’t, since I probably would have focused too much on
adhereing to these emerging standards and less time to, you know,
getting things done. Or worse, just waited for them to actually finish.

Lagen.nu: The first month

A month ago, I announced
the existance
of lagen.nu. I thought I
could do a ”alpha” release, without doing much PR for it, but the web
does not work that way. Several bloggers relayed the
news
, Google indexed the site, and the hits started trickling
in. These are the numbers for the last 24 hours:

log entries after filtering:      2679
page views:                       1197
Unique visitors:                   582
Unique 2+ page visitors:           181
Unique IP adresses:                567
Search engine referers:            585
Internal searches:                 100
Bookmarked frontpage visitors:      46
Bookmarked frontpage visits:        58

These might not be earth-shattering numbers, but I’m very happy
with them, particularly as they are steadily increasing. Pretty soon,
the site will go from being hosted under my desk to being
professionally handled by the very nice people at Mr Friday.

Furthermore, specific laws on the site has been referenced in
various web discussion forums when discussing swedish law, which is
exactly how I hoped the site would be used. And I must mention the
very blog-esque button that Kadrik made (look for the
yellow and blue button in the left hand column). Neat!

But other than obsessivly tracking referers and analyzing web
server logs, I’ve been slacking pretty much this month. Some work has
been done on the text->XML conversion, more work has been going into
trimming the update process (currently done each wednesday, when new
laws generally become available), and I’ve started tackling the
problem of recognizing tabular data in plaintext and formatting it
appropriately. I’m also working on a series of postings detailing the
tech used behind the scenes, which you might be interested in if you
want to do some large-scale web scraping with Python.

The pace will probably pick up in December. The 17th is my last day
at work, and then I’ll have almost a month of free time before school
starts
(that is IF I get in… keep your fingers
crossed). If you have any comments on the site, or suggestions on how
to make it more useful, please mail me (or comment on this
post). This goes double for those of you working in the legal
profession or studying law.

Lagen.nu is now public

Finally, after many a late nights of coding, lagen.nu is now usable enough for me to
make it public. For new viewers, this is the hobby project I’ve been
working on during weekends and evenings for the last few months, and
it’s basically the entire body of swedish law, nicely formatted,
hyperreferences and linkable.

As an example, take a look at the official online edition of the swedish copyright law, (or the more nicely formatted version at Notisum) and contrast it with the one at lagen.nu

Things to notice:

  • The table of contents, simplifying navigation in large law
    texts.
  • The lenght and simplicity of the URL, making it feasible to input from memory, as long as you know the ID of the law (later on I will add some nice vhosting magic so you can input http://upphovsrätts.lagen.nu/, but that’s not finished yet).
  • The ”changelog” for each section, detailing when that law was
    changed, with links to transitional regulations and the preparation
    documents that led up to the law.
  • Links to prejudical verdicts for each section, where such are
    available. This is particularly important, since many parts of law are
    hard to interpret without the knowledge of legal practice.
  • The fact that each and every paragraph and section is directly
    linkable. For example, to refer to the second paragraph in section 26
    g of the copyright law, just use the URL http://lagen.nu/1960:729#P26gS2. Purple
    numbers
    are used to make it easier to discover and create such
    direct links.
  • (This is my favorite part). Wherever there are inline references
    in the text to another paragraph or another law, direct links are
    created, so that you quickly can click around and find out the exact
    text of the referenced paragraphs. Or, for even quicker access, just
    hover with the pointer over the link to get a tooltip containing the
    first ten words or so.

Also, there’s of course RSS feeds. One for news about new features on the site,
and one that contains all new and changed laws, as
that information becomes public. I hope that the latter one in
particular will become useful for anyone that wants to keep up to date
on Swedish law.

Now, if don’t have an interest in Swedish law, lagen.nu won’t be of
much interest to you, but if you do, I hope that you will take a look
at it, and if you have any feedback at all (feature suggestions, bug
reports, lavish praise), please do mail it to me.

I have really no idea of what will become of the project, but since
I’m planning on starting law school at the start of next year, I will
probably add whatever features that could be usable for a student of
Swedish law.

I’m not dead yet!

Not much activity on the blog as of recently, but this is mainly
because I’ve been working evenings and weekends putting lagen.nu in shape. I’ve only got three
TODO items left (and they’re really minor) before I can let the site
go public in an alpha version.

Lagen.nu won’t be of much use to you unless you speak swedish and is
interested in swedish law, but if you do, it will be THE reference
site for you. I’m real happy about how it’s shaping up, in
particularly how fun it is to hack on it. I think a key aspect of it
is the total feature-driven approach I’ve been taking. Basically, I’ve
just completely disregarded everything I’ve learned about planning
ahead, writing maintainable code, finding out the Right Way to do
things, and just coded away with no plan in sight.

And yet, I’ve still learned a few interesting lessons. I hope to find
the time and energy to write some of them down here, but the main
lesson is the one outlined above.

”EBNF Rocks!”, or undoing ten years of regex-inflicted damage

I’ve been parsing text with regexes for longer than as I’ve been
programming for a living. If there was some text extraction that
needed to be done, regexes was my first choice. Later on, I moved on
to HTML and XML parsers when I could, but there would often be some
sort of regexp magic operating on the text nodes.

But the problem of parsing swedish law texts, looking for internal
and external references, finally proved too tough for my beloved
regexp hammer. As an example, here is a sample paragraph: § 5 1982:713:

Bestämmelserna om livförsäkring, med undantag för 1 kap. 8 a §
samt 7 kap. 22, 23 och 26 §§, får tillämpas för skadeförsäkringar som
avses i 2 kap. 3 a § första stycket klasserna 1 och 2 samt för
avgångsbidragsförsäkringar.

(by the way, the linked HTML page is generated before I implemented
the EBNF based parsing described below. As you can see, there are quite a lot of places where my previous naive regex solution got it wrong.)

Quick primer on swedish and swedish law texts: ”1 kap.” means
Chapter 1, ”8 a §” means ”section 8 a”, ”första stycket” means ”first
paragraph” and the rest of the stuff is not really important. To
create the HTML pages I first create a XML document from the plaintext law, where I try to
find as much structure as possible. The most difficult part is finding
internal references. The end result I’m looking for is the
following:

Bestämmelserna om livförsäkring, med undantag för 
<ilink chapter="1" section="8 a">1 kap. 8 a §</ilink> samt 
<ilink chapter="7">7 kap.</ilink>
<ilink chapter="7" section="22">22</ilink>,
<ilink chapter="7" section="23">23</ilink> och
<ilink chapter="7" section="26">26 §§</ilink>, får
tillämpas för skadeförsäkringar som avses i
<ilink chapter="2" section="3 a" piece="1">2 kap. 3 a § första stycket</ilink>
klasserna 1 och 2 samt för avgångsbidragsförsäkringar.

This is a difficult problem to solve with rexes, partly because you
need to keep a lot of state while simultaneously iterating over an
unknown number of references (eg, when you reach ”26” and
create a <ilink> tag, you must remember that the
current chapter in this context is ”7”, and you must handle
the case when there isn’t a current chapter), partly because some
patterns are subpatterns of larger patterns (eg, there is a rule for
the simple case ”26 §”, but this matches a subset of the
larger case ”22, 23 och 26 §§”. Since there are so many rules
(and there are many more than I’ve shown in this simple example) a
solution that’s based on running several regexp transformations over
the same text quickly becomes unmanageble.

Well, after reading Chapter 4, ”Parsers and State-machines” from Text Processing in Python, I started
thinking about the problem from a more structured parsing view. I’ve
never really worked with EBNF parsers before. I could read a EBNF
grammar, but since I’ve never written a compiler or interpreter, it
didn’t really occur to me how useful they can be for other situations
as well.

The following is a grammar that can parse the above paragraph:

root               ::= (refs/ref/plain)+
refs               ::= (ChapterSectionRefs/SectionRefs)
ChapterSectionRefs ::= ChapterRef, wc, SectionRefs
ChapterRef         ::= ChapterRefID, c"kap."
ChapterRefID       ::= number, wc, (char, wc)?
SectionRefs        ::= (IntervalOrSingle,Comma,wc)*, IntervalOrSingle, wc, And, wc, LastSectionRefID, wc, DoubleSectionMark

IntervalOrSingle   ::= (IntervalSection/SingleSectionRefID)
SingleSectionRefID ::= number
FirstSectionRefID  ::= SingleSectionRefID
LastSectionRefID   ::= number
IntervalSection    ::= SingleSectionRefID, Hyphen, SingleSectionRefID

And                ::= 'och'
Or                 ::= 'eller'
Comma              ::= ','
Hyphen             ::= '-'
DoubleSectionMark  ::= '§§'

plain        ::= (wcs/word/number/punctuation)
wc           ::= [ \t\n\r\f\v]
wcs          ::= wc+
char         ::= [a-zA-ZåäöÅÄÖ]
word         ::= char+
digit        ::= [0-9]
number       ::= digit+
punctuation  ::= [][!@#$%^&()+=|\{}:;<>,.?/§"]

If you have basic knowledge of regexes, particularly bracket
expressions and repetition operators, the above grammar shouldn’t be
that difficult to understand. Just start at the top of the text to be
parsed, and try to satisfy the top rule (”root”), moving down
to the sub-rules (that’s probably not the correct technical term) as
you go. ‘(a/b/c)‘ means ”first try to satisfy a, if not successful try
b, if not successful try c” and ‘a, b, c‘ means ”satisfy a and then b
and then c”

The end result, when you feed this grammar and some law text into
SimpleParse, is a
result tree, which is a easily navigable data structure that I can use
to create the output text. For the example above, it will return a
plain‘ node, then a ‘refs‘ node, then another
plain‘ node, another ‘refs‘ and finally a last
plain‘ node. The ‘plain‘ nodes are everything that
the parser couldn’t match with the more specific rules (note that the
refs‘ and ‘ref‘ comes before ‘plain‘ in
the root rule), and I just print out them as-is.

For a ‘refs‘ node, I can dive into it’s subtree, where
I’ll find a single ‘ChapterSectionRefs‘ node, which in turn
will have three nodes: ‘ChapterRef‘, ‘wc‘,
SectionRefs‘. I’m sure you can see how this contiues.

Writing the code that takes this parse tree and constructs the XML
I want is not trivial, but compared to the regex mess I was in before,
it’s a walk in the park.

So, after ten years with my regex hammer, during which every text
parsing problem looked like a nail, I now have a EBNF
screwdriver. Better late than never!

mod_rewrite, win32 and colons

I’ve been spending my evenings divided between frantically studying history and working on
my new project. The
latter is soon to debut on lagen.nu
(but it will be in swedish only).

In swedish law, all laws are uniquely identified by what’s known as
”SFS-nummer”, a string consisting of the year the law was enacted,
followed by a colon, followed by a index number for that year (there
are some exceptions to this rule, but I’m ignoring them for now). For
example, the swedish copyright law is known as 1960:729.

Lagen.nu will contain all swedish laws together with all manners of
cross-referencing goodness. So, wouldn’t it be great if there was some
URL-rewriting magic at work, so that instead of going to
http://lagen.nu/1960/729.html, you could just go to http://lagen.nu/1960:729?
Turns out this is fairly simple with apache and mod_rewrite:


RewriteEngine On
RewriteRule ^(\d+):(\d+)$ /$1/$2.html [L]

Well, on my Unix box, that is. On Windows, things are a little more
complicated, and since my development machine is a WinXP laptop, I ran
into these complications. I run Apache on the laptop, to have the same
environment on both computers. As you may know, NTFS has a
little-known feature called Alternate Data
Streams
, which are specified by appending a colon and the stream
name to the file name. That’s why colon isn’t allowed in filenames (at
least I think that’s why…)

Anyway, Apache has
a problem
with this on win32. Even though we never want apache to
look at the disk for a file named 1960:729, somewhere deep in
the apache core the incoming URL (or at least part of it) is tested
for filename validity, and fails, resulting in a permission denied
error.

So, what to do? IIS to the rescue! It turns out that there is a IIS
plugin closely modelled after mod_rewrite called ISAPI_rewrite, which is
closed-source but free as in beer in its lite version. Good enough for
me. I had some problems with it (couldn’t use \d to match
just digits, I had to use RewriteRule ([^:]*):(.*)
$1/$2.html
) but otherwise, works just fine. Might be worth a look
if you’re on IIS but want to have more control over your URLs.

update: just discovered an interesting bug (i think?) in IE:
If you have a url on the form <a href="1960:729">, IE
assumes that it’s not a relative URL, but instead a absolute URL to
the server 1960 on port 729. Heh.

Learning python… again

An interesting side effect of deciding to quit
programming for food
is that I’ve started to program for fun
again. Since a set of laws is, at it’s core, just a gigantic set of
large documents, much interesting stuff can be done by textual
processing of this body of work. And so I’m writing some python code
to fetch all current swedish laws, the preparation documents for them,
and verdicts referencing them, in order to semantically mark them all
up and cross-reference the hell out of them. It’s been surprisingly
fun so far.

So, why python? Since I’m more proficient in C# or Perl, it would
make more sense to use any of these languages. Well, C# in particular
is a very sensible language, with a very useful class library. But,
you know, it’s just not that fun. There’s something about the
whitespace-sensitivity of python that just feels good. And since I no
longer have to worry about marketable programming skills, that’s what
I’m going to use.

But one aspect of the sensibility of C# and the .Net programming
platform is that the tools are really really (really) good. Visual
Studio’s integration of the class libraries (incl intellisense),
documentation and debugging is first-class. Sure, the editor might not
be as powerful as Emacs, but that’s really only a small part of the
puzzle.

And so I’ve been searching for a good IDE for python
development. So far I’ve tried, and rejected, the following:

PythonWin
Free with a download of ActivePython.
No integrated class browsing or intellisense-like features, and no
easy access to documentation.
Visual Python
This is a plugin to Visual Studio, which to me makes a lot of
sense, and it does a lot of things right. What brings it down for me
is the lack of help and autocompletion for built-in types like strings
and the lack of an Immediate pane during debugging. It’s also way to
expensive for me to buy personally.
Komodo
This is a standalone commercial IDE with an affordable personal
use price. Here, the main dealbreaker is the lack of class library
integration and a slow editor. The class browser also leaves much to
be desired (it’s more of a symbol browser, relly).

So, if none of these passed my test, what am I using? Well, I gave up
on the whole IDE thing and decided to just use Emacs. With some help
from Pontus, I got
python-mode.el to behave enough that I’m comfortable enough to get
some stuff done. The py-help-at-point command does most of what
integrated documentation would do, and for the time being I’ll have to
browse the class library over here with
Mozilla instead. And use printf-debugging. It worked ten years ago, it
should work now as well.

I do intend to test WingIDE and BlackAdder,
but since they’re both commercial and not that affordable for a
student, they have to be really really good for me to choose them over
my current Emacs solution.