Archive for 2004

Part 8: Finishing touches

måndag, december 20th, 2004

(The last in a series of blog posts about the tech behind lagen.nu. The other parts are here: first, second, third, fourth, fifth, sixth and seventh)

URL design: Insired both by “Cool URIs don’t change” and the general REST emphasis on sensible URL design, I designed a URL scheme for lagen.nu that maps closely to how swedish laws are identified, and one that is future-proof in that it hides implementation details. Since I hope that many people will find it useful to link to individual pages on lagen.nu, it’s very important that those links doesn’t break.

To illustrate: the generated html files are placed in directories named html/year/id.html. Suppose that I just were to let Apache (or any other webserver) serve content straight from that directory. For example, the copyright law has SFS id 1960:729, and so with this scheme the URL would be http://lagen.nu/html/1960/729.html (or http://lagen.nu/1960/729.html, depending on what I would set DocumentRoot to). This would work fine, the URL’s would be easy to understand, and people would link to all those individual pages from all over the web.

But now suppose I wake up one day and decide to stick all data in a big database, and build a PHP frontend? The URL’s would change, and probably be on the form http://lagen.nu/showlaw.php?sfs=1960:729. Again, nice short URL, easy to understand… but now the links from all over are broken.

So, mod_rewrite to the rescue: With just the simple rule:

    RewriteRule ([^:]*):(.*) /html/$1/$2.html
  

, the resource found at URL http://lagen.nu/html/1960/729.html is now also available at http://lagen.nu/1960:729. This is even nicer to read, futureproof, and enables someone that knows the ID of any law to go straight to the page for that law quickly.

As a added bonus, it makes the text and XML version of the laws easily available too: During generation, I put these versions in sibling directories to html/, named text/ and xml/, respectively, and then use the following rules:

    RewriteRule ([^:]*):(.*).xml	/xml/$1/$2.xml
    RewriteRule ([^:]*):(.*).txt	/text/$1/$2.txt
  

There are other parts, such as the index pages and the about pages, where the underlying flat-file nature of the site shines through, such as http://lagen.nu/om/me.html, but those pages are not as likely to be linked to. Still, if I should decide to change it at some point, I’ll probably make some mod_rewrite based backwards compatibility hack. Also, if you want to do this on a Win32 platform, you’re out of luck. See my previous post for alternatives.

Update functionality: As the body of swedish law is always changing, I had to plan ahead of how to keep the site up to date. New laws are usually published every Wednesday, and it’s out of the question to download every page from Regeringskansliets rättsdatabaser once a week. So instead, I store the highest previously-fetched ID, and from the update routine, I try fetching laws, increasing that ID, until I finally get a “law not found” error. The laws can be either “base constitutions”, i.e. new laws, or “change constitutions”, laws that specify that some other, older, law should change (think “source code file” and “patch”, respectively).

If it’s a base constituion it’s pretty simple, just download it and process it from start to finish. If it’s a change constitution, however, we find out wich base constitution it changes, fetch that, see if it has been updated (”patched”), and if so, store the old versions of that law somewhere, then do the normal regeneration process. In this way, I can, over time, build up an archive of old law revisions, so that I can tell how the law looked at a particular date. For now, I have only a few months worth of history, but the value of this will grow as the time goes on. In particular, it would be cool to be able to do CVS-style diffs between arbitrary revisions of a law, or to be able to link to a law as it looked at a particular moment of time.

Front page: With the information we gather during the update process, we can build a list of recently changed laws, and put it on the front page. Similarly, we build a list of recent new court verdicts, and also one with site news. All these are published in a side bar on the front page, while the main content area is filled with a static listing of some of the most important laws. The different parts of the side bar leads to different news pages, which details site, law and verdict news in greater detail.

RSS feeds: Hey, it’s 2004, how could any self-respecting new site not have a RSS feed? I generate feeds for all three news types (site, law and verdict) using PyRSS2Gen, a nice little lib for creating RSS 2.0 feeds. I haven’t tried them out much, but feedvalidator says they’re OK, and they work fine in all RSS readers I’ve tested with (although Opera tends to show the raw HTML instead of rendering it, which probably indicates that I should include it in some other way than just escaping it and showing it in the description tag. Maybe it would be useful to use a service like Feedburner for this.

Conclusion: This marks the end of this eight-part posting. I hope that you’ve picked up a trick or two. As should be apparent, I am no python wizard or even a programming guru in general, but over the years I’ve found a style of development that works for me in a single-developer context (doing multiple-developer projects is a totally different thing), mostly centered around the XP tenet “Do the simplest thing that could possibly work”, or in my own words: “Don’t try to find the right way of doing something — if you just work away, the right way will find you”.

Part 7: Regression testing

söndag, december 19th, 2004

(A series of blog posts about the tech behind lagen.nu. Earlier parts are here: first, second, third, fourth, fifth and sixth)

Like most developers that have been Test infected, I try to create regression tests whenever I can. A project like lagen.nu, which has no GUI, no event handling, no databases, just operations on text files, is really well suited for automated regression testing. However, when I started out, I didn’t do test-first programming since I didn’t really have any idea of what I was doing. As things solidified, I encountered a particular section of the code that lended itself very nicely to regression testing.

Now, when I say regression testing, I don’t neccesarily mean unit testing. I’m not so concerned with testing classes at the method level as my “API” is really document oriented; a particular text document sent into the program should result in the return of a particular XML document. Basically, there are only two methods that I’m testing:

  • The lawparser.parse() method: Given a section of law text, returns the same section with all references marked up, as described in part 5.
  • Law._txt_to_xml(), which, given a entire law as plaintext, returns the xml version, as described in part 4

Since both these tests operate in the fashion “Send in really big string, compare result with the expected other even bigger string”, I found that pyunit didn’t work that well for me, as it’s more centered around testing lots of methods in lots of classes, where the testdata is so small that it’s comfortable having them inside the test code.

Instead, I created my tests in the form of a bunch of text files. For lawparser.parse, each file is just two paragraphs, the first being the indata, and the second being the expected outdata:

    Vid ändring av en bolagsordning eller av en beviljad koncession
    gäller 3 § eller 4 a § i tillämpliga delar.

    Vid ändring av en bolagsordning eller av en beviljad koncession
    gäller <link section="3">3 §</link> eller <link section="4a">4 a §</link> i tillämpliga delar.
  

The test runner then becomes trivial:

def runtest(filename,verbose=False,quiet=False):
    (test,answer) = open(filename).read().split("\n\n", 1)
    p = LawParser(test,verbose)
    res = p.parse()
    if res.strip() == answer.strip():
        print "Pass: %s" % filename
        return True
    else:
        print "FAIL: %s" % filename
        if not quiet:
            print "----------------------------------------"
            print "EXPECTED:"
            print answer
            print "GOT:"
            print res
            print "----------------------------------------"
            return False
  

Similarly, the code to test Law._txt_to_xml() is also pretty trivial. There are two differences: Since the indata is larger and already split up in paragraphs, the indata and expected result for a particular test is stored in separate files. This also lets me edit the expected results file using nXML mode in Emacs.

Comparing two XML documents is also a little trickier, in that they can be equivalent, but still not match byte-for-byte (since there can be semantically insignificant whitespace and similar stuff). To avoid getting false alarms, I put both the expected result file, as well as the actual result, trough tidy. This ensures that their whitespacing will be equivalent, as well as easy to read. Also, a good example of piping things to and from a command in python:

def tidy_xml_string(xmlstring):
    """Neatifies a XML string and returns it"""
    (stdin,stdout) = os.popen2("tidy -q -n -xml --indent auto --char-encoding latin1")
    stdin.write(xmlstring)
    stdin.close()
    return stdout.read()
  

If the two documents still don’t match, it can be difficult to pinpoint the exact place where they match. I could dump the results to file and run command-line diff on them, but since there exists a perfectly good diff implementation in the python standard libraries I used that one instead:

    from difflib import Differ
    differ = Differ()
    diff = list(differ.compare(res.splitlines(), answer.splitlines()))
    print "\n".join(diff)+"\n"
  

The result is even easier to read than standard diff output, since it points out the position on the line as well (maybe there’s a command line flag for diff that does this?):

[...]
      suscipit non, venenatis ac, dictum ut, nulla. Praesent
      mattis.</p>
    </section>
-   <section id="1" element="2">
?                   ^^^

+   <section id="1" moment="2">
?                   ^^

      <p>Sed semper, ante non vehicula lobortis, leo urna sodales
      justo, sit amet mattis felis augue sit amet felis. Ut quis
[...]
  

So, that’s basically my entire test setup for now. I need to build more infrastructure for testing the XSLT transform and the HTML parsing code, but these two areas are the trickiest.

Since I can run these test methods without having a expected return value, they are very useful as the main way of developing new functionality: I specify the indata, and let the test function just print the outdata. I can then work on new functionality without having to manually specifying exactly how I want the outdata to look (because this is actually somewhat difficult for large documents), I just hack away until it sort of looks like I want, and then just cut’n paste the outdata to the “expected result” file.

Part 6: Transforming with XSLT

lördag, december 18th, 2004

Now, after having downloaded, analysed and XML-ified the lawtext, theres just one step left: Convert them to HTML. Enter XSLT.

XSLT is a complex beast, and I find that it’s better to approach it as a programming language, albeit with an unusual syntax, than a stylesheet language. I have one master stylesheet that I use for the transformation (now over 600 lines of XSLT code), and during devlopment I found out quite a few tricks. By the way, if you want to see the original XML code for any law on lagen.nu, just add the suffix “.xml” to the url, ie http://lagen.nu/1960:729.xml.

Creating useful tooltips: An important aspect of the work done with lagen.nu is that I’m trying to not only produce more astethically pleasing layouts of the text, I try to cross-reference as much as possible, to really take advantage of the medium.

As one example: The internal references that were discussed in the last post, should of course be transformed to ordinary hypertext links, on the form <a href=”#P26c”>26 c §</a>, so that you quickly can jump around in the document and follow the references. But we can do better than that, by extracting the first 20 or so words from section 26 c, and stick them into the title attribute of the a tag:

    <a href="#P26c" title="Ägaren till en byggnad eller ett
  bruksföremål">26 c §</a>
  

Now, if you hover with the mouse over the link (assuming you have browser hard/software where these things are possible), the title text will be shown as a tooltip. This makes it even easier to make sense of a section, since you get an instant reminder of what the referenced section says — in many cases you don’t even have to click on the link.

So, initially, my plan was to have things like these tooltip strings prepared in the XML document, and just do a very simple transform into HTML. But as the work progressed, I found that I was almost always able to do it in XSLT instead. This is the relevant part of the link template:

<xsl:attribute name="title">
  <xsl:variable name="text">
    <xsl:choose>
	<xsl:when test="$hasChapters and $sectionOneCount = 1">
	  <xsl:value-of select="/law/chapter/section[@id=$section]/p[position() = $piece]"/>
	</xsl:when>
	<xsl:when test="$chapter != ''"><xsl:value-of select="/law/chapter[@id=$chapter]/section[@id=$section]/p[position() = $piece]"/></xsl:when>
	<xsl:otherwise><xsl:value-of select="/law/section[@id=$section]/p[position() = $piece]"/></xsl:otherwise> 
    </xsl:choose>
  </xsl:variable>
  
  <xsl:variable name="real-width">
    <xsl:call-template name="tune-width">
	<xsl:with-param name="txt" select="$text"/>
	<xsl:with-param name="width" select="160"/>
	<xsl:with-param name="def" select="160"/>
    </xsl:call-template>
  </xsl:variable>
  <xsl:value-of select="normalize-space(substring($text, 1, $real-width - 1))" />
</xsl:attribute>
  

The variables $chapter, $section, and $piece gets their values earlier in the template, and are set to the place the link goes to. $hasChapters and $sectionOneCount are set globally for the document and are used to tell what kind of section numbering this particular lawtext is using. There are three variants commonly used:

  • No chapters, just simple ascending section numbering
  • Chapters with restarting section numbering (ie, regardless of the number of sections in chapter 1, the first section in chapter 2 will be numbered ‘1 §’ — ie, we need to know the chapter as well as the section — just ‘17 §’ is ambigious)
  • Chapters with continous section numbering (ie, if the last section in chapter 1 is ‘16 §’, the first section of chapter 2 will be numbered ‘17 §’ — ie, the section number is never needed to unambigosly determine what ‘17 §’ refers to).

The code constructs a XPath expression and finds the node that the link refers to, and stores it in the variable text. Then, it trims the string down to a suitable length (max 160 chars, in this case) by using the user-defined function tune-witdh, together with normalize-space and substring. tune-width ensures that we end the string on a word boundary. The result of all this is assigned to the title attribute.

Generating a TOC: Again, if you look at the swedish copyright law, you will notice a big blue link titled “Visa innehållsförteckning”. Clicking on this yields (if you have a browser that supports javascript and DOM level 1) a table of contents (TOC), generated from chapter headings and other headlines. It starts out as hidden, since it usually is in the way, but sometimes it’s very useful.

The XML document in itself do not contain any TOC data. To generate the TOC, we use a number of mode-specific templates that extract the relevant information from headlines contained in the document, all triggered by a <xsl:apply-templates> call:

<div id="toc" style="display:none;">
  <xsl:apply-templates select="law/chapter[(headline or section)]" mode="toc"/>
</div>

...

<xsl:template match="headline" mode="toc">
  <xsl:if test="@level = '1'">
	<div class="l1">
	  <a>
	    <xsl:attribute name="href">
	      <xsl:choose>
		<xsl:when test="substring(.,1,5) = 'AVD. '">#R<xsl:value-of select="@id"/></xsl:when>
		<xsl:when test="../../chapter">#K<xsl:value-of select="../@id"/></xsl:when>
		<xsl:otherwise>#R<xsl:value-of select="@id"/></xsl:otherwise>
	      </xsl:choose>
	    </xsl:attribute>
	    <xsl:value-of select="."/>
	  </a>
	</div>
  </xsl:if>
  <xsl:if test="@level = '2'">
	<div class="l2">
	  <a href="#R{@id}"><xsl:value-of select="."/></a>
	</div>
  </xsl:if>
</xsl:template>
<xsl:template match="section" mode="toc"/>

<xsl:template match="changes" mode="toc"/>
<xsl:template match="appendix" mode="toc"/>
<xsl:template match="preamble" mode="toc"/>
  

Things to note: The text-to-XML conversion is responsible for determining the ‘level’ of the headlines. Level 1 headlines are usually associated with a chapter (though not always), and we use some tests to determine if that is the case. If so, the resulting link uses a “#K<number>” anchor fragment, otherwise a “#R<number>” fragment. “K” is for “kapitel” (chapter), while “R” is for “rubrik” (headline). Not strictly neccesary, but I prefer a link that explicitly says “this link goes to chapter 4″, rather than “this link goes to the 14th headline”, particularly as the number of headlines can change in the future, which would make the link point to the wrong place.

I have a number of “empty” templates. They are needed, since if I don’t have them, the base template kicks in and just copies the entire text, which seriously messes up the TOC. Now, I should be able to limit that with the select attribute of the <xsl:apply-templates> tag, but I have been unsuccesful (the reason I select both headlines AND sections, then do nothing with the sections, is that I’ve been experimenting with using the first lines of each section in the TOC as well, but that came out too messy).

Accessing data in other documents: To properly understand a section in a law, it helps to read court verdicts that reference it. In part 2, I described how to fetch data from Domstolsväsendets rättsinformation, which has this data. Basically, I have a python code snipped that goes through all 1800+ verdicts, uses the parser that was mentioned in part 5 to find references to laws and sections therein, and then generates a “cache.xml” file, that contains references to all verdicts, sorted by law, chapter and section, like this:

<?xml version="1.0" encoding="iso-8859-1"?>
<verdicts>
<law id="1736:0123_2">
<chapter id="9">
  <section id="5">
    <verdictref caseid="NJA 2002 s. 577 (alt. NJA 2002:70)" casenumber="T3933-01" id="2002/758" verdictdate="2002-11-22">
      Gäldenär, som hade flera skulder till en borgenär, har ansetts ha rätt att destinera sin be
    </verdictref>
  </section>
</chapter>
<chapter id="10">
  <section id="1">
    <verdictref caseid="NJA 2000 s. 667 (alt. NJA 2000:97)" casenumber="T4689-98" id="2000/772" verdictdate="2000-12-18">
      Sedan A och B var för sig ställt pantsäkerhet för C:s lån i en bank, har banken gjort sig f
    </verdictref>
    <verdictref caseid="NJA 2000 s. 88 (alt. NJA 2000:13)" casenumber="Ö3863-98" id="2000/923" verdictdate="2000-02-23">
      Ett bolags upplåtelse av företagshypotek har ansetts bli sakrättsligt gällande när företags
    </verdictref>
  </section>
  [...]
  

In the top level of the stylesheet, I select the relevant nodeset from this cache document, and store it in a variable. To be able to select, I first need to know the id of the law (the ‘SFS-nummer’), and so I pass it as a command line parameter:

<xsl:param name="lawid"/>
<xsl:variable name="relevantVerdicts"
  select="document('generated/verdict-xml/cache.xml')//law[@id=$lawid]"/>
  
Then, as I process each section, I check the nodeset to see if there’s any verdicts relevant to the section:
<xsl:variable name="sectionChapterVerdicts" select="$relevantVerdicts//chapter[@id=$chapterid]/section[@id=$sectionid]/verdictref"/>
<xsl:if test="$sectionChapterVerdicts">
<div class="metadata">
<xsl:for-each select="$sectionChapterVerdicts">
  <xsl:variable name="linktext">
    <xsl:choose>
      <xsl:when test="@caseid"><xsl:value-of select="@caseid"/></xsl:when>
      <xsl:when test="@casenumber">Målnummer <xsl:value-of select="@casenumber"/></xsl:when>
      <xsl:when test="@diaryid">Diarienummer <xsl:value-of select="@diaryid"/></xsl:when>
      <xsl:otherwise>Vägledande dom</xsl:otherwise>
    </xsl:choose>
  </xsl:variable>    
  <xsl:variable name="real-width">
    <xsl:call-template name="tune-width">
      <xsl:with-param name="txt" select="."/>
      <xsl:with-param name="width" select="80"/>
      <xsl:with-param name="def" select="80"/>
    </xsl:call-template>
  </xsl:variable>
  <a href="/dom/{@id}"><xsl:value-of select="$linktext"/></a>: <xsl:value-of select="normalize-space(substring(., 1, $real-width - 1))" />... (<xsl:value-of select="@verdictdate"/>)<br/>
</xsl:for-each>
</div>
</xsl:if>
  

Again, a little extra work is done to make sure the explaining text is cut at a word boundary (we could‘ve done that in the python code that generates cache.xml), and also that the text for the actual link makes sense. You see, different courts have different systems for assigning ID’s to cases: HD (the swedish supreme court) uses the page number as the case is published in that year’s anthology of relevant verdicts (Nytt Juridiskt Arkiv, NJA), other uses a court-specific identifier, and some use a derivative of the date when it was handled. Since these identifiers represent different things, they are also represented with different attributes in the XML file, and a little <xsl:choose> trickery selects the most appropriate ID.

Optimization hints: Some of the law texts are quite big, and processing them can take a long time. To speed things up, here are some of the things I did:

  • Don’t use the python interface to libxsl! I started out with this, but it turned out to take twice or even three times as long for a transformation, as compared to running command-line xsltproc (win32 binaries) through an os.system() call.
  • Use the –profile switch to xsltproc. It’s pointless to optimize if you don’t know where the bottlenecks are, and with xsltproc it’s so easy to find them.
  • Store frequently-referenced nodes, nodesets and values in variables, instead of selecting them through XPath queries all the time. Yeah, this one is kind of obvious, but it really helps, particularly in those “inner” templates that are used all the time.

Just by using the above tips, the transformation of my standard test case (1960:729) went from 90 seconds to three. Still, I have other test cases that still takes several minutes, so clearly there’s still some work to do…

Part 5: Finding and formatting inline law references

lördag, december 18th, 2004

(Earlier posts: 1, 2, 3, 4)

The second part of the convert-things-to-xml approach deals with finding inline references to other paragraphs in the law text. I’ve written about it in part in this blog post, but to recap:

Swedish law text contains lots and lots of internal and external references to other sections. These references have a semi-standardized format, but they are clearly meant to be parsed by humans, not machines.

The simplest case is a single reference to a single section (example from 4 § Försäkringsrörelselagen (1982:713)):

    Vid ändring av en bolagsordning eller av en beviljad koncession
    gäller 3 § i tillämpliga delar.
  

Here, the string ‘3 §’ signifies a reference to the third section in the current section of the current law. If we can identify and mark that reference up, we can make the ‘3 §’ into a hyperlink leading to the definition of section 3. You know, the stuff hyperlinking was designed for. Currently, this gets transformed into the following XML (how the XML gets transformed into clickable HTML is the subject of a later article):

    Vid ändring av en bolagsordning eller av en beviljad koncession
    gäller <link section=”3″>3 §</link> i tillämpliga delar.
  

For cases like that, the transformation is trivial, and could be done with regexps or just simple string matching. But for cases like this (Patentbesvärsrätten 96-837):

    14 § 1 st. 6) och 6 § varumärkeslagen (1960:644)
  

or better yet, this (49 a § URL, my personal favourite):

    Bestämmelserna i 2 § andra och tredje styckena, 3, 7--9 och 11 §§,
    12 § första stycket, 13, 15, 16, 18--20 och 23 §§, 24 § första
    stycket, 25--26 b, 26 d -- 26 f, 26 i -- 28, 31--38, 41, 42 och
    50--52 §§ skall tillämpas på bilder som avses i denna paragraf.
  

things get way more complicated. Enter EBNF-grammar-based parsing with the dynamic duo that is SimpleParse and mxTextTools. Also, the book Text Processing in Python by David Mertz should be mentioned, as it helped me in the right direction when I realized regexes weren’t going to cut it.

My previous post describes the actual EBNF grammar and how SimpleParse is used to build a parse tree from it, so you might want to read that too.

However, that is really only half of the problem. After having a tree of parsed tokens, that might (for a somewhat complicated scenario) look like the following:

refs': '14 § 1 st. 6) och 6 § varumärkes|lagen (1960:644)'
    'ExternalRefs': '14 § 1 st. 6) och 6 § varumärkes|lagen (1960:644)'
        'MultipleGenericRefs': '14 § 1 st. 6) och 6 §'
            'GenericRefs': '14 § 1 st. 6)'
                'GenericRef': '14 § 1 st. 6)'
                    'SectionPieceRef': '14 § 1 st. 6)'
                        'SectionRef': '14 §'
                            'SectionRefID': '14'
                                'number': '14'
                            'Whitespace': ' '
                        'Whitespace': ' '
                        'PieceRef': '1 st. 6)'
                            'PieceRefID': '1'
                                'ordinal': '1'
                            'Whitespace': ' '
                            'PieceOrPieces': 'st.'
                            'Whitespace': ' '
                            'ItemRef': '6)'
                                'ItemRefID': '6'
                                    'number': '6'
                                'RightParen': ')'
            'WAndOrW': ' och '
                'Whitespace': ' '
                'AndOr': 'och'
                    'And': 'och'
                'Whitespace': ' '
            'GenericRefs': '6 §'
                'GenericRef': '6 §'
                    'SectionRef': '6 §'
                        'SectionRefID': '6'
                            'number': '6'
                        'Whitespace': ' '
        'Whitespace': ' '
        'ExternalLawRef': 'varumärkes|lagen (1960:644)'
            'NamedExternalLawRef': 'varumärkes|lagen (1960:644)'
                'word': 'varumärkes'
                'Pipe': '|'
                'LawSynonyms': 'lagen'
                'Whitespace': ' '
                'LawRef': '(1960:644)'
                    'LeftParen': '('
                    'LawRefID': '1960:644'
                        'digit': '1'
                        'digit': '9'
                        'digit': '6'
                        'digit': '0'
                        'Colon': ':'
                        'digit': '6'
                        'digit': '4'
                        'digit': '4'
                    'RightParen': ')'
  

, how do we turn it into the following XML?

    <link law="1960:644" section="14" piece="1" item="6">
      14 § 1 st. 6)
    </link>
    och 
    <link law="1960:644" section="6">
      6 §
    </link>
    varumärkes|lagen (
    <link law="1960:644">
      1960:644
    </link>
    )
  

Turns out this is a problem that can be solved in a rather generic manner with a small amount of planning and a little recursiveness. Basically, all tokens that end in ‘Ref‘ should generally end up formatted as a <link>. All tokens underneath such tokens that ends in ‘RefID‘ are used to find the attributes for these tags. Start at the root, then recurse depth-first over all child nodes until done. Sometimes there are notes that end in Ref underneath other nodes also ending in Ref, in those cases it’s the top node that is turned into a <link> tag

Of course, there are exceptions and things to be aware of. Two of these are illustrated in the above example. To correctly insert the law reference (the SFS id ‘1960:644′) for tag to be produced from the MultipleGenericRefs token ‘14 § 1 st. 6) och 6 §‘, we have to plan ahead, when dealing with the parent node (ExternalRefs). The formatter has built-in knowledge of the special handling needed, and, when encountering a ExternalRefs node, finds the ExternalLawRef node child, stores the underlying LawRefID value, and later picks this value up when formatting the tags for the two GenericRef tokens.

To make this looping and recursing easier, I build a OO wrapper around the array-based parse tree that mxTextTools return. This was the subject of another post.

Note also that the ExternalLawRef token did not result in a <link> tag, but the underlying LawRefID token did. This was a consious decision (I thought it looked better that way), and was implemented by creating a number of extra subroutines in the formatter. Basically, the main function acts as a generic dispatcher, by looping over all subtokens in a tree, and for each token, checks if there’s a corresponding format_tokenname() function. If so, call that, otherwise call a generic formatter (which may, recursivly, make use of the generic dispatcher). The code is pretty simple, but is indicative of how neat stuff like this can be done in dynamic langugages:

        try:
            formatter = getattr(self,"format_"+part.tag)
            res = formatter(part)
        except AttributeError:
            res = self.format_tokentree(part)
  

The other wierd thing is that ‘|’ sign in the term ‘varumärkes|lagen‘. This is swedish for ‘the trademark law’, but since we like to write words together, this creates an interesting challenge for creating the EBNF gramar. Basically, I cannot find a way to match a word that ends in a specific suffix, such as ‘lagen‘. The resulting parser is always ‘greedy’, so there seem to be no way of matching these words without matching all words. So, to fix this, I preprocess the text before putting it through the parser, using normal python regular expressions, which can be non-greedy, to put in that ‘|’ sign, which solves the problem. Then, after retrieving the string from the tag formatter, I remove those signs.

One particular satisfying thing of the problem described in this post is that how well it lends itselfs to automated regression testing. Any new feature can easily be specified in a test case before coding begins, and after it’s done, it’s easy to verify that nothing that previously worked has been broken. More on regression testing in a later post.

That’s a load off my {mind,chest,back}

torsdag, december 16th, 2004

We’re taking a break from our “the tech behind lagen.nu series (1, 2, 3, 4), to bring you this information:

For those of my readers that cannot read swedish, see, or discern text in really blurry pictures, the above images tells that I am now officially qualified for law school, starting 2005-01-16. I had good reason to assume that I would qualify, but it’s still nice to have it in black on white. Especially considering that tomorrow is my last day at work. Now I can look forward to a full month of vacation without any clouds over my head.

On a side note, the picture was taken with my new toy, a Sony Ericsson P910i. This little device really is starting to grow on me. More about it in a later post, as well.

Part 4: Converting stuff to XML

torsdag, december 16th, 2004

(If you missed part 1-3, they are here, here and here)

If you look at a sample law text as they are presented in SFST, they are 100% plaintext. In order to convert them to semantically sensible XML, we must look for patterns in the plaintext, to identify things like headlines, start of sections and references.

I did this with a two-part approach. First; I break down the text into it’s individual paragraphs and determine what each paragraph is. This analysis operate on a ‘block level’ — either a block of text is a headline, part of an ordered list, the start of a section etc, or it isn’t. A block can’t be half headline and half ordered list.

Secondly, if the paragraph can contain references to other parts of the law (or parts of other laws, for that matter), I analyze the text in greater detail to find and resolve these references. This step operates on a ‘token’ or character level – a block of text can contain zero, one or many of these references.

The first part is fairly easy, at least conceptually. I wrote a bunch of functions like is_chapter(p), is_section(p), is_headline(p), where each individual function just performs a simple test. These are then used from a simple loop that uses a bunch of local variables to keep track of what kind of structures we’ve encountered so far – a so-called state machine.

These functions started out as very simple regexp-based inline expressions, but as I encountered more and more exceptions to my simple rules, their complexity grew. The body of current swedish law is over 250 years old, and consistency has not been the law makers forte. Here’s an example of how to recognize the start of a section:

    re_SectionId       = re.compile(r'^(\d+ ?\w?) §[ \.]') # used for both match+sub
    re_SectionIdOld    = re.compile(r'^§ (\d+ ?\w?).')     # as used in eg 1810:0926

    def is_section(self,p):
        section_id = self.section_id(p)
        if section_id == None:
            return False
        if section_id == '1':
            if self.verbose: print "is_section: The section numbering's restarting"
            return True
        # now, if this sectionid is less than last section id, the
        # section is probably just a reference and not really the
        # start of a new section. One example of that is
        # /1991:1469#K1P7S1. We use util.numsort to get section id's
        # like "26 g" correct.
        a = [self.current_section,section_id]
        
        if a == util.numsort(a):
            # ok, the sort order's still the same, which means the potential new section has a larger ID
            if self.verbose: print "is_section: '%s' looks like the start of the section, and it probably is (%s < %s)" % (
                p[:30], self.current_section, section_id)
            return True
        else:
            if self.verbose: print "is_section: Even though '%s' looks like the start of the section, the numbering's wrong (%s > %s)" % (
                p[:30], self.current_section, section_id)
            return False
    
    def section_id(self,p):
        match = self.re_SectionId.match(p)
        if match:
            return match.group(1).replace(" ","")
        else:
            match = self.re_SectionIdOld.match(p)
            if match:
                return match.group(1).replace(" ","")
            else:
                return None
  

So, the start of a section looks like ‘<number> [letter] §’, unless it looks like ‘§ <number> [letter]’, and as long as the section id is larger than the previous section id, unless it’s restarting at 1, and also taking into account that section ids can contain an optional letter (like ‘26 g’). Simple as that.

For example, below is the start of the swedish copyright law (which has, during development, been my foremost test case). Now, if you don’t read Swedish, just know that the first paragraph signifies the start of chapter one (”1 Kap.”), the second the start of a section(”1 §”), and then there are three items in an ordered list, and finally a plain ol’ paragraph of text. (By the way, for my swedish readers: The swedish legal term ‘paragraf’ is NOT the same as the english typographical term ‘paragraph’, but rather translates into ’section’. When the english word ‘paragraph’ is used, it is in the same sense as the swedish word ’stycke’)

1 Kap. Upphovsrättens föremål och innehåll

1 § Den som har skapat ett litterärt eller konstnärligt verk har
upphovsrätt till verket oavsett om det är

1. skönlitterär eller beskrivande framställning i skrift eller tal,

2. datorprogram,

[…other items in the ordered list omitted for brevity…]

Till litterära verk hänförs kartor, samt även andra i teckning eller
grafik eller i plastisk form utförda verk av beskrivande art.

[…more things omitted for brevity…]

2 § Upphovsrätt innefattar, med de inskränkningar som nedan stadgas,
uteslutande rätt att förfoga över verket genom att framställa exemplar
därav och genom att göra det tillgängligt för allmänheten, i
ursprungligt eller ändrat skick, i översättning eller bearbetning, i
annan litteratur- eller konstart eller i annan teknik.
  

So, to make sense of this, the following state transitions are done:

  • For the first paragraph, is_chapter() will return True, so we transition to the in_chapter state. This will emit a <chapter id=”1″> tag to the result, along with the text.
  • For the second paragraph, is_section() will return True, transitioning us to the in_section stage as well. It’s important to realize that these states are mostly independent; we can be in a chapter w/o being in a section, and vice versa (some laws have only sections, no chapters). This transition will of course emit a a <chapter id=”1″> tag to the result.
  • For the third paragraph, is_ordered_list() will return True, transitioning us to the in_ordered_list state, and emitting <ol><li>1. skönlitterär eller […]</li> to the result. A couple of things to note about this:
    • These tags may look like HTML tags, but they’re not. They do, in this case, share the same semantics, though.
    • A HTML ordered list (<ol>) keeps track of it’s numbering, and any user-agent should add the numbers to the displayed result. Since this is not HTML, we don’t do that, and instead keep the number that was in the original text. Mostly because we’re not sure that no ordered list in the entire body of swedish law doesn’t contain sequence gaps or things like ‘26 g’.
  • For the fourth, is_ordered_list() will again return true, but now we’re already in the in_ordered_list state, so we don’t emit the initial <ol> tag.
  • Then we omit some boring junk, and then we encounter a normal paragraph. There is no function named is_ordinary_paragraph(), this is just infered from the fact that none of our other test matched. Now, the start of a normal paragraph is implicit evidence that our ordered list must have ended, and so the code to transition into in_normal_paragraph state checks to see if we’re in the in_ordered_list state, and if so, emits the trailing </ol>. After that, the normal paragraph gets emitted as <p>Till litterära verk […]</p>.
  • Similarly, when we encounter the start of a new section (”2 § Upphovsrätt innefattar“), we transition out of any ordered lists we might be in, out of the in_section state, emit </section>, and then transition into the same state again.

The largest part of this high-level parsing was to find all the different structures and discover the implicit state transitions that needed to take place. One interesting problem is determining wheter a given paragraph is a headline or just a really short ordinary paragraph. Well, it turns out that a headline never ends with a period, and a ordinary paragraph always does. Except for headlines that end in “m.m.” (swedish for “etc.”). And ordinary paragraph that end with “,”, “och” or “samt”. But a headline is always followed by a section, so we can peek ahead to see if the next paragraph matches is_section(). Except for those cases where a headline is followed by ANOTHER headline, which is then followed by a section…

A lot of work, and there are still a lot of places where we break (for example, things like definition lists, nested ordered lists, and tables, are not yet supported). Furthermore, tweaking to satisfy one case can easily make other cases break. I will return to this problem in the part about regression testing.

Some other things: As you may have guessed by the usage of the term “emit”, I construct the XML by hand in a single pass. This means that I write XML data straight to a file, without using DOM or anything similar. It was just easier to start out that way. I am thinking of overhauling the state machine mechanism a bit, and I might take a DOM approach then.

Some other parts of the code that constructs XML documents (like the code that handle court verdicts) use DOM, and it works pretty fine, with one small exception: It does not play nice with xml fragments in string form. I have a separate class to find law references in text (to be covered in deeper detail in the next post), and the interface of that class is a plain string one: it returns strings like “<link law=”1960:644″ piece=”2″ section=”1″>1 § 2 st.</link> varumärkeslagen“. Since there isn’t any method on individual element objects like ele.ParseAndAppendMixedContent(), I first have to create a ‘fake’ XML document and transform it into a node tree with parseString (from minidom):

    s = '<?xml version="1.0" encoding="iso-8859-1"?><%s>%s</%s>' % (
                        element, string_containing_mixed_xml_content, element)
    fragment = parseString(s)
    subele = fragment.documentElement
    ele.appendChild(subele)
  

Another drawback of writing XML the raw way is that there is no guarantee that your output will be valid. Special care is taken to escape things like < and &, and to ensure the document is well-formed, it’s also run through HTML Tidy (with the options -q -n -xml –indent auto –char-encoding latin1), which, despite it’s name, is an excellent tool to tidy up XML as well.

Part 3: Understanding what was fetched

onsdag, december 15th, 2004

(Earlier posts in this series: here and here)

There are a lot of ways to extract data from a HTML file. You can do simple string searching (by the way, why is the python documentation for basic string objects hidden under the non-descript heading “Sequence types”, and why is there no reference to that part of the documentation from the separate string module, which hardly does anything?) and rexep munging, or you can use more sophisticated HTML parsers. Funnily enough, there are two of these in the Python standard library, and both of them are callback based — why no tree-based interface? If the HTML code is modern and well-formed, you can even use a vast array of XML tools (and if it’s not, you can fix it with HTML Tidy).

I ended up using the BaseHTMLProcessor approach from Dive Into Python., which has a whole chapter devoted to the art of HTML parsing. Basically, you subclass BaseHTMLProcessor, implementing callbacks for various tags, which are called as these tags are encountered in the document. Your class is responsible for keeping track of whatever state (ie what depth you are in the document, what tags were encountered before this one, and so on) that needs to be kept.

There are some things that are cumbersome with this approach. For example, automatic HTML entity resolving would be good. The HTML fragment “<h1>r&auml;ksm&ouml;rg&aring;s</h1gt;” represents a single header with the string “räksmörgås” (a common test phrase for swedish programmers), and so it should only result in three callbacks: start_h1, handle_data (which should be called with the string “räksmörgås“), and end_h1.

Instead, the following callbacks are called:

  • start_h1
  • handle_data (called with the string ‘R‘)
  • handle_entityref (called with the string ‘auml‘)
  • handle_data (called with the string ‘ksm‘)
  • handle_entityref (called with the string ‘ouml‘)

…you get the idea. There exists a mapping that helps with the entity resolving, but for the HTML case, this could have been solved at a lower-level stage.

Still, for the parsing problems I have, the callback-based/keep-your-own-goddam-state-approach works. Most of the time I’m just concerned with finding the elements in a table, meaning I have to keep track of what cells I’ve seen and when a new table row starts, things like that. As I go along, build up a list of mappings or something similar, and then just use that list once done. The calling code gets quite nice and simple:

cl = SFSChangelogExtractor()
cl.feed(open("downloaded/lawinfo/%s.html" % self.basefile).read())
for c in cl.changelog:
    if c.item('SFS-nummer') == current_transitional_id: ...
  

(Note that the ‘c’ object here is not a standard dictionary, but a mapping-ish object that also keeps track of the order keys have been inserted. That’s why it’s c.item(’SFS-nummer’) and not c[’SFS-nummer’]. That, and the fact that I was too lazy to implement the special methods needed to do a proper Duck Typed dictionary.)

The one exception is the problem of finding all the plaintext in a law text like this one, but it’s even easier: Just increment a counter whenever a <pre> tag is encountered, decrement it when seing </pre>. In handle_entityref and handle_text, just check if the counter is > 0 and if so, append the text to a StringIO object.

Part 2: Fetching stuff from the web

tisdag, december 14th, 2004

(A continuation of the series started in this post)

Now, the first thing that needs to be done is to actually get the text of the laws from the web. Before that can be done, a list of available laws must be fetched. In Swedish law, most laws have nice ID’s known as “SFS-nummer”, usually on the form “yyyy:nnn”, where yyyy is the year it was issued, and nnn is incremented for each law passed that year. Some of the older laws don’t strictly follow this convention, and can have ID’s like “1736:0123 2″ or “1844:50 s.2″.

To get a list of all laws passed between two years, one can use this form from Regeringskansliets Rättsdatabaser. It uses the normal GET method, and so it’s quite easy to construct a URL that will return all laws between, say, 1600 and 1850 (linebreaks inserted for readability):

http://62.95.69.15/cgi-bin/thw?${HTML}=sfsr_lst&
${OOHTML}=sfsr_dok&${SNHTML}=sfsr_err&${MAXPAGE}=26&
${BASE}=SFSR&${FORD}=FIND&ÅR=FRÅN+1600&
ÅR=TILL+1850

To fetch this, I used urllib2. Now for a little aside rant: Why are there two urllibs in the standard distribution? I understand that basic urllib has a simple interface and urllib2 a more complex, but would it be so hard to design a single module that lets you do simple things easy, and progress onto hard things? In the Perl world, you can start with LWP::Simple and then go on to more advanced stuff, but with python it’s either simple urllib requests with practially no control at all, or urllib2 with it’s way-too-complex system of chained handlers. I will return to this rant in a little bit, but for now let’s have some useful content. This is the code used to fetch stuff:

url = "http://62.95.69.15/...ÅR=FRÅN+%s&ÅR=TILL+%s" % (1600,1850)
sock = urllib.urlopen(url)
html = sock.read()
  

So, as long as your needs are simple, like just wanting to do a simple GET or POST, urllib and/or urllib2 will work. However, I encountered a more complex scenario when I wanted to download court verdicts from Domstolsväsendets rättsinformation: This is a web app that relies on HTTP posts, HTTP redirects, session tracking cookies, frames, javascript links and is, in general, incredibly fragile. The slightest “error” in what you send and the server answers with a “500 Server Error: java.lang.NullPointerException

The first problem was that the application requires cookie support, which urllib2 doesn’t have (as it doesn’t have any concept of state between requests). At first I thought I could fix the cookie support by reading and setting headers, the way it was done back in the day when men were men and knew the cookie spec by heart. Turns out the web service sets a cookie when you issue a search request, but the answer from the search request is a HTTP redirect. To get the resulting list of matches, you need to present that same cookie that was set.

Now, let’s continue the rant: urllib2 blindly follows the redirect, giving us no chance to set the Cookie header. From the documentation, it appears that it should be possible to override this behaviour by subclassing HTTPRedirectHandler and passing the instance to build_opener, which creates a chain of instances of BaseHandler or a subclassed class. Reading the documentation for urllib2 makes me think that someones OO/design patterns fetish was not kept properly in check. Anyway, I could not get that to work.

Another thing that bugs me about urllib2 is that is has no support for implementing Robots Exclusion Standard (RES) support. Right now, neither Regeringskansliets databaser or Domstolsväsendets rättsinformation has a /robots.txt, but if they put one in tomorrow I think I should respect it.

I did briefly use ClientCookie, which is an add-on module for urllib2 that provides automatic cookie support, and it did solve my first problem. Although I did not try it, it can also be used to provide RES support, proper Referer setting, and some other goodies. It seems that at least the cookie handling functionality of ClientCookie has been folded into urllib2 in Python 2.4, which is a good thing.

However, some time after I first got some code to work with the site, they changed something around and made it even more fragile. No matter what I did, I couldn’t get the site to respond with anything other than a “500 Server Error“, even though I checked the client-server communication when using IE (with the excellent Fiddler utility), and replicated the behaviour down to the exact header level.

So, I remembered that Erik had told me about the virtues of webscraping using IE and COM automation. Since I’m only running the code on windows machines, giving up platform independence wasn’t that big a deal, and the rich COM support in both Python and IE made it quite easy (after installing pywin32 for COM support). Here’s the basic code:

      from win32com.client import Dispatch
      ie = Dispatch("InternetExplorer.Application")
      ie.Visible = 1
      ie.Navigate("http://www.rattsinfosok.dom.se/lagrummet/index.jsp")
      while ie.Busy: sleep(1)
      ie.Document.frames(1).Document.forms(0).all.item("txtAvgDatumFran").value = startdate.strftime("%Y-%m-%d")
      ie.Document.frames(1).Document.forms(0).all.item("txtAvgDatumTill").value = "%s-12-31" % year
      ie.Document.frames(1).Document.forms(0).all.item("slctDomstol").value = "ALLAMYND"
      ie.Document.frames(1).Document.forms(0).all.item("buttonSok").click()
      while ie.Busy: sleep(1)
      html = ie.Document.frames(1).Document.body.outerHTML.encode('iso-8859-1')
  

With such a javascript- and browser behaviour dependent web app such as this, you can really save yourself a whole lot of trouble if your code can use that web browser instead of trying to emulate that web browser. For one thing, behaviour implemented in javascript (OnClick-handlers and the like) is reproduced correctly without any extra work.

Well, that’s all for now about fetching stuff from the web. Next installment will center around making sense of what we’ve just fetched, i.e. parsing HTML and stuff.

Lagen.nu behind the scenes

måndag, december 13th, 2004

Now that lagen.nu has been out for some time, it might be a good idea to write down what I’ve learned from it so far, in blog form. Much of the discussion will be centered around python, a language I’m far from proficient in, but it’s possible that someone will learn at least something from it.

First, take a look at this post that explains what lagen.nu is, from a user perspective.

This post is about how the site is produced. When I started out, I had no clear idea of what I wanted to do, other than to download the text of all swedish laws and convert it to some sort of nice HTML. I knew I wanted to do as much as possible with static HTML files, and I had a hunch that XML would be involved in some way.

So, essentially, the code only needs to run off-line, with no GUI required.

I thought about doing this in C#, since it would be a good experience building project in a language for which expertise is highly sought after. But since I’m no longer programming for food (actually I am, for another four days, but still), I took the opportunity to do it in python, a language which I’ve always liked but never become friends with.

From a high level, the code does the following:

  • Finds out what laws are available
  • Downloads the law text HTML documents
  • Converts the text to XML
  • Transforms the XML to HTML

There are some extra steps involved in creating the front page, RSS feeds, and handling the verdicts database, but these are the main steps.

The result of the program is a tree with static HTML files, ready for deployment.

I started out by looking for a good Python IDE. I did not find it, and settled for Emacs with python-mode.

Once set up with a recent version of python-mode, properly configured, I had a nice light-weight development environment. Here’s my minimal configuration (this goes into your .emacs file):

(autoload 'python-mode "python-mode" "Python Mode." t)
(add-to-list 'auto-mode-alist '("\.py'" . python-mode))
(add-to-list 'interpreter-mode-alist '("python" . python-mode))
(setq py-python-command "C:\Python23\python.exe")
  

My code lives in classes, and to test things out, I have code at the end of the main code file that looks sort of like the following:

if __name__ == "__main__":
    vc = VerdictCollection()
    vc.get(2004,refreshidx=True)
  

(That is, if I want to test the get method of the VerdictCollection class). To test the code, I just press C-c C-c in the python editor window. The entire python buffer gets sent to the python shell, and the last part (after if __name__ == “__main__”:) executes.

Things that are good about this environment:

  • Free, in both senses of the word
  • The intendation support really works, which is quite important with python
  • Reasonably fast edit-run cycle
  • The interactive python shell

Things that are bad:

  • I can’t debug stuff. It seems like it should be possible, but I have no pdb.exe, which seems to be a requirement. In particular, it would be nice to be able to automatically start debugging when an unhandled exception is raised.
  • Copy and paste from the *Python* buffer has character set problems. For example, if my code outputs a § sign, and I cut’n paste it into another file, emacs will complain:
    These default  coding systems were tried: 
    iso-latin-1-dos
    However, none of them safely encodes the target text.
    This is bogus, since the § sign is perfectly legal in latin-1.

I use the standard python.org distribution of Python 2.3 (I haven’t gotten around to upgrading to 2.4 yet), not the ActiveState one. I tried it, and like the fact that the win32com module is bundled, but the python.org version is a leaner download and has a more usable HTML help application (particularly the good index).

To get a grip of how to do things with python, I’ve used the online version of Mark Pilgrim’s Dive Into Python, as well as the Python cookbook. This, together with the reference manual, (the eff-bot guide to) The Standard Python Library and Text Processing in Python has been all I need so far.

Legal document standards, part 2

fredag, december 3rd, 2004

Rasmus blogs about open standards and open access for law texts, a topic of great interest to me as of lately (or ‘obsession’, apparently :-). We both agree that there are too little of that, and that the much-heralded open governement could do better in this area.

Anyway, I came across an interesting report, a summary from a conference held about two years ago. It featured various people working with legal information systems, both in the government and private companies, sharing their views on standardisation of document formats and systems. There are views from the people behind Rixlex, Infodata and Notisum, amongst others, but also an interesting view into the state of legal information standards in Norway. They seem to be way ahead of Sweden in this area.

The general consensus seemed to be “standardization is good, and we should do it”, but with no real commitments or timeplans. Maybe there has been developments that I don’t know about since then. This was, after all, two years ago.

Meanwhile, if you want to do interesting stuff today with the body of swedish law, such as making a WAP version or performing graph analysis of all references contained in the 7500+ texts, just download my completely non-{standardized,documented} XML version and go nuts!

There are now at least three document standards, or efforts to create such, for marking up law texts and other legal doucments on my radar: uscfrag (mentioned earlier), used by Cornell University for marking up US Code, LegalXML which seems to be US-centric, and LEXML, which appears to be more EU-centric. It even has it’s own Sourceforge page!

I had no idea so much was going on in so many committees when I started working on the XMLization of swedish law. In a way I’m glad that I didn’t, since I probably would have focused too much on adhereing to these emerging standards and less time to, you know, getting things done. Or worse, just waited for them to actually finish.