Part 6: Transforming with XSLT

Now, after having downloaded, analysed and XML-ified the lawtext, theres just one step left: Convert them to HTML. Enter XSLT.

XSLT is a complex beast, and I find that it’s better to approach it as a programming language, albeit with an unusual syntax, than a stylesheet language. I have one master stylesheet that I use for the transformation (now over 600 lines of XSLT code), and during devlopment I found out quite a few tricks. By the way, if you want to see the original XML code for any law on lagen.nu, just add the suffix “.xml” to the url, ie http://lagen.nu/1960:729.xml.

Creating useful tooltips: An important aspect of the work done with lagen.nu is that I’m trying to not only produce more astethically pleasing layouts of the text, I try to cross-reference as much as possible, to really take advantage of the medium.

As one example: The internal references that were discussed in the last post, should of course be transformed to ordinary hypertext links, on the form <a href=”#P26c”>26 c §</a>, so that you quickly can jump around in the document and follow the references. But we can do better than that, by extracting the first 20 or so words from section 26 c, and stick them into the title attribute of the a tag:

    <a href="#P26c" title="Ägaren till en byggnad eller ett
  bruksföremål">26 c §</a>
  

Now, if you hover with the mouse over the link (assuming you have browser hard/software where these things are possible), the title text will be shown as a tooltip. This makes it even easier to make sense of a section, since you get an instant reminder of what the referenced section says — in many cases you don’t even have to click on the link.

So, initially, my plan was to have things like these tooltip strings prepared in the XML document, and just do a very simple transform into HTML. But as the work progressed, I found that I was almost always able to do it in XSLT instead. This is the relevant part of the link template:

<xsl:attribute name="title">
  <xsl:variable name="text">
    <xsl:choose>
	<xsl:when test="$hasChapters and $sectionOneCount = 1">
	  <xsl:value-of select="/law/chapter/section[@id=$section]/p[position() = $piece]"/>
	</xsl:when>
	<xsl:when test="$chapter != ''"><xsl:value-of select="/law/chapter[@id=$chapter]/section[@id=$section]/p[position() = $piece]"/></xsl:when>
	<xsl:otherwise><xsl:value-of select="/law/section[@id=$section]/p[position() = $piece]"/></xsl:otherwise> 
    </xsl:choose>
  </xsl:variable>
  
  <xsl:variable name="real-width">
    <xsl:call-template name="tune-width">
	<xsl:with-param name="txt" select="$text"/>
	<xsl:with-param name="width" select="160"/>
	<xsl:with-param name="def" select="160"/>
    </xsl:call-template>
  </xsl:variable>
  <xsl:value-of select="normalize-space(substring($text, 1, $real-width - 1))" />
</xsl:attribute>
  

The variables $chapter, $section, and $piece gets their values earlier in the template, and are set to the place the link goes to. $hasChapters and $sectionOneCount are set globally for the document and are used to tell what kind of section numbering this particular lawtext is using. There are three variants commonly used:

  • No chapters, just simple ascending section numbering
  • Chapters with restarting section numbering (ie, regardless of the number of sections in chapter 1, the first section in chapter 2 will be numbered ‘1 §’ — ie, we need to know the chapter as well as the section — just ‘17 §’ is ambigious)
  • Chapters with continous section numbering (ie, if the last section in chapter 1 is ‘16 §’, the first section of chapter 2 will be numbered ‘17 §’ — ie, the section number is never needed to unambigosly determine what ‘17 §’ refers to).

The code constructs a XPath expression and finds the node that the link refers to, and stores it in the variable text. Then, it trims the string down to a suitable length (max 160 chars, in this case) by using the user-defined function tune-witdh, together with normalize-space and substring. tune-width ensures that we end the string on a word boundary. The result of all this is assigned to the title attribute.

Generating a TOC: Again, if you look at the swedish copyright law, you will notice a big blue link titled “Visa innehållsförteckning”. Clicking on this yields (if you have a browser that supports javascript and DOM level 1) a table of contents (TOC), generated from chapter headings and other headlines. It starts out as hidden, since it usually is in the way, but sometimes it’s very useful.

The XML document in itself do not contain any TOC data. To generate the TOC, we use a number of mode-specific templates that extract the relevant information from headlines contained in the document, all triggered by a <xsl:apply-templates> call:

<div id="toc" style="display:none;">
  <xsl:apply-templates select="law/chapter[(headline or section)]" mode="toc"/>
</div>

...

<xsl:template match="headline" mode="toc">
  <xsl:if test="@level = '1'">
	<div class="l1">
	  <a>
	    <xsl:attribute name="href">
	      <xsl:choose>
		<xsl:when test="substring(.,1,5) = 'AVD. '">#R<xsl:value-of select="@id"/></xsl:when>
		<xsl:when test="../../chapter">#K<xsl:value-of select="../@id"/></xsl:when>
		<xsl:otherwise>#R<xsl:value-of select="@id"/></xsl:otherwise>
	      </xsl:choose>
	    </xsl:attribute>
	    <xsl:value-of select="."/>
	  </a>
	</div>
  </xsl:if>
  <xsl:if test="@level = '2'">
	<div class="l2">
	  <a href="#R{@id}"><xsl:value-of select="."/></a>
	</div>
  </xsl:if>
</xsl:template>
<xsl:template match="section" mode="toc"/>

<xsl:template match="changes" mode="toc"/>
<xsl:template match="appendix" mode="toc"/>
<xsl:template match="preamble" mode="toc"/>
  

Things to note: The text-to-XML conversion is responsible for determining the ‘level’ of the headlines. Level 1 headlines are usually associated with a chapter (though not always), and we use some tests to determine if that is the case. If so, the resulting link uses a “#K<number>” anchor fragment, otherwise a “#R<number>” fragment. “K” is for “kapitel” (chapter), while “R” is for “rubrik” (headline). Not strictly neccesary, but I prefer a link that explicitly says “this link goes to chapter 4″, rather than “this link goes to the 14th headline”, particularly as the number of headlines can change in the future, which would make the link point to the wrong place.

I have a number of “empty” templates. They are needed, since if I don’t have them, the base template kicks in and just copies the entire text, which seriously messes up the TOC. Now, I should be able to limit that with the select attribute of the <xsl:apply-templates> tag, but I have been unsuccesful (the reason I select both headlines AND sections, then do nothing with the sections, is that I’ve been experimenting with using the first lines of each section in the TOC as well, but that came out too messy).

Accessing data in other documents: To properly understand a section in a law, it helps to read court verdicts that reference it. In part 2, I described how to fetch data from Domstolsväsendets rättsinformation, which has this data. Basically, I have a python code snipped that goes through all 1800+ verdicts, uses the parser that was mentioned in part 5 to find references to laws and sections therein, and then generates a “cache.xml” file, that contains references to all verdicts, sorted by law, chapter and section, like this:

<?xml version="1.0" encoding="iso-8859-1"?>
<verdicts>
<law id="1736:0123_2">
<chapter id="9">
  <section id="5">
    <verdictref caseid="NJA 2002 s. 577 (alt. NJA 2002:70)" casenumber="T3933-01" id="2002/758" verdictdate="2002-11-22">
      Gäldenär, som hade flera skulder till en borgenär, har ansetts ha rätt att destinera sin be
    </verdictref>
  </section>
</chapter>
<chapter id="10">
  <section id="1">
    <verdictref caseid="NJA 2000 s. 667 (alt. NJA 2000:97)" casenumber="T4689-98" id="2000/772" verdictdate="2000-12-18">
      Sedan A och B var för sig ställt pantsäkerhet för C:s lån i en bank, har banken gjort sig f
    </verdictref>
    <verdictref caseid="NJA 2000 s. 88 (alt. NJA 2000:13)" casenumber="Ö3863-98" id="2000/923" verdictdate="2000-02-23">
      Ett bolags upplåtelse av företagshypotek har ansetts bli sakrättsligt gällande när företags
    </verdictref>
  </section>
  [...]
  

In the top level of the stylesheet, I select the relevant nodeset from this cache document, and store it in a variable. To be able to select, I first need to know the id of the law (the ‘SFS-nummer’), and so I pass it as a command line parameter:

<xsl:param name="lawid"/>
<xsl:variable name="relevantVerdicts"
  select="document('generated/verdict-xml/cache.xml')//law[@id=$lawid]"/>
  
Then, as I process each section, I check the nodeset to see if there’s any verdicts relevant to the section:
<xsl:variable name="sectionChapterVerdicts" select="$relevantVerdicts//chapter[@id=$chapterid]/section[@id=$sectionid]/verdictref"/>
<xsl:if test="$sectionChapterVerdicts">
<div class="metadata">
<xsl:for-each select="$sectionChapterVerdicts">
  <xsl:variable name="linktext">
    <xsl:choose>
      <xsl:when test="@caseid"><xsl:value-of select="@caseid"/></xsl:when>
      <xsl:when test="@casenumber">Målnummer <xsl:value-of select="@casenumber"/></xsl:when>
      <xsl:when test="@diaryid">Diarienummer <xsl:value-of select="@diaryid"/></xsl:when>
      <xsl:otherwise>Vägledande dom</xsl:otherwise>
    </xsl:choose>
  </xsl:variable>    
  <xsl:variable name="real-width">
    <xsl:call-template name="tune-width">
      <xsl:with-param name="txt" select="."/>
      <xsl:with-param name="width" select="80"/>
      <xsl:with-param name="def" select="80"/>
    </xsl:call-template>
  </xsl:variable>
  <a href="/dom/{@id}"><xsl:value-of select="$linktext"/></a>: <xsl:value-of select="normalize-space(substring(., 1, $real-width - 1))" />... (<xsl:value-of select="@verdictdate"/>)<br/>
</xsl:for-each>
</div>
</xsl:if>
  

Again, a little extra work is done to make sure the explaining text is cut at a word boundary (we could‘ve done that in the python code that generates cache.xml), and also that the text for the actual link makes sense. You see, different courts have different systems for assigning ID’s to cases: HD (the swedish supreme court) uses the page number as the case is published in that year’s anthology of relevant verdicts (Nytt Juridiskt Arkiv, NJA), other uses a court-specific identifier, and some use a derivative of the date when it was handled. Since these identifiers represent different things, they are also represented with different attributes in the XML file, and a little <xsl:choose> trickery selects the most appropriate ID.

Optimization hints: Some of the law texts are quite big, and processing them can take a long time. To speed things up, here are some of the things I did:

  • Don’t use the python interface to libxsl! I started out with this, but it turned out to take twice or even three times as long for a transformation, as compared to running command-line xsltproc (win32 binaries) through an os.system() call.
  • Use the –profile switch to xsltproc. It’s pointless to optimize if you don’t know where the bottlenecks are, and with xsltproc it’s so easy to find them.
  • Store frequently-referenced nodes, nodesets and values in variables, instead of selecting them through XPath queries all the time. Yeah, this one is kind of obvious, but it really helps, particularly in those “inner” templates that are used all the time.

Just by using the above tips, the transformation of my standard test case (1960:729) went from 90 seconds to three. Still, I have other test cases that still takes several minutes, so clearly there’s still some work to do…

Tags: , ,

Leave a Reply