Now, after having downloaded, analysed and XML-ified the lawtext,
theres just one step left: Convert them to HTML. Enter XSLT.
XSLT is a complex beast, and I find that it’s better to approach
it as a programming language, albeit with an unusual syntax, than
a stylesheet language. I have one master stylesheet that I use for
the transformation (now over 600 lines of XSLT code), and during
devlopment I found out quite a few tricks. By the way, if you want
to see the original XML code for any law on lagen.nu, just add the
suffix “.xml” to the url, ie http://lagen.nu/1960:729.xml.
Creating useful tooltips: An important aspect of the work
done with lagen.nu is that I’m trying to not only produce more
astethically pleasing layouts of the text, I try to
cross-reference as much as possible, to really take advantage of
the medium.
As one example: The internal references that were discussed in the
last post, should of course be transformed to ordinary hypertext
links, on the form <a href=”#P26c”>26 c
§</a>, so that you quickly can jump around in the
document and follow the references. But we can do better than
that, by extracting the first 20 or so words from section 26 c,
and stick them into the title attribute of the a tag:
<a href="#P26c" title="Ägaren till en byggnad eller ett
bruksföremål">26 c §</a>
Now, if you hover with the mouse over the link (assuming you have
browser hard/software where these things are possible), the
title text will be shown as a tooltip. This makes it even
easier to make sense of a section, since you get an instant
reminder of what the referenced section says — in many cases you
don’t even have to click on the link.
So, initially, my plan was to have things like these tooltip
strings prepared in the XML document, and just do a very simple
transform into HTML. But as the work progressed, I found that I
was almost always able to do it in XSLT instead. This is the
relevant part of the link template:
<xsl:attribute name="title">
<xsl:variable name="text">
<xsl:choose>
<xsl:when test="$hasChapters and $sectionOneCount = 1">
<xsl:value-of select="/law/chapter/section[@id=$section]/p[position() = $piece]"/>
</xsl:when>
<xsl:when test="$chapter != ''"><xsl:value-of select="/law/chapter[@id=$chapter]/section[@id=$section]/p[position() = $piece]"/></xsl:when>
<xsl:otherwise><xsl:value-of select="/law/section[@id=$section]/p[position() = $piece]"/></xsl:otherwise>
</xsl:choose>
</xsl:variable>
<xsl:variable name="real-width">
<xsl:call-template name="tune-width">
<xsl:with-param name="txt" select="$text"/>
<xsl:with-param name="width" select="160"/>
<xsl:with-param name="def" select="160"/>
</xsl:call-template>
</xsl:variable>
<xsl:value-of select="normalize-space(substring($text, 1, $real-width - 1))" />
</xsl:attribute>
The variables $chapter, $section, and $piece gets their values
earlier in the template, and are set to the place the link goes
to. $hasChapters and $sectionOneCount are set globally for the
document and are used to tell what kind of section numbering this
particular lawtext is using. There are three variants commonly
used:
- No chapters, just simple ascending section numbering
- Chapters with restarting section numbering (ie, regardless of
the number of sections in chapter 1, the first section in chapter
2 will be numbered ‘1 §’ — ie, we need to know the chapter as
well as the section — just ‘17 §’ is ambigious)
- Chapters with continous section numbering (ie, if the last
section in chapter 1 is ‘16 §’, the first section of chapter 2
will be numbered ‘17 §’ — ie, the section number is never needed
to unambigosly determine what ‘17 §’ refers to).
The code constructs a XPath expression and finds the node that the
link refers to, and stores it in the variable text. Then,
it trims the string down to a suitable length (max 160 chars, in
this case) by using the user-defined
function tune-witdh, together with normalize-space and
substring. tune-width ensures that we end the string on a word
boundary. The result of all this is assigned to the title
attribute.
Generating a TOC: Again, if you look at the swedish
copyright law, you will notice a big blue link titled “Visa
innehållsförteckning”. Clicking on this yields (if you have a
browser that supports javascript and DOM level 1) a table of
contents (TOC), generated from chapter headings and other
headlines. It starts out as hidden, since it usually is in the
way, but sometimes it’s very useful.
The XML document in itself do not contain any TOC data. To
generate the TOC, we use a number of mode-specific templates that
extract the relevant information from headlines contained in the
document, all triggered by a <xsl:apply-templates> call:
<div id="toc" style="display:none;">
<xsl:apply-templates select="law/chapter[(headline or section)]" mode="toc"/>
</div>
...
<xsl:template match="headline" mode="toc">
<xsl:if test="@level = '1'">
<div class="l1">
<a>
<xsl:attribute name="href">
<xsl:choose>
<xsl:when test="substring(.,1,5) = 'AVD. '">#R<xsl:value-of select="@id"/></xsl:when>
<xsl:when test="../../chapter">#K<xsl:value-of select="../@id"/></xsl:when>
<xsl:otherwise>#R<xsl:value-of select="@id"/></xsl:otherwise>
</xsl:choose>
</xsl:attribute>
<xsl:value-of select="."/>
</a>
</div>
</xsl:if>
<xsl:if test="@level = '2'">
<div class="l2">
<a href="#R{@id}"><xsl:value-of select="."/></a>
</div>
</xsl:if>
</xsl:template>
<xsl:template match="section" mode="toc"/>
<xsl:template match="changes" mode="toc"/>
<xsl:template match="appendix" mode="toc"/>
<xsl:template match="preamble" mode="toc"/>
Things to note: The text-to-XML conversion is responsible for
determining the ‘level’ of the headlines. Level 1 headlines are
usually associated with a chapter (though not always), and we use
some tests to determine if that is the case. If so, the resulting
link uses a “#K<number>” anchor fragment, otherwise a
“#R<number>” fragment. “K” is for “kapitel” (chapter), while
“R” is for “rubrik” (headline). Not strictly neccesary, but I
prefer a link that explicitly says “this link goes to chapter 4″,
rather than “this link goes to the 14th headline”, particularly as
the number of headlines can change in the future, which would make
the link point to the wrong place.
I have a number of “empty” templates. They are needed, since if I
don’t have them, the base template kicks in and just copies the
entire text, which seriously messes up the TOC. Now, I should be
able to limit that with the select attribute of the
<xsl:apply-templates> tag, but I have been unsuccesful (the
reason I select both headlines AND sections, then do nothing with
the sections, is that I’ve been experimenting with using the first
lines of each section in the TOC as well, but that came out too
messy).
Accessing data in other documents: To properly understand a
section in a law, it helps to read court verdicts that reference
it. In part 2, I described how to fetch data from Domstolsväsendets
rättsinformation, which has this data. Basically, I have a
python code snipped that goes through all 1800+ verdicts, uses the
parser that was mentioned in part 5 to find references
to laws and
sections therein, and then generates a “cache.xml” file, that
contains references to all verdicts, sorted by law, chapter and
section, like this:
<?xml version="1.0" encoding="iso-8859-1"?>
<verdicts>
<law id="1736:0123_2">
<chapter id="9">
<section id="5">
<verdictref caseid="NJA 2002 s. 577 (alt. NJA 2002:70)" casenumber="T3933-01" id="2002/758" verdictdate="2002-11-22">
Gäldenär, som hade flera skulder till en borgenär, har ansetts ha rätt att destinera sin be
</verdictref>
</section>
</chapter>
<chapter id="10">
<section id="1">
<verdictref caseid="NJA 2000 s. 667 (alt. NJA 2000:97)" casenumber="T4689-98" id="2000/772" verdictdate="2000-12-18">
Sedan A och B var för sig ställt pantsäkerhet för C:s lån i en bank, har banken gjort sig f
</verdictref>
<verdictref caseid="NJA 2000 s. 88 (alt. NJA 2000:13)" casenumber="Ö3863-98" id="2000/923" verdictdate="2000-02-23">
Ett bolags upplåtelse av företagshypotek har ansetts bli sakrättsligt gällande när företags
</verdictref>
</section>
[...]
In the top level of the stylesheet, I select the relevant nodeset
from this cache document, and store it in a variable. To be able
to select, I first need to know the id of the law (the
‘SFS-nummer’), and so I pass it as a command line parameter:
<xsl:param name="lawid"/>
<xsl:variable name="relevantVerdicts"
select="document('generated/verdict-xml/cache.xml')//law[@id=$lawid]"/>
Then, as I process each section, I check the nodeset to see if
there’s any verdicts relevant to the section:
<xsl:variable name="sectionChapterVerdicts" select="$relevantVerdicts//chapter[@id=$chapterid]/section[@id=$sectionid]/verdictref"/>
<xsl:if test="$sectionChapterVerdicts">
<div class="metadata">
<xsl:for-each select="$sectionChapterVerdicts">
<xsl:variable name="linktext">
<xsl:choose>
<xsl:when test="@caseid"><xsl:value-of select="@caseid"/></xsl:when>
<xsl:when test="@casenumber">Målnummer <xsl:value-of select="@casenumber"/></xsl:when>
<xsl:when test="@diaryid">Diarienummer <xsl:value-of select="@diaryid"/></xsl:when>
<xsl:otherwise>Vägledande dom</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<xsl:variable name="real-width">
<xsl:call-template name="tune-width">
<xsl:with-param name="txt" select="."/>
<xsl:with-param name="width" select="80"/>
<xsl:with-param name="def" select="80"/>
</xsl:call-template>
</xsl:variable>
<a href="/dom/{@id}"><xsl:value-of select="$linktext"/></a>: <xsl:value-of select="normalize-space(substring(., 1, $real-width - 1))" />... (<xsl:value-of select="@verdictdate"/>)<br/>
</xsl:for-each>
</div>
</xsl:if>
Again, a little extra work is done to make sure the explaining
text is cut at a word boundary (we could‘ve done that in
the python code that generates cache.xml), and also that the text
for the actual link makes sense. You see, different courts have
different systems for assigning ID’s to cases: HD (the swedish
supreme court) uses the page number as the case is published in
that year’s anthology of relevant verdicts (Nytt Juridiskt Arkiv,
NJA), other uses a court-specific identifier, and some use a
derivative of the date when it was handled. Since these
identifiers represent different things, they are also represented
with different attributes in the XML file, and a little
<xsl:choose> trickery selects the most appropriate ID.
Optimization hints: Some of the law texts are quite big, and
processing them can take a long time. To speed things up, here are
some of the things I did:
- Don’t use the python interface to libxsl! I started out with
this, but it turned out to take twice or even three times as long
for a transformation, as compared to running command-line xsltproc (win32 binaries)
through an os.system() call.
- Use the –profile switch to xsltproc. It’s pointless to
optimize if you don’t know where the bottlenecks are, and with
xsltproc it’s so easy to find them.
- Store frequently-referenced nodes, nodesets and values in
variables, instead of selecting them through XPath queries all the
time. Yeah, this one is kind of obvious, but it really helps,
particularly in those “inner” templates that are used all the
time.
Just by using the above tips, the transformation of my standard
test case (1960:729) went
from 90 seconds to three. Still, I have other test cases that
still takes several minutes, so clearly there’s still some work to
do…