(Earlier posts: 1, 2, 3, 4)
The second part of the convert-things-to-xml approach deals with
finding inline references to other paragraphs in the law
text. I’ve written about it in part in this
blog post, but to recap:
Swedish law text contains lots and lots of internal and external
references to other sections. These references have a
semi-standardized format, but they are clearly meant to be parsed
by humans, not machines.
The simplest case is a single reference to a single section
(example from 4 §
Försäkringsrörelselagen (1982:713)):
Vid ändring av en bolagsordning eller av en beviljad koncession
gäller 3 § i tillämpliga delar.
Here, the string ’3 §’ signifies a reference to the third section
in the current section of the current law. If we can identify and
mark that reference up, we can make the ’3 §’ into a hyperlink
leading to the definition of section 3. You know, the stuff
hyperlinking was designed for. Currently, this gets transformed
into the following XML (how the XML gets transformed into
clickable HTML is the subject of a later article):
Vid ändring av en bolagsordning eller av en beviljad koncession
gäller <link section="3">3 §</link> i tillämpliga delar.
For cases like that, the transformation is trivial, and could be
done with regexps or just simple string matching. But for cases
like this (Patentbesvärsrätten
96-837):
14 § 1 st. 6) och 6 § varumärkeslagen (1960:644)
or better yet, this (49
a § URL, my personal favourite):
Bestämmelserna i 2 § andra och tredje styckena, 3, 7--9 och 11 §§,
12 § första stycket, 13, 15, 16, 18--20 och 23 §§, 24 § första
stycket, 25--26 b, 26 d -- 26 f, 26 i -- 28, 31--38, 41, 42 och
50--52 §§ skall tillämpas på bilder som avses i denna paragraf.
things get way more complicated. Enter EBNF-grammar-based parsing
with the dynamic duo that is SimpleParse and mxTextTools. Also,
the book Text Processing in
Python by David Mertz should be mentioned, as it helped me in
the right direction when I realized regexes weren’t going to cut
it.
My previous
post describes the actual EBNF grammar and how SimpleParse is
used to build a parse tree from it, so you might want to read that too.
However, that is really only half of the problem. After having a
tree of parsed tokens, that might (for a somewhat complicated
scenario) look like the following:
refs': '14 § 1 st. 6) och 6 § varumärkes|lagen (1960:644)'
'ExternalRefs': '14 § 1 st. 6) och 6 § varumärkes|lagen (1960:644)'
'MultipleGenericRefs': '14 § 1 st. 6) och 6 §'
'GenericRefs': '14 § 1 st. 6)'
'GenericRef': '14 § 1 st. 6)'
'SectionPieceRef': '14 § 1 st. 6)'
'SectionRef': '14 §'
'SectionRefID': '14'
'number': '14'
'Whitespace': ' '
'Whitespace': ' '
'PieceRef': '1 st. 6)'
'PieceRefID': '1'
'ordinal': '1'
'Whitespace': ' '
'PieceOrPieces': 'st.'
'Whitespace': ' '
'ItemRef': '6)'
'ItemRefID': '6'
'number': '6'
'RightParen': ')'
'WAndOrW': ' och '
'Whitespace': ' '
'AndOr': 'och'
'And': 'och'
'Whitespace': ' '
'GenericRefs': '6 §'
'GenericRef': '6 §'
'SectionRef': '6 §'
'SectionRefID': '6'
'number': '6'
'Whitespace': ' '
'Whitespace': ' '
'ExternalLawRef': 'varumärkes|lagen (1960:644)'
'NamedExternalLawRef': 'varumärkes|lagen (1960:644)'
'word': 'varumärkes'
'Pipe': '|'
'LawSynonyms': 'lagen'
'Whitespace': ' '
'LawRef': '(1960:644)'
'LeftParen': '('
'LawRefID': '1960:644'
'digit': '1'
'digit': '9'
'digit': '6'
'digit': '0'
'Colon': ':'
'digit': '6'
'digit': '4'
'digit': '4'
'RightParen': ')'
, how do we turn it into the following XML?
<link law="1960:644" section="14" piece="1" item="6">
14 § 1 st. 6)
</link>
och
<link law="1960:644" section="6">
6 §
</link>
varumärkes|lagen (
<link law="1960:644">
1960:644
</link>
)
Turns out this is a problem that can be solved in a rather generic
manner with a small amount of planning and a little
recursiveness. Basically, all tokens that end in ’Ref’
should generally end up formatted as a <link>. All
tokens underneath such tokens that ends in ’RefID’ are
used to find the attributes for these tags. Start at the root,
then recurse depth-first over all child nodes until
done. Sometimes there are notes that end in Ref
underneath other nodes also ending in Ref, in those cases
it’s the top node that is turned into a <link> tag
Of course, there are exceptions and things to be aware of. Two of
these are illustrated in the above example. To correctly insert
the law reference (the SFS id ’1960:644’) for tag to be produced
from the MultipleGenericRefs token ’14 § 1 st. 6) och
6 §’, we have to plan ahead, when dealing with the parent
node (ExternalRefs). The formatter has built-in knowledge
of the special handling needed, and, when encountering a
ExternalRefs node, finds the ExternalLawRef node
child, stores the underlying LawRefID value, and later
picks this value up when formatting the tags for the two
GenericRef tokens.
To make this looping and recursing easier, I build a OO wrapper
around the array-based parse tree that mxTextTools return. This
was the subject of another
post.
Note also that the ExternalLawRef token did not result in
a <link> tag, but the underlying LawRefID
token did. This was a consious decision (I thought it looked
better that way), and was implemented by creating a number of
extra subroutines in the formatter. Basically, the main function
acts as a generic dispatcher, by looping over all subtokens in a
tree, and for each token, checks if there’s a corresponding
format_tokenname() function. If so, call that,
otherwise call a generic formatter (which may, recursivly, make
use of the generic dispatcher). The code is pretty simple, but is
indicative of how neat stuff like this can be done in dynamic
langugages:
try:
formatter = getattr(self,"format_"+part.tag)
res = formatter(part)
except AttributeError:
res = self.format_tokentree(part)
The other wierd thing is that ’|’ sign in the term
’varumärkes|lagen’. This is swedish for ’the trademark
law’, but since we like to write words together, this creates an
interesting challenge for creating the EBNF gramar. Basically, I
cannot find a way to match a word that ends in a specific suffix,
such as ’lagen’. The resulting parser is always ’greedy’,
so there seem to be no way of matching these words without
matching all words. So, to fix this, I preprocess the text
before putting it through the parser, using normal python regular
expressions, which can be non-greedy, to put in that ’|’
sign, which solves the problem. Then, after retrieving the string
from the tag formatter, I remove those signs.
One particular satisfying thing of the problem described in this
post is that how well it lends itselfs to automated regression
testing. Any new feature can easily be specified in a test case
before coding begins, and after it’s done, it’s easy to verify
that nothing that previously worked has been broken. More on
regression testing in a later post.