Part 4: Converting stuff to XML
(If you missed part 1-3, they are here, here and here)
If you look at a sample law text as they are presented in SFST, they are 100% plaintext. In order to convert them to semantically sensible XML, we must look for patterns in the plaintext, to identify things like headlines, start of sections and references.
I did this with a two-part approach. First; I break down the text into it’s individual paragraphs and determine what each paragraph is. This analysis operate on a ‘block level’ — either a block of text is a headline, part of an ordered list, the start of a section etc, or it isn’t. A block can’t be half headline and half ordered list.
Secondly, if the paragraph can contain references to other parts of the law (or parts of other laws, for that matter), I analyze the text in greater detail to find and resolve these references. This step operates on a ‘token’ or character level – a block of text can contain zero, one or many of these references.
The first part is fairly easy, at least conceptually. I wrote a bunch of functions like is_chapter(p), is_section(p), is_headline(p), where each individual function just performs a simple test. These are then used from a simple loop that uses a bunch of local variables to keep track of what kind of structures we’ve encountered so far – a so-called state machine.
These functions started out as very simple regexp-based inline expressions, but as I encountered more and more exceptions to my simple rules, their complexity grew. The body of current swedish law is over 250 years old, and consistency has not been the law makers forte. Here’s an example of how to recognize the start of a section:
re_SectionId = re.compile(r'^(\d+ ?\w?) §[ \.]') # used for both match+sub
re_SectionIdOld = re.compile(r'^§ (\d+ ?\w?).') # as used in eg 1810:0926
def is_section(self,p):
section_id = self.section_id(p)
if section_id == None:
return False
if section_id == '1':
if self.verbose: print "is_section: The section numbering's restarting"
return True
# now, if this sectionid is less than last section id, the
# section is probably just a reference and not really the
# start of a new section. One example of that is
# /1991:1469#K1P7S1. We use util.numsort to get section id's
# like "26 g" correct.
a = [self.current_section,section_id]
if a == util.numsort(a):
# ok, the sort order's still the same, which means the potential new section has a larger ID
if self.verbose: print "is_section: '%s' looks like the start of the section, and it probably is (%s < %s)" % (
p[:30], self.current_section, section_id)
return True
else:
if self.verbose: print "is_section: Even though '%s' looks like the start of the section, the numbering's wrong (%s > %s)" % (
p[:30], self.current_section, section_id)
return False
def section_id(self,p):
match = self.re_SectionId.match(p)
if match:
return match.group(1).replace(" ","")
else:
match = self.re_SectionIdOld.match(p)
if match:
return match.group(1).replace(" ","")
else:
return None
So, the start of a section looks like ‘<number> [letter] §’, unless it looks like ‘§ <number> [letter]’, and as long as the section id is larger than the previous section id, unless it’s restarting at 1, and also taking into account that section ids can contain an optional letter (like ‘26 g’). Simple as that.
For example, below is the start of the swedish copyright law (which has, during development, been my foremost test case). Now, if you don’t read Swedish, just know that the first paragraph signifies the start of chapter one (”1 Kap.”), the second the start of a section(”1 §”), and then there are three items in an ordered list, and finally a plain ol’ paragraph of text. (By the way, for my swedish readers: The swedish legal term ‘paragraf’ is NOT the same as the english typographical term ‘paragraph’, but rather translates into ’section’. When the english word ‘paragraph’ is used, it is in the same sense as the swedish word ’stycke’)
1 Kap. Upphovsrättens föremål och innehåll 1 § Den som har skapat ett litterärt eller konstnärligt verk har upphovsrätt till verket oavsett om det är 1. skönlitterär eller beskrivande framställning i skrift eller tal, 2. datorprogram, […other items in the ordered list omitted for brevity…] Till litterära verk hänförs kartor, samt även andra i teckning eller grafik eller i plastisk form utförda verk av beskrivande art. […more things omitted for brevity…] 2 § Upphovsrätt innefattar, med de inskränkningar som nedan stadgas, uteslutande rätt att förfoga över verket genom att framställa exemplar därav och genom att göra det tillgängligt för allmänheten, i ursprungligt eller ändrat skick, i översättning eller bearbetning, i annan litteratur- eller konstart eller i annan teknik.
So, to make sense of this, the following state transitions are done:
- For the first paragraph, is_chapter() will return True, so we transition to the in_chapter state. This will emit a <chapter id=”1″> tag to the result, along with the text.
- For the second paragraph, is_section() will return True, transitioning us to the in_section stage as well. It’s important to realize that these states are mostly independent; we can be in a chapter w/o being in a section, and vice versa (some laws have only sections, no chapters). This transition will of course emit a a <chapter id=”1″> tag to the result.
-
For the third paragraph, is_ordered_list() will return
True, transitioning us to the in_ordered_list state, and
emitting <ol><li>1. skönlitterär eller
[…]</li> to the result. A couple of things to
note about this:
- These tags may look like HTML tags, but they’re not. They do, in this case, share the same semantics, though.
- A HTML ordered list (<ol>) keeps track of it’s numbering, and any user-agent should add the numbers to the displayed result. Since this is not HTML, we don’t do that, and instead keep the number that was in the original text. Mostly because we’re not sure that no ordered list in the entire body of swedish law doesn’t contain sequence gaps or things like ‘26 g’.
- For the fourth, is_ordered_list() will again return true, but now we’re already in the in_ordered_list state, so we don’t emit the initial <ol> tag.
- Then we omit some boring junk, and then we encounter a normal paragraph. There is no function named is_ordinary_paragraph(), this is just infered from the fact that none of our other test matched. Now, the start of a normal paragraph is implicit evidence that our ordered list must have ended, and so the code to transition into in_normal_paragraph state checks to see if we’re in the in_ordered_list state, and if so, emits the trailing </ol>. After that, the normal paragraph gets emitted as <p>Till litterära verk […]</p>.
- Similarly, when we encounter the start of a new section (”2 § Upphovsrätt innefattar“), we transition out of any ordered lists we might be in, out of the in_section state, emit </section>, and then transition into the same state again.
The largest part of this high-level parsing was to find all the different structures and discover the implicit state transitions that needed to take place. One interesting problem is determining wheter a given paragraph is a headline or just a really short ordinary paragraph. Well, it turns out that a headline never ends with a period, and a ordinary paragraph always does. Except for headlines that end in “m.m.” (swedish for “etc.”). And ordinary paragraph that end with “,”, “och” or “samt”. But a headline is always followed by a section, so we can peek ahead to see if the next paragraph matches is_section(). Except for those cases where a headline is followed by ANOTHER headline, which is then followed by a section…
A lot of work, and there are still a lot of places where we break (for example, things like definition lists, nested ordered lists, and tables, are not yet supported). Furthermore, tweaking to satisfy one case can easily make other cases break. I will return to this problem in the part about regression testing.
Some other things: As you may have guessed by the usage of the term “emit”, I construct the XML by hand in a single pass. This means that I write XML data straight to a file, without using DOM or anything similar. It was just easier to start out that way. I am thinking of overhauling the state machine mechanism a bit, and I might take a DOM approach then.
Some other parts of the code that constructs XML documents (like the code that handle court verdicts) use DOM, and it works pretty fine, with one small exception: It does not play nice with xml fragments in string form. I have a separate class to find law references in text (to be covered in deeper detail in the next post), and the interface of that class is a plain string one: it returns strings like “<link law=”1960:644″ piece=”2″ section=”1″>1 § 2 st.</link> varumärkeslagen“. Since there isn’t any method on individual element objects like ele.ParseAndAppendMixedContent(), I first have to create a ‘fake’ XML document and transform it into a node tree with parseString (from minidom):
s = '<?xml version="1.0" encoding="iso-8859-1"?><%s>%s</%s>' % (
element, string_containing_mixed_xml_content, element)
fragment = parseString(s)
subele = fragment.documentElement
ele.appendChild(subele)
Another drawback of writing XML the raw way is that there is no guarantee that your output will be valid. Special care is taken to escape things like < and &, and to ensure the document is well-formed, it’s also run through HTML Tidy (with the options -q -n -xml –indent auto –char-encoding latin1), which, despite it’s name, is an excellent tool to tidy up XML as well.
Tags: lagen.nu, rättsinformation, tillståndsmaskiner, XML