(If you missed part 1-3, they are here,
here
and here)
If you look at a sample law text as they are presented in SFST,
they are 100% plaintext. In order to convert them to semantically
sensible XML, we must look for patterns in the plaintext, to
identify things like headlines, start of sections and references.
I did this with a two-part approach. First; I break down the text
into it’s individual paragraphs and determine what each paragraph
is. This analysis operate on a ‘block level’ — either a block of
text is a headline, part of an ordered list, the start of a
section etc, or it isn’t. A block can’t be half headline and half
ordered list.
Secondly, if the paragraph can contain references to other parts
of the law (or parts of other laws, for that matter), I analyze
the text in greater detail to find and resolve these
references. This step operates on a ‘token’ or character level –
a block of text can contain zero, one or many of these references.
The first part is fairly easy, at least conceptually. I wrote a
bunch of functions like is_chapter(p),
is_section(p), is_headline(p), where each
individual function just performs a simple test. These are then
used from a simple loop that uses a bunch of local variables to
keep track of what kind of structures we’ve encountered so far –
a so-called state machine.
These functions started out as very simple regexp-based inline
expressions, but as I encountered more and more exceptions to my
simple rules, their complexity grew. The body of current swedish
law is over 250 years old, and consistency has not been the law
makers forte. Here’s an example of how to recognize the
start of a section:
re_SectionId = re.compile(r'^(\d+ ?\w?) §[ \.]') # used for both match+sub
re_SectionIdOld = re.compile(r'^§ (\d+ ?\w?).') # as used in eg 1810:0926
def is_section(self,p):
section_id = self.section_id(p)
if section_id == None:
return False
if section_id == '1':
if self.verbose: print "is_section: The section numbering's restarting"
return True
# now, if this sectionid is less than last section id, the
# section is probably just a reference and not really the
# start of a new section. One example of that is
# /1991:1469#K1P7S1. We use util.numsort to get section id's
# like "26 g" correct.
a = [self.current_section,section_id]
if a == util.numsort(a):
# ok, the sort order's still the same, which means the potential new section has a larger ID
if self.verbose: print "is_section: '%s' looks like the start of the section, and it probably is (%s < %s)" % (
p[:30], self.current_section, section_id)
return True
else:
if self.verbose: print "is_section: Even though '%s' looks like the start of the section, the numbering's wrong (%s > %s)" % (
p[:30], self.current_section, section_id)
return False
def section_id(self,p):
match = self.re_SectionId.match(p)
if match:
return match.group(1).replace(" ","")
else:
match = self.re_SectionIdOld.match(p)
if match:
return match.group(1).replace(" ","")
else:
return None
So, the start of a section looks like ‘<number> [letter] §’,
unless it looks like ‘§ <number> [letter]’, and as long as
the section id is larger than the previous section id, unless it’s
restarting at 1, and also taking into account that section ids can
contain an optional letter (like ‘26 g’). Simple as that.
For example, below is the start of the swedish copyright law
(which has, during development, been my foremost test case). Now,
if you don’t read Swedish, just know that the first paragraph
signifies the start of chapter one (”1 Kap.”), the second the
start of a section(”1 §”), and then there are three items in an
ordered list, and finally a plain ol’ paragraph of text. (By the
way, for my swedish readers: The swedish legal term ‘paragraf’ is
NOT the same as the english typographical term ‘paragraph’, but
rather translates into ’section’. When the english word
‘paragraph’ is used, it is in the same sense as the swedish word
’stycke’)
1 Kap. Upphovsrättens föremål och innehåll
1 § Den som har skapat ett litterärt eller konstnärligt verk har
upphovsrätt till verket oavsett om det är
1. skönlitterär eller beskrivande framställning i skrift eller tal,
2. datorprogram,
[…other items in the ordered list omitted for brevity…]
Till litterära verk hänförs kartor, samt även andra i teckning eller
grafik eller i plastisk form utförda verk av beskrivande art.
[…more things omitted for brevity…]
2 § Upphovsrätt innefattar, med de inskränkningar som nedan stadgas,
uteslutande rätt att förfoga över verket genom att framställa exemplar
därav och genom att göra det tillgängligt för allmänheten, i
ursprungligt eller ändrat skick, i översättning eller bearbetning, i
annan litteratur- eller konstart eller i annan teknik.
So, to make sense of this, the following state transitions are done:
-
For the first paragraph, is_chapter() will return True,
so we transition to the in_chapter state. This will emit a
<chapter id=”1″> tag to the result, along with
the text.
-
For the second paragraph, is_section() will return
True, transitioning us to the in_section stage as
well. It’s important to realize that these states are mostly
independent; we can be in a chapter w/o being in a section, and
vice versa (some laws have only sections, no chapters). This
transition will of course emit a a <chapter id=”1″> tag
to the result.
-
For the third paragraph, is_ordered_list() will return
True, transitioning us to the in_ordered_list state, and
emitting <ol><li>1. skönlitterär eller
[…]</li> to the result. A couple of things to
note about this:
-
These tags may look like HTML tags, but they’re not. They do, in
this case, share the same semantics, though.
-
A HTML ordered list (<ol>) keeps track of
it’s numbering, and any user-agent should add the numbers to
the displayed result. Since this is not HTML, we don’t do
that, and instead keep the number that was in the original
text. Mostly because we’re not sure that no ordered list in
the entire body of swedish law doesn’t contain sequence gaps
or things like ‘26 g’.
-
For the fourth, is_ordered_list() will again return
true, but now we’re already in the in_ordered_list
state, so we don’t emit the initial <ol> tag.
-
Then we omit some boring junk, and then we encounter a normal
paragraph. There is no function named
is_ordinary_paragraph(), this is just infered from the
fact that none of our other test matched. Now, the start of a
normal paragraph is implicit evidence that our ordered list must
have ended, and so the code to transition into
in_normal_paragraph state checks to see if we’re in the
in_ordered_list state, and if so, emits the trailing
</ol>. After that, the normal paragraph gets
emitted as <p>Till litterära verk […]</p>.
-
Similarly, when we encounter the start of a new section (”2
§ Upphovsrätt innefattar“), we transition out of any
ordered lists we might be in, out of the in_section state, emit
</section>, and then transition into the same state
again.
The largest part of this high-level parsing was to find all the
different structures and discover the implicit state transitions
that needed to take place. One interesting problem is determining
wheter a given paragraph is a headline or just a really short
ordinary paragraph. Well, it turns out that a headline never ends
with a period, and a ordinary paragraph always does. Except for
headlines that end in “m.m.” (swedish for “etc.”). And ordinary
paragraph that end with “,”, “och” or “samt”. But a headline is
always followed by a section, so we can peek ahead to see if the
next paragraph matches is_section(). Except for those cases where
a headline is followed by ANOTHER headline, which is then followed
by a section…
A lot of work, and there are still a lot of places
where we break (for example, things like definition lists, nested
ordered lists, and tables, are not yet supported). Furthermore,
tweaking to satisfy one case can easily make other cases break. I
will return to this problem in the part about regression testing.
Some other things: As you may have guessed by the usage of the
term “emit”, I construct the XML by hand in a single pass. This
means that I write XML data straight to a file, without using DOM
or anything similar. It was just easier to start out that way. I
am thinking of overhauling the state machine mechanism a bit, and
I might take a DOM approach then.
Some other parts of the code that constructs XML documents (like
the code that handle court verdicts) use DOM, and it works pretty
fine, with one small exception: It does not play nice with xml
fragments in string form. I have a separate class to find law
references in text (to be covered in deeper detail in the next
post), and the interface of that class is a plain string one: it
returns strings like “<link law=”1960:644″ piece=”2″
section=”1″>1 § 2 st.</link> varumärkeslagen“. Since
there isn’t any method on individual element objects like
ele.ParseAndAppendMixedContent(), I first have to create
a ‘fake’ XML document and transform it into a node tree with
parseString (from minidom):
s = '<?xml version="1.0" encoding="iso-8859-1"?><%s>%s</%s>' % (
element, string_containing_mixed_xml_content, element)
fragment = parseString(s)
subele = fragment.documentElement
ele.appendChild(subele)
Another drawback of writing XML the raw way is that there is no
guarantee that your output will be valid. Special care is taken to
escape things like < and &, and to ensure the document is
well-formed, it’s also run through HTML Tidy (with the
options -q -n -xml –indent auto –char-encoding latin1),
which, despite it’s name, is an excellent tool to tidy up XML as
well.