”EBNF Rocks!”, or undoing ten years of regex-inflicted damage

I’ve been parsing text with regexes for longer than as I’ve been
programming for a living. If there was some text extraction that
needed to be done, regexes was my first choice. Later on, I moved on
to HTML and XML parsers when I could, but there would often be some
sort of regexp magic operating on the text nodes.

But the problem of parsing swedish law texts, looking for internal
and external references, finally proved too tough for my beloved
regexp hammer. As an example, here is a sample paragraph: § 5 1982:713:

Bestämmelserna om livförsäkring, med undantag för 1 kap. 8 a §
samt 7 kap. 22, 23 och 26 §§, får tillämpas för skadeförsäkringar som
avses i 2 kap. 3 a § första stycket klasserna 1 och 2 samt för
avgångsbidragsförsäkringar.

(by the way, the linked HTML page is generated before I implemented
the EBNF based parsing described below. As you can see, there are quite a lot of places where my previous naive regex solution got it wrong.)

Quick primer on swedish and swedish law texts: ”1 kap.” means
Chapter 1, ”8 a §” means ”section 8 a”, ”första stycket” means ”first
paragraph” and the rest of the stuff is not really important. To
create the HTML pages I first create a XML document from the plaintext law, where I try to
find as much structure as possible. The most difficult part is finding
internal references. The end result I’m looking for is the
following:

Bestämmelserna om livförsäkring, med undantag för 
<ilink chapter="1" section="8 a">1 kap. 8 a §</ilink> samt 
<ilink chapter="7">7 kap.</ilink>
<ilink chapter="7" section="22">22</ilink>,
<ilink chapter="7" section="23">23</ilink> och
<ilink chapter="7" section="26">26 §§</ilink>, får
tillämpas för skadeförsäkringar som avses i
<ilink chapter="2" section="3 a" piece="1">2 kap. 3 a § första stycket</ilink>
klasserna 1 och 2 samt för avgångsbidragsförsäkringar.

This is a difficult problem to solve with rexes, partly because you
need to keep a lot of state while simultaneously iterating over an
unknown number of references (eg, when you reach ”26” and
create a <ilink> tag, you must remember that the
current chapter in this context is ”7”, and you must handle
the case when there isn’t a current chapter), partly because some
patterns are subpatterns of larger patterns (eg, there is a rule for
the simple case ”26 §”, but this matches a subset of the
larger case ”22, 23 och 26 §§”. Since there are so many rules
(and there are many more than I’ve shown in this simple example) a
solution that’s based on running several regexp transformations over
the same text quickly becomes unmanageble.

Well, after reading Chapter 4, ”Parsers and State-machines” from Text Processing in Python, I started
thinking about the problem from a more structured parsing view. I’ve
never really worked with EBNF parsers before. I could read a EBNF
grammar, but since I’ve never written a compiler or interpreter, it
didn’t really occur to me how useful they can be for other situations
as well.

The following is a grammar that can parse the above paragraph:

root               ::= (refs/ref/plain)+
refs               ::= (ChapterSectionRefs/SectionRefs)
ChapterSectionRefs ::= ChapterRef, wc, SectionRefs
ChapterRef         ::= ChapterRefID, c"kap."
ChapterRefID       ::= number, wc, (char, wc)?
SectionRefs        ::= (IntervalOrSingle,Comma,wc)*, IntervalOrSingle, wc, And, wc, LastSectionRefID, wc, DoubleSectionMark

IntervalOrSingle   ::= (IntervalSection/SingleSectionRefID)
SingleSectionRefID ::= number
FirstSectionRefID  ::= SingleSectionRefID
LastSectionRefID   ::= number
IntervalSection    ::= SingleSectionRefID, Hyphen, SingleSectionRefID

And                ::= 'och'
Or                 ::= 'eller'
Comma              ::= ','
Hyphen             ::= '-'
DoubleSectionMark  ::= '§§'

plain        ::= (wcs/word/number/punctuation)
wc           ::= [ \t\n\r\f\v]
wcs          ::= wc+
char         ::= [a-zA-ZåäöÅÄÖ]
word         ::= char+
digit        ::= [0-9]
number       ::= digit+
punctuation  ::= [][!@#$%^&()+=|\{}:;<>,.?/§"]

If you have basic knowledge of regexes, particularly bracket
expressions and repetition operators, the above grammar shouldn’t be
that difficult to understand. Just start at the top of the text to be
parsed, and try to satisfy the top rule (”root”), moving down
to the sub-rules (that’s probably not the correct technical term) as
you go. ‘(a/b/c)‘ means ”first try to satisfy a, if not successful try
b, if not successful try c” and ‘a, b, c‘ means ”satisfy a and then b
and then c”

The end result, when you feed this grammar and some law text into
SimpleParse, is a
result tree, which is a easily navigable data structure that I can use
to create the output text. For the example above, it will return a
plain‘ node, then a ‘refs‘ node, then another
plain‘ node, another ‘refs‘ and finally a last
plain‘ node. The ‘plain‘ nodes are everything that
the parser couldn’t match with the more specific rules (note that the
refs‘ and ‘ref‘ comes before ‘plain‘ in
the root rule), and I just print out them as-is.

For a ‘refs‘ node, I can dive into it’s subtree, where
I’ll find a single ‘ChapterSectionRefs‘ node, which in turn
will have three nodes: ‘ChapterRef‘, ‘wc‘,
SectionRefs‘. I’m sure you can see how this contiues.

Writing the code that takes this parse tree and constructs the XML
I want is not trivial, but compared to the regex mess I was in before,
it’s a walk in the park.

So, after ten years with my regex hammer, during which every text
parsing problem looked like a nail, I now have a EBNF
screwdriver. Better late than never!