“EBNF Rocks!”, or undoing ten years of regex-inflicted damage

I’ve been parsing text with regexes for longer than as I’ve been programming for a living. If there was some text extraction that needed to be done, regexes was my first choice. Later on, I moved on to HTML and XML parsers when I could, but there would often be some sort of regexp magic operating on the text nodes.

But the problem of parsing swedish law texts, looking for internal and external references, finally proved too tough for my beloved regexp hammer. As an example, here is a sample paragraph: § 5 1982:713:

Bestämmelserna om livförsäkring, med undantag för 1 kap. 8 a §
samt 7 kap. 22, 23 och 26 §§, får tillämpas för skadeförsäkringar som
avses i 2 kap. 3 a § första stycket klasserna 1 och 2 samt för
avgångsbidragsförsäkringar.

(by the way, the linked HTML page is generated before I implemented the EBNF based parsing described below. As you can see, there are quite a lot of places where my previous naive regex solution got it wrong.)

Quick primer on swedish and swedish law texts: “1 kap.” means Chapter 1, “8 a §” means “section 8 a”, “första stycket” means “first paragraph” and the rest of the stuff is not really important. To create the HTML pages I first create a XML document from the plaintext law, where I try to find as much structure as possible. The most difficult part is finding internal references. The end result I’m looking for is the following:

Bestämmelserna om livförsäkring, med undantag för 
<ilink chapter="1" section="8 a">1 kap. 8 a §</ilink> samt 
<ilink chapter="7">7 kap.</ilink>
<ilink chapter="7" section="22">22</ilink>,
<ilink chapter="7" section="23">23</ilink> och
<ilink chapter="7" section="26">26 §§</ilink>, får
tillämpas för skadeförsäkringar som avses i
<ilink chapter="2" section="3 a" piece="1">2 kap. 3 a § första stycket</ilink>
klasserna 1 och 2 samt för avgångsbidragsförsäkringar.

This is a difficult problem to solve with rexes, partly because you need to keep a lot of state while simultaneously iterating over an unknown number of references (eg, when you reach “26” and create a <ilink> tag, you must remember that the current chapter in this context is “7“, and you must handle the case when there isn’t a current chapter), partly because some patterns are subpatterns of larger patterns (eg, there is a rule for the simple case “26 §“, but this matches a subset of the larger case “22, 23 och 26 §§“. Since there are so many rules (and there are many more than I’ve shown in this simple example) a solution that’s based on running several regexp transformations over the same text quickly becomes unmanageble.

Well, after reading Chapter 4, “Parsers and State-machines” from Text Processing in Python, I started thinking about the problem from a more structured parsing view. I’ve never really worked with EBNF parsers before. I could read a EBNF grammar, but since I’ve never written a compiler or interpreter, it didn’t really occur to me how useful they can be for other situations as well.

The following is a grammar that can parse the above paragraph:

root               ::= (refs/ref/plain)+
refs               ::= (ChapterSectionRefs/SectionRefs)
ChapterSectionRefs ::= ChapterRef, wc, SectionRefs
ChapterRef         ::= ChapterRefID, c"kap."
ChapterRefID       ::= number, wc, (char, wc)?
SectionRefs        ::= (IntervalOrSingle,Comma,wc)*, IntervalOrSingle, wc, And, wc, LastSectionRefID, wc, DoubleSectionMark

IntervalOrSingle   ::= (IntervalSection/SingleSectionRefID)
SingleSectionRefID ::= number
FirstSectionRefID  ::= SingleSectionRefID
LastSectionRefID   ::= number
IntervalSection    ::= SingleSectionRefID, Hyphen, SingleSectionRefID

And                ::= 'och'
Or                 ::= 'eller'
Comma              ::= ','
Hyphen             ::= '-'
DoubleSectionMark  ::= '§§'

plain        ::= (wcs/word/number/punctuation)
wc           ::= [ \t\n\r\f\v]
wcs          ::= wc+
char         ::= [a-zA-ZåäöÅÄÖ]
word         ::= char+
digit        ::= [0-9]
number       ::= digit+
punctuation  ::= [][!@#$%^&()+=|\{}:;<>,.?/§"]

If you have basic knowledge of regexes, particularly bracket expressions and repetition operators, the above grammar shouldn’t be that difficult to understand. Just start at the top of the text to be parsed, and try to satisfy the top rule (”root“), moving down to the sub-rules (that’s probably not the correct technical term) as you go. ‘(a/b/c)‘ means “first try to satisfy a, if not successful try b, if not successful try c” and ‘a, b, c‘ means “satisfy a and then b and then c”

The end result, when you feed this grammar and some law text into SimpleParse, is a result tree, which is a easily navigable data structure that I can use to create the output text. For the example above, it will return a ‘plain‘ node, then a ‘refs‘ node, then another ‘plain‘ node, another ‘refs‘ and finally a last ‘plain‘ node. The ‘plain‘ nodes are everything that the parser couldn’t match with the more specific rules (note that the ‘refs‘ and ‘ref‘ comes before ‘plain‘ in the root rule), and I just print out them as-is.

For a ‘refs‘ node, I can dive into it’s subtree, where I’ll find a single ‘ChapterSectionRefs‘ node, which in turn will have three nodes: ‘ChapterRef‘, ‘wc‘, ‘SectionRefs‘. I’m sure you can see how this contiues.

Writing the code that takes this parse tree and constructs the XML I want is not trivial, but compared to the regex mess I was in before, it’s a walk in the park.

So, after ten years with my regex hammer, during which every text parsing problem looked like a nail, I now have a EBNF screwdriver. Better late than never!

Tags: , , ,

Leave a Reply