Quickies of the day

Not much time for interesting posts of my own these days, busy
studying history for my final oral exam on Wednesday. Meanwhile,
here’s what interested me in the blogosphere today.

  • Interesting
    blog entry
    from Frederik Lundh on the role of Trackers in the
    Widget Construction Set
    for Python. I might need to write some GUI code soon, maybe this would
    be the way to go.
  • Aaron Wormus worries
    about the embedded database SQLite and bad security practises (Hint:
    Keep the SQLite database file OUT of the web root)
  • Brad Adams posts slides
    and demos
    from his BorCon presentation. I
    think Borland might have a winner on its hand with the next version of
    Delphi, since it seems it will allow both classic Win32 programming
    and .Net programming, filling the niche that VB left with
    VB.Net. There’s a lot of people that wants to write code that runs on
    machines without the .Net runtime, and currently the only Microsoft
    option for that is C++. At the same time, people want to have the
    opportunity to do .Net development as well. Delphi’s might be the best
    option for these people.
  • Sam Ruby notices some character set mismatch in a blog post from
    Cory
    Doctorow
    , and dissects it in great detail. Almost as required a
    reading as Joel Spolsky’s much linked developers
    guide to Unicode
    .
  • David Warnock suggests
    that python is a good language for teaching kids how to code. I’m
    wondering how good Squeak
    works for this, seemingly as it’s designed for this and similar
    purposes?
  • Edward W. Felten suggests,
    like Lessig
    did
    a few weeks ago, that online porn shold be labeled with an
    ”adultsonly” or ”porn” tag. I said
    it then, and I’ll say it again: Take a look at PICS, the w3c standard that’s
    designed precisely for this and similar problems. Maybe the problem is
    that there is no ”standard” rating vocabulary with
    definitions for ”porn” or ”adultsonly”. Or maybe it’s the fact that
    only XML standard geeks can understand that last document. I think
    PICS need an evangelist quick, or others will reimplement it,
    badly.

Python is starting to grow on me

Ok, so I’m busily parsing
lawtext with EBNF grammars
, but the resulting parse tree is
somewhat cumbersome to work with. Each node is a 4-tuple,
[tagname,startindex,endindex,childnodes], where childnodes is a list
of similar 4-tuples:

  for n in r[3]:
      print "%s: %s" % (n[0],indata[n[1]:n[2])
      if n[0] == 'refs':
          for sn in n[3]:
              print "  %s: %s" % (sn[0],indata[sn[1]:sn[2]])

So, late at night I begin to ponder if it would be possible to write
some sort of OO-wrapper around this structure of tuples in tuples, so
that I could write somewhat more readable code:

  for n in r.nodes:
      print "%s: %s" % (n.tag,n.text)
      if n.tag == 'refs':
          for sn in n.nodes:
              print "  %s: %s" % (sn.tag,sn.text)

Turns out it takes less than 20 lines of code:

class NodeTree:
    def __init__(self,root,data,offset=0,isRoot=True):
        self.data = data
        self.root = root
        self.isRoot = isRoot
        self.offset = offset

    def __getattr__(self,name):
        if name == "text":
            return self.data
        elif name == "tag":
            return (self.isRoot and 'root' or self.root[0])
        elif name == "nodes":
            res = []
            for p in (self.isRoot and self.root[1] or self.root[3]):
                res.append(NodeTree(p,self.data[p[1]-self.offset:p[2]-self.offset],p[1],False))
            return res
        else:
            raise AttributeError

A good python programmer could probably trim down the above to ten
lines, but still. I don’t know if it’s the language, or the particular
problem I’m trying to solve, but I’ve been programming a whole lot
more recursively lately.

”EBNF Rocks!”, or undoing ten years of regex-inflicted damage

I’ve been parsing text with regexes for longer than as I’ve been
programming for a living. If there was some text extraction that
needed to be done, regexes was my first choice. Later on, I moved on
to HTML and XML parsers when I could, but there would often be some
sort of regexp magic operating on the text nodes.

But the problem of parsing swedish law texts, looking for internal
and external references, finally proved too tough for my beloved
regexp hammer. As an example, here is a sample paragraph: § 5 1982:713:

Bestämmelserna om livförsäkring, med undantag för 1 kap. 8 a §
samt 7 kap. 22, 23 och 26 §§, får tillämpas för skadeförsäkringar som
avses i 2 kap. 3 a § första stycket klasserna 1 och 2 samt för
avgångsbidragsförsäkringar.

(by the way, the linked HTML page is generated before I implemented
the EBNF based parsing described below. As you can see, there are quite a lot of places where my previous naive regex solution got it wrong.)

Quick primer on swedish and swedish law texts: ”1 kap.” means
Chapter 1, ”8 a §” means ”section 8 a”, ”första stycket” means ”first
paragraph” and the rest of the stuff is not really important. To
create the HTML pages I first create a XML document from the plaintext law, where I try to
find as much structure as possible. The most difficult part is finding
internal references. The end result I’m looking for is the
following:

Bestämmelserna om livförsäkring, med undantag för 
<ilink chapter="1" section="8 a">1 kap. 8 a §</ilink> samt 
<ilink chapter="7">7 kap.</ilink>
<ilink chapter="7" section="22">22</ilink>,
<ilink chapter="7" section="23">23</ilink> och
<ilink chapter="7" section="26">26 §§</ilink>, får
tillämpas för skadeförsäkringar som avses i
<ilink chapter="2" section="3 a" piece="1">2 kap. 3 a § första stycket</ilink>
klasserna 1 och 2 samt för avgångsbidragsförsäkringar.

This is a difficult problem to solve with rexes, partly because you
need to keep a lot of state while simultaneously iterating over an
unknown number of references (eg, when you reach ”26” and
create a <ilink> tag, you must remember that the
current chapter in this context is ”7”, and you must handle
the case when there isn’t a current chapter), partly because some
patterns are subpatterns of larger patterns (eg, there is a rule for
the simple case ”26 §”, but this matches a subset of the
larger case ”22, 23 och 26 §§”. Since there are so many rules
(and there are many more than I’ve shown in this simple example) a
solution that’s based on running several regexp transformations over
the same text quickly becomes unmanageble.

Well, after reading Chapter 4, ”Parsers and State-machines” from Text Processing in Python, I started
thinking about the problem from a more structured parsing view. I’ve
never really worked with EBNF parsers before. I could read a EBNF
grammar, but since I’ve never written a compiler or interpreter, it
didn’t really occur to me how useful they can be for other situations
as well.

The following is a grammar that can parse the above paragraph:

root               ::= (refs/ref/plain)+
refs               ::= (ChapterSectionRefs/SectionRefs)
ChapterSectionRefs ::= ChapterRef, wc, SectionRefs
ChapterRef         ::= ChapterRefID, c"kap."
ChapterRefID       ::= number, wc, (char, wc)?
SectionRefs        ::= (IntervalOrSingle,Comma,wc)*, IntervalOrSingle, wc, And, wc, LastSectionRefID, wc, DoubleSectionMark

IntervalOrSingle   ::= (IntervalSection/SingleSectionRefID)
SingleSectionRefID ::= number
FirstSectionRefID  ::= SingleSectionRefID
LastSectionRefID   ::= number
IntervalSection    ::= SingleSectionRefID, Hyphen, SingleSectionRefID

And                ::= 'och'
Or                 ::= 'eller'
Comma              ::= ','
Hyphen             ::= '-'
DoubleSectionMark  ::= '§§'

plain        ::= (wcs/word/number/punctuation)
wc           ::= [ \t\n\r\f\v]
wcs          ::= wc+
char         ::= [a-zA-ZåäöÅÄÖ]
word         ::= char+
digit        ::= [0-9]
number       ::= digit+
punctuation  ::= [][!@#$%^&()+=|\{}:;<>,.?/§"]

If you have basic knowledge of regexes, particularly bracket
expressions and repetition operators, the above grammar shouldn’t be
that difficult to understand. Just start at the top of the text to be
parsed, and try to satisfy the top rule (”root”), moving down
to the sub-rules (that’s probably not the correct technical term) as
you go. ‘(a/b/c)‘ means ”first try to satisfy a, if not successful try
b, if not successful try c” and ‘a, b, c‘ means ”satisfy a and then b
and then c”

The end result, when you feed this grammar and some law text into
SimpleParse, is a
result tree, which is a easily navigable data structure that I can use
to create the output text. For the example above, it will return a
plain‘ node, then a ‘refs‘ node, then another
plain‘ node, another ‘refs‘ and finally a last
plain‘ node. The ‘plain‘ nodes are everything that
the parser couldn’t match with the more specific rules (note that the
refs‘ and ‘ref‘ comes before ‘plain‘ in
the root rule), and I just print out them as-is.

For a ‘refs‘ node, I can dive into it’s subtree, where
I’ll find a single ‘ChapterSectionRefs‘ node, which in turn
will have three nodes: ‘ChapterRef‘, ‘wc‘,
SectionRefs‘. I’m sure you can see how this contiues.

Writing the code that takes this parse tree and constructs the XML
I want is not trivial, but compared to the regex mess I was in before,
it’s a walk in the park.

So, after ten years with my regex hammer, during which every text
parsing problem looked like a nail, I now have a EBNF
screwdriver. Better late than never!

Learning python… again

An interesting side effect of deciding to quit
programming for food
is that I’ve started to program for fun
again. Since a set of laws is, at it’s core, just a gigantic set of
large documents, much interesting stuff can be done by textual
processing of this body of work. And so I’m writing some python code
to fetch all current swedish laws, the preparation documents for them,
and verdicts referencing them, in order to semantically mark them all
up and cross-reference the hell out of them. It’s been surprisingly
fun so far.

So, why python? Since I’m more proficient in C# or Perl, it would
make more sense to use any of these languages. Well, C# in particular
is a very sensible language, with a very useful class library. But,
you know, it’s just not that fun. There’s something about the
whitespace-sensitivity of python that just feels good. And since I no
longer have to worry about marketable programming skills, that’s what
I’m going to use.

But one aspect of the sensibility of C# and the .Net programming
platform is that the tools are really really (really) good. Visual
Studio’s integration of the class libraries (incl intellisense),
documentation and debugging is first-class. Sure, the editor might not
be as powerful as Emacs, but that’s really only a small part of the
puzzle.

And so I’ve been searching for a good IDE for python
development. So far I’ve tried, and rejected, the following:

PythonWin
Free with a download of ActivePython.
No integrated class browsing or intellisense-like features, and no
easy access to documentation.
Visual Python
This is a plugin to Visual Studio, which to me makes a lot of
sense, and it does a lot of things right. What brings it down for me
is the lack of help and autocompletion for built-in types like strings
and the lack of an Immediate pane during debugging. It’s also way to
expensive for me to buy personally.
Komodo
This is a standalone commercial IDE with an affordable personal
use price. Here, the main dealbreaker is the lack of class library
integration and a slow editor. The class browser also leaves much to
be desired (it’s more of a symbol browser, relly).

So, if none of these passed my test, what am I using? Well, I gave up
on the whole IDE thing and decided to just use Emacs. With some help
from Pontus, I got
python-mode.el to behave enough that I’m comfortable enough to get
some stuff done. The py-help-at-point command does most of what
integrated documentation would do, and for the time being I’ll have to
browse the class library over here with
Mozilla instead. And use printf-debugging. It worked ten years ago, it
should work now as well.

I do intend to test WingIDE and BlackAdder,
but since they’re both commercial and not that affordable for a
student, they have to be really really good for me to choose them over
my current Emacs solution.

State of alternative languages on the CLR?

A big selling point in the initial marketing of the .Net platform was that it was supposed to enable code written in different language to interoperate smoothly. I always looked at that claim with a fair amount of suspicion, probably since I remember what it was like to develop classic ASP with Perl. If you remember, ASP was supposed to be usable with any language that supported Active Scripting, and ActiveState brought out a Win32 version of Perl that did. However, I never got it to run as smoothly as with VBScript (If I recall correctly, I had the most problems with getting ADO calls to work right), and since it was a ”unsupported” language, I couldn’t find much help on the net. I was hoping that this time around, Microsoft would try harder to make programming in third-party languages a reality.

Now, while the CLR is designed to run IL code which is language-agnostic in theory, it’s seems obvious to me that the CTS was designed for imperative, object oriented language with strong(-ish) typing, and in particular the C# language. It do not lend itself readily to typeless,dynamic, functional and/or logical languages. Articles like Dynamic languages and virtual machines by Jon Udell and ”Ignoring the Scripts” by Larry O’Brien state that noone is using dynamic languages on the .Net platform to any large degree.

However, while I was researching Smalltalk for the IDG.se column, I stumbled across #Smalltalk that seem to be a fairly useful Smalltalk implementation that compiles to IL code. Also of interest is S#.Net, a dialect of Smalltalk-98 that also runs on the CLR.

Some of the other languages that have been made to run on the CLR are:

All these languages are pretty far from being strongly-typed imperative languages, so it seems that it is indeed possible to use more dynamic languages on the .Net platform. Indeed, Jim Hugunin, the author of IronPython (and Jython, the Python-on-JVM implementation) notes that while his initial intention was to write an article titled ”Why .NET is a terrible platform for dynamic languages”, he ended up with the conclusion that the CLR is indeed a good platform for dynamic languages. My question is: Is anyone using these languages on the .Net platform in real projects? I’d be very interested to hear any success stories.

Quickies