(The last in a series of blog posts about the tech behind lagen.nu. The other parts are here: first, second, third, fourth, fifth, sixth and seventh)
URL design: Insired both by ”Cool URIs don’t
change” and the general REST
emphasis on sensible URL design, I designed a URL scheme for
lagen.nu that maps closely to how swedish laws are identified, and
one that is future-proof in that it hides implementation
details. Since I hope that many people will find it useful to link
to individual pages on lagen.nu, it’s very important that those
links doesn’t break.
To illustrate: the generated html files are placed in directories
named html/year/id.html. Suppose that I
just were to let Apache (or any other webserver) serve content straight
from that directory. For example, the copyright law has SFS id
1960:729, and so with this scheme the URL would be
http://lagen.nu/html/1960/729.html (or
http://lagen.nu/1960/729.html, depending on what I would
set DocumentRoot to). This would work fine, the URL’s would be
easy to understand, and people would link to all those individual pages from
all over the web.
But now suppose I wake up one day and decide to stick all data in a big
database, and build a PHP frontend? The URL’s would change, and probably be
on the form
http://lagen.nu/showlaw.php?sfs=1960:729. Again, nice
short URL, easy to understand… but now the links from all over
are broken.
So, mod_rewrite
to the rescue: With just the simple rule:
RewriteRule ([^:]*):(.*) /html/$1/$2.html
, the resource found at URL
http://lagen.nu/html/1960/729.html is now also available
at http://lagen.nu/1960:729. This is even nicer to read,
futureproof, and enables someone that knows the ID of any law to
go straight to the page for that law quickly.
As a added bonus, it makes the text and XML version of the laws
easily available too: During generation, I put these versions in
sibling directories to html/, named text/ and
xml/, respectively, and then use the following rules:
RewriteRule ([^:]*):(.*).xml /xml/$1/$2.xml RewriteRule ([^:]*):(.*).txt /text/$1/$2.txt
There are other parts, such as the index pages and the about
pages, where the underlying flat-file nature of the site shines
through, such as http://lagen.nu/om/me.html, but those
pages are not as likely to be linked to. Still, if I should decide
to change it at some point, I’ll probably make some mod_rewrite
based backwards compatibility hack. Also, if you want to do this
on a Win32 platform, you’re out of luck. See my previous
post for alternatives.
Update functionality: As the body of swedish law is always
changing, I had to plan ahead of how to keep the site up to
date. New laws are usually published every Wednesday, and it’s out
of the question to download every page from Regeringskansliets rättsdatabaser
once a week. So instead, I store the highest previously-fetched
ID, and from the update routine, I try fetching laws, increasing
that ID, until I finally get a ”law not found” error. The laws can
be either ”base constitutions”, i.e. new laws, or ”change
constitutions”, laws that specify that some other, older, law
should change (think ”source code file” and ”patch”,
respectively).
If it’s a base constituion it’s pretty simple, just
download it and process it from start to finish. If it’s a change
constitution, however, we find out wich base constitution it
changes, fetch that, see if it has been updated (”patched”), and
if so, store the old versions of that law somewhere, then do the
normal regeneration process. In this way, I can, over time, build
up an archive of old law revisions, so that I can tell how the law
looked at a particular date. For now, I have only a few months
worth of history, but the value of this will grow as the time goes
on. In particular, it would be cool to be able to do CVS-style
diffs between arbitrary revisions of a law, or to be able to link
to a law as it looked at a particular moment of time.
Front page: With the information we gather during the
update process, we can build a list of recently changed laws, and
put it on the front page. Similarly, we build a list of recent new
court verdicts, and also one with site news. All these are
published in a side bar on the front page, while the main content
area is filled with a static listing of some of the most important
laws. The different parts of the side bar leads to different news
pages, which details site, law and verdict news in greater detail.
RSS feeds: Hey, it’s 2004, how could any self-respecting
new site not have a RSS feed? I generate feeds for all three news
types (site, law and verdict) using PyRSS2Gen,
a nice little lib for creating RSS 2.0 feeds. I haven’t tried them
out much, but feedvalidator says they’re OK, and they work fine in
all RSS readers I’ve tested with (although Opera tends to show the
raw HTML instead of rendering it, which probably indicates that I
should include it in some other way than just escaping it and
showing it in the description tag. Maybe it would be
useful to use a service like Feedburner for
this.
Conclusion: This marks the end of this eight-part posting. I
hope that you’ve picked up a trick or two. As should be apparent,
I am no python wizard or even a programming guru in general, but
over the years I’ve found a style of development that works for me
in a single-developer context (doing multiple-developer projects
is a totally different thing), mostly centered around the XP tenet
”Do the simplest thing that could possibly work”, or in my own
words: ”Don’t try to find the right way of doing something — if
you just work away, the right way will find you”.