The legality of screenscraping

Niklas Lundblad directs
me
to a couple of interesting propositions about pending laws
regarding computer crime (one of them actually uses the phrase ”crimes
in cyberspace” — very 1995’ish retro). Both Ds 2005:5 and Ds 2005:6 are
intended to be the first steps in implementing recent EC legislation
(particularly ”Convention
on Cybercrime ETS no.:185
”) in swedish law.

Unfortunately, both documents are very thin when it comes to
defining what should be regarded as ”illegal access to information
systems”. Swedish law has not been well defined in this area
before, and I was hoping that maybe the lawmakers would take the
opportunity to clarify this.

The issue I’m mostly concerned with is screen
scraping
. I like screen scraping. I think screen scraping is
cool. I wrote my first screen scraping program the same week I got my
first job, almost ten years ago (it was a simple script to
automatically download the latest Dilbert cartoon and email it as an
attachment to myself).

Now, it’s easy to see why a content provider would be opposed to
screen scraping. When I got my Dilbert strip in my inbox, there was no
advertising attached to it, thus I was depriving Unitedmedia of ad
revenue.

Years later, I was involved in the XMLTV project as I was
playing around with a homebrew HTPC. For Swedish TV listings, there
was a simple program that would fetch TV listings from dagenstv.com and
re-format them into the XMLTV format. One day, their service started
to serve up a very hostile-worded text file when I ran this
program. Basically, dagenstv.com had changed their web server
configuration so that requests from a certain User-agent (our
screen scraper program), would get a stern warning that we were doing
illegal things and that our IP adress had been logged, or something
like that.

Now, were we doing something illegal? Keep in mind that each user
would run this program on his or hers individual computer; we never
redistributed the content. Wheter or not we could have done that
legally in Sweden is another question, one that maybe could be
answered by pondering URL 49
§
and related materials. It’s an interesting question in it’s own
right, just not the subject of today’s blog post.

Was the mere act of accessing the site with a different tool than
the site owner intended, thus gaining access to digital data in a
non-approved way, illegal? For me as a programmer, this feels like an
absurd question. I’m only sending humble GET
requests
, if the site owner doesn’t want me to have the
information, then don’t send it! But with my legal student-glasses on,
this could be considered as computer infringement, as per the wording
in BrB 4:9 c: ”Den som
[…] olovligen bereder sig tillgång till upptagning för automatisk
databehandling […]döms för dataintrång till böter eller fängelse i
högst två år.”
. (A rough translation would be ”Someone who gets
hold of a recording for automatic computer handling without permission
is to be sentenced for computer infringement to fine or prison for no
more than two years”
).

Dagenstv.com could be said to have given users with normal web
browsers implicit permission to access the data, but probably not to
us with our screen scraper. If we had asked the site owners, they
would very likely had said ”no”, and therefore, they could well argue
that we were getting hold of a ”recording” witout permission.

(As an aside: the use of ”recording” (”upptagning” in swedish) in
the quoted law text is interesting in it’s own right — the
legislation was originally written with telephone wiretapping, opening
of letters, and similar things in mind, then ”adapted” (used in the
loosest of senses) into the digital age.)

I would prefer that questions like these were solved by technical,
not legal, means. Dagenstv.com used one such mean (the User-agent
discrimination) to block our screen scraper. We could have changed our
program to masquerade as a normal Internet Explorer browser, but that
would only escalate into a pointless arms race. Someone wrote a
different script that fetched the data from another site instead, and
that was the end of it. Furthermore, if we had bypassed dagenstv.com’s
User-agent check, we would have essentially said ”Even though we’ve
been told in no uncertain terms that what we’re doing is not permitted
by the site owners, we’re choosing to ignore that and circumvent the
access control” — if we had done that, dagenstv.com would really be
right in saying we were getting hold of data without permission.

But there’s a lot to be said for screen scraping. lagen.nu could not exist without screen
scraping. A lot of really cool web services over the years have been
made possible by screen scraping. It has enabled loose
coupling
years before anyone had talked about web services. It’s
the basis for a lot of interesting research and data mining. And
sometimes it just enables plain cool stuff.

Furthermore, it would be wrong to assume that all content providers
are opposed to screen scraping. For example, what is the one thing
that distinguishes forward-thinking web companies? They provide API’s
to their services (Amazon,

Livejournal,
Yahoo, Google, Flickr), enabling
anyone to build cool applications on top of their data, just like we
wanted to build a cool HTPC application using data from
dagenstv.com. By providing API’s, smart web sites remove the need for
actual screen scraping (which, in all fairness, is a messy and seldom
very interesting technological challenge, and furthermore only a means
to an end), but enable and encourage the same kinds of applications.
These API’s (and the XML-RPC/SOAP-based underpinnings) did not emerge
from a vacuum. People have been screen-scraping Amazon.com for their
own little needs since it was launched. Smart service providers
realise that it’s better to work with all this creativity than against
it.

If web site providers choose to do what dagenstv.com did, then
fine. They’ve stated their intent, it’s their service, their rules,
they’re entitled to take their ball and go home. But before a site
owner puts such a block in place (which could also be done through a
robots.txt file), screen
scraping should in no way be considered unlawful computer
infringement.

It turns out that the EC
convention
that this new legislation is to implement provides for
these kinds of distinctions, under article 2 (my emphasis):

Article 2 – Illegal access

Each Party shall adopt such legislative and other measures as may be
necessary to establish as criminal offences under its domestic law,
when committed intentionally, the access to the whole or any part of a
computer system without right. A Party may require that the offence be
committed by infringing security measures
, with the intent of
obtaining computer data or other dishonest intent, or in relation to a
computer system that is connected to another computer system.

Knowingly circumventing a access control system by, for example,
changing the User-agent string, might be considered infringing
security measures (weak as they are), but an unassuming GET request
could, with this definition, never be considered illegal access. I
hope that Sweden takes this opportunity to better define what should
be considered illegal access.

Another aside: Since lot of my current activities, and thus my blog
writing, revolve around swedish law, it’s sometimes difficult to write
in English, as there are a lot of precise Swedish legal terms that I’m
not comfortable translating. For anyone versed in Swedish law, posts
about it in English is probably way harder to read. Furthermore, most
of these posts are probably of limited interest to non-swedes.

Therefore, I’m considering switching the language of this blog to
Swedish. If you don’t understand Swedish, but would like to continue
reading this blog, please say so in the comments. Thank you.

Switched over

After some general fiddling on and off for a few days, I’ve finally moved the blog over from Blosxom to WordPress. As part of the switch, the site is now running on my new and oh-so-cute mac mini, which, piece by piece, is taking over duties from my old trusty P-166. The Mac mini has enough processing power to run a dynamic php/mysql-driven app like wordpress. Still, things are not quite as zippy as I would have hoped, but they will have to do for now.

This also means that all my trusty readers can comment on my insightfullness once again (unless they get caught in the shiny new spam filters, that is).

The main part of the import was done using Mark Nozell’s instructions, then semi-manually importing comments, both those that stemmed from my blosxom-with-writeback-plugin, as well as older comments from back when I was running dasBlog.

For some nice layout, I used the theme ”Journalized Sand” from Mike Little with some minor tweaks. One particularly nice touch with this theme is that even though it has a left-hand navigation bar thing, it places the main content first in the HTML file, which means the interesting stuff comes first when browsing with lynx, on a cell phone, etc. Good stuff.

I’m using WordPress’ url-rewriting features to make old permalink URL’s still point to the right place, even the oldest BlogX-based permalink URL’s should still work — I hope. I’ll be scanning my server logs for 404’s. If you’re reading this through a RSS/Atom reader, at least some of it must have worked.

So, this is the fourth blog engine in a year (BlogX->dasBlog->blosxom->WordPress). Wonder if I’ll be happy with this one? At least it has a healthy userbase, which is always good.