The legality of screenscraping

Niklas Lundblad directs
me
to a couple of interesting propositions about pending laws
regarding computer crime (one of them actually uses the phrase ”crimes
in cyberspace” — very 1995’ish retro). Both Ds 2005:5 and Ds 2005:6 are
intended to be the first steps in implementing recent EC legislation
(particularly ”Convention
on Cybercrime ETS no.:185
”) in swedish law.

Unfortunately, both documents are very thin when it comes to
defining what should be regarded as ”illegal access to information
systems”. Swedish law has not been well defined in this area
before, and I was hoping that maybe the lawmakers would take the
opportunity to clarify this.

The issue I’m mostly concerned with is screen
scraping
. I like screen scraping. I think screen scraping is
cool. I wrote my first screen scraping program the same week I got my
first job, almost ten years ago (it was a simple script to
automatically download the latest Dilbert cartoon and email it as an
attachment to myself).

Now, it’s easy to see why a content provider would be opposed to
screen scraping. When I got my Dilbert strip in my inbox, there was no
advertising attached to it, thus I was depriving Unitedmedia of ad
revenue.

Years later, I was involved in the XMLTV project as I was
playing around with a homebrew HTPC. For Swedish TV listings, there
was a simple program that would fetch TV listings from dagenstv.com and
re-format them into the XMLTV format. One day, their service started
to serve up a very hostile-worded text file when I ran this
program. Basically, dagenstv.com had changed their web server
configuration so that requests from a certain User-agent (our
screen scraper program), would get a stern warning that we were doing
illegal things and that our IP adress had been logged, or something
like that.

Now, were we doing something illegal? Keep in mind that each user
would run this program on his or hers individual computer; we never
redistributed the content. Wheter or not we could have done that
legally in Sweden is another question, one that maybe could be
answered by pondering URL 49
§
and related materials. It’s an interesting question in it’s own
right, just not the subject of today’s blog post.

Was the mere act of accessing the site with a different tool than
the site owner intended, thus gaining access to digital data in a
non-approved way, illegal? For me as a programmer, this feels like an
absurd question. I’m only sending humble GET
requests
, if the site owner doesn’t want me to have the
information, then don’t send it! But with my legal student-glasses on,
this could be considered as computer infringement, as per the wording
in BrB 4:9 c: ”Den som
[…] olovligen bereder sig tillgång till upptagning för automatisk
databehandling […]döms för dataintrång till böter eller fängelse i
högst två år.”
. (A rough translation would be ”Someone who gets
hold of a recording for automatic computer handling without permission
is to be sentenced for computer infringement to fine or prison for no
more than two years”
).

Dagenstv.com could be said to have given users with normal web
browsers implicit permission to access the data, but probably not to
us with our screen scraper. If we had asked the site owners, they
would very likely had said ”no”, and therefore, they could well argue
that we were getting hold of a ”recording” witout permission.

(As an aside: the use of ”recording” (”upptagning” in swedish) in
the quoted law text is interesting in it’s own right — the
legislation was originally written with telephone wiretapping, opening
of letters, and similar things in mind, then ”adapted” (used in the
loosest of senses) into the digital age.)

I would prefer that questions like these were solved by technical,
not legal, means. Dagenstv.com used one such mean (the User-agent
discrimination) to block our screen scraper. We could have changed our
program to masquerade as a normal Internet Explorer browser, but that
would only escalate into a pointless arms race. Someone wrote a
different script that fetched the data from another site instead, and
that was the end of it. Furthermore, if we had bypassed dagenstv.com’s
User-agent check, we would have essentially said ”Even though we’ve
been told in no uncertain terms that what we’re doing is not permitted
by the site owners, we’re choosing to ignore that and circumvent the
access control” — if we had done that, dagenstv.com would really be
right in saying we were getting hold of data without permission.

But there’s a lot to be said for screen scraping. lagen.nu could not exist without screen
scraping. A lot of really cool web services over the years have been
made possible by screen scraping. It has enabled loose
coupling
years before anyone had talked about web services. It’s
the basis for a lot of interesting research and data mining. And
sometimes it just enables plain cool stuff.

Furthermore, it would be wrong to assume that all content providers
are opposed to screen scraping. For example, what is the one thing
that distinguishes forward-thinking web companies? They provide API’s
to their services (Amazon,

Livejournal,
Yahoo, Google, Flickr), enabling
anyone to build cool applications on top of their data, just like we
wanted to build a cool HTPC application using data from
dagenstv.com. By providing API’s, smart web sites remove the need for
actual screen scraping (which, in all fairness, is a messy and seldom
very interesting technological challenge, and furthermore only a means
to an end), but enable and encourage the same kinds of applications.
These API’s (and the XML-RPC/SOAP-based underpinnings) did not emerge
from a vacuum. People have been screen-scraping Amazon.com for their
own little needs since it was launched. Smart service providers
realise that it’s better to work with all this creativity than against
it.

If web site providers choose to do what dagenstv.com did, then
fine. They’ve stated their intent, it’s their service, their rules,
they’re entitled to take their ball and go home. But before a site
owner puts such a block in place (which could also be done through a
robots.txt file), screen
scraping should in no way be considered unlawful computer
infringement.

It turns out that the EC
convention
that this new legislation is to implement provides for
these kinds of distinctions, under article 2 (my emphasis):

Article 2 – Illegal access

Each Party shall adopt such legislative and other measures as may be
necessary to establish as criminal offences under its domestic law,
when committed intentionally, the access to the whole or any part of a
computer system without right. A Party may require that the offence be
committed by infringing security measures
, with the intent of
obtaining computer data or other dishonest intent, or in relation to a
computer system that is connected to another computer system.

Knowingly circumventing a access control system by, for example,
changing the User-agent string, might be considered infringing
security measures (weak as they are), but an unassuming GET request
could, with this definition, never be considered illegal access. I
hope that Sweden takes this opportunity to better define what should
be considered illegal access.

Another aside: Since lot of my current activities, and thus my blog
writing, revolve around swedish law, it’s sometimes difficult to write
in English, as there are a lot of precise Swedish legal terms that I’m
not comfortable translating. For anyone versed in Swedish law, posts
about it in English is probably way harder to read. Furthermore, most
of these posts are probably of limited interest to non-swedes.

Therefore, I’m considering switching the language of this blog to
Swedish. If you don’t understand Swedish, but would like to continue
reading this blog, please say so in the comments. Thank you.

5 reaktioner till “The legality of screenscraping”

  1. Hey there,

    I have run into a similar problem to the one you described and wanted to expresss my appreciation for your clear thoughts on the matter. There is a website in the US called Craigslist (cragslist.com). Some guy created http://www.housingmaps.com by esentially scraping data from craigslist and using the Google Maps API to provide a visual tool to the craigslist real estate listings.

    I had thought of the same thing a while back, but using MLS data (MLS = multilisting, a regional consolidated data source for real estate professionals). I made such an application, and it went so well I thought I would contact the people I was getting the MLS data from and offer them free advertising in exchange for an actual data feed, which they provide commercially. They were not very nice to me, and sort of freaked out. I have been trying to find another way to get this data, another *legal* way, but it doesnt seem likely. They have incredibly detailed rules regarding the use of their data, even for their paying customers.

    Anyway, that was probably not very interesting to read, but I just wanted to thank you for the article. It gave some perspective to things I had been considering.

    Morgan

  2. I appreciated your comments here in English. I do a lot of statistical analysis of publicly available data, and find that some US state web sites encourage this by including downloadable .csv or database files. Others create complex forms for accessing the data, seemingly to discourage automatic access. I am learning screen-scraping in Perl, but am concerned about where to draw the line in my use. Clearly, if there is a ”terms and conditions” statement that speaks to the use of an automatic tool, then it would be illegal (unethical?), and I think your working definition of not circumventing (albeit weak) security measures is a good one.

  3. I like your blog but I don’t understand Swedish 🙁 You say its probably not that interesting to non-swedes, but i don’t agree.

    Sweden is in the global spotlight for copyright and digital rights laws. Sweden is doing a lot of great things to protect consumers and also not making vague blanket laws that are clearly written by bureaucrats that don’t understand the technology. It is really useful for people from other countries to see and understand whats going on in Sweden and use it as an example to try and achieve.

Kommentarer kan inte lämnas på detta inlägg.