The legality of screenscraping

Niklas Lundblad directs me to a couple of interesting propositions about pending laws regarding computer crime (one of them actually uses the phrase “crimes in cyberspace” — very 1995′ish retro). Both Ds 2005:5 and Ds 2005:6 are intended to be the first steps in implementing recent EC legislation (particularly “Convention on Cybercrime ETS no.:185“) in swedish law.

Unfortunately, both documents are very thin when it comes to defining what should be regarded as “illegal access to information systems”. Swedish law has not been well defined in this area before, and I was hoping that maybe the lawmakers would take the opportunity to clarify this.

The issue I’m mostly concerned with is screen scraping. I like screen scraping. I think screen scraping is cool. I wrote my first screen scraping program the same week I got my first job, almost ten years ago (it was a simple script to automatically download the latest Dilbert cartoon and email it as an attachment to myself).

Now, it’s easy to see why a content provider would be opposed to screen scraping. When I got my Dilbert strip in my inbox, there was no advertising attached to it, thus I was depriving Unitedmedia of ad revenue.

Years later, I was involved in the XMLTV project as I was playing around with a homebrew HTPC. For Swedish TV listings, there was a simple program that would fetch TV listings from dagenstv.com and re-format them into the XMLTV format. One day, their service started to serve up a very hostile-worded text file when I ran this program. Basically, dagenstv.com had changed their web server configuration so that requests from a certain User-agent (our screen scraper program), would get a stern warning that we were doing illegal things and that our IP adress had been logged, or something like that.

Now, were we doing something illegal? Keep in mind that each user would run this program on his or hers individual computer; we never redistributed the content. Wheter or not we could have done that legally in Sweden is another question, one that maybe could be answered by pondering URL 49 § and related materials. It’s an interesting question in it’s own right, just not the subject of today’s blog post.

Was the mere act of accessing the site with a different tool than the site owner intended, thus gaining access to digital data in a non-approved way, illegal? For me as a programmer, this feels like an absurd question. I’m only sending humble GET requests, if the site owner doesn’t want me to have the information, then don’t send it! But with my legal student-glasses on, this could be considered as computer infringement, as per the wording in BrB 4:9 c: “Den som […] olovligen bereder sig tillgÃ¥ng till upptagning för automatisk databehandling […]döms för dataintrÃ¥ng till böter eller fängelse i högst tvÃ¥ Ã¥r.”. (A rough translation would be “Someone who gets hold of a recording for automatic computer handling without permission is to be sentenced for computer infringement to fine or prison for no more than two years”).

Dagenstv.com could be said to have given users with normal web browsers implicit permission to access the data, but probably not to us with our screen scraper. If we had asked the site owners, they would very likely had said “no”, and therefore, they could well argue that we were getting hold of a “recording” witout permission.

(As an aside: the use of “recording” (”upptagning” in swedish) in the quoted law text is interesting in it’s own right — the legislation was originally written with telephone wiretapping, opening of letters, and similar things in mind, then “adapted” (used in the loosest of senses) into the digital age.)

I would prefer that questions like these were solved by technical, not legal, means. Dagenstv.com used one such mean (the User-agent discrimination) to block our screen scraper. We could have changed our program to masquerade as a normal Internet Explorer browser, but that would only escalate into a pointless arms race. Someone wrote a different script that fetched the data from another site instead, and that was the end of it. Furthermore, if we had bypassed dagenstv.com’s User-agent check, we would have essentially said “Even though we’ve been told in no uncertain terms that what we’re doing is not permitted by the site owners, we’re choosing to ignore that and circumvent the access control” — if we had done that, dagenstv.com would really be right in saying we were getting hold of data without permission.

But there’s a lot to be said for screen scraping. lagen.nu could not exist without screen scraping. A lot of really cool web services over the years have been made possible by screen scraping. It has enabled loose coupling years before anyone had talked about web services. It’s the basis for a lot of interesting research and data mining. And sometimes it just enables plain cool stuff.

Furthermore, it would be wrong to assume that all content providers are opposed to screen scraping. For example, what is the one thing that distinguishes forward-thinking web companies? They provide API’s to their services (Amazon, Livejournal, Yahoo, Google, Flickr), enabling anyone to build cool applications on top of their data, just like we wanted to build a cool HTPC application using data from dagenstv.com. By providing API’s, smart web sites remove the need for actual screen scraping (which, in all fairness, is a messy and seldom very interesting technological challenge, and furthermore only a means to an end), but enable and encourage the same kinds of applications. These API’s (and the XML-RPC/SOAP-based underpinnings) did not emerge from a vacuum. People have been screen-scraping Amazon.com for their own little needs since it was launched. Smart service providers realise that it’s better to work with all this creativity than against it.

If web site providers choose to do what dagenstv.com did, then fine. They’ve stated their intent, it’s their service, their rules, they’re entitled to take their ball and go home. But before a site owner puts such a block in place (which could also be done through a robots.txt file), screen scraping should in no way be considered unlawful computer infringement.

It turns out that the EC convention that this new legislation is to implement provides for these kinds of distinctions, under article 2 (my emphasis):

Article 2 - Illegal access

Each Party shall adopt such legislative and other measures as may be necessary to establish as criminal offences under its domestic law, when committed intentionally, the access to the whole or any part of a computer system without right. A Party may require that the offence be committed by infringing security measures, with the intent of obtaining computer data or other dishonest intent, or in relation to a computer system that is connected to another computer system.

Knowingly circumventing a access control system by, for example, changing the User-agent string, might be considered infringing security measures (weak as they are), but an unassuming GET request could, with this definition, never be considered illegal access. I hope that Sweden takes this opportunity to better define what should be considered illegal access.

Another aside: Since lot of my current activities, and thus my blog writing, revolve around swedish law, it’s sometimes difficult to write in English, as there are a lot of precise Swedish legal terms that I’m not comfortable translating. For anyone versed in Swedish law, posts about it in English is probably way harder to read. Furthermore, most of these posts are probably of limited interest to non-swedes.

Therefore, I’m considering switching the language of this blog to Swedish. If you don’t understand Swedish, but would like to continue reading this blog, please say so in the comments. Thank you.

Tags: , ,

4 Responses to “The legality of screenscraping”

  1. Morgan Whitney Says:

    Hey there,

    I have run into a similar problem to the one you described and wanted to expresss my appreciation for your clear thoughts on the matter. There is a website in the US called Craigslist (cragslist.com). Some guy created http://www.housingmaps.com by esentially scraping data from craigslist and using the Google Maps API to provide a visual tool to the craigslist real estate listings.

    I had thought of the same thing a while back, but using MLS data (MLS = multilisting, a regional consolidated data source for real estate professionals). I made such an application, and it went so well I thought I would contact the people I was getting the MLS data from and offer them free advertising in exchange for an actual data feed, which they provide commercially. They were not very nice to me, and sort of freaked out. I have been trying to find another way to get this data, another *legal* way, but it doesnt seem likely. They have incredibly detailed rules regarding the use of their data, even for their paying customers.

    Anyway, that was probably not very interesting to read, but I just wanted to thank you for the article. It gave some perspective to things I had been considering.

    Morgan

  2. Jeremy Stanley Says:

    I appreciated your comments here in English. I do a lot of statistical analysis of publicly available data, and find that some US state web sites encourage this by including downloadable .csv or database files. Others create complex forms for accessing the data, seemingly to discourage automatic access. I am learning screen-scraping in Perl, but am concerned about where to draw the line in my use. Clearly, if there is a “terms and conditions” statement that speaks to the use of an automatic tool, then it would be illegal (unethical?), and I think your working definition of not circumventing (albeit weak) security measures is a good one.

  3. Ny tv-widget - Page 8 - 99mac Says:

    […] Ursprungligen skriven av gusax840 Ja man måste säkert abonnera på Showview för att få koderna officiellt. Vad jag menade var att själva koderna ju måste komma från någon matematisk formel. http://en.wikipedia.org/wiki/Showview#Algorithms Jag ser att http://dagenstv.com/ visar något som jag antar är ShowView-nummer. Det är ju tekniskt möjligt att extrahera sådan information därifrån, men kanske olagligt - i alla fall oönskad av personerna bakom sajten. […]

  4. SpaceMonkey Says:

    I like your blog but I don’t understand Swedish :-( You say its probably not that interesting to non-swedes, but i don’t agree.

    Sweden is in the global spotlight for copyright and digital rights laws. Sweden is doing a lot of great things to protect consumers and also not making vague blanket laws that are clearly written by bureaucrats that don’t understand the technology. It is really useful for people from other countries to see and understand whats going on in Sweden and use it as an example to try and achieve.

Leave a Reply