Is the 27 club a myth? Maybe not.

A recent study published in the British Medical Journal, widely linked (NYT, Jezebel, Andrew Sullivan) claims that there is no statistical support for the assumption that rock stars die more frequently at the age of 27 (aka the 27 club meme)

However, the study used a sampling scheme that excluded four of the most well known members of the 27 club (Robert Johnson, Jimi Hendrix, Janis Joplin and Jim Morrisson). Furthermore, it only included 71 musicians that actually died. This seems to me to be a very small sample. Can we procure a better sample?

Using DBPedia (the structured database containing facts gathered from Wikipedia), I selected a list of dead musicians that were born after 1900 using a somewhat simple SPARQL query:

PREFIX dbo:
SELECT DISTINCT ?person ?birth ?death WHERE {
     ?person dbo:birthDate ?birth .
     ?person dbo:deathDate ?death .
     ?person rdf:type ?musician
     FILTER (regex(str(?musician), "Musicians"))
     FILTER (?birth >= "1900-01-01"^^xsd:date) .

This yields a list of almost 2500 persons. When plotting a histogram for the age of these musicians, we get the following view:


It seems to me that there is a small but significant spike at 27 (154 % increased chance of death compared to the year before), with a secondary spike at 32 (78 % increased chance). This shows the value of selecting a good sample when using statistical analysis.

Furthermore, it shows the value of structured data. I fixed the above charts by fiddling with SPARQL queries for half an hour, then about an hour of fiddling with Excel. I believe my criteria (”Dead musician famous enough to have a Wikipedia article”) is better than the one used in the study (”Artists with a number one UK album”). But if you disagree, you can easily modify the SPARQL query to work with a different study sample. Let me know your results!

The Cookie directive and HTML5

In 2002, the European community introduced Directive 2002/58/EC, commonly known as the Directive on privacy and electronic communications. Amongst other provision, it has the subarticle 5(3) which has made it known as ”The cookie directive”, as the subarticle states that information may be stored in or retrieved from end user computers only if the user is made aware of this and is given the opportunity to refuse this storage or retrieving. In 2009 the directive was amended (2009/136/EC) so that storage or retrieval is only permitted if the user has given his or her consent.

The full text of the amended subarticle is as follows:

3. Member States shall ensure that the storing of information, or the gaining of access to information already stored, in the terminal equipment of a subscriber or user is only allowed on condition that the subscriber or user concerned has given his or her consent, having been provided with clear and comprehensive information, in accordance with Directive 95/46/EC, inter alia, about the purposes of the processing. This shall not prevent any technical storage or access for the sole purpose of carrying out the transmission of a communication over an electronic communications network, or as strictly necessary in order for the provider of an information society service explicitly requested by the subscriber or user to provide the service.

This regulation is generally understood to apply to the HTTP State Management Mechanism (RFC 6265, earlier RFC 2965, RFC 2109), most commonly known as ”Cookies”. In fact, preamble 25 of 2002/58/EC and preamble 66 of 2009/136/EC explicitly mention cookies as one example of such mechanisms. National regulations and in particular guidelines have focused on this particular mechanism for storing and accessing information on end user computers over a network.

But the directive text can clearly apply to other mechanisms apart from HTTP Cookies. Among mechanisms that permit similar storage and retrieval of information are the Local Shared Object mechanism found in Flash, the userData functionality in Internet Explorer, and more recently, a variety of mechanisms being defined and implemented under the html5 umbrella.

Two questions are therefore interesting:

  1. Are there mechanisms in html5 that allow user tracking (including by third parties) in a way that is not subject to the consent requirement?
  2. Are there mechanisms in html5 that have no privacy concerns, yet is subject to the consent requirement?

The first question is the most sensitive, and the hardest to answer. But consider a javascript that is served by by a third-party ad network, and is included by a number of unrelated content sites. If such a script:

  1. Generates a local GUID on the client (ie an identifier that the ad network did not choose)
  2. Stores this GUID in local storage.
  3. Sends this GUID back to the ad network using a background XMLHTTPRequest (and, presumably some other information, such as the URL of the page embedding the script) to the ad network.

(Step 1-2 are skipped if the GUID is already present in local storage)

Such a script has the same ability to track a user’s movement across sites, and to assign a user (or rather his/her computer) a permanent identifier. But does it require consent according to article 5(3)? One way to argue that it does not, is to take note of preamble 66: ”Third parties may wish to store information on the equipment of a user, or gain access to information already stored, for a number of purposes”. It may be argued that step 1-2 does not mean that it is the third party (the ad network) that stores the information (indeed, the ad network does not know what information is stored). If the third party hasn’t stored the information, then the gaining of information in step 3 might not be be subject to the rule as well, since the wording seem to require that information gained by a third party must have been previously stored by the same party. (If there is no requirement that the information gained must have been stored by the same party, one must note that every third party whose resources are included by a web page automatically gains access to a lot of information, such as the User-agent string, and ask if that information gaining is subject to the directive as well).

I will concede that this argument is not strong, as it’s assumption that step 1-2 does not constitute information storage by the third party, when the third party is responsible for sending the javascript code that ultimately results in information being stored. It seems functionally equivalent to traditional HTTP Cookie-based storage of information. But the difference is that using this method, the third party does not specify what information should be stored. Could this not be significant?

The second question seems easier to answer. Consider Offline web applications. These are web pages that contain a reference to all resources (HTML, Javascript, CSS) they require in order to work. A browser supporting offline applications will download all these resources so that the application works even if there’s no internet connection. Note that if the browser does not support offline apps, they still work — they just require you to be online. A simple example containing a version of the Halma game is described by Mark Pilgrim.

This mechanism causes the storing of information on the end user computer. This storage is not strictly necessary in order to provide the service (remember, the app works without the mechanism if the user is online — offline support is just a nice-to-have). No information is ever accessed by the provider of the game, but this is not a requirement of the directive, storing of information is enough. Thus, consent is needed. And yet there are no privacy concerns (no personal identifiable information is ever retrieved).

The aim of article 5(3) was to regulate certain usages of cookies percieved to be illegitimate. But it was written to be technology neutral, as new techniques similar to HTTP cookies were sure to be created after the directive (The diabolical evercookie uses 12 additional mechanisms, including a brilliantly twisted way of storing information in the users browsers history of visited URLs). The problem is that such mechanisms are only similar, not identical. This make writing technology neutral legislation really difficult.

Thesis: Appendicies and backmatter

The rest of the thesis consists of two appendicies (firstly describing the system prototype in detail, including how to run it yourself, secondly describing the ”gold standard” tests we’ve evaluated the system against) and the bibliography.

That is all, for this time! If you’ve read all chapters so far, I’d really appreciate your comments and suggestions for improvement.

Download appendicies and backmatter here.

Thesis chapter 3: Information retrieval

Relevance can be interpreted in many ways, from subjective to objective. Which interpretations are built into traditional information retrieval systems, and what properties does these manifestations of relevance have? The use of IR for legal information has a long history. How does legal information retrieval correspond to the legal method, and can we improve on this correspondance, by e.g. creating a relevance ranking function more in line with what is considered legally relevant?

Download chapter 3 here.