Saturday, November 26, 2011

Churnalism: DEFRA churn - the Guardian is in the lead!

(UPDATE: 27/11/11 - raw churnalism data available HERE - JSON format, zipped, 448kb)

After the churn analysis of the Environment Agency press releases (please read that article for more details and important caveats if you haven't read one of my churn posts before), I followed up with DEFRA.I will be making the raw data publicly available tomorrow for both the Environment Agency and DEFRA churn analyses.

This time I was able to construct the spiders and process the data faster - I also avoided (most of) the unicode problems that plagued me with the EA data, so this analysis can be considered slightly more accurate and slightly less forgiving of the media organisations (though it still strictly follows the rules I set down previously, such as no editing of the press release to remove extraneous information). Along with inevitable issues with some difficult characters slipping through and no editing of the press releases, it still means the data will naturally favour the media organisations. As I said on previous posts, when I make the raw data publicly available, churn analysis by other people will very likely improve upon my methods and yield results more detrimental for the media.

In any case - onto the results. The summary is presented below (click for full size image):

"Quality" press churn results

Summary of results:

- A total of 386 press releases were analysed, from 13th May 2010 to 24th November 2011. These generated 1959 detectable cases of churn. Again, there is probably a lot of interesting data within the "detectable" category that deserves analysis at a later date. For now it is discarded.

- Out of those, 173 were classified as "significant" and 18 as "major".

- The Guardian was the leader in both categories by a long way - accounting for 19.65% of significant churn and 27.78% of major churn.

- The BBC followed close behind in terms of significant churn with 16.18%, though for major churn was beaten into third place by both the Independent and the Daily Mail with a joint 16.67%.

- The Independent came third in the signifcant churn classification.

- A common factor in the most highly churned articles both in this analysis and the previous two appears to be lack of a named author in most cases (though see one of the exceptions detailed below). This suggests the media organisations are aware that what they are doing is not kosher.

- Continuing a theme from the last two churn analyses, the tabloids consistently embarrass the so called "quality press". This time I pulled out the statistics for the UK's major tabloids for comparison (click for full size image):

Tabloid press churn results
When I first started these analyses I fully expected to see a much higher showing of churn by the tabloids. It is interesting to see the contrast. Also out of the churn analyses done so far, it is consistently the Mirror out of the tabloids that has the highest percentage of churn.

As usual I select a few of the more egregarious cases of churn for your entertainment (and importantly - provide a manual submission to the churnalism database so they can be seen visually):

'Gloucestershire Old Spots pork protected by Europe'
An absolutely cracking BBC 79% cut and paste job on - er - crackling.

'Bonfire of the Quangos'
Remember that list of Quangos that were to go? Completely cut and pasted from a press release. This one is particularly fascinating because in the two worst cases the cut and paste was the list provided in the press release. It actually included several paragraphs laying out a context that was not cut and pasted across. If it had just been the list in the original both would have scored close to 100% pastes.....
The pastes are so large in any case that the churnalism engine falls over when the 'view' button is clicked to see the visualised version. Be warned if you click it, your browser may hang.

'New service for householders to stop unwanted advertising mail'
Absolute carnage on the churning front here with the majority of the main media outlets represented. The Guardian appeared to like this story so much they cut and pasted it twice - and this time each article has a named author. Where the hell was the editor?

Friday, November 25, 2011

It's mob rule at the Guardian....

(This blogpost should perhaps also be titled - 'What I did/didn't/did say at the Guardian today....')

There's nothing quite like rank hypocrisy to boil my piss. However, to ensure it is fully evaporated in anger, combine rank hypocrisy with crass stupidity, naked opportunism, complete resistance to facts or reason and censorship.

For that was the bread and butter of Leo "bless 'im" Hickman's disgraceful piece of yellow bellied journalism at the Guardian today.

Hickman decided it was time to form a posse comitatus to try tracking down the source of the climategate emails, laughably using the README textfile included in the latest tranche of releases as the primary source of evidence.

This was one of those pieces - especially as it was in the comment is free if you agree section - that really reveals the Guardian's true colours. Numerous commentators including me (prior to the first round of censorship - sorry - 'comment adjustment') attempted to point out the Guardian's and Hickman's rank hypocrisy on this issue. The most striking and obvious example having been the paper's massive support for Wikileaks, however there were many other examples, including the anonymous Enron whistleblower, as another commenter pointed out. As was repeated again and again, it appeared that all leakers were equal but some were more equal than others in the Guardian's eyes.

This was of course brushed off by Hickman and his part-time principle party of followers in the comments section.

Next I pointed out (prior to 'comment adjustment') that claiming it was the work of a hacker was still just an assumption. Hickman replied to me directly on that and similarly brushed it off. He claimed it was irrelevant. The poor dear didn't seem to realise that if he assumed it was the work of a hacker and in fact it was a leaker then his "investigation" would lead him down to all sorts of blind alleys, not least because the MO and levels of access would be completely different (not to mention the trail of evidence left behind).

There were a plethora of delightfully dense comments in support of Hickman et al and stunning leaps of reasoning. These people were also apparently immune to criticism because they "knew" what they were claiming was true, especially regarding the "hacker" claim. Many pronounced completely ill-informed statements about this showing that i) they knew nothing about IT security and ii) that they couldn't even be bothered to use google to check details. After all, The difference between an internal security breach and a carefully coordinated external breach is vast. Pointman gave an excellent overview after the first climategate - here. Moreover they absolutely did not care about their ignorance. What a familiar pattern, eh? No wonder they were immediately supportive of the "scientists" at the heart of the climategate storm - they're just like them!

There were some absolute crackers amongst the received wisdom of this bunch of easily led zealots and I highly recommend you read through the comments - well those that are left - as it is a laugh a minute.

Komment Macht Frei

Speaking of the comments - when the piece first appeared this morning, it was absolute devastation from the moderator. ALL of my comments bar the first one were censored, as were numerous other comments by others. I had no clue why they'd been removed beyond the fact that we all seemed to disagree intensely with Hickman.

Now I should point out something important here for Guardian watchers - they have two types of post moderation. There is the one we're all familiar with - where the boilerplate 'this comment was moderated because it breached our community (puke) standards' but there's also a much more insidious type and I only noticed it because I've been paying a lot of attention to their censorship pattern over the last couple of years - its what I call "nuking". In this case they remove all evidence that the comment was ever there. It's particularly chilling for freedom of speech because aside from the fact that by looking at the comments one can't actually assess the general level of censorship, if it's *your* comment that disappears in this way it's only your word that it was ever there in the first place....

Now bizaarely, after the comments spilled over onto two pages I happened to click back to the first page to see what else had been censored and was surprised to see that most of my previously "moderated" comments had reappeared (except for the "nuked" ones). I don't know if this is a bug in their software or a disagreement between moderators but it adds even more to the general sense of confusion and latent fear of arbitrary censorship that completely fucks any meaningful contribution over there.

Another important point to be aware of is this: One way to guarantee being censored on the Guardian is if you make a reference to your, or someone else's having been censored you will immediately be censored and they often use the "nuke" option too.

The Guardian is  - as a media institution - utterly reprehensible. Most other media outlets are of course too, across the political spectrum. But none outside the BBC attempt to present themselves so often as the default "good guys", nor do their followers similarly regard it as received wisdom...

The climategate 'gait' or the 'out of context paradox'

There's a regular pattern that occurs in any discussion of climategate (1 or 2). It is inconsistent but also entirely consistent with the unthinking nature of many of those who promulgate it:
i) They assert that the emails were "taken out of context"
ii) Responder says that they are not.
iii) A request is then made for evidence.
iv) Responder invites them to read the emails - there are numerous complete email chains, supporting claims against the "scientists" that ONLY MAKE SENSE IN CONTEXT. But the trick is you have to actually read the emails....

A modern day climate "scientist"
Now given how unambiguous some of the exchanges are (in particular those that involve purposefully frustrating FOI inquiries and deleting emails....) one is then prompted to ask exactly what standard of evidence is required. For the evidence before us, if for example we stick with complete email chains rather than individual comments, is a magnitude higher than the typical standard accepted in the vast majority of journalism that we ever read or see. It means that - to be consistent - if one were to completely reject these email chains as sufficient evidence, one would have to throw out almost every received opinion on any quoted person in the press one has ever encountered. Will the zealots do of course they won't. But of course consistency is in the same disused box in their basement as a regard for truth....

One final delicious irony of this of course is that 'The Team' will surely be scratching their heads now, trying to remember what on earth what was said to who. But because they very likely deleted these emails after they had been copied from the mailserver then they have only one place to go to check.....

Thursday, November 24, 2011

Churnalism: Churning 'Frack Off'

In several comments on my previous piece of work on churnalism I read at Biased BBC, the activist group 'Frack Off' were mentioned and questions were asked whether they had any detectable churn in the media as online links were often found to them at the BBC and the Guardian sites by Biased BBC readers.

I decided to have a look.

The group is very new and they have only released four official press releases. This meant I could work fast with this data as I wouldn't need to write specialised spiders for gathering, analysing or submitting the data as it could be done manually with such a small data set.

Again I find myself being surprised by the results:

A reminder on the scoring criteria:

>=100 is classified as "detectable" churn. I usually discard these results and they will always be discarded for comparing one set of data to another (e.g. this analysis to the previous environment agency analysis), however it still yields a rich seam of data and as this data set is so small compared to the previous one I decided to take some time to look into some of these.
>=500 is classified as "significant" churn.
>=1000 is classified as "major" churn - in these cases the articles simply could not have been written without cutting and pasting the bulk of its material from the press release.


- Out of four press releases, three have generated a total of 13 articles with detectable churn according to my criteria (score of >= 100 from the Churnalism engine). 14 were originally found but I removed one Guardian article as it was detected twice (probably because of the similar screeds issued by 'Frack Off' in their press releases).

- Out of those, 5 were significant churn (score of >= 500) - and as they were only a handful I have entered them manually into the Churnalism database so you can see the side by side comparisons yourself. 2 were from the Guardian, 1 from the Mirror, 1 from the Times and 1 from the Daily Mail.

- Several of the remaining 8 articles with detectable churn come very close to the >= 500 criterion (details below) and indeed in a couple of cases the Churnalism engine considers them to be significant enough to display when manually input (an API input by contrast gives an exact score; the churnalism seems to have a less forgiving standard than myself and will display many articles with a score between 400-500 as significant).

Further comments:

Now, even being generous to the media organisations and 'Frack Off', this means 50% of their press releases are being significantly churned - and primarily by the Guardian and the Mirror (one of the Mirror's contributions came close to a "major" ( >= 1000) piece of churn with a score of 870. I suspect that were I less forgiving with my methodology, removing extraneous elements in the initial press release (links, contact information etc) and didn't have to remove some characters to ease processing (e.g. single quotes), this may well have scored as 'major churn'. You can eyeball it yourself in any case here

I say that 50% assessment is 'being generous' to 'Frack Off' also because it includes a press release they issued yesterday so it will not have had time to percolate through the media yet. If the press release yesterday results in any significant churn (which I will check again in a week or two), that percentage will climb to 75%.

Further details on the data for each press release:

26th October 2011 press release: - submitting it to the API yielded 3 detectable chases of churn and one borderline (score 96), so I looked at them in more detail. One of these articles is from the Guardian and already highlighted by the Churnalism engine as containing significant churn from one of the other FO press releases. Two are Telegraph articles, one by Louise Gray - both name check the activist group. The other is a Times article and unfortunately I can't verify anything beyond the paywall.

2nd November 2011 press release - 7 detectable cases of churn - 4 were "significant" churn and a further 2 came very close to being considered "significant" by my scoring criteria - one from the Independent  (score: 436) and one from the Guardian (score: 448). Notice how the Independent article also cites: Chris Huhne, the WWF, Friends of the Earth and Greenpeace. Cuadrilla Resources - the company responsible for the Fracking discussed at the heart of the article get one response of similar length to the others, along with a neutral response from the shadow energy secretary, Tom Greatrex.Cuadrilla don't get a single response in the Guardian article and in mentioning the independent report commissioned by Cuadrilla fails to mention that the report concluded that another earthquake incident was unlikely.

The Churnalism engine considers one of the Guardian articles significant that doesn't hit my threshold (500) - it scores 448, but the manual search on also yields an entry that the engine considers significant, at the same time, the Daily Mail entry with a score of 521 isn't listed on the manual search so this balances out.

See visual breakdowns of significant / detectable churn for this press release here.

3rd November 2011 press release - four detectable cases of churn, one from the Guardian classifies as "significant".

See visual breakdowns of significant / detectable churn for this press release here.

23rd November 2011 press release - no detectable churn via either manual entry to or to the more sensitive API system. This could well change however as it was only yesterday this PR was issued.

Final comments:

I find these results very concerning. A small single-issue activist group that has only existed for a matter of months should not be generating such a significant amount of churn with just four press releases. The Guardian and the Mirror in particular appear to be giving the group a free ride (though also see the commentary above on one of the higher scoring Independent articles). The Environment Agency at least has some kind of mandate and does carry out a wide variety of tasks - to that extent it's no surprise some of its press releases are churned (though this is no excuse for the journalists concerned, or indeed for any EA employees who are aware of and have no issue with any kind of symbiotic relationship here).

This is extremely dangerous, especially for such an important issue and shows how groups like 'Frack Off' can be so polarising. On a personal level I do believe there are legitimate environmental concerns regarding Shale Gas. Unlike 'Frack Off' however I consider these issues surmountable. They make clear from their website that they want this to be a polarising issue regardless of the facts. They primarily cite Gasland - a "documentary" itself which is making rational discussion of the issue of Fracking all but impossible. It is primarily a series of anecdotes that are themselves seriously problematic as evidence. 'Frack Off' et al should be citing careful and replicable research such as this for discussion. Why don't they? And moreover why doesn't the media, who are apparently falling over themselves to repeat 'Frack Off' claims without doing the very research most of us expect of them. My guess is that the research isn't nearly alarmist enough for them and has plenty of caveats that prevent them from easily reporting it as such.

And whilst I am sure many 'Frack Off' adherents are, as we speak, screaming that the unfolding "Climategate 2" saga is all about emails "being taken out of context", they themselves have a website about Fracking that not only cites factually incorrect and misleading material but - crucially - misses the all important context that Shale Gas could be an enormous energy (not to mention political) game changer. Meanwhile all of the companies involved in prospecting and mining are listed as "bad guys". Its utterly juvenile.

But what is truth to people like this? Or indeed to the mass media in whom we trust its flame is continually nurtured, not murdered as it seems to be?

Tuesday, November 22, 2011

Ammo: Churnalism - Churning the Environment Agency

(UPDATE 27/11/11 : raw churnalism data now available HERE (JSON format, zipped,  2.1MB))

My work on bot-writing and churnalism has finally started to bear fruit. And its not often I write a blog here that is an "exclusive", however this is one of the few!

A large part of my research over the last year has focused on the nature of digital and virtual technologies and in particular on the nature of censorship and propaganda online. A perennial concern, as outlined in my Privacy 2.0 post is the arrival of 'mass dataveillance', where the monitoring of an individual is far less important - in both import and consequences - than the gathering of masses of data about lots of individuals. This is because those masses of data are able to reveal patterns and links that the individual members of the data set are very likely unaware of.

However, this is something that works for us as much as against us if we're willing to use the freely available data in the public domain. Its not just large corporations of whom one is suspicious who can gather, analyse and deploy the masses of data.

Nick Davies, in his excellent book, 'Flat Earth News', popularised the term "Churnalism". His book, and the research it is based on was an absolutely damning indictment of the UK media, with similar implications for the entirety of Western media itself. Davies said journalists: “....are no longer gathering news but are reduced instead to passive processors of whatever material comes their way, churning out stories, whether real event or PR artifice, important or trivial, true or false”.

His book was substantially based on research he commissioned by the University of Cardiff. In the course of that research, amongst other things, what they found "suggests that 60% of press articles and 34% of broadcast stories come wholly or mainly from one of these ‘pre-packaged’ sources." - a phenomenon that many of my regular readers will no doubt be familiar with and one that has now sadly started becoming commonplace in other areas too - we've been moving ineluctably from just 'journalism by press release' to also 'science by press release'.

Upon reading this research over a year ago, one of my first thoughts having had years of programming experience was, 'that could be automated!'. And not only could it be automated, but with enough data, direct patterns of bias and influence could also be detected.  Lo and behold, around that time, the wonderful churnalism engine appeared.

This site plugs into the Journalisted database, where every single press item is archived online. The churnalism engine uses an algorithm that enables anyone to manually copy and paste text from any source (though most usually a press release) and compare it quickly to the entire journalisted database, it is then able to report back any cases that are likely to have been cut and pasted, along with an estimate of the proportions copied. It means one can effectively trace the provenance of many news stories back to press releases and also assess how much has been copied directly into the article - articles that were are so often led to believe are supposed to uphold high journalistic standards.

I decided that it should be possible to combine the churnalism utility with masses of data in a way that would not have been possible even a decade ago. The ability to write programs ('spiders' or 'bots' in this context) that could gather all of the press releases from a single organisation and then submit them to the churnalism facility, combined with cheap and readily available computing power means this is an entirely achievable goal. 

So I programmed spiders that were able to gather every single one of the Envrionment Agency's press release, filter out the formatting tags (from the web page) to get the original text, submit them to the churnalism engine and then store and analyse the results.

The data shows several clear patterns:

- The so called "quality press" are the worst offenders for churning Environment Agency press releases - whilst there were many entries from the tabloids and local papers, their cutting and pasting was less egregarious than the "quality press".

- The BBC is by far and away the worst offender for simply repeating whatever the Environment Agency claimed in its press releases.Out of the 393 articles where "significant" churn had taken place, the BBC were responsible for 44%. Likewise for the 49 articles that had "major" churn (meaning in most cases they were almost complete cut and pastes of the press releases), the BBC was responsible for 30.6%.

- I was able to grab a total of 1962 press releases from the Environment Agency giving almost complete coverage of their press release output for the last two years. This total was orginally slightly higher, however since the first pass of my bots (to gather the links before downloading with the second pass), the EA has inexplicably removed 10 press releases. I also found a handful of duplicates. The lowest level of granularity I was willing to accept as "detectable" churn (see below) yielded 5089 articles.


- I had set the bar very high for counting articles as "churnalism". The churnalism API uses a "scoring" system that identifies how many 15-character chunks had been copied and/or pasted. It represents a compromise between the lengths of the source and end articles - so, for example, if a large proportion of the original press release has been copied, but this represents a lower proportion of the end article, this will still yield a high score. And vice versa (Louise Gray's articles in the Telegraph often followed this latter pattern for example - with her article being much shorter than the original press release it is copied from).

- Setting the filter on the data I gathered to a score of >=100 yielded 5089 articles. I regard this as an acceptable baseline for "detectable" churn, though for current purposes I am discarding this larger data set. One of my hypotheses is that any distinct patterns should be visible throughout and indeed this appears to be the case - those I counted as "detectable" were made up of 1983 articles from the BBC - a very similar proportion to those in the higher scoring category (38.9%).

- My methodology (to be detailed in a much longer post later on has massively favoured the media organisations, primarily in three respects:

i) for submitting the press releases to the churnalism engine, I did not edit the press releases to remove anything extraneous - so they often included repeated titles, contact details etc that would lower the percentage of hits in the churnalism engine.

ii) the original 'screen scrapes' of the website press releases included many characters that were difficult to work with programatically. Single commas cause problems for database processing so these were often removed. Also the data regularly contained unicode - and not all of this would be correctly re-encoded when sent to the churnalism web server. This further reduced the percentage of hits.

iii) Finally - as I already mentioned I set the bar very high for what I counted as cases of definite "churnalism". I decided upon three categories of scoring: 1) "detectable" churn - with scores of 100 or more (in practice this would mean maybe a paragraph had been copied) 2) "significant" churn - with scores of 500 or more (in practice this means more than one paragraph had been copied) and finally 3) "major" churn - a score of 1000 or more meaning the majority, or substantive minority of either/both press release and final article had been copied.

With these three aspects in mind, I consider it almost certain that when other people make use of my data (which I will make publicly available, along with a longer article detailing technical issues) they will find a noticably higher proportion of churnalism. There would also be a case for lowering the significance score of 500.  I also think there is probably a lot of useful information to be found in the 5089 articles with "detectable" churn, however I won't be drawing any conclusions from this larger data set at present.

With those caveats and details in mind, here is the summary of the final results, calculated after any duplicates and outliers had been removed (click for larger image):

It should be noted that this data set alone is comparable in size to that used by the research that formed the basis of Nick Davies' 'Flat Earth News'. It took a team of several researchers many months to pore over the same amount of press releases and articles and come to the conclusions he presents in the book. It took me a week to write the spiders, and I'll be doing this again and again for different organisations and likely revealing more hidden patterns in the data. And yes, I'll be reporting them here first and sharing the data publicly!

For those interested, I aim to give access to the data on by next week, which I am revamping this weekend as I finally have some substantial research to use the site to showcase!

For the most appalling cases of churn I submitted them manually to the Churnalism site (a different process to submitting them in an automated fashion to the API). Doing this enables me to save individual examples in the site database and it yields a very useful side by side comparison so it is possible for one to see visually (and compare manually) the worst cases.

For your amusement, amongst my favourites were:

- "Glaciergate should not distract us from climate battle"
Here the chairman of the Environment Agency is asked to write a piece for the BBC. It repeats exactly the majority of a press release issued by the Environment Agency two months beforehand claiming to quote the chairman by declaring what he is going to say at a forthcoming event. Yes I had trouble getting my head around that too.

- "Llamas help protect an ice-age fish"
The infamous (and crazy) Llamas protect fish from climate change press release in fully churned glory.

- "Flood defence project gets small seal of approval"
An absolutely stunning 94% paste job by the BBC. Even the churnalism engine struggles to represent it visually - make sure you click through to the BBC article itself. You can see from eyeballing it and comparing it to the submitted press release that it is a straight cut and paste.

Friday, November 18, 2011

Tuesday, November 15, 2011

Ammo: Privacy 2.0

For an explanation of the 'Ammo' prefix, please see here.

My research work is now reaching the stage where I'm able to usefully share parts of it (updates to with actual working parts is coming soon too...). This is the first of those contributions and is a necessary prelude to some upcoming posts that will require reading this first.

I was prompted to write this blog entry today with the very sad news of the death of one of the co-founders of Diaspora, 22-year old Ilya Zhitomirskiy. I'll be writing on Diaspora soon and making important contrasts with Facebook, Google+ and Twitter. I think Diaspora is incredibly useful and will argue why in a later post - including recommending why you should switch from the latter three tecnhnologies to Diaspora.

First though, its important to understand Jonathan Zittrain's concept of 'Privacy 2.0'. And why is this included in the 'Ammo' series? Simply because our conceptions of privacy are outdated - scholars and technologists such as Zittrain and his peers (such as Morozov, Benkler etc) are beginning to provide immensely useful analyses and concepts for understanding the brave new digital world we find ourselves in. The powers that be, including the mass media, have found themselves flat-footed by the new social media technologies. And its more than that they are vested interests trying (and failing) to protect their turf. It is that they simply don't have the conceptual know-how to even begin to grasp this new domain and the promise (and pitfalls) it offers. Staying ahead of their (albeit very slow) curve will arm you (and help to protect you against the mendacious individuals who have already grasped it...).

Below is a summary of Zittrain's 'Privacy 2.0' concept. It is written in a very academic style and I make no apologies for that - it is one component amongst many in my current academic toolbox and so is expressed that way. I hope you find it useful:

'Privacy 2.0'

Instead of simply ‘privacy’, this particular concept appends the ‘2.0’ to reflect the new era of digital and internet privacy that are still being addressed using concepts, practices and legal precedents that can be said to apply to an ‘earlier’ conception of privacy – ‘privacy 1.0’.

The “generative” technologies that form the basis of digital, networking and internet devices and behaviours put old problems of privacy into new and often unexpected configurations. In both the digital and internet landscapes, broadly understood, there are enormous amounts of uncoordinated actions by actors that can be combined in new and often unpredictable ways thanks to these same technologies.

Limitations on actors to preserve freedom has up until very recently focused on constraining institutional actors (governments, large corporations) etc. New privacy problems however go beyond this traditional paradigm, which centres on the collation of data in centralised databases logging potentially sensitive information on individuals. Whilst this is still an issue within the purview of ‘Privacy 2.0’, it is only a small part of a much wider breed of new problems. More modern legislation in the UK, such as the Data Protection Act recognises this to a limited degree, yet still largely targets the same institutional actors as previous ‘Privacy 1.0’ legislation. The precedent setting Privacy Act of 1974 in the U.S. remained limited to public institutions. The 1998 UK Data Protection Act recognises part of the new privacy problem by the casting the net wider to “data controllers” generally and investing them with legal responsibilities. The fears motivating both pieces of legislation however, originate from the idea of “mass dataveillance” – i.e. de facto surveillance via centralised data collection. Solutions such as restraint, disclosure and encryption are appropriate for these ‘Privacy 1.0’ concerns but extremely limited for the new generative technologies.

The generative mosaic

'Generative mosaic' is a term coined by Jonathan Zittrain, in ‘The Future of the Internet’ that I think elegantly expresses the data mining privacy issues now coming to the fore.

Certain datasets collected on individuals, even if only focused on one aspect of their behaviour, allow patterns to be mined from the data that the individual themselves may have no awareness of (and thus may also never have cause to complain if such – potentially advantageous – information is used against them).

Such data can be immensely powerful even when only gathered for a very narrow range of behaviour. For example, Amazon were able to roll out differential pricing of their products according to past customer behaviour. They were caught when some individuals deleted their browser cookies and discovered that the advertised price would change (no longer having had a reference point to the individual’s previous behaviour).

This example brings home the fact that data mining can very quickly produce tangible results with a comprehensive enough data set. currently claim to have a “market leading” bespoke algorithm for predicting whether a customer is likely to default on their loan. The standard assumed default rate in the retail loan sector is 10%. Wonga claim that their default rate remains in single figures, which is particularly astonishing considering their risky lending sector (short term loans of hundreds of pounds at an astronomical interest rate). Their two primary sources of information are a set of approximately thirty questions asked on the initial application followed by “thousands” of online data points. The fact that Wonga have monetized such information so efficiently demonstrates that there are hidden behavioural cues in people’s data available online that most are not aware of.

(see my earlier blog 'The Rights of Wonga' for more information on this).

The new threats to privacy

Compounded with the ‘generative mosaic’ problem is the fact that government and corporate databases are increasingly less threatening privacy concerns than those created by our ‘generative’ digital technologies and those who use them (virtually everyone in the first world and ever increasing numbers in the second and third worlds).

Ever cheaper processors, networks and sensor technology have created billions of constant data gatherers worldwide. Further, the flow of data from (and to and between) said data gatherer is not generally impeded by gatekeepers – unlike the relatively restrained government and corporate sectors.

A key feature of “Web 2.0” is peer production, and the rise of the ‘prosumer’ – people who are constantly consuming and producing new content, often ‘remixing’ the content they have consumed. The process is chaotic, ever changing and usually without gatekeepers. As a result, the surveillers (and ‘sousveillers’ to add Steve Mann’s lexicon) are us. Government and corporate actors and their intermediaries represent an ever shrinking portion of the ‘Privacy 2.0’ landscape.

From Intellectual Property and Copyright to ‘Privacy 2.0’

“The intellectual property conflicts raised by the generative Internet, where people can still copy large amounts of copyrighted music without fear of repercussion, are rehearsals for the problems of Privacy 2.0” – Jonathan Zittrain, ‘The Future of the Internet’, p.210.

Whilst intellectual property and copyright issues generated by modern digital technologies and environments (‘Web 2.0’) are at the forefront, with various legal scholars such as Yochai Benkler and Lawrence Lessig at the coal face philosophising the quintessential issues and concepts, the impact of ‘Privacy 2.0’ is yet to be truly felt or understood. We are – as Zittrain puts it – effectively ‘all on notice’ as anyone can become a youtube superstar in minutes.

Daniel Solove, in ‘The Future of Reputation’, considers the impact this can have, highlighting examples such as the ‘bus uncle’ of Hong Kong and ‘dog poo girl’ of South Korea. Incidents which, whilst public, would have remained relatively ephemeral in the past can now be recorded and spread virally across the globe, often with undesirable results due to a mass public reaction that would never before have been possible. ‘Bus uncle’ was attacked at his workplace in a targeted attack and ‘dog poo girl’ left her job, both as a result of the firestorms resulting from the videos. Lives are easily ruined in these cases because the total outrage generated is completely disproportionate to the social norm (or possibly, law) violated at the time. And as Zittrain puts it, “…ridicule or mere celebrity can be as chilling as outright disapprobation”.

A debate that regular resurfaces as a result of this scrutiny concerns the idea of the ‘participatory panopticon’ – popularised by science fiction authors such as David Brin (in non-fiction works), the proposal is that total surveillance would not be a problem if it was comprehensive and equal: any and all surveillers could themselves be surveilled. Steve Mann has frequently carried out ‘sousveillance’ interventions to test social norms in situations of surveillance and data sharing, often finding that a Brin style participatory panopticon may not be as remotely welcome as Brin and others suppose.

A strong counter –argument to the idea of the participatory panopticon, which itself rests on an assumption of inevitability and technological determinism, (c.f. the quote from McNealy – similar sentiments have been expressed by other prominent figures in IT and data mining industries such as ex Google CEO Eric Schmidt), is the charge that such extensive scrutiny renders us all into automatons – hence the “chilling” effect referred to by Zittrain above. Indeed, Zittrain compares the situation to that of politicians whenever they are in the public eye – and he implies – they have been the first to understand and adapt to “Privacy 2.0” in their public behaviour (even if they are disastrous at articulating and applying these concepts) and as such we should regard their behaviour in modern times as the canary in the coal mine:

“Ubiquitous sensors threaten to push everyone toward treating each public encounter as if it were a press conference, creating fewer spaces in which citizens can express their private selves.”
(Zittrain, ‘The Future of the Internet’, p.212).

Public statements by politicians cleave to an uncontroversial and bland centre ground. This isn’t just a result of realpolitik; it is also a direct result of ubiquitous media coverage (increasingly now an activity of citizens – especially those from opposing camps) and the ease with which a sentence can be taken out of context. This has a chilling effect that stifles behavioural outliers, and as examples of which, politicians are only the most prominent case. Speech and behaviour in the past was only subject to a relatively small group for disapprobation. Now the exposure group can potentially be society-wide in seconds.

New conceptions of ‘Public’ and ‘Private’

It isn’t just legal conceptions, of the ‘Privacy 1.0’ type, that lag behind. It is also a whole raft of concepts – one of the most important pairings in this case is our notions of ‘public’ and ‘private’ which still inform debates on privacy using ‘Privacy 1.0’ understandings. The most ubiquitous uses of these terms are not subtle enough to capture what we as individuals may want in privacy terms in the ‘Privacy 2.0’ world. Typically we use notions of ‘public’ and ‘private’ and ‘in private’ that forget this. Whilst behaviour ‘in public’ is technically open to the public eye, it is usually only a small number of eyewitnesses, often strangers, who observe it and it remains ephemeral. What were previously private public spaces become public public spaces.

The principle of freedom of speech framed in the U.S. constitution assumed that private conversations in public spaces would not become public broadcasts. There were no means at the time to effect this and now that those means do exist, we are effectively naked in the eyes of the law with no defence and a potentially chilling effect on our behaviour that may only allow new, or radical behaviours or speech to those who are completely disconnected from existing norms and so do not fear them.

Combining this ‘conceptual slippage’, (generally ignored – commercial interests are rarely threatened in the same way by ‘Privacy 2.0’ as they are by ‘Web 2.0’ and its attendant dilution of Intellectual Property and Copyright), with the mass generation of fresh data and ever increasingly convenient and accurate means of tagging and identifying (including the kind of facial recognition being pioneered by Facebook, Google and others), creates the perfect ‘Privacy 2.0’ storm.

Generative ‘mash-ups’ of data and the vast array of tools and APIs available for processing them mean it is increasingly trivial to find answers to questions such as ‘where was person x on date y’. And the answers will increasingly be coming from the general public, not from government or corporate surveillance.

For a practical application of this, see for example Tom Owad’s application for identifying subversives on Amazon via their wishlists using a bot that queries the site. A mashup combining this with photo recognition technologies would be relatively straightforward and itself could be combined and recombined endlessly with other mashups to provide a much sharper slice of someone’s life than is now visible through even the most invasive state database systems in the world.

The ‘peer production’ technologies have, effectively, been disruptive for Intellectual Property and Copyright, whilst restrictive for privacy. Even were it possible to circumscribe a database describing the total picture of an individual that is accessible digitally, this would be of little use for what the database itself is changes rapidly, often from one moment to the next. The emergent and inscrutable nature of the outputs of these technologies mean that the current ‘Privacy 1.0’ concepts and legal structure we operate within, from the government to the corporate to the academic worlds, are urgently in need of revision.

Friday, November 11, 2011

They Shall Not Grow Old

Those of us in generations 'jones','x' and 'millennial' will be the last to have known veterans from either of the World Wars. Something about that fact makes me profoundly sad, although I'm not entirely sure why. I think part of it may be to do with the fact that it only took 10 years for the upcoming generations that will follow us to forget who Osama Bin Laden was.....

In Flanders fields the poppies blow
Between the crosses, row on row,
That mark our place; and in the sky
The larks, still bravely singing, fly
Scarce heard amid the guns below.

We are the dead. Short days ago
We lived, felt dawn, saw sunset glow,
Loved, and were loved, and now we lie
In Flanders fields.

Take up our quarrel with the foe:
To you from failing hands we throw
The torch; be yours to hold it high.
If ye break faith with us who die
We shall not sleep, though poppies grow
In Flanders fields.
– Lt.-Col. John McCrae