Tuesday, January 17, 2012

OPERATION BLACKOUT - STOP SOPA / PIPA


This post is in support of the anti-SOPA / PIPA actions that are about to start in the U.S.  Details of the "internet strike" are here. This blog, and the handful of sites I administer will also be joining the strike, which now includes big hitters such as Google, Wikipedia and Reddit.

Whilst SOPA has (for now) been put on ice, its sister bill, PIPA is still alive in the Senate. SOPA could quite easily return as it is on hold. Both need to be chopped up and burned, never to see the light of day again. Just like the Digital Economy Act here in the UK before it, though a much more far reaching version, it hands far too much power to copyright holders and will likely be used to censor huge swathes of the net. I say this as someone who supports IP / copyright in some form, though one who is still undecided on what the solutions might be - many aspects of the issue are intractable. What I do know is that concentrating power in the hands of copyright holders, who - in the case of SOPA / PIPA supporters - represent the old school mass media interests who really should realise that their time to die is long past and are desperately clinging on, is a terrifying and deeply wrong turn of events.

This video explains the mortal danger SOPA / PIPA poses to all of us, not just citizens of the U.S.

And if you think it doesn't directly affect Britons, just consider the fate of Richard O'Dwyer - if SOPA or PIPA pass, expect to see dozens if not hundreds more cases like his, and for a much milder "crime". It will surely include many of us in the blogosphere.

In solidarity with the sites that will be going into "blackout", all of my other posts will revert to 'draft' and be inaccessible for the duration of the draft - disappearing suddenly into the ether in exactly the same way that many sites will if SOPA / PIPA pass.

Join the strike! If you have no sites to 'black out' then help to raise awareness if you can.

And one last thing that has been lost in the drama of SOPA / PIPA is that the DEA has now officially been ruled compatible with EU law. Expect a fight on our hands very shortly on our own shores as it is enforced.....

Saturday, December 24, 2011

Some Christmas Cheer.....

We all know that the Euro is going down and quite possibly the EU with it. After we've all recharged out batteries over Christmas though, let's be sure to give it a helping hand in the New Year:




Saturday, November 26, 2011

Churnalism: DEFRA churn - the Guardian is in the lead!

(UPDATE: 27/11/11 - raw churnalism data available HERE - JSON format, zipped, 448kb)

After the churn analysis of the Environment Agency press releases (please read that article for more details and important caveats if you haven't read one of my churn posts before), I followed up with DEFRA.I will be making the raw data publicly available tomorrow for both the Environment Agency and DEFRA churn analyses.

This time I was able to construct the spiders and process the data faster - I also avoided (most of) the unicode problems that plagued me with the EA data, so this analysis can be considered slightly more accurate and slightly less forgiving of the media organisations (though it still strictly follows the rules I set down previously, such as no editing of the press release to remove extraneous information). Along with inevitable issues with some difficult characters slipping through and no editing of the press releases, it still means the data will naturally favour the media organisations. As I said on previous posts, when I make the raw data publicly available, churn analysis by other people will very likely improve upon my methods and yield results more detrimental for the media.

In any case - onto the results. The summary is presented below (click for full size image):

"Quality" press churn results






Summary of results:

- A total of 386 press releases were analysed, from 13th May 2010 to 24th November 2011. These generated 1959 detectable cases of churn. Again, there is probably a lot of interesting data within the "detectable" category that deserves analysis at a later date. For now it is discarded.

- Out of those, 173 were classified as "significant" and 18 as "major".

- The Guardian was the leader in both categories by a long way - accounting for 19.65% of significant churn and 27.78% of major churn.

- The BBC followed close behind in terms of significant churn with 16.18%, though for major churn was beaten into third place by both the Independent and the Daily Mail with a joint 16.67%.

- The Independent came third in the signifcant churn classification.

- A common factor in the most highly churned articles both in this analysis and the previous two appears to be lack of a named author in most cases (though see one of the exceptions detailed below). This suggests the media organisations are aware that what they are doing is not kosher.

- Continuing a theme from the last two churn analyses, the tabloids consistently embarrass the so called "quality press". This time I pulled out the statistics for the UK's major tabloids for comparison (click for full size image):

Tabloid press churn results
When I first started these analyses I fully expected to see a much higher showing of churn by the tabloids. It is interesting to see the contrast. Also out of the churn analyses done so far, it is consistently the Mirror out of the tabloids that has the highest percentage of churn.

As usual I select a few of the more egregarious cases of churn for your entertainment (and importantly - provide a manual submission to the churnalism database so they can be seen visually):

'Gloucestershire Old Spots pork protected by Europe'
An absolutely cracking BBC 79% cut and paste job on - er - crackling.

'Bonfire of the Quangos'
Remember that list of Quangos that were to go? Completely cut and pasted from a press release. This one is particularly fascinating because in the two worst cases the cut and paste was the list provided in the press release. It actually included several paragraphs laying out a context that was not cut and pasted across. If it had just been the list in the original both would have scored close to 100% pastes.....
The pastes are so large in any case that the churnalism engine falls over when the 'view' button is clicked to see the visualised version. Be warned if you click it, your browser may hang.

'New service for householders to stop unwanted advertising mail'
Absolute carnage on the churning front here with the majority of the main media outlets represented. The Guardian appeared to like this story so much they cut and pasted it twice - and this time each article has a named author. Where the hell was the editor?


Friday, November 25, 2011

It's mob rule at the Guardian....

(This blogpost should perhaps also be titled - 'What I did/didn't/did say at the Guardian today....')

There's nothing quite like rank hypocrisy to boil my piss. However, to ensure it is fully evaporated in anger, combine rank hypocrisy with crass stupidity, naked opportunism, complete resistance to facts or reason and censorship.

For that was the bread and butter of Leo "bless 'im" Hickman's disgraceful piece of yellow bellied journalism at the Guardian today.

Hickman decided it was time to form a posse comitatus to try tracking down the source of the climategate emails, laughably using the README textfile included in the latest tranche of releases as the primary source of evidence.

This was one of those pieces - especially as it was in the comment is free if you agree section - that really reveals the Guardian's true colours. Numerous commentators including me (prior to the first round of censorship - sorry - 'comment adjustment') attempted to point out the Guardian's and Hickman's rank hypocrisy on this issue. The most striking and obvious example having been the paper's massive support for Wikileaks, however there were many other examples, including the anonymous Enron whistleblower, as another commenter pointed out. As was repeated again and again, it appeared that all leakers were equal but some were more equal than others in the Guardian's eyes.

This was of course brushed off by Hickman and his part-time principle party of followers in the comments section.

Next I pointed out (prior to 'comment adjustment') that claiming it was the work of a hacker was still just an assumption. Hickman replied to me directly on that and similarly brushed it off. He claimed it was irrelevant. The poor dear didn't seem to realise that if he assumed it was the work of a hacker and in fact it was a leaker then his "investigation" would lead him down to all sorts of blind alleys, not least because the MO and levels of access would be completely different (not to mention the trail of evidence left behind).

There were a plethora of delightfully dense comments in support of Hickman et al and stunning leaps of reasoning. These people were also apparently immune to criticism because they "knew" what they were claiming was true, especially regarding the "hacker" claim. Many pronounced completely ill-informed statements about this showing that i) they knew nothing about IT security and ii) that they couldn't even be bothered to use google to check details. After all, The difference between an internal security breach and a carefully coordinated external breach is vast. Pointman gave an excellent overview after the first climategate - here. Moreover they absolutely did not care about their ignorance. What a familiar pattern, eh? No wonder they were immediately supportive of the "scientists" at the heart of the climategate storm - they're just like them!

There were some absolute crackers amongst the received wisdom of this bunch of easily led zealots and I highly recommend you read through the comments - well those that are left - as it is a laugh a minute.

Komment Macht Frei

Speaking of the comments - when the piece first appeared this morning, it was absolute devastation from the moderator. ALL of my comments bar the first one were censored, as were numerous other comments by others. I had no clue why they'd been removed beyond the fact that we all seemed to disagree intensely with Hickman.

Now I should point out something important here for Guardian watchers - they have two types of post moderation. There is the one we're all familiar with - where the boilerplate 'this comment was moderated because it breached our community (puke) standards' but there's also a much more insidious type and I only noticed it because I've been paying a lot of attention to their censorship pattern over the last couple of years - its what I call "nuking". In this case they remove all evidence that the comment was ever there. It's particularly chilling for freedom of speech because aside from the fact that by looking at the comments one can't actually assess the general level of censorship, if it's *your* comment that disappears in this way it's only your word that it was ever there in the first place....

Now bizaarely, after the comments spilled over onto two pages I happened to click back to the first page to see what else had been censored and was surprised to see that most of my previously "moderated" comments had reappeared (except for the "nuked" ones). I don't know if this is a bug in their software or a disagreement between moderators but it adds even more to the general sense of confusion and latent fear of arbitrary censorship that completely fucks any meaningful contribution over there.

Another important point to be aware of is this: One way to guarantee being censored on the Guardian is if you make a reference to your, or someone else's having been censored you will immediately be censored and they often use the "nuke" option too.

The Guardian is  - as a media institution - utterly reprehensible. Most other media outlets are of course too, across the political spectrum. But none outside the BBC attempt to present themselves so often as the default "good guys", nor do their followers similarly regard it as received wisdom...

The climategate 'gait' or the 'out of context paradox'

There's a regular pattern that occurs in any discussion of climategate (1 or 2). It is inconsistent but also entirely consistent with the unthinking nature of many of those who promulgate it:
i) They assert that the emails were "taken out of context"
ii) Responder says that they are not.
iii) A request is then made for evidence.
iv) Responder invites them to read the emails - there are numerous complete email chains, supporting claims against the "scientists" that ONLY MAKE SENSE IN CONTEXT. But the trick is you have to actually read the emails....

A modern day climate "scientist"
Now given how unambiguous some of the exchanges are (in particular those that involve purposefully frustrating FOI inquiries and deleting emails....) one is then prompted to ask exactly what standard of evidence is required. For the evidence before us, if for example we stick with complete email chains rather than individual comments, is a magnitude higher than the typical standard accepted in the vast majority of journalism that we ever read or see. It means that - to be consistent - if one were to completely reject these email chains as sufficient evidence, one would have to throw out almost every received opinion on any quoted person in the press one has ever encountered. Will the zealots do that...no of course they won't. But of course consistency is in the same disused box in their basement as a regard for truth....

One final delicious irony of this of course is that 'The Team' will surely be scratching their heads now, trying to remember what on earth what was said to who. But because they very likely deleted these emails after they had been copied from the mailserver then they have only one place to go to check.....

Thursday, November 24, 2011

Churnalism: Churning 'Frack Off'

In several comments on my previous piece of work on churnalism I read at Biased BBC, the activist group 'Frack Off' were mentioned and questions were asked whether they had any detectable churn in the media as online links were often found to them at the BBC and the Guardian sites by Biased BBC readers.

I decided to have a look.

The group is very new and they have only released four official press releases. This meant I could work fast with this data as I wouldn't need to write specialised spiders for gathering, analysing or submitting the data as it could be done manually with such a small data set.

Again I find myself being surprised by the results:

A reminder on the scoring criteria:

>=100 is classified as "detectable" churn. I usually discard these results and they will always be discarded for comparing one set of data to another (e.g. this analysis to the previous environment agency analysis), however it still yields a rich seam of data and as this data set is so small compared to the previous one I decided to take some time to look into some of these.
>=500 is classified as "significant" churn.
>=1000 is classified as "major" churn - in these cases the articles simply could not have been written without cutting and pasting the bulk of its material from the press release.


Results:

- Out of four press releases, three have generated a total of 13 articles with detectable churn according to my criteria (score of >= 100 from the Churnalism engine). 14 were originally found but I removed one Guardian article as it was detected twice (probably because of the similar screeds issued by 'Frack Off' in their press releases).

- Out of those, 5 were significant churn (score of >= 500) - and as they were only a handful I have entered them manually into the Churnalism database so you can see the side by side comparisons yourself. 2 were from the Guardian, 1 from the Mirror, 1 from the Times and 1 from the Daily Mail.

- Several of the remaining 8 articles with detectable churn come very close to the >= 500 criterion (details below) and indeed in a couple of cases the Churnalism engine considers them to be significant enough to display when manually input (an API input by contrast gives an exact score; the churnalism seems to have a less forgiving standard than myself and will display many articles with a score between 400-500 as significant).

Further comments:

Now, even being generous to the media organisations and 'Frack Off', this means 50% of their press releases are being significantly churned - and primarily by the Guardian and the Mirror (one of the Mirror's contributions came close to a "major" ( >= 1000) piece of churn with a score of 870. I suspect that were I less forgiving with my methodology, removing extraneous elements in the initial press release (links, contact information etc) and didn't have to remove some characters to ease processing (e.g. single quotes), this may well have scored as 'major churn'. You can eyeball it yourself in any case here

I say that 50% assessment is 'being generous' to 'Frack Off' also because it includes a press release they issued yesterday so it will not have had time to percolate through the media yet. If the press release yesterday results in any significant churn (which I will check again in a week or two), that percentage will climb to 75%.

Further details on the data for each press release:

26th October 2011 press release: - submitting it to the API yielded 3 detectable chases of churn and one borderline (score 96), so I looked at them in more detail. One of these articles is from the Guardian and already highlighted by the Churnalism engine as containing significant churn from one of the other FO press releases. Two are Telegraph articles, one by Louise Gray - both name check the activist group. The other is a Times article and unfortunately I can't verify anything beyond the paywall.

2nd November 2011 press release - 7 detectable cases of churn - 4 were "significant" churn and a further 2 came very close to being considered "significant" by my scoring criteria - one from the Independent  (score: 436) and one from the Guardian (score: 448). Notice how the Independent article also cites: Chris Huhne, the WWF, Friends of the Earth and Greenpeace. Cuadrilla Resources - the company responsible for the Fracking discussed at the heart of the article get one response of similar length to the others, along with a neutral response from the shadow energy secretary, Tom Greatrex.Cuadrilla don't get a single response in the Guardian article and in mentioning the independent report commissioned by Cuadrilla fails to mention that the report concluded that another earthquake incident was unlikely.

The Churnalism engine considers one of the Guardian articles significant that doesn't hit my threshold (500) - it scores 448, but the manual search on Churnalism.com also yields an entry that the engine considers significant, at the same time, the Daily Mail entry with a score of 521 isn't listed on the manual search so this balances out.

See visual breakdowns of significant / detectable churn for this press release here.

3rd November 2011 press release - four detectable cases of churn, one from the Guardian classifies as "significant".

See visual breakdowns of significant / detectable churn for this press release here.


23rd November 2011 press release - no detectable churn via either manual entry to Churnalism.com or to the more sensitive API system. This could well change however as it was only yesterday this PR was issued.

Final comments:

I find these results very concerning. A small single-issue activist group that has only existed for a matter of months should not be generating such a significant amount of churn with just four press releases. The Guardian and the Mirror in particular appear to be giving the group a free ride (though also see the commentary above on one of the higher scoring Independent articles). The Environment Agency at least has some kind of mandate and does carry out a wide variety of tasks - to that extent it's no surprise some of its press releases are churned (though this is no excuse for the journalists concerned, or indeed for any EA employees who are aware of and have no issue with any kind of symbiotic relationship here).

This is extremely dangerous, especially for such an important issue and shows how groups like 'Frack Off' can be so polarising. On a personal level I do believe there are legitimate environmental concerns regarding Shale Gas. Unlike 'Frack Off' however I consider these issues surmountable. They make clear from their website that they want this to be a polarising issue regardless of the facts. They primarily cite Gasland - a "documentary" itself which is making rational discussion of the issue of Fracking all but impossible. It is primarily a series of anecdotes that are themselves seriously problematic as evidence. 'Frack Off' et al should be citing careful and replicable research such as this for discussion. Why don't they? And moreover why doesn't the media, who are apparently falling over themselves to repeat 'Frack Off' claims without doing the very research most of us expect of them. My guess is that the research isn't nearly alarmist enough for them and has plenty of caveats that prevent them from easily reporting it as such.

And whilst I am sure many 'Frack Off' adherents are, as we speak, screaming that the unfolding "Climategate 2" saga is all about emails "being taken out of context", they themselves have a website about Fracking that not only cites factually incorrect and misleading material but - crucially - misses the all important context that Shale Gas could be an enormous energy (not to mention political) game changer. Meanwhile all of the companies involved in prospecting and mining are listed as "bad guys". Its utterly juvenile.

But what is truth to people like this? Or indeed to the mass media in whom we trust its flame is continually nurtured, not murdered as it seems to be?

Tuesday, November 22, 2011

Ammo: Churnalism - Churning the Environment Agency

(UPDATE 27/11/11 : raw churnalism data now available HERE (JSON format, zipped,  2.1MB))

My work on bot-writing and churnalism has finally started to bear fruit. And its not often I write a blog here that is an "exclusive", however this is one of the few!

A large part of my research over the last year has focused on the nature of digital and virtual technologies and in particular on the nature of censorship and propaganda online. A perennial concern, as outlined in my Privacy 2.0 post is the arrival of 'mass dataveillance', where the monitoring of an individual is far less important - in both import and consequences - than the gathering of masses of data about lots of individuals. This is because those masses of data are able to reveal patterns and links that the individual members of the data set are very likely unaware of.

However, this is something that works for us as much as against us if we're willing to use the freely available data in the public domain. Its not just large corporations of whom one is suspicious who can gather, analyse and deploy the masses of data.

Nick Davies, in his excellent book, 'Flat Earth News', popularised the term "Churnalism". His book, and the research it is based on was an absolutely damning indictment of the UK media, with similar implications for the entirety of Western media itself. Davies said journalists: “....are no longer gathering news but are reduced instead to passive processors of whatever material comes their way, churning out stories, whether real event or PR artifice, important or trivial, true or false”.

His book was substantially based on research he commissioned by the University of Cardiff. In the course of that research, amongst other things, what they found "suggests that 60% of press articles and 34% of broadcast stories come wholly or mainly from one of these ‘pre-packaged’ sources." - a phenomenon that many of my regular readers will no doubt be familiar with and one that has now sadly started becoming commonplace in other areas too - we've been moving ineluctably from just 'journalism by press release' to also 'science by press release'.

Upon reading this research over a year ago, one of my first thoughts having had years of programming experience was, 'that could be automated!'. And not only could it be automated, but with enough data, direct patterns of bias and influence could also be detected.  Lo and behold, around that time, the wonderful churnalism engine appeared.

This site plugs into the Journalisted database, where every single press item is archived online. The churnalism engine uses an algorithm that enables anyone to manually copy and paste text from any source (though most usually a press release) and compare it quickly to the entire journalisted database, it is then able to report back any cases that are likely to have been cut and pasted, along with an estimate of the proportions copied. It means one can effectively trace the provenance of many news stories back to press releases and also assess how much has been copied directly into the article - articles that were are so often led to believe are supposed to uphold high journalistic standards.

I decided that it should be possible to combine the churnalism utility with masses of data in a way that would not have been possible even a decade ago. The ability to write programs ('spiders' or 'bots' in this context) that could gather all of the press releases from a single organisation and then submit them to the churnalism facility, combined with cheap and readily available computing power means this is an entirely achievable goal. 

So I programmed spiders that were able to gather every single one of the Envrionment Agency's press release, filter out the formatting tags (from the web page) to get the original text, submit them to the churnalism engine and then store and analyse the results.

The data shows several clear patterns:

- The so called "quality press" are the worst offenders for churning Environment Agency press releases - whilst there were many entries from the tabloids and local papers, their cutting and pasting was less egregarious than the "quality press".

- The BBC is by far and away the worst offender for simply repeating whatever the Environment Agency claimed in its press releases.Out of the 393 articles where "significant" churn had taken place, the BBC were responsible for 44%. Likewise for the 49 articles that had "major" churn (meaning in most cases they were almost complete cut and pastes of the press releases), the BBC was responsible for 30.6%.

- I was able to grab a total of 1962 press releases from the Environment Agency giving almost complete coverage of their press release output for the last two years. This total was orginally slightly higher, however since the first pass of my bots (to gather the links before downloading with the second pass), the EA has inexplicably removed 10 press releases. I also found a handful of duplicates. The lowest level of granularity I was willing to accept as "detectable" churn (see below) yielded 5089 articles.

Details:

- I had set the bar very high for counting articles as "churnalism". The churnalism API uses a "scoring" system that identifies how many 15-character chunks had been copied and/or pasted. It represents a compromise between the lengths of the source and end articles - so, for example, if a large proportion of the original press release has been copied, but this represents a lower proportion of the end article, this will still yield a high score. And vice versa (Louise Gray's articles in the Telegraph often followed this latter pattern for example - with her article being much shorter than the original press release it is copied from).

- Setting the filter on the data I gathered to a score of >=100 yielded 5089 articles. I regard this as an acceptable baseline for "detectable" churn, though for current purposes I am discarding this larger data set. One of my hypotheses is that any distinct patterns should be visible throughout and indeed this appears to be the case - those I counted as "detectable" were made up of 1983 articles from the BBC - a very similar proportion to those in the higher scoring category (38.9%).

- My methodology (to be detailed in a much longer post later on censoring.me) has massively favoured the media organisations, primarily in three respects:

i) for submitting the press releases to the churnalism engine, I did not edit the press releases to remove anything extraneous - so they often included repeated titles, contact details etc that would lower the percentage of hits in the churnalism engine.

ii) the original 'screen scrapes' of the website press releases included many characters that were difficult to work with programatically. Single commas cause problems for database processing so these were often removed. Also the data regularly contained unicode - and not all of this would be correctly re-encoded when sent to the churnalism web server. This further reduced the percentage of hits.

iii) Finally - as I already mentioned I set the bar very high for what I counted as cases of definite "churnalism". I decided upon three categories of scoring: 1) "detectable" churn - with scores of 100 or more (in practice this would mean maybe a paragraph had been copied) 2) "significant" churn - with scores of 500 or more (in practice this means more than one paragraph had been copied) and finally 3) "major" churn - a score of 1000 or more meaning the majority, or substantive minority of either/both press release and final article had been copied.

With these three aspects in mind, I consider it almost certain that when other people make use of my data (which I will make publicly available, along with a longer article detailing technical issues) they will find a noticably higher proportion of churnalism. There would also be a case for lowering the significance score of 500.  I also think there is probably a lot of useful information to be found in the 5089 articles with "detectable" churn, however I won't be drawing any conclusions from this larger data set at present.

With those caveats and details in mind, here is the summary of the final results, calculated after any duplicates and outliers had been removed (click for larger image):






It should be noted that this data set alone is comparable in size to that used by the research that formed the basis of Nick Davies' 'Flat Earth News'. It took a team of several researchers many months to pore over the same amount of press releases and articles and come to the conclusions he presents in the book. It took me a week to write the spiders, and I'll be doing this again and again for different organisations and likely revealing more hidden patterns in the data. And yes, I'll be reporting them here first and sharing the data publicly!

For those interested, I aim to give access to the data on censoring.me by next week, which I am revamping this weekend as I finally have some substantial research to use the site to showcase!

For the most appalling cases of churn I submitted them manually to the Churnalism site (a different process to submitting them in an automated fashion to the API). Doing this enables me to save individual examples in the Churnalism.com site database and it yields a very useful side by side comparison so it is possible for one to see visually (and compare manually) the worst cases.

For your amusement, amongst my favourites were:

- "Glaciergate should not distract us from climate battle"
Here the chairman of the Environment Agency is asked to write a piece for the BBC. It repeats exactly the majority of a press release issued by the Environment Agency two months beforehand claiming to quote the chairman by declaring what he is going to say at a forthcoming event. Yes I had trouble getting my head around that too.

- "Llamas help protect an ice-age fish"
The infamous (and crazy) Llamas protect fish from climate change press release in fully churned glory.

- "Flood defence project gets small seal of approval"
An absolutely stunning 94% paste job by the BBC. Even the churnalism engine struggles to represent it visually - make sure you click through to the BBC article itself. You can see from eyeballing it and comparing it to the submitted press release that it is a straight cut and paste.



Friday, November 18, 2011