Tuesday, November 22, 2011

Ammo: Churnalism - Churning the Environment Agency

(UPDATE 27/11/11 : raw churnalism data now available HERE (JSON format, zipped,  2.1MB))

My work on bot-writing and churnalism has finally started to bear fruit. And its not often I write a blog here that is an "exclusive", however this is one of the few!

A large part of my research over the last year has focused on the nature of digital and virtual technologies and in particular on the nature of censorship and propaganda online. A perennial concern, as outlined in my Privacy 2.0 post is the arrival of 'mass dataveillance', where the monitoring of an individual is far less important - in both import and consequences - than the gathering of masses of data about lots of individuals. This is because those masses of data are able to reveal patterns and links that the individual members of the data set are very likely unaware of.

However, this is something that works for us as much as against us if we're willing to use the freely available data in the public domain. Its not just large corporations of whom one is suspicious who can gather, analyse and deploy the masses of data.

Nick Davies, in his excellent book, 'Flat Earth News', popularised the term "Churnalism". His book, and the research it is based on was an absolutely damning indictment of the UK media, with similar implications for the entirety of Western media itself. Davies said journalists: “....are no longer gathering news but are reduced instead to passive processors of whatever material comes their way, churning out stories, whether real event or PR artifice, important or trivial, true or false”.

His book was substantially based on research he commissioned by the University of Cardiff. In the course of that research, amongst other things, what they found "suggests that 60% of press articles and 34% of broadcast stories come wholly or mainly from one of these ‘pre-packaged’ sources." - a phenomenon that many of my regular readers will no doubt be familiar with and one that has now sadly started becoming commonplace in other areas too - we've been moving ineluctably from just 'journalism by press release' to also 'science by press release'.

Upon reading this research over a year ago, one of my first thoughts having had years of programming experience was, 'that could be automated!'. And not only could it be automated, but with enough data, direct patterns of bias and influence could also be detected.  Lo and behold, around that time, the wonderful churnalism engine appeared.

This site plugs into the Journalisted database, where every single press item is archived online. The churnalism engine uses an algorithm that enables anyone to manually copy and paste text from any source (though most usually a press release) and compare it quickly to the entire journalisted database, it is then able to report back any cases that are likely to have been cut and pasted, along with an estimate of the proportions copied. It means one can effectively trace the provenance of many news stories back to press releases and also assess how much has been copied directly into the article - articles that were are so often led to believe are supposed to uphold high journalistic standards.

I decided that it should be possible to combine the churnalism utility with masses of data in a way that would not have been possible even a decade ago. The ability to write programs ('spiders' or 'bots' in this context) that could gather all of the press releases from a single organisation and then submit them to the churnalism facility, combined with cheap and readily available computing power means this is an entirely achievable goal. 

So I programmed spiders that were able to gather every single one of the Envrionment Agency's press release, filter out the formatting tags (from the web page) to get the original text, submit them to the churnalism engine and then store and analyse the results.

The data shows several clear patterns:

- The so called "quality press" are the worst offenders for churning Environment Agency press releases - whilst there were many entries from the tabloids and local papers, their cutting and pasting was less egregarious than the "quality press".

- The BBC is by far and away the worst offender for simply repeating whatever the Environment Agency claimed in its press releases.Out of the 393 articles where "significant" churn had taken place, the BBC were responsible for 44%. Likewise for the 49 articles that had "major" churn (meaning in most cases they were almost complete cut and pastes of the press releases), the BBC was responsible for 30.6%.

- I was able to grab a total of 1962 press releases from the Environment Agency giving almost complete coverage of their press release output for the last two years. This total was orginally slightly higher, however since the first pass of my bots (to gather the links before downloading with the second pass), the EA has inexplicably removed 10 press releases. I also found a handful of duplicates. The lowest level of granularity I was willing to accept as "detectable" churn (see below) yielded 5089 articles.

Details:

- I had set the bar very high for counting articles as "churnalism". The churnalism API uses a "scoring" system that identifies how many 15-character chunks had been copied and/or pasted. It represents a compromise between the lengths of the source and end articles - so, for example, if a large proportion of the original press release has been copied, but this represents a lower proportion of the end article, this will still yield a high score. And vice versa (Louise Gray's articles in the Telegraph often followed this latter pattern for example - with her article being much shorter than the original press release it is copied from).

- Setting the filter on the data I gathered to a score of >=100 yielded 5089 articles. I regard this as an acceptable baseline for "detectable" churn, though for current purposes I am discarding this larger data set. One of my hypotheses is that any distinct patterns should be visible throughout and indeed this appears to be the case - those I counted as "detectable" were made up of 1983 articles from the BBC - a very similar proportion to those in the higher scoring category (38.9%).

- My methodology (to be detailed in a much longer post later on censoring.me) has massively favoured the media organisations, primarily in three respects:

i) for submitting the press releases to the churnalism engine, I did not edit the press releases to remove anything extraneous - so they often included repeated titles, contact details etc that would lower the percentage of hits in the churnalism engine.

ii) the original 'screen scrapes' of the website press releases included many characters that were difficult to work with programatically. Single commas cause problems for database processing so these were often removed. Also the data regularly contained unicode - and not all of this would be correctly re-encoded when sent to the churnalism web server. This further reduced the percentage of hits.

iii) Finally - as I already mentioned I set the bar very high for what I counted as cases of definite "churnalism". I decided upon three categories of scoring: 1) "detectable" churn - with scores of 100 or more (in practice this would mean maybe a paragraph had been copied) 2) "significant" churn - with scores of 500 or more (in practice this means more than one paragraph had been copied) and finally 3) "major" churn - a score of 1000 or more meaning the majority, or substantive minority of either/both press release and final article had been copied.

With these three aspects in mind, I consider it almost certain that when other people make use of my data (which I will make publicly available, along with a longer article detailing technical issues) they will find a noticably higher proportion of churnalism. There would also be a case for lowering the significance score of 500.  I also think there is probably a lot of useful information to be found in the 5089 articles with "detectable" churn, however I won't be drawing any conclusions from this larger data set at present.

With those caveats and details in mind, here is the summary of the final results, calculated after any duplicates and outliers had been removed (click for larger image):






It should be noted that this data set alone is comparable in size to that used by the research that formed the basis of Nick Davies' 'Flat Earth News'. It took a team of several researchers many months to pore over the same amount of press releases and articles and come to the conclusions he presents in the book. It took me a week to write the spiders, and I'll be doing this again and again for different organisations and likely revealing more hidden patterns in the data. And yes, I'll be reporting them here first and sharing the data publicly!

For those interested, I aim to give access to the data on censoring.me by next week, which I am revamping this weekend as I finally have some substantial research to use the site to showcase!

For the most appalling cases of churn I submitted them manually to the Churnalism site (a different process to submitting them in an automated fashion to the API). Doing this enables me to save individual examples in the Churnalism.com site database and it yields a very useful side by side comparison so it is possible for one to see visually (and compare manually) the worst cases.

For your amusement, amongst my favourites were:

- "Glaciergate should not distract us from climate battle"
Here the chairman of the Environment Agency is asked to write a piece for the BBC. It repeats exactly the majority of a press release issued by the Environment Agency two months beforehand claiming to quote the chairman by declaring what he is going to say at a forthcoming event. Yes I had trouble getting my head around that too.

- "Llamas help protect an ice-age fish"
The infamous (and crazy) Llamas protect fish from climate change press release in fully churned glory.

- "Flood defence project gets small seal of approval"
An absolutely stunning 94% paste job by the BBC. Even the churnalism engine struggles to represent it visually - make sure you click through to the BBC article itself. You can see from eyeballing it and comparing it to the submitted press release that it is a straight cut and paste.



10 comments:

Anonymous said...

Good job. Bad luck with the timing. Not going to get the attention you deserve with ClimateGate2 breaking.

donwreford said...

Churning is good for us, After extensive research into my body and mind, I have decided that I am a machine, furthermore I come from the first Country that produced the machine on a scale as never before, as a result of this internalization psychically driven repetitious function within, that now becoming the driving force of my existence, the reader has already preempted where we are going, that one is no longer responsible nor guilty of any misnomers of being on this Planet, anymore than a machine is guilty for breakdown.
The sheer relief knowing you are free to be or do what ever you like, and no repercussions other than the status-quo judgement of what they think you should be.
Similar to Gods judgment ideology and programme, except one significant difference, God no longer exists in this situation, or at least has been relegated to becoming a back seat driver.

Daedalus X. Parrot said...

Excellent work Katabasis. Look forward to further developments from you.

Do you propose to analyse press releases from other lobby groups and politicised think tanks (of all colours)?

Re: Environment Agency, I notice from the leaked ClimateGate II emails on Delingpole's blog that civil servants in their parent organisation, DEFRA, are right in there amongst the other AGW crooks. Here's a quote from one taxpayer funded mandarin:

"I can’t overstate the HUGE amount of political interest in the project as a message that the Government can give on climate change to help them tell their story. They want the story to be a very strong one and don’t want to be made to look foolish."

Anonymous said...

Ah... The Environment Agency loosing a press release? 'gor blimey.

The chaps over at Avoncliff Mill caught them tinkering with dishonest notices on their web site about their statutory duty earlier this year.

The water permitting team there have been up to no good - to put it very mildly...

That the EA's press releases are uncritically churned like this is really adding insult to injury.

Katabasis said...

@Anonymous 12:15:

- Many thanks. I was mildly annoyed at first that my timing was so bad, however I think it may actually be a good thing in the medium term. The "Climategate 2" issue will probably hit a lag in another week or two.

This work will probably provide a much needed boost if it is cited in a week or two. There's also more on the way (in fact I've just posted the results of another churn exercise).

@donwreford

- I think we might be talking about completely different types of churn.

Katabasis said...

@Daedalus

Yes I do - I did orginally have a list of organisations to work through giving them the 'churn=spider' treatment, however since Climategate 2 has broken it makes sense to temporarily redirect my time to organisations that are likely to exhibit similar churning patterns. It could certainly be one of those extra bullets that helps turn the tide of the debate back to rationality in any case....

DEFRA are next on my list now I've done 'Frack Off' (who I will be revisiting). I've already done an initial pass with my link-spider that has gathered all of their press release links. I'll start processing them this weekend or beginning of next...

@Anonymous 1:04

That's an interesting read. Its good to make these kind of links. I have almost all of their press releases in text format - would you like me to run a keyword scan to see if I can find anything related to the Avoncliff Mill case?

You may also like to know that they're not above deleting press releases. In between the first scan by my bots (to grab the links) and the second to screen scrape every single press release, they deleted 10 press releases. I only managed to recover 4. I should go back to those deletions at some point to see if there is any reason behind it....

Gordon the Fence Post Tortoise said...

It's anon@0104 from yesterday...

We've got a real problem with the water permit teams at the EA and they are backed up by their 150+ strong PR teams and yes - there are that many...

Unfortunately we can't write up the entire story as we are in Judicial Review and walking on eggshells.

The illegally awarded licence will be quashed at the High Court to terminate the Judicial Review but that's a tactic - to avoid having to go in front of a judge to explain the rest of their toxic doings. They've delayed a viable (without subsidy!) hydro power scheme for almost two years already.

What is abundantly clear to us is that the EA has acted wholly illegally in relation to our water licences and show every indication that they are going to persist in their wilful bad behaviour and indulge in arbitrary diktat, dissembling and so on.

Some officials placed a bunch of public notices on EA web sites at the end of 2011 which were wholly false - somebody at the EA eventually figured out that the notices looked very dodgy to anybody who could read a calendar and took them down. They ignored our complaints about the web content.

I doubt there's anything about Avoncliff Mill as they are extremely keen to keep it out of the limelight. They contrived to get a local paper article spiked a couple of months back we understand they pressured the editor. The article was laid up for half the front page with two photos and literally at the last moment ended up mid page single column on the fold on page 7 and about 2-1/2" high between the double glazing ads....

Gordon the Fence Post Tortoise said...

typo = oops EA false web content was up for 6 months Nov 2010 till end March 2011.

Content persisted on their "mirrored" Welsh Environment Agency site for another couple of months until somebody noticed it....

Sres said...

I knew there was a good reason I followed you on twitter many moons ago.

Keep up the good work, it's interesting and highlights what a complete and utter joke the BBC are as well as how lazy journo's have become since the advent of the internet.

Inside the Environment Agency said...

No surprises there.