(UPDATE 27/11/11 : raw churnalism data now available HERE (JSON format, zipped, 2.1MB))
My work on bot-writing and churnalism has finally started to bear fruit. And its not often I write a blog here that is an "exclusive", however this is one of the few!
A large part of my research over the last year has focused on the nature of digital and virtual technologies and in particular on the nature of censorship and propaganda online. A perennial concern, as outlined in my Privacy 2.0 post is the arrival of 'mass dataveillance', where the monitoring of an individual is far less important - in both import and consequences - than the gathering of masses of data about lots of individuals. This is because those masses of data are able to reveal patterns and links that the individual members of the data set are very likely unaware of.
However, this is something that works for us as much as against us if we're willing to use the freely available data in the public domain. Its not just large corporations of whom one is suspicious who can gather, analyse and deploy the masses of data.
Nick Davies, in his excellent book, 'Flat Earth News', popularised the term "Churnalism". His book, and the research it is based on was an absolutely damning indictment of the UK media, with similar implications for the entirety of Western media itself. Davies said journalists: “....are no longer gathering news but are reduced
instead to passive processors of whatever material comes their way, churning out stories, whether real event or PR artifice, important or trivial, true or false”.
His book was substantially based on research he commissioned by the University of Cardiff. In the course of that research, amongst other things, what they found "suggests that 60% of press articles and 34% of broadcast stories come wholly or mainly from one of these ‘pre-packaged’ sources." - a phenomenon that many of my regular readers will no doubt be familiar with and one that has now sadly started becoming commonplace in other areas too - we've been moving ineluctably from just 'journalism by press release' to also 'science by press release'.
Upon reading this research over a year ago, one of my first thoughts having had years of programming experience was, 'that could be automated!'. And not only could it be automated, but with enough data, direct patterns of bias and influence could also be detected. Lo and behold, around that time, the wonderful churnalism engine appeared.
This site plugs into the Journalisted database, where every single press item is archived online. The churnalism engine uses an algorithm that enables anyone to manually copy and paste text from any source (though most usually a press release) and compare it quickly to the entire journalisted database, it is then able to report back any cases that are likely to have been cut and pasted, along with an estimate of the proportions copied. It means one can effectively trace the provenance of many news stories back to press releases and also assess how much has been copied directly into the article - articles that were are so often led to believe are supposed to uphold high journalistic standards.
I decided that it should be possible to combine the churnalism utility with masses of data in a way that would not have been possible even a decade ago. The ability to write programs ('spiders' or 'bots' in this context) that could gather all of the press releases from a single organisation and then submit them to the churnalism facility, combined with cheap and readily available computing power means this is an entirely achievable goal.
So I programmed spiders that were able to gather every single one of the Envrionment Agency's press release, filter out the formatting tags (from the web page) to get the original text, submit them to the churnalism engine and then store and analyse the results.
The data shows several clear patterns:
- The so called "quality press" are the worst offenders for churning Environment Agency press releases - whilst there were many entries from the tabloids and local papers, their cutting and pasting was less egregarious than the "quality press".
- The BBC is by far and away the worst offender for simply repeating whatever the Environment Agency claimed in its press releases.Out of the 393 articles where "significant" churn had taken place, the BBC were responsible for 44%. Likewise for the 49 articles that had "major" churn (meaning in most cases they were almost complete cut and pastes of the press releases), the BBC was responsible for 30.6%.
- I was able to grab a total of 1962 press releases from the Environment Agency giving almost complete coverage of their press release output for the last two years. This total was orginally slightly higher, however since the first pass of my bots (to gather the links before downloading with the second pass), the EA has inexplicably removed 10 press releases. I also found a handful of duplicates. The lowest level of granularity I was willing to accept as "detectable" churn (see below) yielded 5089 articles.
- I had set the bar very high for counting articles as "churnalism". The churnalism API uses a "scoring" system that identifies how many 15-character chunks had been copied and/or pasted. It represents a compromise between the lengths of the source and end articles - so, for example, if a large proportion of the original press release has been copied, but this represents a lower proportion of the end article, this will still yield a high score. And vice versa (Louise Gray's articles in the Telegraph often followed this latter pattern for example - with her article being much shorter than the original press release it is copied from).
- Setting the filter on the data I gathered to a score of >=100 yielded 5089 articles. I regard this as an acceptable baseline for "detectable" churn, though for current purposes I am discarding this larger data set. One of my hypotheses is that any distinct patterns should be visible throughout and indeed this appears to be the case - those I counted as "detectable" were made up of 1983 articles from the BBC - a very similar proportion to those in the higher scoring category (38.9%).
- My methodology (to be detailed in a much longer post later on censoring.me) has massively favoured the media organisations, primarily in three respects:
i) for submitting the press releases to the churnalism engine, I did not edit the press releases to remove anything extraneous - so they often included repeated titles, contact details etc that would lower the percentage of hits in the churnalism engine.
ii) the original 'screen scrapes' of the website press releases included many characters that were difficult to work with programatically. Single commas cause problems for database processing so these were often removed. Also the data regularly contained unicode - and not all of this would be correctly re-encoded when sent to the churnalism web server. This further reduced the percentage of hits.
iii) Finally - as I already mentioned I set the bar very high for what I counted as cases of definite "churnalism". I decided upon three categories of scoring: 1) "detectable" churn - with scores of 100 or more (in practice this would mean maybe a paragraph had been copied) 2) "significant" churn - with scores of 500 or more (in practice this means more than one paragraph had been copied) and finally 3) "major" churn - a score of 1000 or more meaning the majority, or substantive minority of either/both press release and final article had been copied.
With these three aspects in mind, I consider it almost certain that when other people make use of my data (which I will make publicly available, along with a longer article detailing technical issues) they will find a noticably higher proportion of churnalism. There would also be a case for lowering the significance score of 500. I also think there is probably a lot of useful information to be found in the 5089 articles with "detectable" churn, however I won't be drawing any conclusions from this larger data set at present.
With those caveats and details in mind, here is the summary of the final results, calculated after any duplicates and outliers had been removed (click for larger image):
It should be noted that this data set alone is comparable in size to that used by the research that formed the basis of Nick Davies' 'Flat Earth News'. It took a team of several researchers many months to pore over the same amount of press releases and articles and come to the conclusions he presents in the book. It took me a week to write the spiders, and I'll be doing this again and again for different organisations and likely revealing more hidden patterns in the data. And yes, I'll be reporting them here first and sharing the data publicly!
For those interested, I aim to give access to the data on censoring.me by next week, which I am revamping this weekend as I finally have some substantial research to use the site to showcase!
For the most appalling cases of churn I submitted them manually to the Churnalism site (a different process to submitting them in an automated fashion to the API). Doing this enables me to save individual examples in the Churnalism.com site database and it yields a very useful side by side comparison so it is possible for one to see visually (and compare manually) the worst cases.
For your amusement, amongst my favourites were:
- "Glaciergate should not distract us from climate battle"
Here the chairman of the Environment Agency is asked to write a piece for the BBC. It repeats exactly the majority of a press release issued by the Environment Agency two months beforehand claiming to quote the chairman by declaring what he is going to say at a forthcoming event. Yes I had trouble getting my head around that too.
- "Llamas help protect an ice-age fish"
The infamous (and crazy) Llamas protect fish from climate change press release in fully churned glory.
- "Flood defence project gets small seal of approval"
An absolutely stunning 94% paste job by the BBC. Even the churnalism engine struggles to represent it visually - make sure you click through to the BBC article itself. You can see from eyeballing it and comparing it to the submitted press release that it is a straight cut and paste.
Overheard in the office
4 minutes ago