Andrew Trench notes from the revolution

2Apr/13Off

And now for something completely different… sensor journalism

I came across a couple of articles which left me twitching with excitement over some new experiments which combine data journalism with the world of physical computing.

I'm not sure if it would be classified as a "field" of journalism yet, but its practitioners are calling it "sensor journalism" and it essentially boils down to creating original data for journalism using sensors and other physical computing technologies.

Scridb filter
26Mar/13Off

Machine-learning – the final newsroom frontier?

Machine_Learning_Technique.I was fortunate enough to recently attend the recent Nicar 2013 conference in Lousiville, Kentucky (thanks to a generous bursary from Wits University) and my eyes were opened to some cutting-edge technologies being deployed in some US newsrooms.

While news proprietors in South Africa - along with proprietors elsewhere in the world - grapple with their business models and declining audiences, others are turning to frontline technologies to bring new methods of news-generation into their operations.

Scridb filter
11Feb/13Off

Sentiment analysis as a social barometer for journalists

I spent some time over the weekend fiddling with some code and a sentiment analysis API and looking at the raging #StopRape discussion on Twitter.

Some months ago we tried an experiment using sentiment analysis around The Spear artwork controversy which I detailed here. This time round I thought I would turn the sentiment analysis project into an interactive visualisation using some of my work-in-progress "skills" with D3, the amazing Javascript library by Mike Bostock.

It took a bit of learning to get it into something, but this is what I did:

1. I wrote a Python script to grab all the tweets from Twitter's search API which had the #StopRape hashtag.

Lesson 1: I learned that when you do this on a trending hashtag you will also suck an enormous amount of "hashtag spam" from the datastream. In fact, almost a third of all the tweets I first grabbed were spam which, it appears, is cleverly filtered when you view a hashtagged timeline using Twitter's website or some of the third party clients.

Scridb filter
21Jan/13Off

Analysing the flow of social conversations with Python and D3

There's a great book by Matthew Russell called Mining the Social Webwhich I have been reading for several months now. The book hits the spot for me on multiple fronts because I am interested in network analysis, programming and news which are at the core of Russell's book.

I'm fascinated with trying to gain insight into how social networks (in their broadest sense) operate as this is something that occupies a large part of our time as an investigative unit. Gaining access to networks of power or understanding the linkages betweek key players is often at the heart of breaking open a major story or at least understanding who you need to be looking at.

I've been hacking out various Python scripts over the years which allow me to rapidly trawl through public information sources and to attempt some programmatic mapping of relationships and connections using brilliant Python libraries like networkx which can produce various graphs applying useful algorythmns like degree centrality, betweenness and so on.

For months I have also dabbled inD3, the powerful javascript library by Mike Bostock (now of the New Year Times team) but, boy, has it been tough! I find javascript challenging compared to the easy elegance of Python. It's like trying to write in Latin if you only know English! But a couple of weeks back I put my head down after fooling around with one of Russell's Python scripts designed to mine Twitter's search API and build a relatively interactive visualisation using the Protovis javascript library.

It worked pretty smoothly but Protovis had some severe limitations in terms of functionality and provided a major hurdle for me to get over since my javascript is so poor.

Instead I decided to try and remake the visualisation using D3 which has more scope for interactivity that I can see - and is also significantly better documented. After two weeks of missioning I finally worked it out and you can see the results over here (click image below) where my wife, The Grubstreet Gal, has used it to represent a Twitter debate over the National Press Club's controversial decision to name the rhino as newsmaker of the year.

I'm really pleased at how it worked out.

Basically what happens here is this:

1. My Python script (with major acknowledgements to Russell's open source original) checks in with the Twitter search API using an entered search term or hashtag;

2. It downloads about 600 messages in a couple of minutes (depending on the popularity of the dicussion) and then parses each message looking for the "RT" or "via" phrases using some clever regular expressions;

3. Once it finds a tweet that has been retweeted then it analyses it to identify the source of the tweet and attempts to extract than information (this is not 100% accurate. as you will see in the visualisation. but I would say is around 98% good because there can be misspellings in retweets or shortenings of words and so on);

4. As it locates these entities the relationships, or edges, between them are recorded in a networkx DiGraph (directed graph) which starts to give the network of the conversation some shape;

5. I also included code to capture the original tweet and save it as an attribute of each node so that on mouseover on the links between two parties in the interactive you can see what the original tweet was. This was one of the most difficult things for me to work out how to do and special thanks to www.d3noob.org and his excellent D3 guide for showing me the way!

6. Next the DiGraph is output as a json file which can then be imported as a data source to D3 which uses Bostock's genius to represent the network.

You will see that it can take a little while for the interactive to settle down thanks to complex physics involved in calculating the relationships and positions between the nodes.

This was a a major learning curve for me and I can't wait to tackle a few more of these. I'd like to apply some of this to natural language text processing, for example, to experiment with programmatically trying to visualise linkages between people in a corpus of text like a news archive or something. Also very handy in producing illustrations of power elites and so on on the fly.

Would love to hear your views and happy to share my (very hacky) code with anyone who is interested.

Scridb filter
9Nov/12Off

SA’s Internet mystery: where are the missing millions? [map, data]

South Africa's internet numbers don't add up.

Last week I was fortunate to get access to 2011 Census data in Pretoria for the release of the survey. While giving myself a repetitive stain injury pulling as much data as possible from the survey (all data is not widely available yet) one of the things I did was a query looking at internet access in South Africa.

This is the first time this question has been asked in the Census and the results are quite startling because they appear to be in stark contrast with some of the accepted measures of the size of the Internet audience in South Africa. (Click around map below for council level data)

According to Census 2011:

* There are 5,231,629 households with Internet access in South Africa;

* Of these, 2,434,236 accessed the Net via their cellphones;

* Some 1,261,368 accessed the Net from home;

* Some 694,117 households had Internet access from work;

Well, households don't tell us too much about total Internet users.

But since I have municipal level data I also pulled total populations for each municipality and then calculated average household size for each municipality and then used that figure to calculate how many people in total in each municipality might have internet access.

The resulting figure suggests that the internet population in South Africa, according to Census figures, is about 17.4 million - which is significantly higher than other estimations.

Some other figures:
* Google Public Data (citing World Bank as a source) puts internet penetration at around 21% as of 2011, or about 10.8m using our new 51,7m population estimate (see chart below);
* An authoritative report by Arthur Goldstuck's World Wide Worx and the howzit MSN online portal recently put the number at 8,5m (http://www.timeslive.co.za/local/2012/05/10/number-of-south-african-internet-users-grows), and;
* Internet World Stats says we have 6,8m Internet users and 4,9m of them are on Facebook (http://www.internetworldstats.com/africa.htm)

So it's easy to see that there is a yawning chasm between the internet penetration suggested by the Census and other specialist research.

Of course, my number could be overstated as I have applied the average size of household calculation to the number of households with internet access in each municipal area which doesn't take into account things like babies and kids who don't access the Net and so on.

My rudimentary skills also don't allow me to calculate an alternate possible weighting for households based on income, urban location and so on so please feel free to provide an alternate analysis.

But even taking that into account it would seem, if we accept the accuracy of our Census data, that SA's internet population is massively larger than what we may think. StatsSA's own calculation based on the 2011 Census suggests that 35.2% of households have some access to the internet. There are 15,6m households in SA with an average of 3.4 people per household, and so do the math.

If these figures are even vaguely right it would mean that online audiences left print behind some time ago.

This raises some thoughts from a media point of view.
1. If this figure approximates the true size of the internet audience in SA does it account for the rapidly accelerating decline of print newspaper circulation, particularly of dailies?
2. Why is Internet advertising spend moving so slowly to the web (still only around 2.7% of total adspend in SA)
3. What does an audience of this assumed size offer new and emerging publishers as well as established media players?

MY TOP 10 RANKING OF INTERNET PENETRATION BY MUNICIPALITY

Name Total Access to Internet Households No access to internet Total Households Total population Average household size People with acess based on average household Internet Penetration % of total population
City of Tshwane 483742 466326 956995 2921488 3 1476754 51
City of Cape Town 538551 569194 1114367 3740026 3 1807479 48
City of Johannesburg 743440 795447 1550475 4434827 3 2126463 48
uMhlathuze 42650 48280 91592 334459 4 155742 47
Stellenbosch 20134 24459 44937 155733 3 69776 45
Ekurhuleni 449130 624808 1080842 3178470 3 1320772 42
Emfuleni 93127 133074 227141 721663 3 295879 41
Ethekwini 407064 586670 1001885 3442361 3 1398625 41
Metsimaholo 19009 27677 46982 149108 3 60329 40
Tlokwe City Council 21496 32852 54650 162762 3 64021 39

Google on SA Internet Access

Scridb filter
1 visitors online now
0 guests, 1 bots, 0 members
Max visitors today: 5 at 12:37 am SAST
This month: 11 at 04-05-2014 04:33 pm SAST
This year: 14 at 03-10-2014 10:09 pm SAST
All time: 202 at 08-30-2011 10:19 am SAST