Friday, April 26, 2013

Using Data Thief to rebuild misleading figures



Have you ever looked at a hard to read graph and wished that there was a way to figure out what the precise values of the data were? Or maybe you wanted to extract the data so that you could do your own analysis (or at least produce a clearer graph)? You’re in luck!

DataThief (http://www.datathief.org/) is a program that lets you take an image of a graph or chart and extract the underlying values. To show how useful this can be, I’m starting with a misleading graph I found (http://heavylifting.blogspot.com/2009/08/another-bad-graph.html) and recreating it to be more informative and honest. Here's the misleading graph:

By having an absolute value on one y-axis, and a percentage on the other y-axis, this graph creates the false impression that unemployment and lack of insurance are both sharply increasing (and that the rate of unemployment has surpassed the rate of lacking insurance). Let’s see if we can do better by producing a more meaningful graph.

Begin by opening DataThief and importing a screen capture of the graph. Put the three circles with an X through them on the origin (blue), top of the y-axis (red), and right of the x-axis (green). Now enter coordinate values for each point as shown below. For the x-axis, I decided on months as units, with March 07 being “0” and Jan 09 being “22”. Finally, pick one of the lines, and put the remaining 3 circles with a + through them on the beginning (green), end (red), and anywhere in-between (blue) on the line you want to extract. If you don't have those three circles, hit the button at the top right of a solid line graph (shown in dark gray below). If working with a bar chart, skip to the end of this post. Note the color you have defined to trace (to the left of the start / end / color buttons); for lines that aren't an entirely uniform color (this one had a bit of shading) try to pick the most representative color within the line to trace. It should look like this (note the location of the 6 reference points below):


After you hit the trace button (the one with three points in black, yellow, and red), the software will try to trace the line, but it has a hard time with thick lines like this. I switched to point mode (the button showing 4 points on a graph) then tweaked the settings via the settings tab right above the graph to get it to work properly.  I reduced how many points it extracts (by switching "all points" to "output distance" with a value of 1 "on the traced path), and made a few manual corrections to individual points that weren't quite right, after which I had this:



It looked to me like I had a point on each of the actual data points on the graph, but if not it's easy to keep tweaking them and add / move / remove more points to fully capture the source data. From there we can export the traced data as a text file, and repeat with the second data series. Note that we need to replace the values for the y-axis (in the upper left of Data Thief) before tracing the second line, since the second series uses different units (replace 8 with 50, and 4 with 44).

In Excel I multiplied the values for “Uninsured Americans” by a million to get the true number (they are reported in the graph with a unit of millions of Americans). I then got some estimates of US population for Jan 2008 and 2009 (http://www.usnews.com/opinion/articles/2008/12/31/us-population-2009-305-million-and-counting), and used those to calculate an average growth per month, the projected baseline population in Mar 2007, and the projected population for each of our data points. This allowed me to calculate the percentage of Americans who are uninsured, to allow us to compare that to the percentage of Americans who are unemployed.

A graph of the resulting data reveals a different pattern than what we saw before: lack of insurance is increasing very slightly (from ~15.2% to ~16.1%) as unemployment increases more rapidly (from ~4.4% to 7.6%). Note that the % unemployment never surpasses the percent of people who are uninsured (contrary to how the original graph made it appear):

There are two important considerations before using this software. First, these values will only be approximate, so if possible it’s always better to get the underlying data from the person who created the first figure. Second, it is possible that the data you are extracting is copyrighted, and that your reuse of their data may violate the data license. Use at your own risk! A third potential problem is that you may find yourself sucked into "fixing" misleading graphs you find on the internet, which is a task you will never complete.

Note that despite the name, DataThief is shareware; if you find it useful, please put your thieving on hold long enough to buy a $25 license.

NOTE: If you're working with a bar chart or other figure where tracing a line isn't necessary, once you set your three axis points and enter the value range for x and y (by updating Ref 0 with 0,[high range of the bar chart], setting Ref 1 to 0,0, and setting Ref 2 to [any value],0), you can simply drag one of the circles with a + in it around to the end of each bar and record the value displayed as you drag it. Especially for a small chart with just a few bars I find this quick and easy.