Have you ever looked at a hard to read graph and wished that
there was a way to figure out what the precise values of the data were? Or
maybe you wanted to extract the data so that you could do your own analysis (or
at least produce a clearer graph)? You’re in luck!
DataThief (http://www.datathief.org/) is a program that lets
you take an image of a graph or chart and extract the underlying values. To
show how useful this can be, I’m starting with a misleading graph I found (http://heavylifting.blogspot.com/2009/08/another-bad-graph.html)
and recreating it to be more informative and honest. Here's the misleading graph:
By having an absolute value on one y-axis, and a percentage
on the other y-axis, this graph creates the false impression that unemployment
and lack of insurance are both sharply increasing (and that the rate of unemployment
has surpassed the rate of lacking insurance). Let’s see if we can do better by
producing a more meaningful graph.
Begin by opening DataThief and importing a screen capture of
the graph. Put the three circles with an X through them on the origin (blue),
top of the y-axis (red), and right of the x-axis (green). Now enter coordinate
values for each point as shown below. For the x-axis, I decided on months as
units, with March 07 being “0” and Jan 09 being “22”. Finally, pick one of the
lines, and put the remaining 3 circles with a + through them on the beginning
(green), end (red), and anywhere in-between (blue) on the line you want to
extract. If you don't have those three circles, hit the button at the top right of a solid line graph (shown in dark gray below). If working with a bar chart, skip to the end of this post. Note the color you have defined to trace (to the left of the start / end / color buttons); for lines that aren't an entirely uniform color (this one had a bit of
shading) try to pick the most representative color within the line to trace. It should look like this (note the location of the 6 reference points below):
After you hit the trace button (the one with three points in black, yellow, and red), the software will try to
trace the line, but it has a hard time with thick lines like this. I switched to point mode (the button showing 4 points on a graph) then tweaked
the settings via the settings tab right above the graph to get it to work properly. I reduced how many points it extracts (by switching "all points" to "output distance" with a value of 1 "on the traced path), and
made a few manual corrections to individual points that weren't quite right, after which I had this:
It looked to me like I had a point on each of the actual data points on the graph, but if not it's easy to keep tweaking them and add / move / remove more points to fully capture the source data. From there we can export the traced data as a text file, and
repeat with the second data series. Note that we need to replace the values for
the y-axis (in the upper left of Data Thief) before tracing the second line, since the second series uses different
units (replace 8 with 50, and 4 with 44).
In Excel I multiplied
the values for “Uninsured Americans” by a million to get the true number (they are reported in the graph with a unit of millions of Americans). I
then got some estimates of US population for Jan 2008 and 2009 (http://www.usnews.com/opinion/articles/2008/12/31/us-population-2009-305-million-and-counting),
and used those to calculate an average growth per month, the projected baseline
population in Mar 2007, and the projected population for each of our data points. This
allowed me to calculate the percentage
of Americans who are uninsured, to allow us to compare that to the percentage
of Americans who are unemployed.
A graph of the resulting data reveals a different pattern
than what we saw before: lack of insurance is increasing very slightly (from
~15.2% to ~16.1%) as unemployment increases more rapidly (from ~4.4% to 7.6%).
Note that the % unemployment never surpasses the percent of people who are
uninsured (contrary to how the original graph made it appear):
There are two important considerations before using this
software. First, these values will only be approximate, so if possible it’s
always better to get the underlying data from the person who created the first
figure. Second, it is possible that the data you are extracting is copyrighted,
and that your reuse of their data may violate the data license. Use at your own
risk! A third potential problem is that you may find yourself sucked into "fixing" misleading graphs you find on the internet, which is a task you will never complete.
Note that despite the name, DataThief is shareware; if you
find it useful, please put your thieving on hold long enough to buy a $25
license.
NOTE: If you're working with a bar chart or other figure where tracing a line isn't necessary, once you set your three axis points and enter the value range for x and y (by updating Ref 0 with 0,[high range of the bar chart], setting Ref 1 to 0,0, and setting Ref 2 to [any value],0), you can simply drag one of the circles with a + in it around to the end of each bar and record the value displayed as you drag it. Especially for a small chart with just a few bars I find this quick and easy.