Problem Management — principles
Statistics

Statistics can perhaps best be summarized as true lies, since it is so easy to present them in a way that far more closely represents the views of the presenter than reality. Never accept any statistic at face value; be sure you understand where the data came from and how the result was derived; and always question the motives of the presenters.


If you have ever answered a survey or compiled experimental data for a course project you already know how unreliable the raw data can be. Perhaps you answered a few questions not quite as truthfully as you could have, and perhaps you adjusted a few of the obviously wrong experimental results. Even if you personally would never do such a thing, you almost certainly realize that many other people would.

Ancel Keys threw away data disproving his theory that cholesterol caused heart disease, and Andrew Wakefield invented data to support his theory that vaccination caused autism. These are two famous instances of scientific fraud. Think how many others haven't been caught, or have been but were not famous enough to be popularly reported?

And even when questions are answered honestly and completely, perhaps the questions themselves were somewhat biased.


But even supposing that the data do accurately reflect reality, they can be analyzed and presented in many different ways.

Suppose a headline reads Veterans benefits increase 20%. That could show how well the government is honouring its Vets. But, if the increase were over a ten year period during which inflation increased 25%, it actually represents a drop in support. But if the number of living veterans decreased by 40% during that time, it really is an increase. Unless most of the increase was in widow's pensions, in which case the living veterans received much less.

Any statistic presented as a simple number is almost always misleading and generally useless, except for the purposes of the person presenting the statistic. Without the full context, it could mean almost anything.

Even raw statistics can produce confusing results. In a taste comparison between Pepsi® and Coke®, can the results be affected by which of the two products is tried first?

39 people tried Pepsi first, and 39 people tried Coke first:

First
Tasted
Preference
Pepsi Coke
Pepsi 19 20
Coke 20 19

Consider the overall results. Only 48.7% of those served Pepsi first preferred it, while 51.3% of those served Coke first preferred the Pepsi. So obviously it would be good (for Pepsi) if the testers always offered the Coke first.

But perhaps knowing the sex of the testers might enable them to improve the results even more:

First
Tasted
Female preference
Pepsi Coke
Pepsi 9 5
Coke 14 9

Among females, 64.3% preferred Pepsi if they tried it first while only 60.1% preferred it if they tried the Coke first. So for females, it's better to serve the Pepsi first.

First
Tasted
Male preference
Pepsi Coke
Pepsi 10 15
Coke 6 10

And among males, 40.0% preferred Pepsi if they tried it first while only 37.5% preferred it if they tried the Coke first. So for males, it's better to serve the Pepsi first.

We can only conclude that it's best to serve Pepsi first if we know the sex of the taster, and to serve Coke first if we don't know. Yet how can the tester's knowledge possibly affect the results?

Obviously it can't. (This situation is known as Simpson's Paradox.)

Note that these strange results are not because of the small sample size. The 39 could just as well represent 39 thousand or 39 million people.

Also note that the numbers given here are made up, they aren't actual survey results.

Another classic example of this paradox involves acceptance rates and discrimination. 800 men and 800 women applied to a university, which then accepted 380 men but only 313 women for admission. The difference (47% vs 39%) is significant enough that there is clear evidence of bias against female applicants.

But consider each department individually:

  Men Women
Physics 60/110(55%) 6/10(60%)
Chemistry 92/190(48%) 22/40(55%)
Biology 100/170(59%) 60/100(60%)
Medicine 117/290(40%) 105/250(42%)
Nursing 11/40(27%) 120/400(30%)
Total 380/800(47.5%) 313/800(39.125%)

The figures add up to the same totals, yet in every individual department there was a higher acceptance rate for women than for men. If anything, the bias is in the other direction.

The mathematics justifying police profiling is equally flawed. (This is when using profiling for random checks, not for when searching for suspects with a specific description.)

Consider two groups of easily identifible people, A and B, each containing non-obviously dangerous people.

  % of Population % Dangerous
A 90% .1%
B 10% .5%

We can quickly, and correctly, conclude that we are 5 times as likely to select a dangerous person if we randomly select from group B than from group A. We can legitimately deduce that there is some systemic difference between the two groups.

Police often, but incorrectly, use this result to justify spending most of their time monitoring group B. Other people often, and again incorrectly, use this result to justify not hiring, or not associating with, people from group B.

Why incorrectly?

Try answering the following two questions:

  1. In any arbitrarily selected violent incident, would the perpetrator more likely be from group A or from group B?
  2. If one person is selected from each group, how much less dangerous is one likely to be than the other?

Unless you actually did the math, the real answers might surprise you:

  1. .09% of the population are dangerous As, while .05% of the population are dangerous Bs. So the perpetrator is 80% more likely to be from group A than from group B.
  2. 99.9% of group A and 99.5% of group B are not dangerous people. So the person from group A is only .4% less likely to be dangerous than the person from group B. If you were hiring one of these two people, for any other factor, you would consider less than a half percent difference to be insignificant.

If police randomly questioned people, regardless of which group they belonged to, they would be bothering an innocent person 99.86% of the time. They would not continue doing this for long.

So how does the fact that they would be wrong only 99.50% of the time, if they restricted it to group B, in any way justify doing so?

Statistics can also be affected by the process that obtains them.

Suppose a boss decided to keep careful track of how his employees are doing, and whenever someone's performance falls below a certain level he reprimands them. Whenever this happens, the statistics for that employee almost always improve over the next few days.

Now suppose another boss does the same thing, but rather than reprimanding, he waits for great performance and offers praise. Whenever this happens, the statistics for that employee almost always get worse over the next few days.

The obvious conclusion is that employees react positively to criticism but slack off whenever praised, and the inevitable result is that managers tend to criticize far more than they praise.

Of course such conclusions and decisions are totally wrong. The observed results have very little to do with the boss's actions or human nature, and everything to do with when the criticism or praise was given.

People's performances can fluctuate randomly from day to day, so it's not surprising that the average after an especially bad day would be better, nor that the average after an especially good day would be worse. The same results would have been observed even if the boss hadn't bothered to offer any criticism or praise at all. The observed effects were simply an artifact of the choice of the test point trigger. If one observes the landscape only when in a valley, the view is of nothing but mountains; if one looks only when a peak is reached, every direction is downhill.

Now even when a statistic is understood in its full context, it might not be as meaningful as it appears. Physicists and engineers know that it would be misleading to express their results with more precision than their original measurements. But people that present statistics are under no such constraint. A statistic like 71.4286% is very precise, and that can lead people to think that it is equally accurate. In fact that could be the result of five out of seven responses, which is nowhere near an accurate value if that sample of seven is supposed to represent a much larger number of people.

And remember, 27.358% of all statistics are made up.