Using statistics to catch cheats and criminals

“If your experiment needs statistics, you ought to have done a better experiment,” Ernest Rutherford once declared. But when you work at the frontier of detection, as astronomers and particle physicists often do, you rely on statistical analysis to extract results. Indeed, if your experiment doesn’t need statistics, then you might be too far from the frontier to make an important discovery.

Despite such statistical triumphs as last year’s discovery of the Higgs boson, Rutherford’s disdain for—or at least suspicion of—statistics remains widespread. A recent statistical analysis demonstrated that visiting your doctor every year for a checkup doesn’t significantly prolong life. Of course, the practice doesn’t harm any individual patient, but its prevalence in the US raises the total cost of medical care, which harms society. Will the study make a difference? I doubt it.

Ernest Rutherford (1871–1937) and his coworkers discovered the atomic nucleus and the proton. They also performed the first experiments that transmuted one element into another. To learn more about Rutherford, visit the online exhibition Rutherford's Nuclear World. hosted by AIP's Center for the History of Physics.

Ernest Rutherford (1871–1937) and his coworkers discovered the atomic nucleus and the proton. They also performed the first experiments that transmuted one element into another. To learn more about Rutherford, visit the online exhibition Rutherford’s Nuclear World, which is hosted by AIP’s Center for the History of Physics. CREDIT: AIP Emilio Segrè Visual Archives (gift of Otto Hahn and Lawrence Badash)

I’m not sure what evidence would convince physicians to refrain from insisting on annual checkups, but they and anyone else who is skeptical of statistical analysis might be persuaded by a simmering scandal that boiled over recently in Atlanta, Georgia.

On 29 March the superintendent of the Atlanta school district, Beverly Hall, and 34 other educators were indicted in what a New York Times news story characterized as “the most widespread public school cheating scandal in memory.”

According to the indictment, the 35 educators conspired to raise students test scores by altering the tests after the students had taken them. Meeting in secret and wearing gloves to avoid leaving incriminating fingerprints, groups of teachers at various schools rubbed out wrong answers and replaced them with the correct ones.

Besides acclaim for appearing to fix badly performing schools, the conspirators also received cash bonuses. Hall’s totaled $500 000, according to the Times. One school, Parks Middle School, “improved” so much that it forfeited $750 000 in state and federal aid.

To gather evidence of a conspiracy that might convince a jury, Georgia state investigator, Richard Hyde, persuaded one of the teachers who was allegedly part of the scheme to wear a secret recording device. But evidence of a different kind had come to light five years earlier. In December 2008, the Atlanta Journal-Constitution drew attention to what seemed like suspiciously large and abrupt jumps in test scores. That initial investigation expanded into a five-year project in which three reporters and two database specialists gathered and analyzed test scores from 69 000 schools in 14 743 districts in 49 states.

The scores from Atlanta and few other districts stuck out as anomalous. As reported last June, some of those school districts are taking advantage of the Atlanta Journal-Constitution study to identify cheating educators.

Organized crime and electoral fraud

Similar statistical investigations can be found on the arXiv e-print server. Last month two physicists, Salvatore Catanese and Giacomo Fiumara and mathematician Emilio Ferrara, all from the University of Messina in Sicily, demonstrated that they could pick out organized criminal activity from cell phone records by looking for statistically anomalous behavior.

My favorite example—because it’s so similar to the Atlanta cheating scandal—was the study posted last year by Dmitry Kobak of the electrical and electronic engineering department of Imperial College London and two unaffiliated coauthors, Sergey Shpilkin and Maxim Pshenichnikov. Here’s the abstract:

Here we perform a statistical analysis of the official data from recent Russian parliamentary and presidential elections (held on December 4th, 2011 and March 4th, 2012, respectively). A number of anomalies are identified that persistently skew the results in favour of the pro-government party, United Russia (UR), and its leader Vladimir Putin. The main irregularities are: (i) remarkably high correlation between turnout and voting results; (ii) a large number of polling stations where the UR/Putin results are given by a round number of percent; (iii) constituencies showing improbably low or (iv) anomalously high dispersion of results across polling stations; (v) substantial difference between results at paper-based and electronic polling stations. These anomalies, albeit less prominent in the presidential elections, hardly conform to the assumptions of fair and free voting. The approaches proposed here can be readily extended to quantify fingerprints of electoral fraud in any other problematic elections.

As for Rutherford, I remain puzzled by his attitude toward statistics. The famous experiment that Hans Geiger and Ernest Marsden performed in 1909 at the University of Manchester under his direction revealed the existence of the atomic nucleus—after Geiger and Marsden had laboriously tallied the rare backward reflections of alpha particles from gold foil.

The ten thousand Kims

When Koreans marry, the wife retains her name, which is entered into the husband’s copy of his family genealogy (jokbo or chokpo in Korean). The practice, which reflects the Confucian reverence for one’s ancestors, has continued for centuries. As you might expect, jokbos are of great interest to historians. Less obviously, they provide a means for three physicists to test their statistical theories.

The physicists are Seung Ki Baek and Petter Minnhagen of Umeå University in Sweden and Beom Jun Kim of Sungkyunkwan University in South Korea. In a paper posted on the arXiv e-print server, they describe their analysis of women’s names recorded in 10 jokbos that go back 480 years.

Baek, Minnhagen, and Kim divided the jokbos’ 480-year span into 30-year intervals and for each interval tallied the number of women who joined the 10 families M, the number of different family names that those women possessed N, and the number of women who possessed the most common family name kmax.

KimHangul.jpg

The physicists wondered whether the changing values of M, N, and kmax could be reproduced by the random group formation (RGF) model. As a starting point, the RGF model assumes that groups (in this case groups of N women with the same family name) form through a mixing process that maximizes the entropy of a probability distribution (in this case the probability, PM(k), that a randomly chosen woman from a population of M has a family name that occurs k times).

The number and size of groups predicted by the RGF model depends on the sample size, which is what you’d expect for family names in real life. As more generations are recorded in the jokbos, the number and frequency of different family names increases. What makes the RGF distinctive is its history independence: For any generation, the frequency distribution of family names retains the same dependence on sample size.

That history independence might seem implausible, given how much famines, wars, industrial revolutions, and other traumas transform societies. To get the idea across, Baek, Minnhagen, and Kim make a comparison with the frequency distribution of words used by an author throughout his or her oeuvre. Because of its length and breadth, Leo Tolstoy’s 1440-page novel War and Peace has a different word-frequency distribution than does his 76-page novella The Death of Ivan Ilyich. Nevertheless, you can think of the two distributions as being drawn from the same single and very large “meta-book” that characterizes the novelist’s choice and use of words. Likewise Korean family names in the jokbos are drawn from the same “meta-registry” that reflects Korea’s enduring culture—provided the RGF model applies, that is.

In fact, it turns out that the RGF model does reproduce how N has varied with M and other patterns derived from the jokbos. What is the origin of the model’s success? Baek, Minnhagen, and Kim speculate that the answer lies in the stability of Korean culture:

It seems that some core of the Korean culture has remained intact over at least 1500 years and as both the population and occupied area expanded, it basically swallowed other cultural influences without compromising its core.

One of the RGF model’s predictions is that kmax, the number of women who have the most frequently occurring family name, is proportional to M, the sample size (not the case, according to the RGF model, for other family names). Kim is the most common name in the jokbos and, indeed, in Korea. (“Kim” is the name that appears in the accompanying photo.) By applying the RGF model, Baek, Minnhagen, and Kim estimate that in AD 500 Korea was home to 10 000 Kims.

Charles Day

Gnomes

When I asked a friend to suggest topics for this blog, she replied by email, “Would love to see a blog about gnomes,” and appended a link to a web page entitled “Physics doesn’t exist, it’s all about Gnomes.” You might have encountered the heterodox gnome theory before. The section on electricity reads:

Inside cables there are hundreds of tiny gnomes “high-fiving” each other and running around swapping messages. This transfer of messages allows things to work, e.g. the gnomes in a plug socket tell the gnomes in the wire, who eventually tell the gnomes in (say) a kettle to fart in the water allowing it to boil.

Of course, physics does exist and it isn’t all about gnomes, but the notion that some physical phenomena arise from the collective action of tiny particles lies at the heart of many branches of physics.

In 1662 Robert Boyle published his experimental discovery that the pressure and volume of a gas are inversely proportional to each other at a fixed temperature. Seventy-six years later, Daniel Bernoulli derived the same law by assuming that gases consist of molecules (but not gnomes) whizzing about in all directions and applying Newtonian mechanics to their motion.

As a physics student, I remember my first encounter with Ludwig Boltzmann’s derivation of the entropy of gas in terms of the statistics of its constituent molecules. It felt like an epiphany!

Gnome.jpg

If I’d remained an astrophysicist, I don’t think I’d have encountered further and more recent attempts to relate thermodynamic laws to the behavior of molecules. But now that I’m a science writer I see quite a few of them, especially in biological physics.

For Physics Today‘s May 2003 issue I wrote a news story about a paper by Françoise Brochard-Wyart and her colleagues. The paper tackled the problem of how pores open and close in cell membranes—or, rather, in simple stand-ins for cells called artificial vesicles.

Poking a hole in the membrane of a living cell or artificial vesicle entails forcing apart the lipids and other molecules that constitute the membrane. Line tension—the one-dimensional analogue of surface tension—resists the formation of the hole and will reseal the membrane once the poker is withdrawn.

Through experiment and theory, Brochard-Wyart and her team found that they could control the rate at which pores reseal by adding certain molecules to the solution that surrounded their artificial vesicles. What’s more, the relation between the line tension and the molecules’ concentration followed a thermodynamic law that J. Willard Gibbs had derived a century earlier.

Much of modern biology is focused on identifying the molecular origin of biological processes. When I asked Brochard-Wyart about the philosophy behind her thermodynamic approach, she reminded me that thermodynamical laws are universal: “Even less general, microscopic models must obey thermodynamics—if they are right.”

Charles Day

Three degrees of Tania Mallet

The notion that everyone on Earth is connected to anyone else on Earth by at most six interpersonal links is popularly known as six degrees of Kevin Bacon. Because the phrase doesn’t identify six as an upper limit, I’d come to think of it as an average. Playing yesterday on a website called the Oracle of Bacon soon disabused me.

The oracle makes use of the Internet Movie Database to identify how many degrees of professional separation lie between Bacon and anyone else in the movie and TV industries. Being competitive, I tried to pick actors whose Bacon number (the smallest number of professional links to Bacon) exceeded six. I couldn’t.

Tania_Mallet.jpg

As a fan of James Bond, I knew that Tania Mallet (shown here) is a former model who made only one movie, Goldfinger. Her Bacon number is 3.

Assuming that contemporary Hollywood is professionally and socially distant from Japan in the 1910s and 20s, I next tried linking Bacon to Matsunosuke Onoe. Born in 1875, Onoe was Japan’s first movie superstar. Amazingly, his Bacon number is 4, just one higher than Mallet’s. The Korean actor and director Woon-gyu Na (1902–37) has a Bacon number of 5.

As a matter of fact, the Oracle of Bacon will evaluate the degrees of separation between any two actors. The results will be similar. Onoe, for example, has a Mallet number of 3. Evidently, the movie industry is supremely interconnected.

For physicists, the challenge and attraction of studying networks lies in identifying generic behavior. Bacon has made more than 60 movies; Mallet just one. An ecosystem of interconnected Bacons and Mallets yields a network whose topology resembles that of the airline route system or the national grid.

To be useful—or even to qualify as scientific—a theoretical model should make predictions. As Mark Newman described in “The physics of networks” (Physics Today, November 2008, page 33), network theory can predict how networks respond to the removal or addition of nodes. In an ecological network, that translates to predicting the fate of a ecosystem if a species becomes extinct—and to devising strategies for safeguarding the ecosystem.

Not being an actor, I can’t use the Oracle of Bacon to evaluate my own Bacon number, but I already know my Newman number. He and I lived next to each other in the same building in Sidney Sussex College, Cambridge.

Charles Day

Thanks to AIP’s Chris Iannicello for telling me about the Oracle of Bacon. He posted a link to it on another network, Facebook.