Highlights: Superforecasting by Phil Tetlock

From the book cover:

In a landmark, twenty-year study, Wharton professor Philip Tetlock showed that the average expert was only slightly better at predicting the future than a layperson using random guesswork. Tetlock’s latest project – an unprecedented, government-funded forecasting tournament involving over a million individual predictions – has since shown that there are, however, some people with real, demonstrable foresight. These are ordinary people, from former ballroom dancers to retired computer programmers, who have an extraordinary ability to predict the future with a degree of accuracy 60% greater than average. They are superforecasters.

In Superforecasting, Tetlock and his co-author Dan Gardner offer a fascinating insight into what we can learn from this elite group. They show the methods used by these superforecasters which enable them to outperform even professional intelligence analysts with access to classified data. And they offer practical advice on how we can all use these methods for our own benefit – whether in business, in international affairs, or in everyday life.

Cummings Summary

Dominic Cummings (regardless of what you think of him) has a pretty good summary of it in a Spectator column from 2016 here

My highlights follow…

Laplace’s Demon vs Lorenz Cloud

Will we ever be able to predict everything?

Lorenz poured cold rainwater on that dream. If the clock symbolizes perfect Laplacean predictability, its opposite is the Lorenzian cloud. High school science tells us that clouds form when water vapor coalesces around dust particles. This sounds simple but exactly how a particular cloud develops—the shape it takes—depends on complex feedback interactions among droplets. To capture these interactions, computer modelers need equations that are highly sensitive to tiny butterfly-effect errors in data collection. So even if we learn all that is knowable about how clouds form, we will not be able to predict the shape a particular cloud will take. We can only wait and see. In one of history’s great ironies, scientists today know vastly more than their colleagues a century ago, and possess vastly more data-crunching power, but they are much less confident in the prospects for perfect predictability.

The staggering size of the intelligence community in the US

By one rough estimate, the United States has twenty thousand intelligence analysts assessing everything from minute puzzles to major events such as the likelihood of an Israeli sneak attack on Iranian nuclear facilities or the departure of Greece from the eurozone.

National Intelligence Estimates are the consensus view of the Central Intelligence Agency, the National Security Agency, the Defense Intelligence Agency, and thirteen other agencies. Collectively, these agencies are known as the intelligence community, or IC. The exact numbers are classified, but by one rough estimate the IC has a budget of more than $50 billion and employs one hundred thousand people.

The scientific method applied to medecine started getting applied SO late!

I knew that medecine was effectively BS for most of human history… but didn’t realise it went on for that long, and that the BS came so close to ending 2 centuries earlier…

The cure for this plague of certainty came tantalizingly close to discovery in 1747, when a British ship’s doctor named James Lind took twelve sailors suffering from scurvy, divided them into pairs, and gave each pair a different treatment: vinegar, cider, sulfuric acid, seawater, a bark paste, and citrus fruit. It was an experiment born of desperation. Scurvy was a mortal threat to sailors on long-distance voyages and not even the confidence of physicians could hide the futility of their treatments. So Lind took six shots in the dark—and one hit. The two sailors given the citrus recovered quickly. But contrary to popular belief, this was not a eureka moment that ushered in the modern era of experimentation. “Lind was behaving in what sounds a modern way, but had no full understanding of what he was doing,” noted Druin Burch. “He failed so completely to make sense of his own experiment that even he was left unconvinced of the exceptional benefits of lemons and limes.”For years thereafter, sailors kept getting scurvy and doctors kept prescribing worthless medicine. Not until the twentieth century did the idea of randomized trial experiments, careful measurement, and statistical power take hold. “Is the application of the numerical method to the subject-matter of medicine a trivial and time-wasting ingenuity as some hold, or is it an important stage in the development of our art, as others proclaim it,” the Lancet asked in 1921. The British statistician Austin Bradford Hill responded emphatically that it was the latter, and laid out a template for modern medical investigation.

Thank you for your service Mr Codman…

A century ago, as physicians were slowly professionalizing and medicine was on the cusp of becoming scientific, a Boston doctor named Ernest Amory Codman had an idea similar in spirit to forecaster scorekeeping. He called it the End Result System. Hospitals should record what ailments incoming patients had, how they were treated, and—most important—the end result of each case. These records should be compiled and statistics released so consumers could choose hospitals on the basis of good evidence. Hospitals would respond to consumer pressure by hiring and promoting doctors on the same basis. Medicine would improve, to the benefit of all. “Codman’s plan disregarded a physician’s clinical reputation or social standing as well as bedside manner or technical skills,” noted the historian Ira Rutkow. “All that counted were the clinical consequences of a doctor’s effort.”8 Today, hospitals do much of what Codman demanded, and more, and physicians would find it flabbergasting if anyone suggested they stop. But the medical establishment saw it differently when Codman first proposed the idea. Hospitals hated it. They would have to pay for record keepers. And the physicians in charge saw nothing in it for them. They were already respected. Keeping score could only damage their reputations. Predictably, Codman got nowhere. So he pushed harder—and alienated his colleagues so badly he was booted from Massachusetts General Hospital. Codman opened his own little hospital, where he personally paid for statistics to be compiled and published, and he continued to publicize his ideas, using increasingly intemperate means. At a meeting of a local medical society in 1915, Codman unfurled an enormous cartoon that poked fun at various dignitaries, including the president of Harvard University. Codman was suspended from the medical society and lost his Harvard teaching post. The status quo seemed unassailable. But “the hue and cry over Codman’s cartoon created a nationwide buzz,” Rutkow wrote. “Medical efficiency and the end result system were suddenly the hot topic of the day. As the profession and the public learned of Codman’s ideas a growing number of hospitals across the country implemented his scheme. Codman became a sought-after speaker and when the fledgling American College of Surgeons formed a commission on hospital standardization, he was appointed its first chairman.” Much that Codman advocated was never adopted—he was an inexhaustible idealist—but ultimately his core insight won.

Let’s hope we get to “evidence-based forecasting” faster than Medecine…

Consumers of forecasting will stop being gulled by pundits with good stories and start asking pundits how their past predictions fared—and reject answers that consist of nothing but anecdotes and credentials. Just as we now expect a pill to have been tested in peer-reviewed experiments before we swallow it, we will expect forecasters to establish the accuracy of their forecasting with rigorous testing before we heed their advice. And forecasters themselves will realize, as Dan Drezner did, that these higher expectations will ultimately benefit them, because it is only with the clear feedback that comes from rigorous testing that they can improve their foresight. It could be huge—an “evidence-based forecasting” revolution similar to the “evidence-based medicine” revolution, with consequences every bit as significant.

Wouldn’t it be great if for every pundit / writer you read their Brier score on predictions in the same domain was published? In the absence of that I will continue to try to listen mostly to people who’s livelihood depends on being right rather than sounding smart!

System 1 takes the decision, system 2 checks it

From Magnus Carlsen:

“Often, I cannot explain a certain move, only know that it feels right, and it seems that my intuition is right more often than not,” observed the Norwegian prodigy Magnus Carlsen, the world chess champion and the highest-ranked player in history. “If I study a position for an hour then I am usually going in loops and I’m probably not going to come up with something useful. I usually know what I am going to do after 10 seconds; the rest is double-checking.”Carlsen respects his intuition, as well he should, but he also does a lot of “double-checking” because he knows that sometimes intuition can let him down and conscious thought can improve his judgment.

Reminds me a Go player I read say something similar (and that AlphaGo seemed to have this same “intuition”)

The mistake of expecting certainty in forecasts

Even sophisticated thinkers fall for it. In 2012, when the Supreme Court was about to release its long-awaited decision on the constitutionality of Obamacare, prediction markets—markets that let people bet on possible outcomes—pegged the probability of the law being struck down at 75%. When the court upheld the law, the sagacious New York Times reporter David Leonhardt declared that “the market—the wisdom of the crowds—was wrong.” The prevalence of this elementary error has a terrible consequence. Consider that if an intelligence agency says there is a 65% chance that an event will happen, it risks being pilloried if it does not—and because the forecast itself says there is a 35% chance it will not happen, that’s a big risk. So what’s the safe thing to do? Stick with elastic language. Forecasters who use “a fair chance” and “a serious possibility” can even make the wrong-side-of-maybe fallacy work for them: If the event happens, “a fair chance” can retroactively be stretched to mean something considerably bigger than 50%—so the forecaster nailed it. If it doesn’t happen, it can be shrunk to something much smaller than 50%—and again the forecaster nailed it. With perverse incentives like these, it’s no wonder people prefer rubbery words over firm numbers.

I remember the same happened to Nate Silver who had predicted Hillary would be more likely to win and was lynched for that on election day, even though he had forecast a non-zero chance of Trump winning too.

Pundits who get it wrong a lot

On the pundits – Kudlows, Krugmans, Friedmans – whose career success is independent of the accuracy of their predictions:

Not that being wrong hurt Kudlow’s career. In January 2009, with the American economy in a crisis worse than any since the Great Depression, Kudlow’s new show, The Kudlow Report, premiered on CNBC. That too is consistent with the EPJ data, which revealed an inverse correlation between fame and accuracy: the more famous an expert was, the less accurate he was. That’s not because editors, producers, and the public go looking for bad forecasters. They go looking for hedgehogs, who just happen to be bad forecasters. Animated by a Big Idea, hedgehogs tell tight, simple, clear stories that grab and hold audiences. As anyone who has done media training knows, the first rule is “keep it simple, stupid.” Better still, hedgehogs are confident. With their one-perspective analysis, hedgehogs can pile up reasons why they are right—“furthermore,” “moreover”—without considering other perspectives and the pesky doubts and caveats they raise. And so, as EPJ showed, hedgehogs are likelier to say something definitely will or won’t happen. For many audiences, that’s satisfying. People tend to find uncertainty disturbing and “maybe” underscores uncertainty with a bright red crayon. The simplicity and confidence of the hedgehog impairs foresight, but it calms nerves—which is good for the careers of hedgehogs.

and are particularly prone to hindsight bias:

Kahneman and other pioneers of modern psychology have revealed that our minds crave certainty and when they don’t find it, they impose it. In forecasting, hindsight bias is the cardinal sin. Recall how experts stunned by the Gorbachev surprise quickly became convinced it was perfectly explicable, even predictable, although they hadn’t predicted it.

but even when consistently wrong pundits are still useful for the superforecasters for the questionsthey raise through their analysis:

Friedman’s conclusion? “If today’s falloff in oil prices is sustained, we’ll also be in for a lot of surprises”—particularly in the petro-states of Venezuela, Iran, and Russia.17 Here was a vague warning of unspecified surprises in unspecified time frames. As a forecast, that’s not terribly helpful. This sort of thing is why some people see Friedman as a particularly successful and slippery pundit who has mastered the art of appearing to go out on a limb without ever venturing out. But the same column can be read less as a forecast than an attempt to draw the attention of forecasters to something they should be thinking about. In other words, it is a question, not an answer. Whether superforecasters can outpredict Friedman is both unknown and, for present purposes, beside the point. Superforecasters and superquestioners need to acknowledge each other’s complementary strengths, not dwell on each other’s alleged weaknesses.

The Good Judgement Project (forecasting tournaments)

Run by IARPA:

In 2006 the Intelligence Advanced Research Projects Activity (IARPA) was created. Its mission is to fund cutting-edge research with the potential to make the intelligence community smarter and more effective. As its name suggests, IARPA was modeled after DARPA, the famous defense agency whose military-related research has had a huge influence on the modern world.

Example questions:

Will the president of Tunisia flee to a cushy exile in the next month? Will an outbreak of H5N1 in China kill more than ten in the next six months? Will the euro fall below $1.20 in the next twelve months?

“Will either the French or Swiss inquiries find elevated levels of polonium in the remains of Yasser Arafat’s body?”

Common fallacy: luck vs number of roll of the dice

A variant of this fallacy is to single out an extraordinarily successful person, show that it was extremely unlikely that the person could do what he or she did, and conclude that luck could not be the explanation. This often happens in news coverage of Wall Street. Someone beats the market six or seven years in a row, journalists profile the great investor, calculate how unlikely it is to get such results by luck alone, and triumphantly announce that it’s proof of skill. The mistake? They ignore how many other people were trying to do what the great man did. If it’s many thousands, the odds of someone getting that lucky shoot up.

This is why it was important to see if the same superforecasters could consistently (over several tournaments) repeat their Brier scores.

Fermi thinking

Found to be common among fox-like superforecasters.


How many piano tuners are there in Chicago? Don’t even think about letting Google find the answer for you. The Italian American physicist Enrico Fermi—a central figure in the invention of the atomic bomb—concocted this little brainteaser decades before the invention of the Internet. And Fermi’s students did not have the Chicago yellow pages at hand. They had nothing. And yet Fermi expected them to come up with a reasonably accurate estimate.

Breaking down the answer:

What Fermi understood is that by breaking down the question, we can better separate the knowable and the unknowable. So guessing—pulling a number out of the black box—isn’t eliminated. But we have brought our guessing process out into the light of day where we can inspect it. And the net result tends to be a more accurate estimate than whatever number happened to pop out of the black box when we first read the question.

Superforecasters do this a lot and combine it with Bayesian updating

The importance of base rates (starting with the outside view)

My question: How likely is it that the Renzettis have a pet? To answer that, most people would zero in on the family’s details. “Renzetti is an Italian name,” someone might think. “So are ‘Frank’ and ‘Camila.’ That may mean Frank grew up with lots of brothers and sisters, but he’s only got one child. He probably wants to have a big family but he can’t afford it. So it would make sense that he compensated a little by getting a pet.” Someone else might think, “People get pets for kids and the Renzettis only have one child, and Tommy isn’t old enough to take care of a pet. So it seems unlikely.” This sort of storytelling can be very compelling, particularly when the available details are much richer than what I’ve provided here. But superforecasters wouldn’t bother with any of that, at least not at first. The first thing they would do is find out what percentage of American households own a pet. Statisticians call that the base rate—how common something is within a broader class. Daniel Kahneman has a much more evocative visual term for it. He calls it the “outside view”—in contrast to the “inside view,” which is the specifics of the particular case. A few minutes with Google tells me about 62% of American households own pets. That’s the outside view here. Starting with the outside view means I will start by estimating that there is a 62% chance the Renzettis have a pet. Then I will turn to the inside view—all those details about the Renzettis—and use them to adjust that initial 62% up or down.

Conversations with Tyler interview

Worth a listen… goes even more in-depth than the book e.g on policy applications of Tetlock’s research. Some classic Tyler grilling.

📚 About “Highlights”

I’m trying to write up raw notes/highlights on books I’ve recently finished (for lack of having time to write proper reviews). This pushes me to reflect a bit on what I’ve learned and have notes to go back to. It may also be of use to you, Dear Reader, if you are curious about the book! 🙂


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s