Probably Overthinking It: 2014

Wednesday, December 17, 2014

Statistics tutorials at PyCon 2015

I am happy to announce that I will offer two statistics tutorials at PyCon 2015 on April 9 in Montreal. In the morning session I am teaching Bayesian Statistics Made Simple, which I have taught several times before, including the last three PyCons. In the afternoon I am offering a new tutorial, Statistical Inference with Computational Methods.

The whole tutorial schedule is here, along with registration information. And here are some details about the tutorials:

Bayesian statistics made simple

Audience level:: Intermediate
Category:: Science

Description

An introduction to Bayesian statistics using Python. Bayesian statistics are usually presented mathematically, but many of the ideas are easier to understand computationally. People who know Python can get started quickly and use Bayesian analysis to solve real problems. This tutorial is based on material and case studies from Think Bayes (O’Reilly Media).

Abstract

Bayesian statistical methods are becoming more common and more important, but there are not many resources to help beginners get started. People who know Python can use their programming skills to get a head start.

I will present simple programs that demonstrate the concepts of Bayesian statistics, and apply them to a range of example problems. Participants will work hands-on with example code and practice on example problems.

Attendees should have at least basic level Python and basic statistics. If you learned about Bayes’s theorem and probability distributions at some time, that’s enough, even if you don’t remember it!

Attendees should bring a laptop with Python and matplotlib. You can work in any environment; you just need to be able to download a Python program and run it. I will provide code to help attendees get set up ahead of time.

More information and registration here.

Statistical inference with computational methods

Audience level:: Intermediate
Category:: Science

Description

Statistical inference is a fundamental tool in science and engineering, but it is often poorly understood. This tutorial uses computational methods, including Monte Carlo simulation and resampling, to explore estimation, hypothesis testing and statistical modeling. Attendees will develop understanding of statistical concepts and learn to use real data to answer relevant questions.

Abstract

Do you know the difference between standard deviation and standard error? Do you know what statistical test to use for any occasion? Do you really know what a p-value is? How about a confidence interval?

Most students don’t really understand these concepts, even after taking several statistics classes. The problem is that these classes focus on mathematical methods that bury the concepts under a mountain of details.

This tutorial uses Python to implement simple statistical experiments that develop deep understanding. Attendees will learn about resampling and related tools that use random simulation to perform statistical inference, including estimation and hypothesis testing. We will use pandas, which provides structures for data analysis, along with NumPy and SciPy.

I will present examples using real-world data to answer relevant questions. The tutorial material is based on my book, Think Stats, a class I teach at Olin College, and my blog, “Probably Overthinking It.”

More information and registration here.

Thursday, December 4, 2014

The Rock Hyrax Problem

This is the third of a series of articles about Bayesian analysis. The previous article is here.

Earlier this semester I posed this problem to my Bayesian statistics class at Olin College:

Suppose I capture and tag 10 rock hyraxes. Some time later, I capture another 10 hyraxes and find that two of them are already tagged. How many hyraxes are there in this environment?

This is an example of a mark and recapture experiment, which you can read about on Wikipedia. The Wikipedia page also includes the photo of a tagged hyrax shown above.

As always with problems like this, we have to make some modeling assumptions.

1) For simplicity, you can assume that the environment is reasonably isolated, so the number of hyraxes does not change between observations.

2) And you can assume that each hyrax is equally likely to be captured during each phase of the experiment, regardless of whether it has been tagged. In reality, it is possible that tagged animals would avoid traps in the future, or possible that the same behavior that got them caught the first time makes them more likely to be caught again. But let's start simple.

My solution to this problem uses the computation framework from my book, Think Bayes. The framework is described in this notebook. If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.

If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.

And when you are ready, you can see my solution in this notebook.

I will post more of the exercises from my class over the next few weeks.

UPDATE December 5, 2014: João Neto posted a solution to this problem in BUGS using a Jeffrey's prior.

Tuesday, November 25, 2014

The World Cup Problem Part 2: Germany v. Argentina

This is the second of two articles about Bayesian analysis applied to World Cup soccer. The previous article is here.

Earlier this semester I posed this problem to my Bayesian statistics class at Olin College:

In the final match of the 2014 FIFA World Cup, Germany defeated Argentina 1-0. How much evidence does this victory provide that Germany had the better team? What is the probability that Germany would win a rematch?

Before you can answer a question like this, you have to make some modeling decisions. As I suggested to my class, scoring in games like soccer and hockey can be well modeled by a Poisson process, which assumes that each team, against a given opponent, will score goals at some goal-scoring rate, λ, and that this rate is stationary; in other words, the probability of scoring a goal is about the same at any point during the game.

If this model holds, we expect the distribution of time between goals to be exponential, and the distribution of goals per game to be Poisson.

My solution to this problem uses the computation framework from my book, Think Bayes. The framework is described in this notebook. If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.

If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.

And when you are ready, you can see my solution in this notebook.

I will post more of the exercises from my class over the next few weeks.

One technical note: Last week on reddit, someone asked about my use of "fake data" to construct the prior distribution. This move is not as bogus as it sounds; it is just a convenient way to construct a prior that has the right mean (or whatever other statistics are known) and a shape that is consistent with background information.

UPDATE November 25, 2014: Cameron Davidson-Pilon has once again risen to the challenge and implemented a solution using PyMC. You can see his solution here.

If you want to learn more about PyMC, an excellent place to start is Cameron's online book Bayesian Methods for Hackers.

Tuesday, November 18, 2014

The World Cup Problem: Germany v. Brazil

Earlier this semester I posed this problem to my Bayesian statistics class at Olin College:

In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark. At that point in the match, how many goals would you expect Germany to score after 90 minutes? What was the probability that they would score 5 more goals (as, in fact, they did)?

Before you can answer a question like this, you have to make some modeling decisions. As I suggested to my class, scoring in games like soccer and hockey can be well modeled by a Poisson process, which assumes that each team, against a given opponent, will score goals at some goal-scoring rate, λ, and that this rate is stationary; in other words, the probability of scoring a goal is about the same at any point during the game.

If this model holds, we expect the distribution of time between goals to be exponential, and the distribution of goals per game to be Poisson.

My solution to this problem uses the computation framework from my book, Think Bayes. The framework is described in this notebook. If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.

If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.

And when you are ready, you can see my solution in this notebook.

I will post more of the exercises from my class over the next few weeks. Coming next: The World Cup Problem Part II: Germany v. Argentina.

UPDATE November 19, 2014: Cameron Davidson-Pilon kindly (and quickly!) responded to my request for a solution to this problem using PyMC. You can see his solution here. If you want to learn more about PyMC, an excellent place to start is Cameron's online book Bayesian Methods for Hackers.

Saturday, October 4, 2014

On efficient algorithms for finding the goddamn endnotes

Introduction

In many recent books, the goddamn endnotes are numbered sequentially within each chapter, but chapter numbers seldom appear in the header on each page. So a reader who wants to find a goddamn endnote typically has to search backward to figure out which chapter they are reading before they can find the goddamn endnote. Several simple changes can make this process more efficient and less infuriating.

In this paper, we present a novel algorithm for finding a goddamn endnote (FGE) in O(1) time; that is, time bounded by a constant regardless of the size or number of chapters.

Background

The most widely deployed algorithm for FGE requires O(n+m) time; that is, time proportional to either n, the number of pages in a chapter, or m, the number of chapters, whichever is larger. The following pseudocode sketches this algorithm:

1) Store the endnote number you want to find as e.
2) Figure out what chapter you are reading by searching for the beginning of the chapter, and store the result as c.
3) Find the footnotes at the end of the chapter and find the footnotes for chapter c.
4) Look up footnote e.

The most common implementation of Step 2 is linear in n, the number of pages in the chapter. Some readers can execute a binary search by comparing chapter titles in the header, reducing search time to O(log n).

Similarly, the most common implementation of Step 3 is linear in m, the number of chapters, but some readers perform a binary search.

Assuming that the number of footnotes in a chapter is bounded by a constant, we consider Step 4 to be constant time.

Alternatives

1) Put the goddamn chapter number (GCN) in the goddamn header. This simple change immediately reduces Step 2 to constant time.

2) Number the goddamn endnotes from the beginning to the end of the book. Do not start over at the beginning of each goddamn chapter. This change also reduces Step 2 to constant time, but with this change Step 4 may no longer be considered constant time.

3) Along with the endnote number, e, also include p, the page where the goddamn endnote appears. This change makes Step 2 constant time; and Step 3, although technically log time, practically constant time for most readers.

4) Put the goddamn endnotes at the bottom of the goddamn page (cf. footnotes). For a bounded page size, looking at the bottom of the page is a constant time operation.

Recommendations

We recommend that publishers replace existing linear-time algorithms with any of several easy-to-implement, efficient alternatives, goddammit.

Monday, September 29, 2014

Two hour marathon in 2041

Today Dennis Kimetto ran the Berlin Marathon in 2:02:57, shaving 26 seconds off the previous world record. That means it's time to check in on the world record progression and update my update from last year of my article from two years ago. The following is a revised version of that article, including the new data point.

Abstract: I propose a model that explains why world record progressions in running speed are improving linearly, and should continue as long as the population of potential runners grows exponentially. Based on recent marathon world records, I extrapolate that we will break the two-hour barrier in 2041.

-----

Let me start with the punchline:

The blue points show the progression of marathon records since 1970, including the new mark. The blue line is a least-squares fit to the data, and the red line is the target pace for a two-hour marathon, 13.1 mph. The blue line hits the target pace in 2041.

In general, linear extrapolation of a time series is a dubious business, but in this case I think it is justified:

1) The distribution of running speeds is not a bell curve. It has a long tail of athletes who are much faster than normal runners. Below I propose a model that explains this tail, and suggests that there is still room between the fastest human ever born and the fastest possible human.

2) I’m not just fitting a line to arbitrary data; there is theoretical reason to expect the progression of world records to be linear, which I present below. And since there is no evidence that the curve is starting to roll over, I think it is reasonable to expect it to continue for a while.

3) Finally, I am not extrapolating beyond reasonable human performance. The target pace for a two-hour marathon is 13.1 mph, which is slower than the current world record for the half marathon (58:23 minutes, or 13.5 mph). It is unlikely that the current top marathoners will be able to maintain this pace for two hours, but we have no reason to think that it is beyond theoretical human capability.

My model, and the data it is based on, are below.

-----

In April 2011, I collected the world record progression for running events of various distances and plotted speed versus year (here's the data, mostly from Wikipedia). The following figure shows the results:

You might notice a pattern; for almost all of these events, the world record progression is a remarkably straight line. I am not the first person to notice, but as far as I know, no one has proposed an explanation for the shape of this curve.

Until now -- I think I know why these lines are straight. Here are the pieces:

1) Each person's potential is determined by several factors that are independent of each other; for example, your VO2 max and the springiness of your tendons are probably unrelated.

2) Each runner's overall potential is limited by their weakest factor. For example, if there are 10 factors, and you are really good at 9 of them, but below average on the 10th, you will probably not be a world-class runner.

3) As a result of (1) and (2), potential is not normally distributed; it is long-tailed. That is, most people are slow, but there are a few people who are much faster.

4) Runner development has the structure of a pipeline. At each stage, the fastest runners are selected to go on to the next stage.

5) If the same number of people went through the pipeline each year, the rate of new records would slow down quickly.

6) But the number of people in the pipeline actually grows exponentially.

7) As a result of (5) and (6) the rate of new records is linear.

8) This model suggests that linear improvement will continue as long as the world population grows exponentially.

Let's look at each of those pieces in detail:

Physiological factors that determine running potential include VO2 max, anaerobic capacity, height, body type, muscle mass and composition (fast and slow twitch), conformation, bone strength, tendon elasticity, healing rate and probably more. Psychological factors include competitiveness, persistence, tolerance of pain and boredom, and focus.

Most of these factors have a large component that is inherent, they are mostly independent of each other, and any one of them can be a limiting factor. That is, if you are good at all of them, and bad at one, you will not be a world-class runner. To summarize: there is only one way to be fast, but there are a lot of ways to be slow.

As a simple model of these factors, we can generate a random person by picking N random numbers, where each number is normally-distributed under a logistic transform. This yields a bell-shaped distribution bounded between 0 and 1, where 0 represents the worst possible value (at least for purposes of running speed) and 1 represents the best possible value.

Then to represent the running potential of each person, we compute the minimum of these factors. Here's what the code looks like:

def GeneratePerson(n=10):
factors = [random.normalvariate(0.0, 1.0) for i in range(n)]
logs = [Logistic(x) for x in factors]
return min(logs)

Yes, that's right, I just reduced a person to a single number. Cue the humanities majors lamenting the blindness and arrogance of scientists. Then explain that this is supposed to be an explanatory model, so simplicity is a virtue. A model that is as rich and complex as the world is not a model.

Here's what the distribution of potential looks like for different values of N:

When N=1, there are many people near the maximum value. If we choose 100,000 people at random, we are likely to see someone near 98% of the limit. But as N increases, the probability of large values drops fast. For N=5, the fastest runner out of 100,000 is near 85%. For N=10, he is at 65%, and for N=50 only 33%.

In this kind of lottery, it takes a long time to hit the jackpot. And that's important, because it suggests that even after 7 billion people, we might not have seen anyone near the theoretical limit.

Let's see what effect this model has on the progression of world records. Imagine that we choose a million people and test them one at a time for running potential (and suppose that we have a perfect test). As we perform tests, we keep track of the fastest runner we have seen, and plot the "world record" as a function of the number of tests.

Here's the code:

def WorldRecord(m=100000, n=10):
data = []
best = 0.0
for i in xrange(m):
person = GeneratePerson(n)
if person > best:
best = person
data.append(i/m, best))
return data

And here are the results with M=100,000 people and the number of factors N=10:

The x-axis is the fraction of people we have tested. The y-axis is the potential of the best person we have seen so far. As expected, the world record increases quickly at first and then slows down.

In fact, the time between new records grows geometrically. To see why, consider this: if it took 100 people to set the current record, we expect it to take 100 more to exceed the record. Then after 200 people, it should take 200 more. After 400 people, 400 more, and so on. Since the time between new records grows geometrically, this curve is logarithmic.

So if we test the same number of people each year, the progression of world records is logarithmic, not linear. But if the number of tests per year grows exponentially, that's the same as plotting the previous results on a log scale. Here's what you get:

That's right: a log curve on a log scale is a straight line. And I believe that that's why world records behave the way they do.

This model is unrealistic in some obvious ways. We don't have a perfect test for running potential and we don't apply it to everyone. Not everyone with the potential to be a runner has the opportunity to develop that potential, and some with the opportunity choose not to.

But the pipeline that selects and trains runners behaves, in some ways, like the model. If a person with record-breaking potential is born in Kenya, where running is the national sport, the chances are good that he will be found, have opportunities to train, and become a world-class runner. It is not a certainty, but the chances are good.

If the same person is born in rural India, he may not have the opportunity to train; if he is in the United States, he might have options that are more appealing.

So in some sense the relevant population is not the world, but the people who are likely to become professional runners, given the talent. As long as this population is growing exponentially, world records will increase linearly.

That said, the slope of the line depends on the parameter of exponential growth. If economic development increases the fraction of people in the world who have the opportunity to become professional runners, these curves could accelerate.

So let's get back to the original question: when will a marathoner break the 2-hour barrier? Before 1970, marathon times improved quickly; since then the rate has slowed. But progress has been consistent for the last 40 years, with no sign of an impending asymptote. So I think it is reasonable to fit a line to recent data and extrapolate. Here are the results [based on 2011 data; see above for the update]:

The red line is the target: 13.1 mph. The blue line is a least squares fit to the data. So here's my prediction: there will be a 2-hour marathon in 2041. I'll be 76, so my chances of seeing it are pretty good. But that's a topic for another article (see Chapter 1 of Think Bayes).

Thursday, September 25, 2014

Bayesian election forecasting

Last week Nate Silver posted this article explaining how the FiveThirtyEight Senate forecast model works. If you are familiar with Silver's work, you probably know that (1) he has been notably successful at predicting outcomes of elections, and (2) he is an advocate for Bayesian statistics. In last week's article, Silver provides a high-level view of his modeling philosophy and some details about how his model works, but he didn't say much that is explicitly Bayesian.

His first principle of modeling hints at some Bayesian ideas:

Principle 1: A good model should be probabilistic, not deterministic.
The FiveThirtyEight model produces probabilistic forecasts as opposed to hard-and-fast predictions.[...] the FiveThirtyEight model might estimate the Democrat has a 20 percent chance of winning the Senate race in Kentucky. My view is that this is often the most important part of modeling — and often the hardest part.

Silver suggests that it is more useful to report the probability of winning than the margin of victory. Bayesian models are generally good at this kind of thing; classical statistical methods are not. But Silver never makes the connection between this principle and Bayesian methods; in fact the article doesn't mention Bayes at all.

And that's fine; it was not central to the point of his article. But since I am teaching my Bayesian statistics class this semester, I will take this opportunity to fill in some details. I don't know anything about Silver's model other than what's in his article, but I think it is a good guess that there is something in there similar to what follows.

But before I get into it, here's an outline:

I present an example problem and formulate a solution using a Bayesian framework.
I develop Python code to compute a solution; if you don't speak Python, you can skip this part.
I show results for an update with a single poll result.
I show how to combine results from a different polls.

My example supports Silver's argument that it is more useful to predict the probability of winning than the margin of victory: after the second update, the predicted margin of victory decreases, but the probability of winning increases. In this case, predicting only the margin of victory would misrepresent the effect of the second poll.

Formulating the problem

Here's the exercise I presented to my class:

Exercise 1: The polling company Strategic Vision reports that among likely voters, 53% intend to vote for your favorite candidate and 47% intend to vote for the opponent (let's ignore undecided voters for now). Suppose that, based on past performance, you estimate that the distribution of error for this company has mean 1.1 percentage point (in favor of your candidate) and standard deviation 3.7 percentage points. What is the probability that the actual fraction of likely voters who favor your candidate is less than 50%?

Strategic Vision is an actual polling company, but other than that, everything about this example is made up. Also, the standard deviation of the error is probably lower than what you would see in practice.

To solve this problem, we can treat the polling company like a measurement instrument with known error characteristics. If we knew the actual fraction of the electorate planning to vote for your candidate, which I'll call A for "actual", and we knew the distribution of the error, ε, we could compute the distribution of the measured value, M:

M = A + ε

But in this case we want to solve the inverse problem: given a measurement M and the distribution of ε, compute the posterior distribution of A. As always with this kind of problem, we need three things:

1) A prior distribution for A,
2) Data that allow us to improve the estimate of A, and
3) A likelihood function that computes the probability of the data for hypothetical values of A.

Once you have these three things, the Bayesian framework does the rest.

Implementing the solution

To demonstrate, I use the Suite class from thinkbayes2, which is a Python module that goes with my book, Think Bayes. The Suite class is documented here, but I will explain what you need to know below. You can download the code from this file in this GitHub repository.

I'll start by defining a new class called Electorate that inherits methods from thinkbayes2.Suite:

class Electorate(thinkbayes2.Suite):
"""Represents hypotheses about the state of the electorate."""

As a starting place, I'll create a uniform prior distribution. In practice this would not be a good choice, for reasons I'll explain soon, but it will allow me to make a point.

hypos = range(0, 101)

suite = Electorate(hypos)

hypos is a list of integers from 0 to 100, representing the percentage of the electorate planning to vote for your candidate.

I'll represent the data with a tuple of three values: the estimated bias of the polling company in percentage points, the standard deviation of their previous errors (also in percentage points), and the result of the poll:

data = 1.1, 3.7, 53

suite.Update(data)

When we call Update, it loops through the hypotheses and computes the likelihood of the data under each hypothesis; we have to provide a Likelihood function that does this calculation:

class Electorate(thinkbayes2.Suite):

"""Represents hypotheses about the electorate."""

def Likelihood(self, data, hypo):

"""Likelihood of the data under the hypothesis.

hypo: fraction of the population

data: poll results

"""

bias, std, result = data

error = result - hypo

like = thinkbayes2.EvalNormalPdf(error, bias, std)

return like

Likelihood unpacks the given data into bias, std, and result. Given a hypothetical value for A, it computes the hypothetical error. For example, if hypo is 50 and result is 53, that means the poll is off by 3 percentage points. The resulting likelihood is the probability that we would be off by that much, given the bias and standard deviation of the poll.

We estimate this probability by evaluating the normal/Gaussian distribution with the given parameters. I am assuming that the distribution of errors is approximately normal, which is probably not a bad assumption when the probabilities are near 50%.

One technical detail: The result of EvalNormalPdf is actually a probability density, not a probability. But the result from Likelihood doesn't actually have to be a probability; it only has to be proportional to a probability, so a probability density will do the job nicely.

The results

And that's it -- we've solved the problem! Here are the results:

The prior distribution is uniform from 0 to 100. The mean of the posterior is 51.9, which makes sense because the result is 53 and the known bias is 1.1, so the posterior mean is (53 - 1.1). The standard deviation of the posterior is 3.7, the same as the standard deviation of the error.

To compute the probability of losing the election (if it were held today), we can loop through the hypotheses and add up the probability of all values less than 50%. The Suite class provides ProbLess, which does that calculation. The result is 0.26, which means your candidate is a 3:1 favorite.

In retrospect we could have computed this posterior analytically with a lot less work, which is the point I wanted to make by using a uniform prior. But in general it's not quite so simple, as we can see by incorporating a second poll:

Exercise 2: What if another poll comes out from Research 2000 showing that 49% of likely voters intend to vote for your candidate, but past poll show that this company's results tend to favor the opponent by 2.3 points, and their past errors (after correcting for this bias) have standard deviation 4.1 points. Now what should you believe?

The second update looks just like the first:

data = -2.3, 4.1, 49
suite.Update(data)

The bias is negative now because this polling company (in my fabricated world) tends to favor the opponent. Here are the results after the second update:

The mean of the new posterior is 51.6, slightly lower than the mean after the first update, 51.9. The two polls are actually consistent with each other after we correct for the biases of the two companies.

The predicted margin of victory is slightly smaller, but the uncertainty of the prediction is also smaller. Based on the second update, the probability is 0.22, which means your candidate is now nearly a 4:1 favorite.

Again, this example demonstrates Silver's point: predicting the probability of winning is more meaningful that predicting the margin of error. And that's exactly the kind of thing Bayesian models are good for.

One more technical note: This analysis is based on the assumption that errors in one poll iare independent of errors in another. It seems likely that in practice there is correlation between polls; in that case we could extend this solution to model the errors with a joint distribution that includes the correlation.

Sunday, September 14, 2014

Regression with Python, pandas and StatsModels

I was at Boston Data-Con 2014 this morning, which was a great event. The organizer, John Verostek, seems to have created this three-day event single-handedly, so I am hugely impressed.

Imran Malek started the day with a very nice iPython tutorial. The description is here, and his slides are here. He grabbed passenger data from the MBTA and generated heat maps showing the number of passengers at each stop in the system during each hour. The tutorial covered a good range of features, and it seemed like many of the participants were able to download the data and follow along in iPython.

And Imran very kindly let me use his laptop to project slides for my talk, which was next. The description of my talk is here:

Regression is a powerful tool for fitting data and making predictions. In this talk I present the basics of linear regression and logistic regression and show how to use them in Python. I demonstrate pandas, a Python module that provides structures for data analysis, and StatsModels, a module that provides tools for regression and other statistical analysis.

As an example, I will use data from the National Survey of Family Growth to generate predictions for the date of birth, weight, and sex of an expected baby. This presentation is based on material from the recent revision of Think Stats, an introduction to data analysis and statistics with Python.

This talk is appropriate for people with no prior experience with regression. Basic familiarity with Python is recommended but not required.

And here are my slides:

The material for this talk is from the second edition of Think Stats, which is in production now and scheduled for release in early November. My draft is available here, and you can pre-order a paper copy here.

As I expected, I prepared way more material than I could present. The audience had some excellent questions, so we spent more time on linear regression and did not get to logistic regression.

The nice people at O'Reilly Media sent over 25 copies of my book, Think Python, so we had a book signing after the talk. I had a chance to chat with everyone who got a book, which is always a lot of fun.

I believe video of the talk will be available soon. I will post it here when it is.

Many thanks to John Verostek for organizing this conference, and to the sponsors for breakfast and lunch. Looking forward to next year!

Friday, August 29, 2014

New study: vaccines prevent disease and death

According to an exciting new study, childhood vaccines are almost miraculously effective at preventing suffering and death due to infectious disease.

Sadly, that is not actually a headline, because it doesn't generate clicks. What does generate clicks? This: Journal questions validity of autism and vaccine study (CNN Health).

If, at this point, headlines like this make you roll your eyes and click on more interesting things, like "Teen catches one-in-2 million lobster," let me assure you that you are absolutely right. In fact, if you want to skip the rest of this post, and read something that will make you happy to live in the 21st century, I recommend this list of vaccine-preventable diseases.

But if you have already heard about this new paper and you are curious to know why it is, from a statistical point of view, completely bogus, read on!

Let me start by reviewing the basics of statistical hypothesis testing. Suppose you see an apparent effect like a difference in risk of autism between groups of children who are vaccinated at different ages. You might wonder whether the effect you see in your selected sample is likely to exist in the general population, or whether it might appear by chance in your sample only, and not in the general population.

To answer at least part of that question, you can compute a p-value, which is the probability of seeing a difference as big as the one you saw if, in fact, there is no difference between the groups. If this value is small, you can conclude that the difference is unlikely to be an artifact of random sampling. And if you are willing to cut a couple of logical corners, you can conclude that the effect is more likely to exist in the general population.

In many fields it is common to use 5% as a magical threshold for statistical significance. If the p-value is less than 5%, the effect is considered statistically significant and, more importantly, fit for publication. If it comes in at 6%, it doesn't count.

This is, of course, arbitrary and mostly silly, but it does have one virtue. If we follow this process with care, it should yield a known and small false positive rate: if, in fact, there is no difference between the groups, we should expect to conclude, incorrectly, that the apparent difference is significant about 5% of the time.

But that expectation only holds if we apply hypothesis tests very carefully. In practice, people don't, and there is accumulating evidence that the false positive rate in published research is much higher than 5%, possibly higher than 50%.

There are lots of ways to mess up hypothesis testing, but one of the most common is performing multiple tests. Every time you perform a test you have a 5% chance of generating a false positive, so if you perform 20 independent tests, you should expect about 1 false positive. This problem is ably demonstrated by this classic xkcd cartoon:

If you are not already familiar with xkcd, you're welcome.

So let's get back to the paper reported in the CNN article. Just looking at the results that appear in the tables, this paper reports 35 p-values. Five of them are identified as significant at the 5% level. That's more than the expected number of false positives, but they might still be due to chance (especially since the tests are not independent).

Fortunately, there is a simple process that corrects for multiple tests, the Holm-Bonferroni method. You can get the details from the Wikipedia article, but I'll give you a quick-and dirty version here: if you perform n tests, the lowest p-value needs to be below 0.05 / n to be considered statistically significant.

Since the paper reports 35 tests, their threshold is 0.05 / 35, which is 0.0014. Since their lowest p-value is 0.0019, it should not be considered statistically significant, and neither should any of the other test results.

And I am being generous by assuming that the authors only performed 35 tests. It is likely that they performed many more and chose carefully which ones to report. And I assume that they are doing everything else right as well.

But even with the benefit of the doubt, these results are not statistically significant. Given how hard the authors apparently tried to find evidence that vaccines cause autism, the fact that they failed should be considered evidence that vaccines do NOT cause autism.

Before I read this paper, I was nearly certain that vaccines do not cause autism. After reading this paper, I am very slightly more certain that vaccines do not cause autism. And by the way, I am also nearly certain that vaccines do prevent suffering and death due to infectious disease.

Now go check out the video of that blue lobster.

UPDATE: Or go read this related article by my friend Ted Bunn.

Friday, August 22, 2014

An exercise in hypothesis testing

I've just turned in the manuscript for the second edition of Think Stats. If you're dying to get your hands on a copy, you can pre-order one here.

Most of the book is about computational methods, but in the last chapter I break out some analytic methods, too. In the last section of the book, I explain the underlying philosophy:

This book focuses on computational methods like resampling and permutation. These methods have several advantages over analysis:

They are easier to explain and understand. For example, one of the most difficult topics in an introductory statistics class is hypothesis testing. Many students don’t really understand what p-values are. I think the approach I presented in Chapter 9—simulating the null hypothesis and computing test statistics—makes the fundamental idea clearer.

They are robust and versatile. Analytic methods are often based on assumptions that might not hold in practice. Computational methods require fewer assumptions, and can be adapted and extended more easily.

They are debuggable. Analytic methods are often like a black box: you plug in numbers and they spit out results. But it’s easy to make subtle errors, hard to be confident that the results are right, and hard to find the problem if they are not. Computational methods lend themselves to incremental development and testing, which fosters confidence in the results.

But there is one drawback: computational methods can be slow. Taking into account these pros and cons, I recommend the following process:

Use computational methods during exploration. If you find a satisfactory answer and the run time is acceptable, you can stop.

If run time is not acceptable, look for opportunities to optimize. Using analytic methods is one of several methods of optimization.

If replacing a computational method with an analytic method is appropriate, use the computational method as a basis of comparison, providing mutual validation between the computational and analytic results.

For the vast majority of problems I have worked on, I didn’t have to go past Step 1.

The last exercise in the book is based on a question my colleague, Lynn Stein, asked me for a paper she was working on:

In a recent paper², Stein et al. investigate the effects of an intervention intended to mitigate gender-stereotypical task allocation within student engineering teams. Before and after the intervention, students responded to a survey that asked them to rate their contribution to each aspect of class projects on a 7-point scale.

Before the intervention, male students reported higher scores for the programming aspect of the project than female students; on average men reported a score of 3.57 with standard error 0.28. Women reported 1.91, on average, with standard error 0.32.

Question 1: Compute a 90% confidence interval and a p-value for the gender gap (the difference in means).

After the intervention, the gender gap was smaller: the average score for men was 3.44 (SE 0.16); the average score for women was 3.18 (SE 0.16). Again, compute the sampling distribution of the gender gap and test it.

Question 2: Compute a 90% confidence interval and a p-value for the change in gender gap.

[2] Stein et al. “Evidence for the persistent effects of an intervention to mitigate gender-sterotypical task allocation within student engineering teams,” Proceedings of the IEEE Frontiers in Education Conference, 2014.

In the book I present ways to do these computations, and I will post my "solutions" here soon. But first I want to pose these questions as a challenge for statisticians and people learning statistics. How would you approach these problems?

The reason I ask: Question 1 is pretty much a textbook problem; you can probably find an online calculator to do it for you. But you are less likely to find a canned solution to Question 2, so I am curious to see how people go about it. I hope to post some different solutions soon.

By the way, this is not meant to be a "gotcha" question. If some people get it wrong, I am not going to make fun of them. I am looking for different correct approaches; I will ignore mistakes, and only point out incorrect approaches if they are interestingly incorrect.

You can post a solution in the comments below, or discuss it on reddit.com/r/statistics, or if you don't want to be influenced by others, send me email at downey at allendowney dot com.

UPDATE August 26, 2014

The discussion of this question on reddit.com/r/statistics was as interesting as I hoped. People suggested several very different approaches to the problem. The range of responses extends from something like, "This is a standard problem with a known, canned answer," to something like, "There are likely to be dependencies among the values that make the standard model invalid, so the best you can do is an upper bound." In other words, the problem is either trivial or impossible!

The approach I had in mind is to compute sampling distributions for the gender gap and the change in gender gap using normal approximations, and then use the sampling distributions to compute standard errors, confidence intervals, and p-values.

I used a simple Python class that represents a normal distribution. Here is the API:

class Normal(object):
"""Represents a Normal distribution"""

def __init__(self, mu, sigma2, label=''):
"""Initializes a Normal object with given mean and variance."""

def __add__(self, other):
"""Adds two Normal distributions."""

def __sub__(self, other):
"""Subtracts off another Normal distribution."""

def __mul__(self, factor):
"""Multiplies by a scalar."""

def Sum(self, n):
"""Returns the distribution of the sum of n values."""

def Prob(self, x):
"""Cumulative probability of x."""

def Percentile(self, p):
"""Inverse CDF of p (0 - 100)."""

The implementation of this class is here.

Here's a solution that uses the Normal class. First we make normal distributions that represent the sampling distributions of the estimated means, using the given means and sampling errors. The variance of the sampling distribution is the sampling error squared:

male_before = normal.Normal(3.57, 0.28**2)
male_after = normal.Normal(3.44, 0.16**2)

female_before = normal.Normal(1.91, 0.32**2)
female_after = normal.Normal(3.18, 0.16**2)

Now we compute the gender gap before the intervention, and print the estimated difference, p-value, and confidence interval:

diff_before = female_before - male_before
print('mean, p-value', diff_before.mu, 1-diff_before.Prob(0))
print('CI', diff_before.Percentile(5), diff_before.Percentile(95))

The estimated gender gap is -1.66 with SE 0.43, 90% CI (-2.3, -0.96) and p-value 5e-5. So that's statistically significant.

Then we compute the gender gap after intervention and the change in gender gap:

diff_after = female_after - male_after

diff = diff_after - diff_before

print('mean, p-value', diff.mu, diff.Prob(0))

print('CI', diff.Percentile(5), diff.Percentile(95))

The estimated change is 1.4 with SE 0.48, 90% CI (0.61, 2.2) and p-value 0.002. So that's statistically significant, too.

This solution is based on a two assumptions:

1) It assumes that the sampling distribution of the estimated means is approximately normal. Since the data are on a Likert scale, the variance and skew are small, so the sum of n values converges to normal quickly. The samples sizes are in the range of 30-90, so the normal approximation is probably quite good.

This claim is based on the Central Limit Theorem, which only applies if the samples are drawn from the population independently. In this case, there are dependencies within teams: for example, if someone on a team does a larger share of a task, the rest of the team necessarily does less. But the team sizes are 2-4 people and the sample sizes are much larger, so these dependencies have short enough "range" that I think it is acceptable to ignore them.

2) Every time we subtract two distributions to get the distribution of the difference, we are assuming that values are drawn from the two distributions independently. In theory, dependencies within teams could invalidate this assumption, but I don't think it's likely to be a substantial effect.

3) As always, remember that the standard error (and confidence interval) indicate uncertainty due to sampling, but say nothing about measurement error, sampling bias, and modeling error, which are often much larger sources of uncertainty.

The approach I presented here is a bit different from what's presented in most introductory stats classes. If you have taken (or taught) a stats class recently, I would be curious to know what you think of this problem. After taking the class, would you be able to solve problems like this? Or if you are teaching, could your students do it?

Your comments are welcome.

Friday, July 18, 2014

More likely to be killed by a terrorist

I am working on the second edition of Think Stats, adding chapters on some topics that didn't make it into the first edition, including survival analysis. Here is the draft of the new chapter; this post contains some of the highlights.

Survival analysis

Survival analysis is a way to describe how long things last. It is often used to study human lifetimes, but it also applies to “survival” of mechanical and electronic components, or more generally to intervals in time before an event.

If someone you know has been diagnosed with a life-threatening disease, you might have seen a “5-year survival rate,” which is the probability of surviving five years after diagnosis. That estimate and related statistics are the result of survival analysis.

As a more cheerful example, I will use data from the National Survey of Family Growth (NSFG) to quantify how long respondents “survive” until they get married for the first time. The range of respondents’ ages is 14 to 44 years, so the dataset provides a snapshot of women at different stages in their lives, in the same way that a medical cohort might include patients at difference stages of disease.

For women who have been married, the dataset includes the date of their first marriage and their age at the time. For women who have not been married, we know their age when interviewed, but have no way of knowing when or if they will get married.

Since we know the age at first marriage for some women, it might be tempting to exclude the rest and compute the distribution of the known data. That is a bad idea. The result would be doubly misleading: (1) older women would be overrepresented, because they are more likely to be married when interviewed, and (2) married women would be overrepresented! In fact, this analysis would lead to the conclusion that all women get married, which is obviously incorrect.

Kaplan-Meier estimation

In this example it is not only desirable but necessary to include observations of unmarried women, which brings us to one of the central algorithms in survival analysis, Kaplan-Meier estimation.
The general idea is that we can use the data to estimate the hazard function, then convert the hazard function to a survival function. To estimate the hazard function, we consider, for each age, (1) the number of women who got married at that age and (2) the number of women “at risk” of getting married, which includes all women who were not married at an earlier age.

The details of the algorithm are in the book; we'll skip to the results:

The top graph shows the estimated hazard function; it is low in the teens, higher in the 20s, and declining in the 30s. It increases again in the 40s, but that is an artifact of the estimation process; as the number of respondents “at risk” decreases, a small number of women getting married yields a large estimated hazard. The survival function will smooth out this noise.

The bottom graph shows the survival function, which shows for each age the fraction of people who are still unmarried at that age. The curve is steepest between 25 and 35, when most women get married. Between 35 and 45, the curve is nearly flat, indicating that women who do not marry before age 35 are unlikely to get married before age 45.

A curve like this was the basis of a famous magazine article in 1986; Newsweek reported that a 40-year old unmarried woman was “more likely to be killed by a terrorist” than get married. These statistics were widely reported and became part of popular culture, but they were wrong then (because they were based on faulty analysis) and turned out to be even more wrong (because of cultural changes that were already in progress and continued). In 2006, Newsweek ran an another article admitting that they were wrong.

I encourage you to read more about this article, the statistics it was based on, and the reaction. It should remind you of the ethical obligation to perform statistical analysis with care, interpret the results with appropriate skepticism, and present them to the public accurately and honestly.

Cohort effects

One of the challenges of survival analysis, and one of the reasons Newsweek was wrong, is that different parts of the estimated curve are based on different groups of respondents. The part of the curve at time t is based on respondents whose age was at least t when they were interviewed. So the leftmost part of the curve includes data from all respondents, but the rightmost part includes only the oldest respondents.

If the relevant characteristics of the respondents are not changing over time, that’s fine, but in this case it seems likely that marriage patterns are different for women born in different generations. We can investigate this effect by grouping respondents according to their decade of birth. Groups like this, defined by date of birth or similar events, are called cohorts, and differences between the groups are called cohort effects.

To investigate cohort effects in the NSFG marriage data, I gathered Cycle 5 data from 1995, Cycle 6 data from 2002, the Cycle 7 data from 2006–2010. In total these datasets include 30,769 respondents.
I divided respondents into cohorts by decade of birth and estimated the survival curve for each cohort:

Several patterns are visible:

Women born in the 50s married earliest, with successive cohorts marrying later and later, at least until age 30 or so.

Women born in the 60s follow a surprising pattern. Prior to age 25, they were marrying at slower rates than their predecessors. After age 25, they were marrying faster. By age 32 they had overtaken the 50s cohort, and at age 44 they are substantially more likely to have married. Women born in the 60s turned 25 between 1985 and 1995. Remembering that the Newsweek article was published in 1986, it is tempting to imagine that the article triggered a marriage boom. That explanation would be too pat, but it is possible that the article and the reaction to it were indicative of a mood that affected the behavior of this cohort.

The pattern of the 70s cohort is similar. They are less likely than their predecessors to be married before age 25, but at age 35 they have caught up with both of the previous cohorts.

Women born in the 80s are even less likely to marry before age 25. What happens after that is not clear; for more data, we have to wait for the next cycle of the NSFG, coming in late fall 2014.

Expected remaining lifetime

Given a survival curve, we can compute the expected remaining lifetime as a function of current age. For example, given the survival function of pregnancy length, we can compute the expected time until delivery.

For example, the following figure (left) shows the expecting remaining pregnancy length as a function of the current duration. During Week 0, the expected remaining duration is about 34 weeks. That’s less than full term (39 weeks) because terminations of pregnancy in the first trimester bring the average down.

The curve drops slowly during the first trimester: after 13 weeks, the expected remaining lifetime has dropped by only 9 weeks, to 25. After that the curve drops faster, by about a week per week.

Between Week 37 and 42, the curve levels off between 1 and 2 weeks. At any time during this period, the expected remaining lifetime is the same; with each week that passes, the destination gets no closer. Processes with this property are called “memoryless,” because the past has no effect on the predictions. This behavior is the mathematical basis of the infuriating mantra of obstetrics nurses: “any day now.”

The figure also shows the median remaining time until first marriage, as a function of age. For an 11 year-old girl, the median time until first marriage is about 14 years. The curve decreases until age 22 when the median remaining time is about 7 years. After that it increases again: by age 30 it is back where it started, at 14 years.

For the marriage data I used median rather than mean because the dataset includes women who are unmarried at age 44. The survival curve is cut off at about 20%, so we can't compute a mean. But the median is well defined as long as more than 50% of the remaining values are known.

Based on this data, young women have decreasing remaining "lifetimes". Mechanical components with this property are called NBUE for "new better than used in expectation," meaning that a new part is expected to last longer.

Women older than 22 have increasing remaining time until first marriage. Components with this property are UBNE for "used better than new in expectation." That is, the older the part, the longer it is expected to last. Newborns and cancer patients are also UBNE; their life expectancy increases the longer they live. Also, people learning to ride a motorcycle.

Details of the calculations in this article are in Think Stats, Chapter 13. The code is in survival.py.

Thursday, July 10, 2014

Bayesian solution to the Lincoln index problem

Last year my occasional correspondent John D. Cook wrote an excellent blog post about the Lincoln index, which is a way to estimate the number of errors in a document (or program) by comparing results from two independent testers. Here's his presentation of the problem:

Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn't very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There’s no way to know with one tester. But if you have two testers, you can get a good idea, even if you don’t know how skilled the testers are.

Then he presents the Lincoln index, an estimator "described by Frederick Charles Lincoln in 1930," where Wikpedia's use of "described" is a hint that the index is another example of Stigler's law of eponymy.

Suppose two testers independently search for bugs. Let k1 be the number of errors the first tester finds and k2 the number of errors the second tester finds. Let c be the number of errors both testers find. The Lincoln Index estimates the total number of errors as k1 k2 / c [I changed his notation to be consistent with mine].

So if the first tester finds 20 bugs, the second finds 15, and they find 3 in common, we estimate that there are about 100 bugs.

Of course, whenever I see something like this, the idea that pops into my head is that there must be a (better) Bayesian solution! And there is. You can read and download my solution here.

I represent the data using the tuple (k1, k2, c), where k1 is the number of bugs found by the first tester, k2 is the number of bugs found by the second, and c is the number they find in common.

The hypotheses are a set of tuples (n, p1, p2), where n is the actual number of errors, p1 is the probability that the first tester finds any given bug, and p2 is the probability for the second tester.

Now all I need is a likelihood function:

class Lincoln(thinkbayes.Suite, thinkbayes.Joint):

def Likelihood(self, data, hypo):
"""Computes the likelihood of the data under the hypothesis.

hypo: n, p1, p2
data: k1, k2, c
"""
n, p1, p2 = hypo
k1, k2, c = data

part1 = choose(n, k1) * binom(k1, n, p1)
part2 = choose(k1, c) * choose(n-k1, k2-c) * binom(k2, n, p2)
return part1 * part2

Where choose evaluates the binomial coefficient,

\tbinom nk

, and binom evaluates the rest of the binomial pmf:

def binom(k, n, p):
return p**k * (1-p)**(n-k)

And that's pretty much it. Here's the code that builds and updates the suite of hypotheses:

data = 20, 15, 3

probs = numpy.linspace(0, 1, 101)

hypos = []

for n in range(32, 350):

for p1 in probs:

for p2 in probs:

hypos.append((n, p1, p2))

suite = Lincoln(hypos)

suite.Update(data)

The suite contains the joint posterior distribution for (n, p1, p2), but p1 and p2 are nuisance parameters; we only care about the marginal distribution of n. Lincoln inherits Marginal from Joint, so we can get the marginal distribution like this:

n_marginal = suite.Marginal(0)

Where 0 is the index of n in the tuple. And here's what the distribution looks like:

The lower bound is 32, which is the total number of bugs found by the two testers. I set the upper bound at 350, which chops off a little bit of the tail.

The maximum likelihood estimate in this distribution is 72; the mean is 106. So those are consistent with the Lincoln index, which is 100. But as usual, it is more useful to have the whole posterior distribution, not just a point estimate. For example, this posterior distribution can be used as part of a risk-benefit analysis to guide decisions about how much effort to allocate to finding additional bugs.

This solution generalizes to more than 2 testers, but figuring out the likelihood function, and evaluating it quickly, becomes increasingly difficult. Also, as the number of testers increases, so does the number of dimensions in the space of hypotheses. With two testers there are about 350 * 100 * 100 hypotheses. On my non-very-fast desktop computer, that takes about a minute. I could speed it up by evaluating the likelihood function more efficiently, but each new tester would multiply the run time by 100.

The library I used in my solution is thinkbayes.py, which is described in my book, Think Bayes. Electronic copies are available under a Creative Commons licence from Green Tea Press. Hard copies are published by O'Reilly Media and available from Amazon.com and other booksellers.

I believe that the approach to Bayesian statistics I present in Think Bayes is a good way to solve problems like this. I cite as evidence that this example, from the time I read John's article, to the time I pressed "Publish" on this post, took me about 3 hours.