Probably Overthinking It: 2015

Tuesday, December 8, 2015

Many rules of statistics are wrong

There are two kinds of people who violate the rules of statistical inference: people who don't know them and people who don't agree with them. I'm the second kind.

The rules I hold in particular contempt are:

The interpretation of p-values: Suppose you are testing a hypothesis, H, so you've defined a null hypothesis, H0, and computed a p-value, which is the likelihood of an observed effect under H0.

According to the conventional wisdom of statistics, if the p-value is small, you are allowed to reject the null hypothesis and declare that the observed effect is "statistically significant". But you are not allowed to say anything about H, not even that it is more likely in light of the data.

I disagree. If we were really not allowed to say anything about H, significance testing would be completely useless, but in fact it is only mostly useless. As I explained in this previous article, a small p-value indicates that the observed data are unlikely under the null hypothesis. Assuming that they are more likely under H (which is almost always the case), you can conclude that the data are evidence in favor of H and against H0. Or, equivalently, that the probability of H, after seeing the data, is higher than it was before. And it is reasonable to conclude that the apparent effect is probably not due to random sampling, but might have explanations other than H.

Correlation does not imply causation: If this slogan is meant as a reminder that correlation does not always imply causation, that's fine. But based on responses to some of my previous work, many people take it to mean that correlation provides no evidence in favor of causation, ever.

I disagree. As I explained in this previous article, correlation between A and B is evidence of some causal relationship between A and B, because you are more likely to observe correlation if there is a causal relationship than if there isn't. The problem with using correlation for to infer causation is that it does not distinguish among three possible relationships: A might cause B, B might cause A, or any number of other factors, C, might cause both A and B.

So if you want to show that A causes B, you have to supplement correlation with other arguments that distinguish among possible relationships. Nevertheless, correlation is evidence of causation.

Regression provides no evidence of causation: This rule is similar to the previous one, but generalized to include regression analysis. I posed this question the reddit stats forum: the consensus view among the people who responded is that regression doesn't say anything about causation, ever. (More about that in this previous article.)

I disagree. It think regression provides evidence in favor of causation for the same reason correlation does, but in addition, it can distinguish among different explanations for correlation. Specifically, if you think that a third factor, C, might cause both A and B, you can try adding a variable that measures C as an independent variable. If the apparent relationship between A and B is substantially weaker after the addition of C, or if it changes sign, that's evidence that C is a confounding variable.

Conversely, if you add control variables that measure all the plausible confounders you can think of, and the apparent relationship between A and B survives each challenge substantially unscathed, that outcome should increase your confidence that either A causes B or B causes A, and decrease your confidence that confounding factors explain the relationship.

By providing evidence against confounding factors, regression provides evidence in favor of causation, but it is not clear whether it can distinguish between "A causes B" and "B causes A". The received wisdom of statistics says no, of course, but at this point I hope you understand why I am not inclined to accept it.

In this previous article, I explore the possibility that running regressions in both directions might help. At this point, I think there is an argument to be made, but I am not sure. It might turn out to be hogwash. But along the way, I had a chance to explore another bit of conventional wisdom...

Methods for causal inference, like matching estimators, have a special ability to infer causality: In this previous article, I explored a propensity score matching estimator, which is one of the methods some people think have special ability to provide evidence for causation. In response to my previous work, several people suggested that I try these methods instead of regression.

Causal inference, and the counterfactual framework it is based on, is interesting stuff, and I look forward to learning more about it. And matching estimators may well squeeze stronger evidence from the same data, compared to regression. But so far I am not convinced that they have any special power to provide evidence for causation.

Matching estimators and regression are based on many of the same assumptions and vulnerable to some of the same objections. I believe (tentatively for now) that if either of them can provide evidence for causation, both can.

Quoting rules is not an argument

As these examples show, many of the rules of statistics are oversimplified, misleading, or wrong. That's why, in many of my explorations, I do things experts say you are not supposed to do. Sometimes I'm right and the rule is wrong, and I write about it here. Sometimes I'm wrong and the rule is right; in that case I learn something and I try to explain it here. In the worst case, I waste time rediscovering something everyone already "knew".

If you think I am doing something wrong, I'd be interested to hear why. Since my goal is to test whether the rules are valid, repeating them is not likely to persuade me. But if you explain why you think the rules are right, I am happy to listen.

Thursday, December 3, 2015

Internet use and religion, part six

In my work on Internet use and religion, one of the recurring questions is whether the analysis I am doing, primarily regression models using observational data, provides evidence that Internet use causes decreased religiosity, or only shows a statistical association between them.

I discuss this question in the next section, which is probably too long. If you get bored, you can skip to the following section, which presents a different method for estimating effect size, a "propensity score matching estimator", and compares the results to the regression models.

What does "evidence" mean?

In the previous article I presented results from two regression models and made the following argument:

Internet use predicts religiosity fairly strongly: the effect size is stronger than education, income, and use of other media (but not as strong as age).
Controlling for the same variables, religiosity predicts Internet use only weakly: the effect is weaker than age, date of interview, income, education, and television (and about the same as radio and newspaper).
This asymmetry suggests that Internet use causes a decrease in religiosity, and the reverse effect (religiosity discouraging Internet use) is weaker or zero.
It is still possible that a third factor could cause both effects, but the control variables in the model, and asymmetry of the effect, makes it hard to come up with plausible ideas for what the third factor could be.

I am inclined to consider these results as evidence of causation (and not just a statistical association).

When I make arguments like this, I get pushback from statisticians who assert that the kind of observational data I am working with cannot provide any evidence for causation, ever. To understand this position better, I posted this query on reddit.com/r/statistics. As always, I appreciate the thoughtful responses, even if I don't agree. The top-rated comments came from /u/Data_Driven_Dude, who states this position:

Causality is almost wholly a factor of methodology, not statistics. Which variables are manipulated, which are measured, when they're measured/manipulated, in what order, and over what period of time are all methodological considerations. Not to mention control/confounding variables.

So the most elaborate statistics in the world can't offer evidence of causation if, for example, a study used cross-sectional survey design. [...]

Long story short: causality is a helluva lot harder than most people believe it to be, and that difficulty isn't circumvented by mere regression.

I believe this is a consensus opinion among many statisticians and social scientists, but to be honest I find it puzzling. As I argued in this previous article, correlation is in fact evidence of causation, because observing a correlation is more likely if there is causation than if there isn't.

The problem with correlation is not that it is powerless to demonstrate causation, but that a simple bivariate correlation between A and B can't distinguish between A causing B, B causing A, or a confounding variable, C, causing both A and B.

But regression models can. In theory, if you control for C, you can measure the causal effect of A on B. In practice, you can never know whether you have identified and effectively controlled for all confounding variables. Nevertheless, by adding control variables to a regression model, you can find evidence for causation. For example:

If A (Internet use, in my example) actually causes B (decreased religiosity), but not the other way around, and we run regressions with B as a dependent variable, and A as an explanatory variable, we expect find that A predicts B, of course. But we also expect the observed effect to persist as we add control variables. The magnitude of the effect might get smaller, but if we can control effectively for all confounding variables, it should converge on the true causal effect size.
On the other hand, if we run the regression the other way, using B to predict A, we expect to find that B predicts A, but as we add control variables, the effect should disappear, and if we control for all confounding variables, it should converge to zero.

For example, in this previous article, I found that first babies are lighter than others, by about 3 ounces. However, the mothers of first babies tend to be younger, and babies of younger mothers tend to be lighter. When I control for mother's age, the apparent effect is smaller, less than an ounce, and no longer statistically significant. I conclude that mother's age explains the apparent difference between first babies and others, and that the causal effect of being a first baby is small or zero.

I don't think that conclusion is controversial, but here's the catch: if you accept "the effect disappears when I add control variables" as evidence against causation, then you should also accept "the effect persists despite effective control for confounding variables" as evidence for causation.

Of course the key word is "effective". If you think I have not actually controlled for an important confounding variable, you would be right to be skeptical of causation. If you think the controls are weak, you might accept the results as weak evidence of causation. But if you think the controls are effective, you should accept the results as strong evidence.

So I don't understand the claim that regression models cannot provide any evidence for causation, at all, ever, which I believe is the position my correspondents took, and which seems to be taken as a truism among at least some statisticians and social scientists.

I put this question to /u/Data_Driven_Dude, who wrote the following:

[...]just because you have statistical evidence for a hypothesized relationship doesn't mean you have methodological evidence for it. Methods and stats go hand in hand; the data you gather and analyze via statistics are reflections of the methods used to gather those data.
Say there is compelling evidence that A causes B. So I hypothesize that A causes B. I then conduct multiple, methodologically-rigorous studies over several years (probably over at least a decade). This effectively becomes my research program, hanging my hat on the idea that A causes B. I become an expert in A and B. After all that work, the studies support my hypothesis, and I then suggest that there is overwhelming evidence that A causes B.

Now take your point. There could be nascent (yet compelling) research from other scientists, as well logical "common sense," suggesting that A causes B. So I hypothesize that A causes B. I then administer a cross-sectional survey that includes A, B, and other variables that may play a role in that relationship. The data come back, and huzzah! Significant regression model after controlling for potentially spurious/confounding variables! Conclusion: A causes B.

Nope. In your particular study, ruling out alternative hypotheses by controlling for other variables and finding that B doesn't predict A when those other variables are considered is not evidence of causality. Instead, what you found is support for the possibility that you're on the right track to finding causality. You found that A tentatively predicts B when controlling for XYZ, within L population based on M sample across N time. Just because you started with a causal hypothesis doesn't mean you conducted a study that can yield data to support that hypothesis.

So when I say that non-experimental studies provide no evidence of causality, I mean that not enough methodological rigor has been used to suppose that your results are anything but a starting point. You're tiptoeing around causality, you're picking up traces of it, you see its shadow in the distance. But you're not seeing causality itself, you're seeing its influence on a relationship: a ripple far removed from the source.

I won't try to summarize or address this point-by-point, but a few observations:

One point of disagreement seems to be the meaning of "evidence". I admit that I am using it in a Bayesian sense, but honestly, that's because I don't understand any alternatives. In particular, I don't understand the distinction between "statistical evidence" and "methodological evidence".
Part of what my correspondent describes is a process of accumulating evidence, starting with initial findings that might not be compelling and ending (a decade later!) when the evidence is overwhelming. I mostly agree with this, but I think the process starts when the first study provides some evidence and continues as each additional study provides more. If the initial study provides no evidence at all, I don't know how this process gets off the ground. But maybe I am still stuck on the meaning of "evidence".
A key feature of this position is the need for methodological rigor, which sounds good, but I am not sure what it means. Apparently regression models with observational data lack it. I suspect that randomized controlled trials have it. But I'm not sure what's in the middle. Or, to be more honest, I know what is considered to be in the middle, but I'm not sure I agree.

To pursue the third point, I am exploring methods commonly used in the social sciences to test causality.

SKIP TO HERE!

Matching estimators of causal effects

I'm reading Morgan and Winship's Counterfactuals and Causal Inference, which is generally good, although it has the academic virtue of presenting simple ideas in complicated ways. So far I have implemented one of the methods in Counterfactuals, a propensity score matching estimator. Matching estimators work like this:

To estimate the effect of a particular treatment, D, on a particular outcome, Y, we divide an observed sample into a treatment group that received D and a control group that didn't.
For each member of the treatment group, we identify a member of the control group that is as similar as possible (I'll explain how soon), and compute the difference in Y between the matched pair.
Averaging the observed differences over the pairs yields an estimate of the mean causal effect of D on Y.

The hard part of this process is matching. Ideally the matching process should take into account all factors that cause Y. If the pairs are identical in all of these factors, and differ only in D, any average difference in Y must be caused by D.

Of course, in practice we can't identify all factors that cause Y. And even if we could, we might not be able to observe them all. And even if we could, we might not be able to find a perfect match for each member of the treatment group.

We can't solve the first two problems, but "propensity scores" help with the third. The basic idea is

Identify factors that predict D. In my example, D is Internet use.
Build a model that uses those factors to predict the probability of D for each member of the sample; this probability is the propensity score. In my example, I use logistic regression with age, income, education, and other factors to compute propensity scores.
Match each member of the treatment group with the member in the control group with the closest propensity score. In my example, the members of each pair have the same predicted probability of using the Internet (according to the model in step 2), so the only relevant difference between them is that one did and one didn't. Any difference in the outcome, religiosity, should reflect the causal effect of the treatment, Internet use.

I implemented this method (details below) and applied it to the data from the European Social Survey (ESS).

In each country I divide respondents into a treatment group with Internet use above the median and a control group below the median. We lose information by quantizing Internet use in this way, but the distribution tends to be bimodal, with many people at the extremes and few in the middle, so treating Internet use as a binary variable is not completely terrible.

To compute propensity scores, I use logistic regression to predict Internet use based on year of birth (including a quadratic term), year of interview, education and income (as in-country ranks), and use of other media (television, radio, and newspaper).

As expected, the average propensity in the treatment group is higher than in the control group. But some members of the control group are matched more often than others (and some not at all). After matching, the two groups have the same average propensity.

Finally, I compute the pair-wise difference in religiosity and the average across pairs.

For each country, I repeat this process 101 times using weighted resampled data with randomly filled missing values. That way I can compute confidence intervals that reflect variation due to sampling and missing data. The following figure shows the estimated effect size for each country and a 95% confidence interval:

The estimated effect of Internet use on religiosity is negative in 30 out of 34 countries; in 18 of them it is statistically significant. In 4 countries the estimate is positive, but none of them are statistically significant.

The median effect size is 0.28 points on a 10 point scale. The distribution of effect size across countries is similar to the results from the regression model, which has median 0.33:

The confidence intervals for the matching estimator are bigger. Some part of this difference is because of the information we lose by quantizing Internet use. Some part is because we lose some samples during the matching process. And some part is due to the non-parametric nature of matching estimators, which make fewer assumptions about the structure of the effect.

How it works

The details of the method are in this IPython notebook, but I'll present the kernel of the algorithm here. In the following, group is a Pandas DataFrame for one country, with one row for each respondent.

The first step is to quantize Internet use and define a binary variable, treatment:

  netuse = group.netuse_f
thresh = netuse.median()
if thresh < 1:
thresh = 1
group.treatment = (netuse >= thresh).astype(int)

The next step is use logistic regression to compute a propensity for each respondent:

formula = ('treatment ~ inwyr07_f + '
'yrbrn60_f + yrbrn60_f2 + '
'edurank_f + hincrank_f +'
'tvtot_f + rdtot_f + nwsptot_f')

model = smf.logit(formula, data=group)
results = model.fit(disp=False)
group.propensity = results.predict(group)

Next we divide into treatment and control groups:

treatment = group[group.treatment == 1]
control = group[group.treatment == 0]

And sort the controls by propensity:

series = control.propensity.sort_values()

Then do the matching by bisection search:

indices = series.searchsorted(treatment.propensity)
indices[indices < 0] = 0
indices[indices >= len(control)] = len(control)-1

And select the matches from the controls:

control_indices = series.index[indices]
matches = control.loc[control_indices]

Extract the distance in propensity between each pair, and the difference in religiosity:

distances = (treatment.propensity.values -
matches.propensity.values)
differences = (treatment.rlgdgr_f.values -
matches.rlgdgr_f.values)

Select only the pairs that are a close match

caliper = differences[abs(distances) < 0.001]

And compute the mean difference:

delta = np.mean(caliper)

That's all there is to it. There are better ways to do the matching, but I started with something simple and computationally efficient (it's n log n, where n is the size of the control or treatment group, whichever is larger).

Back to the philosophy

The agreement of the two methods provides some evidence of causation, because if the effect were spurious, I would expect different methodologies, which are more or less robust against the spurious effect, to yield different results.

But it is not very strong evidence, because the two methods are based on many of the same assumptions. In particular, the matching estimator is only as good as the propensity model, and in this case the propensity model includes the same factors as the regression model. If those factors effectively control for confounding variables, both methods should work. If they don't, neither will.

The propensity model uses logistic regression, so it is based on the usual assumptions about linearity and the distribution of errors. But the matching estimator is non-parametric, so it depends on fewer assumptions about the effect itself. It seems to me that being non-parametric is a potential advantage of matching estimators, but it doesn't help with the fundamental problem, which is that we don't know if we have effectively controlled for all confounding variables.

So I am left wondering why a matching estimator should be considered suitable for causal inference if a regression model is not. In practice one of them might do a better job than the other, but I don't see any difference, in principle, in their ability to provide evidence for causation: either both can, or neither.

Monday, November 30, 2015

Internet use and religion, part five

[If you are jumping into the middle of this series, you might want to start with this article, which explains the methodological approach I am taking.]

In the previous article, I show results from two regression models that predict religious affiliation and degree of religiosity. I use the models to compare hypothetical respondents who are at their national means for all explanatory factors; then I vary one factor at a time, comparing someone at the 25th percentile with someone at the 75th percentile. I compute the different in the predicted probability of religious affiliation and the predicted level of religiosity:

1) In almost every country, a hypothetical respondent with high Internet use is less likely to report a religious affiliation. The median effect size across countries is 3.5 percentage points.

2) In almost every country, a hypothetical respondent with high Internet use reports a lower degree of religiosity. The median effect size is 0.36 points on a 10-point scale.

These results suggest that Internet use might cause religious disaffiliation and decreased religiosity, but they are ambiguous. It is also possible that the direction of causation is the other way; that is, that religiosity causes a decrease in Internet use. Or there might be other factors that cause both Internet use and religiosity.

I'll address the first possibility first. If religiosity causes lower Internet use, we should be able to measure that effect by flipping the models, taking religious affiliation and religiosity as explanatory variables and trying to predict Internet use.

I did that experiment and found:

1) Religious affiliation (hasrelig) has no predictive value for Internet use in most countries, and only weak predictive value in others.

2) Degree of religiosity (rlgdgr) has some predictive value for Internet use in some countries, but the effect is weaker than other explanatory variables (like age and education), and weaker than the effect of Internet use on religiosity: the median across countries is 0.19 points on a 7-point scale.

Considering the two possibilities, that Internet use causes religious disaffiliation or the other way around, these results support the first possibility, although the second might make a smaller contribution.

Although it is still possible that a third factor causes increased Internet use and decreased religious affiliation, it would have to do so in a strangely asymmetric way to account for these results. And since the model controls for age, income, education, and other media use, this hypothetical third factor would have to be uncorrelated with these controls (or only weakly correlated).

I can't think of any plausible candidates for this third factor. So I tentatively conclude that Internet use causes decreased religiosity. I present more detailed results below.

Summary of previous results

In the previous article, I computed an effect size for each factor and reported results for two models, one that predicts hasrelig (whether the respondent reports religious affiliations) and one that predicts rlgdgr (degree of religiosity on a 0-10 scale). The following two graphs summarize the results.

Each figure shows the distribution of effect size across the 34 countries in the study. The first figure shows the results for the first model as a difference in percentage points; each line shows the effect size for a different explanatory variable.

The factors with the largest effect sizes are year of birth (dark green line) and Internet use (purple).

For Internet use, a respondent who is average in every way, but falls at the 75th percentile of Internet use, is typically 2-7 percentage points less likely to be affiliated than a similar respondent at the 25th percentile of Internet use. In a few countries, the effect is apparently the other way around, but in those cases the estimated effect size is not statistically significant.

Overall, people who use the Internet more are less likely to be affiliated, and the effect is stronger than the effect of education, income, or the consumption of other media.

Similarly, when we try to predict degree of religiosity, people who use the Internet more (again comparing the 75th and 25th percentiles) report lower religiosity, typically 0.2 to 0.7 points on a 10 point scale. Again, the effect size for Internet use is bigger than for education, income, or other media.

Of course, what I am calling an "effect size" may not be an effect in the sense of cause and effect. What I have shown so far is that Internet users tend to be less religious, even when we control for other factors. It is possible, and I think plausible, that Internet use actually causes this effect, but there are two other possible explanations for the observed statistical association:

1) Religious affiliation and religiosity might cause decreased Internet use.
2) Some other set of factors might cause both increased Internet use and decreased religiosity.

Addressing the first alternative explanation, if people who are more religious tend to use the Internet less (other things being equal), we would expect that effect to appear in a model that includes religiosity as an explanatory variable and Internet use as a dependent variable.

But it turns out that if we run these models, we find that religiosity has little power to predict levels of Internet use when we control for other factors. I present the results below; the details are in this IPython notebook.

Model 1

The first model tries to predict level of Internet use taking religious affiliation (hasrelig) as an explanatory variable, along with the same controls I used before: year of birth (linear and quadratic terms), year of interview, education, income, and consumption of other media.

The following figure shows the effect size of religious affiliation on Internet use.

In most countries it is essentially zero, but in a few countries people who report a religious affiliation also report less Internet use, but always less than 0.5 points on a 7 point scale.

The following figure shows the distribution of effect size for the other variables on the same scale.

If we are trying to predict Internet use for a given respondent, the most useful explanatory variables, in descending order of effect size, are year of birth, education, year of interview, income, and television viewing. The effect sizes for religious affiliation, radio listening, and newspaper reading are substantially smaller.

The results of the second model are similar.

Model 2

The second model tries to predict level of Internet use taking degree of religiosity (rlgdgr) as an explanatory variable, along with the same controls I used before.

The following figure shows the estimated effect size in each country, showing the difference in Internet use of two hypothetical respondents who are at their national mean for all variables except degree of religiosity, where they are at the 25th and 75th percentiles.

In most countries, the respondent reporting a higher level of religiosity also reports a lower level of Internet use, in almost all cases less than 0.5 points on a 7-point scale. Again, this effect is smaller than the apparent effect of the other explanatory variables.

Again, the variables that best predict Internet use are year of birth, education, year of interview, income, and television viewing. The apparent effect of religiosity is somewhat less than television viewing, and more than radio listening and newspaper reading.

Next steps

As I present these results, I realize that I can make them easier to interpret by expressing the effect size in standard deviations, rather than raw differences. Internet use is recorded on a 7 point scale, and religiosity on a 10 point scale, so its not obvious how to compare them.

Also, variability of Internet use and religiosity is different across countries, so standardizing will help with comparisons between countries, too.

More results in the next installment.

Tuesday, November 24, 2015

Internet use and religion, part four

[If you are jumping into the middle of this series, you might want to start with this article, which explains the methodological approach I am taking.]

In the previous article, I presented preliminary results from a study of relationships between Internet use and religion. Using data from the European Social Survey, I ran regressions to estimate the effect of media consumption (television, radio, newspapers, and Internet) on religious affiliation and degree of religiosity.

As control variables, I include year born, education and income (expressed as relative ranks within each country) and the year the data were gathered (between 2002 and 2010).

Some of the findings so far:

In almost every country, younger people are less religious.
In most countries, people with more education are less religious.
In about half of the 34 countries, people with lower income are less religious. In the rest, the effect (if any) is too small to be distinguished from noise with this sample size.
In most countries, people who watch more television are less religious.
In a fewer than half of the countries, people who listen to more radio are less religious.
The results for newspapers are similar: only a few countries show a negative effect, and in some countries the effect is positive.
In almost every country, people who use the Internet are less religious.
There is a weak relationship between the strength of the effect and the average degree of religiosity: the negative effect of Internet use on religion tends to be stronger in more religious countries.

In the previous article, I measured effect size using the parameters of the regression models: for logistic regression, the parameter is a log odds ratio, for linear regression it is a linear weight. These parameters are not useful for comparing the effects of different factors, because they are not on the same scale, and they are not the best choice for comparing effect size between countries, because they don't take into account variation in each factor in each country.

For example, in one country the parameter associated with Internet use might be small, but if there is large variation in Internet use within the country, the net effect size might be greater than in another country with a larger parameter, but little variation.

Effect size

So my next step is to define effect size in terms that are comparable between factors and between countries. To explain the methodology, I'll use the logistic model, which predicts the probability of religious affiliation. I start by fitting the model to the data, then use the model to predict the probability of affiliation for a hypothetical respondent whose values for all factors are the national mean. Then I vary one factor at a time, generating predictions for hypothetical respondents whose value for one factor is at the 25th percentile (within country) and at the 75th percentile. Finally, I compute the difference in predicted values in percentage points.

As an example, suppose a hypothetically average respondent has a 45% chance of reporting a religious affiliation, as predicted by the model. And suppose the 25th and 75th percentiles of Internet use are 2 and 7, on a 7 point scale. A person who is average in every way, but with Internet use only 2 might have a 47% chance of affiliation. The same person with Internet use 7 might have a 42% chance. In that case I would report that the effect size is a difference of 5 percentage points.

As in the previous article, I run this analysis on about 200 iterations of resampled data, then compute a median and 95% confidence interval for each value.

The IPython notebook for this installment is here.

Quadratic age model

Before I get to the results, there is one other change from the previous installment: I added a quadratic term for year born. The reason is that in preliminary results, I noticed that Internet had the strongest negative association with religiosity, followed by television, then radio and newspapers. I wondered whether this pattern might be the result of correlation with age; that is, whether younger people are more likely to consume new media and be less religious. I was already controlling for age using yrborn60 (year born minus 1960) but I worried that if the relationship with age is nonlinear, I might not be controlling for it effectively.

So I added a quadratic term to the model. Here are the estimated parameters for the linear term and quadratic term:

In many countries, both parameters are statistically significant, so I am inclined to keep them in the model. The sign of the quadratic term is usually positive, so the curves are convex up, which suggests that the age effect might be slowing down.

Anyway, including the quadratic term has almost no effect on the other results: the relative strengths of the associations are the same.

Model 1 results

Again, the first model uses logistic regression with dependent variable hasrelig, which indicates whether the respondent reports a religious affiliation.

In the following figures, the x-axis is the percentage point difference in hasrelig between people at the 25th and 75th percentile for each explanatory variable.

In most countries, people with more education are less religious.

In most countries, the effect of income is small and not statistically significant.

The effect of television is negative in most countries.

The effect of radio is usually small.

The effect of newspapers is usually small.

In most countries, Internet use is associated with substantial decreases in religious affiliation.

Pulling together the results so far, the following figure shows the distribution (CDF) of effect size across countries:

Overall, the effect size for Internet use is the largest, followed by education and television. The effect sizes for income, radio, and newspaper are all small, and centered around zero.

Model 2 results

The second model uses linear regression with dependent variable rlgdgr, which indicates degree of religiosity on a 0-10 scale.

In the following figures, the x-axis shows the difference in rlgdgr between people at the 25th and 75th percentile for each explanatory variable.

In most countries, people with more education are less religious.

The effect size for income is smaller.

People who watch more television are less religious.

The effect size for radio is small.

The effect size for newspapers is small.

In most countries, people who use the Internet more are less religious.

Comparing the effect size for different explanatory variable, again, Internet use has the biggest effect, followed by education and television. Effect sizes for income, radio, and newspaper are smaller and centered around zero.

That's all for now. I have a few things to check out, and then I should probably wrap things up.

Thursday, November 19, 2015

Internet use and religion, part three

This article reports preliminary results from an exploration of the relationship between religion and Internet use in Europe, using data from the European Social Survey (ESS).

I describe the data processing pipeline and models in this previous article. All the code for this article is in this IPython notebook.

Data inventory

The dependent variables I use in the models are

rlgblg: Do you consider yourself as belonging to any particular religion or denomination?

rlgdgr: Regardless of whether you belong to a particular religion, how religious would you say you are? Scale from 0 = Not at all religious to 10 = Very religious.

The explanatory variables are

yrbrn: And in what year were you born?

hincrank: Household income, rank from 0-1 indicating where this respondent falls relative to respondents from the same country, same round of interviews.

edurank: Years of education, rank from 0-1 indicating where this respondent falls relative to respondents from the same country, same round of interviews.

tvtot: On an average weekday, how much time, in total, do you spend watching television? Scale from 0 = No time at all to 7 = More than 3 hours.

rdtot: On an average weekday, how much time, in total, do you spend listening to the radio? Scale from 0 = No time at all to 7 = More than 3 hours.

nwsptot: On an average weekday, how much time, in total, do you spend reading the newspapers? Scale from 0 = No time at all to 7 = More than 3 hours.

netuse: Now, using this card, how often do you use the internet, the World Wide Web or e-mail - whether at home or at work - for your personal use? Scale from 0 = No access at home or work, 1 = Never use, 6 = Several times a week, 7 = Every day.

Model 1: Affiliated or not?

In the first model, the dependent variable is rlgblg, which indicates whether the respondent is affiliated with a religion.

The following figures shows estimated parameters from logistic regression, for each of the explanatory variables. The parameters are log odds ratios: negative values indicate that the variable decreases the likelihood of affiliation; positive values indicate that it increases the likelihood.

The horizontal lines show the 95% confidence interval for the parameters, which includes the effects of random sampling and filling missing values. Confidence intervals that cross the zero line indicate that the parameter is not statistically significant at the p<0.05 level.

In most countries, interview year has no apparent effect. I will probably drop it from the next iteration of the model.

Year born has a consistent negative effect, indicating that younger people are less likely to be affiliated. Possible exceptions are Israel, Turkey and Cyprus.

In most countries, people with more education are less likely to be affiliated. Possible exceptions: Latvia, Sweden, and the UK.

In a few countries, income might have an effect, positive or negative. But it most countries it is not statistically significant.

It looks like television might have a positive or negative effect in several countries.

In most countries the effect of radio is not statistically significant. Possible exceptions are Portugal, Greece, Bulgaria, the Netherlands, Estonia, the UK, Belgium, and Germany.

In most countries the effect of newspapers is not statistically significant. Possible exceptions are Turkey, Greece, Italy, Spain, Croatia, Estonia, Portugal and Norway.

In the majority of countries, Internet use (which includes email and web) is associated with religious disaffiliation. The estimated parameter is only positive in 4 countries, and not statistically significant in any of them. The effect of Internet use appears strongest in Poland, Portugal, Israel, and Austria.

The following scatterplot shows the estimated parameter for Internet use on the x-axis, and the fraction of people who report religious affiliation on the y-axis. There is a weak negative correlation between then (rho = -0.38), indicating that the effect of Internet use is stronger in countries with higher rates of affiliation.

Model 2: Degree of religiosity

In the first model, the dependent variable is rlgdgr, a self-reported degree of religiosity on a 0-10 scale (where 0 = not at all religious and 10 = very religious).

The following figures shows estimated parameters from linear regression, for each of the explanatory variables. Negative values indicate that the variable decreases the likelihood of affiliation; positive values indicate that it increases the likelihood.

Again, the horizontal lines show the 95% confidence interval for the parameters; intervals that cross the zero line are not statistically significant at the p<0.05 level.

As in Model 1, interview year is almost never statistically significant.

Younger people are less religious in every country except Israel.

In most countries, people with more education are less religious, with possible exceptions Estonia, the UK, and Latvia.

In about half of the countries, people with higher income are less religious. One possible exception: Germany.

In most countries, people who watch more television are less religious. Possible exceptions: Greece and Italy.

In several countries, people who listen to the radio are less religious. Possible exceptions: Slovenia, Austria, Polans, Israel, Croatia, Lithuania.

In some countries, people who read newspapers more are less religious, but in some other countries they are more religious.

In almost every country, people who use the Internet more are less religious. The estimated parameter is only positive in three countries, and none of them are statistically significant. The effect of Internet use appears to be particularly strong in Israel and Luxembourg.

The following scatterplot shows the estimated parameter for Internet use on the x-axis and the national average degree of religiosity on the y-axis. Using all data points, the coefficient of correlation is -0.16, but if we exclude the outliers, it is -0.36, indicating that the effect of Internet use is stronger in countries with higher degrees of religiosity.

Next steps

I am working on a second round of visualizations that show the size of the Internet effect in each country, expressed in terms of differences between people at the 25th, 50th, and 75th percentiles of Internet use.

I am also open to suggestions for further explorations. And if anyone has insight into some of the countries that show up as exceptions to the common patterns, I would be interested to hear it.

Tuesday, November 17, 2015

Internet use and religion, part two

In the previous article, I posted a preliminary exploration of the relationship between Internet use and religious affiliation in Europe. In this article I clean up some data issues and present results broken by country.

Cleaning and resampling

Here are the steps of the data cleaning pipeline:

I replace sentinel values with NaNs.
I recode some of the explanatory variables, and shift "year born" and "interview year" so their mean is 0.
Within each data collection round and each country, I resample the respondents using their post stratification weights (pspwght).
In Rounds 1 and 2, there are a few countries that were not asked about Internet use. I remove these countries from those rounds.
For the variables eduyrs and hinctnta, I replace each value with its rank (from 0-1) among respondents in the same round and country.
I replace missing values with random samples from the same round and country.
Finally, I merge rounds 1-5 into a single DataFrame and then group by country.

This IPython notebook has the details, and summaries of the variables after processing.

One other difference compared to the previous notebook/article: I've added the variable invyr70, which is the year the respondent was interviewed (between 2002 and 2012), shifted by 2007 so the mean is near 0.

Aggregate results

Using variables with missing values filled (as indicated by the _f suffix), I get the following results from logistic regression on rlgblg_f (belonging to a religion), with sample size 233,856:

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	1.1096	0.017	66.381	0.000	1.077 1.142
inwyr07_f	0.0477	0.002	30.782	0.000	0.045 0.051
yrbrn60_f	-0.0090	0.000	-32.682	0.000	-0.010 -0.008
edurank_f	0.0132	0.017	0.775	0.438	-0.020 0.046
hincrank_f	0.0787	0.016	4.950	0.000	0.048 0.110
tvtot_f	-0.0176	0.002	-7.888	0.000	-0.022 -0.013
rdtot_f	-0.0130	0.002	-7.710	0.000	-0.016 -0.010
nwsptot_f	-0.0356	0.004	-9.986	0.000	-0.043 -0.029
netuse_f	-0.1082	0.002	-61.571	0.000	-0.112 -0.105

Compared to the results from last time, there are a few changes

Interview year has a substantial effect, but probably should not be taken too seriously in this model, since the set of countries included in each round varies. The apparent effect of time might reflect the changing mix of countries. I expect this variable to be more useful after we group by country.
The effect of "year born" is similar to what we saw before. Younger people are less likely to be affiliated.
The effect of education, now expressed in relative terms within each country, is no longer statistically significant. The apparent effect we saw before might have been due to variation across countries.
The effect of income, now expressed in relative terms within each country, is now positive, which is more consistent with results in other studies. Again, the apparent negative effect in the previous analysis might have been due to variation across countries (see Simpson's paradox).
The effect of the media variables is similar to what we saw before: Internet use has the strongest effect, 2-3 times bigger than newspapers, which are 2-3 times bigger than television or radio. And all are negative.

The inconsistent behavior of education and income as control variables is a minor concern, but I think the symptoms are most likely the result of combining countries, possibly made worse because I am not weighting countries by population, so smaller countries are overrepresented.

Here are the results from linear regression with rlgdgr_f (degree of religiosity) as the dependent variable:

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	6.0140	0.022	270.407	0.000	5.970 6.058
inwyr07_f	0.0253	0.002	12.089	0.000	0.021 0.029
yrbrn60_f	-0.0172	0.000	-46.121	0.000	-0.018 -0.016
edurank_f	-0.2429	0.023	-10.545	0.000	-0.288 -0.198
hincrank_f	-0.1541	0.022	-7.128	0.000	-0.196 -0.112
tvtot_f	-0.0734	0.003	-24.399	0.000	-0.079 -0.067
rdtot_f	-0.0199	0.002	-8.760	0.000	-0.024 -0.015
nwsptot_f	-0.0762	0.005	-15.673	0.000	-0.086 -0.067
netuse_f	-0.1374	0.002	-57.557	0.000	-0.142 -0.133

In this model, all parameters are statistically significant. The effect of the media variables, including Internet use, is similar to what we saw before.

The effect of education and income is negative in this model, but I am not inclined to take it too seriously, again because we are combining countries in a way that doesn't mean much.

Breakdown by country

The following table shows results for logistic regression, with rlgblg_f as the dependent variable, broken down by country; the columns are country code, number of observations, and the estimated parameter associated with Internet use:

Country Num Coef of

code obs. netuse_f

------- ---- --------

AT 6918 -0.0795 **

BE 8939 -0.0299 **

BG 6064 0.0145

CH 9310 -0.0668 **

CY 3293 -0.229 **

CZ 8790 -0.0364 **

DE 11568 -0.0195 *

DK 7684 -0.0406 **

EE 6960 -0.0205

ES 9729 -0.0741 **

FI 7969 -0.0228

FR 5787 -0.0185

GB 11117 -0.0262 **

GR 9759 -0.0245

HR 3133 -0.0375

HU 7806 -0.0175

IE 10472 -0.0276 *

IL 7283 -0.0636 **

IS 579 0.0333

IT 1207 -0.107 **

LT 1677 -0.0576 *

LU 3187 -0.0789 **

LV 1980 -0.00623

NL 9741 -0.0589 **

NO 8643 -0.0304 **

PL 8917 -0.108 **

PT 10302 -0.103 **

RO 2146 0.00855

RU 7544 0.00437

SE 9201 -0.0374 **

SI 7126 -0.0336 **

SK 6944 -0.0635 **

TR 4272 -0.0857 *

UA 7809 -0.0422 **

** p < 0.01, * p < 0.05

In the majority of countries, there is a statistically significant relationship between Internet use and religious affiliation. In all of those countries the relationship is negative, with the magnitude of most coefficients between 0.03 and 0.11 (with one exceptionally large value in Cyprus).

Degree of religiosity

And here are the results of linear regression, with rlgdgr_f as the dependent variable:

Country Num Coef of

code obs. netuse_f

------- ---- --------

AT 6918 -0.0151 **

BE 8939 -0.0072 **

BG 6064 0.0023

CH 9310 -0.0132 **

CY 3293 -0.00221 **

CZ 8790 -0.005 **

DE 11568 -0.0045 *

DK 7684 -0.00909 **

EE 6960 -0.00363

ES 9729 -0.0165 **

FI 7969 -0.00501

FR 5787 -0.00429

GB 11117 -0.0061 **

GR 9759 -0.00362 **

HR 3133 -0.00559

HU 7806 -0.00478 *

IE 10472 -0.00412 *

IL 7283 -0.00419 **

IS 579 0.00752

IT 1207 -0.0212 **

LT 1677 -0.00746

LU 3187 -0.0152 **

LV 1980 -0.00147

NL 9741 -0.014 **

NO 8643 -0.00721 **

PL 8917 -0.00919 **

PT 10302 -0.0149 **

RO 2146 0.000303

RU 7544 0.00102

SE 9201 -0.00815 **

SI 7126 -0.00835 **

SK 6944 -0.0119 **

TR 4272 -0.00331 **

UA 7809 -0.00947 **

In most countries there is a negative and statistically significant relationship between Internet use and degree of religiosity.

In this model the effect of education is consistent: in most countries it is negative and statistically significant. In the two countries where it is positive, it is not statistically significant.

The effect of income is less consistent: in most countries it is not statistically significant; when it is, it is positive as often as negative.

But education and income are in the model primarily as control variables; they are not the focus on this study. If they are actually associated with religious affiliation, these variables should be effective controls; if not, they contribute some noise, but otherwise do no harm.

Next steps

For now I am using StatsModels to estimate parameters and compute confidence intervals, but that's not quite right because I am using resampled data and filling missing values with random samples. To account correctly for these sources of random error, I have to run the whole process repeatedly:

Resample the data.
Fill missing values.
Estimate parameters.

Collecting the estimated parameters from multiple runs, I can estimate the sampling distribution of the parameters and compute confidence intervals.

Once I have implemented that, I plan to translate the results into a form that is easier to interpret (rather than just estimated coefficients), and generate visualizations to make the results easier to explore.

I would also like to relate the effect of Internet use in each country with the average level of religiosity, to see whether, for example, the effect is bigger in more religious countries.

While I am working on that, I am open to suggestions for additional explorations people might be interested in. You can explore the variables in the ESS using their "Cumulative Data Wizard"; let me know what you find!