Monday, February 18, 2013

Belly Button Biodiversity: Part Three

This is part three of a series of articles about a Bayesian solution to the Unseen Species problem, applied to data from the Belly Button Biodiversity project.

In Part One I presented the simplest version of the algorithm, which I think is easy to understand, but slow.  In Think Bayes I present some ways to optimize it.  In Part Two I apply the algorithm to real data and generate predictive distributions.  Now in Part Three, as promised, I publish the predictions the algorithm generates.  In Part Four I will compare the predictions to actual data.

Background: Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen science project with the goal of identifying bacterial species that can be found in human navels (http://bbdata.yourwildlife.org).

Transparent science

In an effort to explore the limits of transparent science, I have started publishing my research in this blog as I go along.  This past summer I wrote a series of articles exploring the relationship between Internet use and religious disaffiliation.  This "publish as you go" model should help keep researchers honest.  Among other things, it might mitigate publication bias due to the "file drawer effect."  And if the data and code are published along with the results, that should help make experiments more reproducible.

Toward that end, I will now subject myself to public humiliation by generating a set of predictions using my almost-entirely-unvalidated solution to the Unseen Species problem.  In the next installment I will publish the correct answers and score my predictions.  Here are the details:

1) I am working with data from the Belly Button Biodiversity project; this data was used in a paper published in PLOS ONE and made available on the web pages of the journal and the researchers.  The data consists of rDNA "reads" from 60 subjects.  In order to facilitate comparisons between subjects, the researchers chose subjects with at least 400 reads, and for each subject they chose a random subset of 400 reads.  The data for the other reads was not published.

2) For each subject, I know the results of the 400 selected reads, and the total number of reads.  I will use my algorithm to generate a "prediction" for each subject, which is the number of additional species in the complete dataset.

3) Specifically, for each subject I will generate 9 posterior credible intervals (CIs) for the number of additional species: a 10% CI, a 20% CI, and so on up to a 90% CI.

4) To validate my predictions, I will count the number of CIs that contain the actual, correct value.  Ideally, 10% of the correct values should fall in the 10% CIs, 20% should fall in the 20% CIs, and so on.  Since the predictions and actual values are integers, a value that hits one end of a predicted CI counts as a half-hit.

 Predictions

And now, without further ado, here are my predictions.  The columns labelled 10, 20, etc. are 10% credible intervals, 20% CIs, and so on.


Code # reads # species 10 20 30 40 50 60 70 80 90
B1234 1392 48 (4, 4) (4, 5) (3, 5) (3, 5) (3, 6) (2, 6) (2, 7) (1, 7) (1, 9)
B1235 2452 69 (11, 12) (10, 12) (10, 13) (9, 13) (8, 14) (8, 15) (7, 16) (6, 17) (5, 19)
B1236 2964 45 (4, 5) (4, 5) (4, 6) (4, 6) (3, 7) (3, 7) (3, 8) (2, 9) (1, 10)
B1237 3090 62 (9, 10) (9, 11) (8, 11) (8, 11) (7, 12) (7, 12) (6, 13) (5, 14) (4, 16)
B1242 3056 61 (9, 9) (8, 10) (8, 10) (7, 11) (7, 11) (6, 12) (6, 14) (5, 15) (5, 16)
B1243 1518 71 (10, 11) (10, 12) (9, 12) (8, 13) (8, 13) (8, 14) (7, 15) (6, 16) (5, 17)
B1246 4230 91 (23, 24) (22, 25) (21, 26) (21, 27) (19, 28) (18, 29) (17, 30) (16, 33) (14, 35)
B1253 1928 86 (16, 17) (15, 18) (14, 18) (14, 20) (13, 20) (13, 21) (12, 23) (11, 24) (10, 26)
B1254 918 58 (5, 5) (4, 6) (4, 6) (3, 6) (3, 7) (3, 7) (2, 8) (2, 9) (1, 10)
B1258 1350 87 (15, 16) (14, 17) (14, 17) (13, 18) (12, 19) (11, 19) (11, 20) (10, 21) (8, 24)
B1259 1002 80 (10, 11) (10, 12) (10, 12) (9, 13) (9, 14) (8, 14) (7, 15) (6, 16) (6, 18)
B1260 1944 96 (22, 23) (21, 24) (20, 25) (19, 25) (19, 26) (18, 27) (17, 29) (15, 30) (14, 32)
B1264 1122 83 (12, 13) (12, 14) (11, 14) (10, 15) (10, 15) (9, 16) (8, 17) (7, 18) (6, 20)
B1265 2928 85 (18, 19) (17, 20) (16, 21) (16, 22) (15, 23) (14, 24) (13, 25) (12, 26) (11, 28)
B1273 2980 61 (9, 9) (8, 10) (8, 10) (7, 11) (7, 12) (6, 12) (6, 13) (5, 14) (4, 16)
B1275 1672 85 (16, 17) (15, 18) (15, 19) (14, 19) (13, 20) (13, 21) (12, 22) (11, 24) (9, 25)
B1278 1242 47 (4, 4) (3, 4) (3, 5) (3, 5) (2, 6) (2, 6) (2, 6) (2, 7) (1, 8)
B1280 1772 46 (4, 4) (4, 5) (3, 5) (3, 5) (3, 6) (2, 6) (2, 7) (2, 8) (1, 9)
B1282 1132 67 (8, 9) (7, 9) (7, 10) (6, 10) (6, 11) (6, 11) (5, 12) (5, 13) (3, 15)
B1283 1414 67 (8, 9) (8, 10) (7, 10) (7, 11) (7, 11) (6, 12) (5, 13) (4, 14) (3, 16)
B1284 1158 91 (15, 16) (14, 17) (14, 17) (13, 18) (13, 19) (12, 19) (12, 20) (10, 21) (9, 23)
B1285 2340 55 (7, 7) (6, 8) (6, 8) (5, 8) (5, 9) (4, 9) (4, 10) (3, 12) (2, 13)
B1286 1728 66 (9, 10) (9, 11) (8, 11) (8, 12) (8, 12) (7, 13) (6, 14) (6, 14) (4, 16)
B1288 1280 107 (23, 24) (22, 25) (21, 25) (21, 26) (20, 27) (19, 27) (18, 29) (17, 31) (15, 32)
B1289 2054 103 (26, 27) (25, 28) (24, 29) (23, 30) (23, 30) (22, 32) (21, 33) (20, 34) (17, 36)
B1291 1248 94 (17, 18) (16, 19) (16, 20) (15, 20) (15, 21) (13, 22) (13, 23) (12, 25) (10, 27)
B1292 1864 82 (15, 16) (14, 16) (13, 17) (13, 18) (13, 19) (12, 20) (11, 21) (10, 22) (9, 24)
B1293 1904 76 (13, 14) (12, 14) (12, 15) (11, 16) (11, 16) (10, 17) (9, 18) (8, 19) (7, 22)
B1294 1784 78 (14, 15) (13, 16) (12, 16) (12, 17) (11, 18) (11, 19) (10, 19) (9, 20) (8, 23)
B1295 1408 70 (10, 10) (9, 11) (9, 12) (8, 12) (8, 12) (7, 13) (7, 14) (6, 15) (4, 17)
B1296 2034 55 (7, 7) (6, 8) (6, 8) (6, 8) (5, 9) (4, 9) (4, 10) (4, 11) (3, 12)
B1298 1478 72 (10, 11) (9, 12) (9, 12) (9, 13) (8, 13) (8, 14) (7, 15) (6, 16) (5, 18)
B1308 1160 58 (6, 6) (5, 7) (5, 7) (5, 7) (4, 8) (4, 8) (3, 9) (3, 10) (2, 11)
B1310 1066 80 (11, 12) (11, 13) (10, 13) (9, 14) (9, 15) (8, 15) (7, 16) (7, 17) (5, 19)
B1374 2364 48 (5, 5) (4, 6) (4, 6) (4, 6) (3, 7) (3, 7) (3, 8) (2, 9) (2, 10)
B940 2874 93 (22, 24) (21, 25) (21, 25) (20, 26) (19, 27) (19, 28) (18, 30) (16, 32) (14, 33)
B941 2154 48 (5, 6) (5, 6) (4, 6) (4, 7) (4, 7) (3, 7) (3, 8) (2, 9) (2, 11)
B944 954 52 (4, 4) (4, 5) (3, 5) (3, 5) (3, 6) (2, 6) (2, 6) (2, 7) (1, 9)
B945 2390 67 (10, 11) (10, 12) (9, 12) (9, 13) (8, 13) (8, 14) (7, 15) (7, 16) (5, 17)
B946 5012 85 (20, 21) (19, 22) (19, 23) (18, 24) (18, 24) (17, 26) (16, 27) (15, 28) (12, 31)
B947 3356 62 (10, 11) (9, 11) (9, 12) (8, 12) (7, 13) (7, 14) (6, 14) (5, 15) (5, 17)
B948 2384 80 (16, 17) (15, 18) (14, 18) (14, 19) (13, 20) (12, 21) (11, 22) (10, 23) (9, 25)
B950 1560 63 (8, 9) (8, 10) (8, 10) (7, 10) (7, 11) (6, 11) (5, 12) (5, 13) (4, 15)
B952 1648 57 (7, 7) (6, 8) (6, 8) (6, 8) (5, 9) (5, 9) (4, 10) (3, 11) (3, 12)
B953 1452 32 (2, 2) (1, 2) (1, 2) (1, 3) (1, 3) (1, 3) (0, 3) (0, 4) (0, 5)
B954 1996 29 (2, 2) (1, 2) (1, 2) (1, 2) (1, 3) (1, 3) (0, 3) (0, 4) (0, 4)
B955 1474 65 (8, 9) (8, 9) (7, 9) (7, 10) (7, 10) (6, 11) (5, 12) (5, 13) (4, 14)
B956 1482 71 (10, 11) (10, 12) (10, 12) (9, 13) (8, 14) (8, 14) (7, 15) (6, 16) (5, 18)
B957 2604 36 (3, 3) (3, 3) (2, 4) (2, 4) (2, 5) (1, 5) (1, 6) (1, 6) (1, 7)
B958 2840 29 (2, 2) (2, 2) (1, 2) (1, 3) (1, 3) (1, 3) (1, 4) (0, 4) (0, 5)
B961 1214 36 (2, 3) (2, 3) (2, 3) (2, 4) (1, 4) (1, 4) (1, 5) (1, 5) (0, 6)
B962 1138 41 (3, 3) (2, 3) (2, 3) (2, 4) (2, 4) (1, 4) (1, 5) (1, 6) (0, 7)
B963 1600 71 (10, 12) (10, 12) (9, 12) (9, 13) (8, 14) (8, 14) (7, 15) (5, 16) (4, 19)
B966 1950 80 (15, 16) (14, 16) (14, 17) (13, 17) (12, 18) (11, 18) (11, 20) (10, 22) (9, 23)
B967 1108 47 (3, 4) (3, 4) (3, 4) (2, 5) (2, 5) (2, 5) (2, 6) (1, 7) (1, 7)
B968 2432 52 (6, 7) (6, 7) (6, 7) (5, 8) (5, 8) (4, 9) (3, 9) (3, 10) (2, 11)
B971 1462 49 (4, 5) (4, 5) (4, 5) (3, 6) (3, 6) (3, 7) (2, 7) (2, 8) (1, 9)
B972 1438 75 (11, 12) (11, 13) (10, 13) (10, 14) (9, 14) (8, 15) (8, 16) (7, 17) (6, 18)
B974 5072 54 (7, 8) (7, 9) (6, 9) (6, 9) (5, 10) (5, 10) (4, 11) (4, 12) (2, 14)
B975 1542 63 (7, 8) (7, 9) (6, 9) (6, 9) (6, 10) (5, 11) (5, 11) (4, 12) (3, 14)

Reading the last row, subject B975 yielded 1542 reads; in a random subset of 400 reads, there were 63 different species (or more precisely, OTUs).  My algorithm predicts that if we look at all 1542 reads, the number of additional species we'll find is between 3 and 14, with 90% confidence.

I have to say that this table fills me with dread.  The intervals seem quite small, which is to say that the algorithm is more confident than I am.  The 90% CIs seem especially narrow to me; it is hard for me to believe that 90% of them will contain the correct values.  Well, I guess that's why Karl Popper called them "bold hypotheses".  We'll find out soon whether they are bold, or just reckless.

I want to thank Rob Dunn at BBB2 for his help with this project.  The code and data I used to generate these results are available from this SVN repository.

EDIT 2-22-13: I ran the predictions again with more simulations.  The results are not substantially different.  I still haven't looked at the answers.

No comments:

Post a Comment