• Shortcuts : 'n' next unread feed - 'p' previous unread feed • Styles : 1 2

» Publishers, Monetize your RSS feeds with FeedShow:  More infos  (Show/Hide Ads)

Date: Thursday, 20 Mar 2014 03:00

Next week I’ll be in Melbourne for one of my favourite meetings, the annual Computational and Simulation Sciences and eResearch Conference.

The main reason for my visit is the Bioinformatics FOAM workshop. Day 1 (March 27) is not advertised since it is an internal CSIRO day, but I’ll be presenting a talk titled “SQL, noSQL or no database at all? Are databases still a core skill?“. Day 2 (March 28) is open to all and I’ll be talking about “Learning from complete strangers: social networking for bioinformaticians“.

I imagine these and other talks will appear on Slideshare soon, at both my account and that of the Australian Bioinformatics Network.

I’m also excited to see that Victoria Stodden is presenting a keynote at the main CSS meeting (PDF) on “Reproducibility in Computational Science: Opportunities and Challenges”.

Hope to see some of you there.

Filed under: bioinformatics, computing, meetings, travel Tagged: conference, foam, melbourne
Author: "nsaunders" Tags: "bioinformatics, computing, meetings, tra..."
Comments Send by mail Print  Save  Delicious 
Date: Sunday, 09 Mar 2014 22:58

A DOI, this morning

A DOI, this morning

When I arrive at work, the first task for the day is “check feeds”. If I’m lucky, in the “journal TOCs” category, there will be an abstract that looks interesting, like this one on the left (click for larger version).

Sometimes, the title is a direct link to the article at the journal website. Often though, the link is a Digital Object Identifier or DOI. Frequently, when the article is labelled as “advance access” or “early”, clicking on the DOI link leads to a page like the one below on the right.

DOI #fail

DOI #fail

In the grand scheme of things I suppose this rates as “minor annoyance”; it means that I have to visit the journal website and search for the article in question. The question is: why does this happen? I’m not familiar with the practical details of setting up a DOI, but I assume that the journal submits article URLs to the DOI system for processing. So who do I blame – journals, for making URLs public before the DOI is ready, or the DOI system, for not processing new URLs quickly enough?

There’s also the issue of whether terms like “advance access” have any meaning in the era of instant, online publishing but that’s for another day.

Filed under: uncategorized Tagged: doi, journals, publishing
Author: "nsaunders" Tags: "uncategorized, doi, journals, publishing"
Comments Send by mail Print  Save  Delicious 
Date: Thursday, 27 Feb 2014 23:42

One of my more popular posts is A brief introduction to “apply” in R. Come August, it will be four years old. Technology moves on, old blog posts do not.

So: thanks to BioStar user zx8754 for pointing me to this Stack Overflow post, in which someone complains that the code in the post does not work as described. The by example is now fixed.

Side note: I often find “contact the author” is the most direct approach to solving this kind of problem ;) always happy to be contacted.

Filed under: R, statistics, this blog Tagged: biostar, stack overflow, stackexchange
Author: "nsaunders" Tags: "R, statistics, this blog, biostar, stack..."
Comments Send by mail Print  Save  Delicious 
Date: Thursday, 06 Feb 2014 02:28

I’m pleased to announce an open-access publication with my name on it:

Mitchell, S.M., Ross, J.P., Drew, H.R., Ho, T., Brown, G.S., Saunders, N.F.W., Duesing, K.R., Buckley, M.J., Dunne, R., Beetson, I., Rand, K.N., McEvoy, A., Thomas, M.L., Baker, R.T., Wattchow, D.A., Young, G.P., Lockett, T.J., Pedersen, S.K., LaPointe L.C. and Molloy, P.L. (2014). A panel of genes methylated with high frequency in colorectal cancer. BMC Cancer 14:54.

If you’re thinking “but Neil never blogs about what he actually does at work anymore”, jump to this footnote [1].


Expression of MAMDC2 in cell lines HT29 and SW480 with (orange) or without (blue) 5-aza 2′deoxycytidine treatment

The title pretty much sums it up. Take a bunch of tissues and cell lines, subject them to a bunch of methods to detect methylation and select some promising methylation biomarker candidates based on consensus between methods.

My main contribution to the paper was analysis of the data from one of the experiments described in the abstract: “activation of gene expression in CRC cell lines following DNA demethylating treatment.” When the final version of the supplementary material appears, you should see plots similar to the one at right, showing increased expression of exon microarray probesets in some colorectal cancer cell lines when treated with a demethylating agent.

Biomarker discovery is our main goal, so to quote the conclusion verbatim:

This study has characterised a panel of 23 genes that show elevated DNA methylation in
>50% of CRC tissue relative to non-neoplastic tissue. Six of these genes (SOX21, SLC6A15,
NPY, GRASP, ST8SIA1 and ZSCAN18) show very low methylation in non-neoplastic
colorectal tissue and are candidate biomarkers for stool-based assays, while 11 genes
SOX21) have very low methylation in peripheral blood DNA and are suitable for further
evaluation as blood-based diagnostic markers.

[1] Despite the tagline for this blog, very few of my posts feature the details of my own day-to-day research. The reasons for this are many and sometimes complicated, but are often because (1) I work with confidential data that I don’t own and (2) I work for an organisation which has strict guidelines about discussing internal information on social media.

I get around this by writing about bioinformatics in general, rather than what I do in particular. I think this is more interesting anyway, so it suits me just fine.

Filed under: bioinformatics, publications, research diary Tagged: biomarker, colorectal cancer, csiro, methylation
Author: "nsaunders" Tags: "research diary, bioinformatics, publicat..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 05 Feb 2014 22:22

So, I read the title:

Mining locus tags in PubMed Central to improve microbial gene annotation

and skimmed the abstract:

The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases.

and thought, well OK, but wouldn’t it be better to incorporate annotations in the first place – when submitting to the public databases – rather than by this indirect method?

The point, of course, is to incorporate new findings from the literature into existing records, rather than to use the tool as a primary method of annotation. I do believe that public databases could do more to enforce data quality standards at deposition time, but that’s an entirely separate issue.

Big thanks to Michael Hoffman for a spirited Twitter discussion that put me straight.

Filed under: bioinformatics, publications Tagged: discussion, mea culpa, twitter
Author: "nsaunders" Tags: "bioinformatics, publications, discussion..."
Comments Send by mail Print  Save  Delicious 
Date: Sunday, 02 Feb 2014 22:43

On a rare, brief holiday (here and here, if you’re interested; both highly-recommended), I make the mistake of checking my Twitter feed:

This points me to BoxPlotR. It draws box plots. Using Shiny Server. That’s the “innovation”, presumably.

With “quilt plots” and now this, I’m starting to think that I’ve been doing science wrong all these years. If I’d been told to submit the trivial computational work I do every single day to journals, I could have thousands of publications by now.

I’m still pretty relaxed post-holiday, so let’s just leave it there.

Filed under: publications, R, statistics Tagged: boxplot, methods, nature
Author: "nsaunders" Tags: "publications, R, statistics, boxplot, me..."
Comments Send by mail Print  Save  Delicious 
Date: Friday, 24 Jan 2014 03:47

I enjoyed this story from the OpenHelix blog today, describing a Microsoft Research project to mine DNA sequences from web pages and map them to UCSC genome builds.

Laura DeMare asks: what was the most-hit gene?

First, visit the UCSC Table Browser. For the human genome hg19 build, the relevant group is “Phenotype and Literature”, the track is “Web Sequences” and the table is pubsBingBlat. Check the “genome” button under regions, enter a filename (I chose bingblat.gz, with gzip compression) and then click “get output”.

(notes: there are other “Bing” tables – but pubsBingBlat contains gene symbols, so seems easiest to work with in the first instance. The following analysis could also be done using the UCSC MySQL server.)

Open the file in R. It contains 313 510 rows. Gene symbols are in the last column, “locus”, which contains either one symbol or two separated by a comma.

bingblat <- read.table("bingblat.gz", header = T, sep = "\t",
                       stringsAsFactors = F, comment.char = "", quote = "")
# [1] 313510     27

#  [1] "X.bin"       "chrom"       "chromStart"  "chromEnd"    "name"       
#  [6] "score"       "strand"      "thickStart"  "thickEnd"    "reserved"   
# [11] "blockCount"  "blockSizes"  "chromStarts" "tSeqTypes"   "seqIds"     
# [16] "seqRanges"   "publisher"   "pmid"        "doi"         "issn"       
# [21] "journal"     "title"       "firstAuthor" "year"        "impact"     
# [26] "classes"     "locus"

[1] "WASH2P,WASH7P" "WASH7P"        "OR4F5"         "OR4F5"        
[5] "OR4F5"         "OR4F5"

If we split locus on comma, then unlist, we’ll get a vector of gene symbols and we can count them up using table. The top 10:

genes <- unlist(strsplit(bingblat$locus, ","))
genes.df <- as.data.frame(table(genes))
head(genes.df[order(genes.df$Freq, decreasing=T),], 10)

#           genes  Freq
# 5308      GAPDH 11096
# 8886   MIR4436A  4341
# 6628      IGHG1  4245
# 186        ACTB  3280
# 242       ADAM6  2466
# 7172   KIAA0125  2135
# 7704  LINC00226  2082
# 4155     ELK2AP  1714
# 15728      TP53  1398
# 6043        HBB  1380

GAPDH, by a long way. You’d guess that this is related to the frequent use of GAPDH as a “housekeeping gene” in assays for differential expression.

Here’s the full list with counts, as a public (I hope) TSV file at Dropbox.

Filed under: bioinformatics, genomics, R, statistics Tagged: blat, microsoft research, ucsc
Author: "nsaunders" Tags: "bioinformatics, genomics, R, statistics,..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 15 Jan 2014 21:36

Stephen tweets:


A “quilt plot”

Quilt plots. Sounds interesting. The link points to a short article in PLoS ONE, containing a table and a figure. Here is Figure 1.

If you looked at that and thought “Hey, that’s a heat map!”, you are correct. That is a heat map. Let’s be quite clear about that. It’s a heat map.

So, how do the authors justify publishing a method for drawing heat maps and then calling them “quilt plots”?

Well, they at least admit that the quilt plot is a heat map. Or as they insist throughout, “heat maps“, in quotes, you know, “so-called heat maps”. Here’s a paragraph from the discussion, reproduced in full because it is so nonsensical and strange:

“Quilt plots” can be considered as a simple formulation of “heat maps”. They produce a similar graphical display to “heat maps” when the “clustering” and “dendrogram” options are turned off. In addition, “quilt plots” have several advantages over “heat maps”. Firstly, unlike “heat maps”, “quilt plots” come with easily understood R-functions (i.e. plot, legend and color). In addition, R is freely available software and supported by leading statistical experts around the world, and it is important to promote the use of this software among epidemiological researchers. In addition it is difficult to learn to use R compared to other statistical packages. For example, “heat maps” require the specification of 21 arguments including hierarchical clustering, weights for re-ordering the row and columns dendrogram, which are not always easily understood unless one has an extensive programming knowledge and skills. One of the aims of our paper is to present “quilt plots” as a useful tool with simply formulated R-functions that can be easily understood by researchers from different scientific backgrounds without high-level programming skills.

So what they’re saying is: quilt plots are heat maps. However, heat maps have complicated options. Turn those options off and you have a quilt plot. Quilt plots are written in R, which makes them easy. It’s good for people to learn R. Except that R is difficult. Especially when you use heat maps. Which are quilt plots. Which make R easier.


The last section of the discussion begins:

Although our method cannot be considered “new”, the novelty is to make these types of methodologies more accessible for researchers from different scientific backgrounds and without the need for strong computing skills.

I do not buy this argument at all. Providing a slightly simpler version of an R function is not going to “lower the bar” to using R for biologists lacking computational skills. I recommend a Software Carpentry bootcamp as a solution to that issue.

I certainly agree that the method cannot be considered new. Which raises the question: should it have been published?

We’re frequently told that the main criterion for publication in PLoS ONE is soundness of methodology. I guess this article is sound in that the code does what it says: generates a heat map. Sorry, quilt plot. However, that criterion is an over-simplification. The criteria for publication are listed here and include:

  • The study presents the results of primary scientific research
  • Results reported have not been published elsewhere
  • Experiments, statistics, and other analyses are performed to a high technical standard and are described in sufficient detail

I’d argue that this is not primary research and that the statistics are not of a high technical standard. I’d even make the case that the “results” can be summarised as “how to draw a heat map”, which has been published previously and is well-known. My main argument though, would be that this does not represent an advance in either methodology or knowledge and as such, does not warrant publication in a scholarly journal. Code like this should go straight to Github, where it will be found and used by someone – if of use.

I doubt that @rvimieiro is the only person to have this thought:

but that’s another blog post. For now, let’s just say that the reviewers and/or editors could have done better with this one.

Filed under: bioinformatics, publications, R, software, statistics Tagged: heat map, plos one, r-project, software
Author: "nsaunders" Tags: "bioinformatics, publications, R, softwar..."
Comments Send by mail Print  Save  Delicious 
Date: Sunday, 05 Jan 2014 22:37

May as well begin 2014 where we left off: complaining about the attitude of scientific publishers regarding reproducible computational research.

I had a “Twitter blurt”. That’s when you read, react and tweet. Happens to the best of us. With hindsight, it was perhaps a little harsh:

The link is to an editorial in Nature Genetics, “Credit for code.” It points out, quite rightly, that “review, replication, reuse and recognition are all incentives to provide code” in research publications. After that promising start though, things get a little strange.

The article is written in a rather awkward, unconvincing style which suggests the editor(s) are not familiar nor comfortable with the subject. Phrases like “instantiated in software written for computers and other laboratory machines” sound, well, just weird. As for “it is also useful to offer the code actually used nonexclusively to the journal in a supplementary text or archived file” – first, that barely makes sense and second, it’s the legalese of people more accustomed to coaxing authors into giving up copyright. It’s unlikely to sit well with many scientific programmers.

The article uses CRAN and Bioconductor as examples of good practice in scientific software development, but again the tone is a little odd.

The journal has sufficient experience with these resources to endorse their use by authors. We do not yet provide any endorsement for the suitability or usefulness of other solutions but will work with our authors and readers, as well as with other journals, to arrive at a set of principles and recommendations.

What are they trying to say? “Our authors seem to use R a lot, so we’re guessing it’s good and besides, we don’t know about anything else”? There’s a substantial and active online community which has already developed principles and recommendations for publishing computational research. I’d suggest the editors get started by visiting Software Carpentry, searching Titus’s blog and reading this Ten Simple Rules article.

The last paragraph is the reason for my “Twitter blurt”. It begins:

If these best practices are not possible, there are ways not to make the current situation worse.

I’d rather we – especially the journals – strive for best practice, rather than adopt an air of resignation. It gets worse, though:

If none of these solutions are feasible, please do declare when there is code involved in the work, even if it is proprietary or unavailable, and provide equations or algorithms that enable a reader to understand and replicate analytical decisions made with the research software.

Few things are more frustrating, or more likely to result in irreproducibility and error, than trying to reconstruct a computational analysis based on a prosaic description of an algorithm in a research article. Yet this is a very typical part of the working day in my field (bioinformatics) and I imagine, in many others.

I may have blurted, but 12 retweets and 10 favourites suggest I hit a chord with a few people. As I suggested in a reply, I’d rather see journals leading the way by mandating standards for publishing computational research, rather than making weak suggestions.

Filed under: computing Tagged: editorial, npg, publishing, reproducibility
Author: "nsaunders" Tags: "computing, editorial, npg, publishing, r..."
Comments Send by mail Print  Save  Delicious 
Date: Friday, 03 Jan 2014 02:49

In something of an end-of-year tradition, WordPress provides users with an effort-free blog post in the form of an annual report. Here is mine.

My ambitious plan at the start of 2013 was to aim for 4 posts a month. I managed 28 and I’m happy with that; about one every two weeks.

Looking forward to a new year of blogging. All the best to you and yours for 2014.

Filed under: personal, this blog
Author: "nsaunders" Tags: "personal, this blog"
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 18 Dec 2013 02:16

Laboratory work, of the “wet” kind, not working out for you? Or perhaps you just need new challenges. Think you have some aptitude with data analysis, computers, mathematics, statistics? Maybe a switch to computational biology is what you need.

That’s the topic of the Nature Careers feature “Computing: Out of the hood“. With thoughts and advice from (on Twitter) @caseybergman, @sarahmhird, @kcranstn, @PavelTomancak, @ctitusbrown and myself.

I enjoyed talking with Roberta and she did a good job of capturing our thoughts for the article. One of these days, I might even write here about my own journey in more detail.

Filed under: bioinformatics, career, computing Tagged: journals, nature
Author: "nsaunders" Tags: "bioinformatics, career, computing, journ..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 03 Dec 2013 04:04

Admitting to stupidity is part of the learning process. So in the interests of public education, here’s something stupid that I did today.

You’re working in the R console. Happy with your exploratory code, you decide to save it to a file.

savehistory(file = "myCode.R")

Then, you type something else, for example:

# more lines here

And then, decide that you should save again:

savehistory(file = "myCode.R")

You quit the console. Returning to it later, you recall that you saved your code and so can simply run source() to get back to the same point:


Unfortunately, you forget that the sourced file now contains the savehistory() command. Result: since your new history contains only the single line source() command, then that is what gets saved back to the file, replacing all of your lovely code.

Possible solutions include:

  • Remember to edit the saved file, removing or commenting out any savehistory() lines
  • Generate a file name for savehistory() based on a timestamp so as not to overwrite each time
  • Suggested by Scott: include a prompt in the code before savehistory()

Filed under: programming, R, research diary, statistics Tagged: stupidity
Author: "nsaunders" Tags: "programming, R, research diary, statisti..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 05 Nov 2013 21:40

This bioinformatician, at least. Hate is a strong word. Perhaps “dislike” is better.

Short answer: because you can’t get data out of them easily, if at all. Longer answer:

If I still had time for fun “side projects”[1], I’d be interested in the newly-sequenced genomes of Pandoraviruses. In particular, I’d be somewhat suspicious regarding the very high proportion of hypothetical and often short ORFs in their genomes. I learn from the publication that 210 putative proteins from P. salinus were validated using mass spectrometry. I wonder: which of those correspond to the short, hypothetical ORF products?

Note – I don’t mean to single out this article for particular criticism. It just provides a good, recent example of the issues that affect almost every journal article, in the eyes of people who care about data.

Problem 1: the supplementary data
For reasons which elude me, the favoured form of supplementary data is still a huge PDF file. PDFs were designed to be printed out and read by humans. Why anyone still believes that they’re suitable as containers for raw data is one of the great mysteries of science.

This article is no exception. That said, it does at least describe the computational methods used in reasonable detail.

I struggle with the entire notion of data as “supplemental”. Presumably, it still means “not considered sufficiently important to occupy page space.” To a bioinformatician of course, the data are far more important than the prosaic description in the journal article. As for page space – hello? The Web? Anyone?

Problem 2: the arbitrary nature of what data are included
Proteins identified using mass spectrometry are listed in table S2 of the supplementary data. Except where they are not. The authors have chosen to list only those proteins (42/210) which also have a significant BLAST match in the NR database. So the answer to my question: which of the validated proteins correspond to short ORFs annotated as “hypothetical” remains “I don’t know.”

I’d argue that if data are cited in the main article:

The ion-mass data were interpreted in reference to a database that includes the A. castellanii (27) and the P. salinus predicted protein sequences. A total of 266 proteins were identified on the basis of at least two different peptides. Fifty-six of them corresponded to A. castellanii proteins presumably associated with the P. salinus particles, and 210 corresponded to predicted P. salinus CDSs.

then that same data – in this case, the complete lists of identified proteins – needs to be made available.

Problem 3: naming things
Staying with table S2, we see one of the great problems in computation – naming things. Column 2 is headed “ORF #” and contains an integer: 400, 258, 356…

What are these numbers? Are they supposed to help us retrieve the ORF in question? We can grab the Genbank file of P. salinus ORFs from the NCBI and look at the annotations. How about RNA polymerase Rpb5, which table S2 tells us is “ORF # 650″?

     CDS             627452..628276
                     /product="RNA polymerase Rpb5"

A few other examples match up too: looks like “ORF #” has become “old_locus_tag” in the database deposition.

Great! Unless we download the ORFs in FASTA format, where old_locus_tag is not found in the headers:

>lcl|NC_022098.1_cdsid_YP_008437064.1 [gene=O182_gp0647] [protein=RNA polymerase Rpb5] [protein_id=YP_008437064.1] [location=627452..628276]

Or the proteins in FASTA format…where it seems that headers contain the old locus tag for proteins annotated as hypothetical, but no tag at all where proteins are named. And old locus tags apparently run in reverse order through the file:

>gi|531035435|ref|YP_008437065.1| hypothetical protein ps_651 [Pandoravirus salinus]

>gi|531035434|ref|YP_008437064.1| RNA polymerase Rpb5 [Pandoravirus salinus]

>gi|531035433|ref|YP_008437063.1| hypothetical protein ps_649 [Pandoravirus salinus]

The problem here is that database deposition and annotation are completely separate processes to article writing and submission. Were this not the case, confusing inconsistencies between what you read and what you see in the data could be avoided.

If you publish a genome, I should be able to derive all of the information that you derived. Ideally, I would be able to do so using raw data deposited alongside your publication, rather than digging around in databases and piecing together what you did after the fact.

And that is why I – and I’d say most bioinformaticians – hate the traditional journal article.

[1] I’m a parent of a 2 year-old

Filed under: bioinformatics, genomics Tagged: journals, pandoravirus, publishing
Author: "nsaunders" Tags: "bioinformatics, genomics, journals, pand..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 30 Oct 2013 22:49

Reading an interesting post at Genomes Unzipped, “Human genetics is microbial genomics“, which states:

Only 10% of cells on your “human” body are human anyway, the rest are microbial.

Have you read a sentence like that before? So have I. So has a reader who left a comment:

I was wondering if you have a source for “Only 10% of cells on your “human” body are human anyway, the rest are microbial”

It’s a good question. Everyone quotes this figure, almost no-one provides a reference. Let’s go in search of one.

Wikipedia. Don’t make it your primary source. However, it’s often a good place to start. From the human microbiome article:

Bacterial cells are much smaller than human cells, and there are at least ten times as many bacteria as human cells in the body (approximately 1014 versus 1013).[7] [8]

Reference [8]: Berg, RD (1996). The indigenous gastrointestinal microflora. Trends Microbiol. 4 (11): 430–435. If you have access to the full text, you’ll find a PDF which seems to be a poor-quality scan of the printed article. From page 432:

In summary, there are ten viable indigenous bacteria in the GI tract for every cell in the human body: 1013 total GI bacteria compared with 1012 total cells making up the human body.

Note the order of magnitude differences compared with the Wikipedia article. I was unable to find a reference in this article to any paper where these numbers were measured or estimated. And so…

…reference [7]: Savage, DC (1977). Microbial Ecology of the Gastrointestinal Tract. Ann. Rev. Microbiol. 31: 107-133.

Another PDF so old that we cannot copy/paste from it. The opening sentences:

The adult human organism is said to be composed of approximately 1013 eukaryotic animal cells (27). That statement is only an expression of a particular point of view. The various body surfaces and the gastrointestinal canals of humans may be colonized by as many
as 1014 indigenous prokaryotic and eukaryotic microbial cells (70).

Note the order of magnitude discrepancies compared with the previous article.

Reference (27) is: Dobzhansky, T. (1971). Genetics of the Evolutionary Process, Vol. 1. New York: Columbia Univ. We’ll leave that one there.

Reference (70) is: Luckey, TD (1972). Am. J. Clin. Nutr. 25: 1292-95. Time to search PubMed:

luckey td[au] 1972[dp]

It’s the first entry, titled “Introduction to intestinal microecology” and it’s freely-available.

Hey look, another ancient PDF – and the last page is 1294, not 1295. First page, second paragraph:

The composition of this system is surprising. Adult man carries 1012 microbes associated with his epidermis and 1014 microbes in his alimentary tract (Fig. 1). The latter number is based upon 1011 microbes/g contents of an alimentary tract with a capacity of approximately 1 liter. The 1013 cells (2) in his body are a distinct numerical minority of the total being that we call man. If we abandon anthropomorphism for the microbic view, we must admire the efficiency of these microbes in using man as a vehicle to further their own cause.

This would seem to be the “definitive” reference, for now. Reference (2) in that quote is Dobzhansky, again.

No mention of adult woman. It’s OK, that’s just how people spoke and wrote in the 1970s.

Filed under: publications Tagged: gut, human, microbiology, microbiome, microbiota
Author: "nsaunders" Tags: "publications, gut, human, microbiology, ..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 30 Oct 2013 03:41

You can guarantee that when scientists publish a study titled:

Determining the Presence of Periodontopathic Virulence Factors in Short-Term Postmortem Alzheimer’s Disease Brain Tissue

a newspaper will publish a story titled:

Poor dental health and gum disease may cause Alzheimer’s

Without access to the paper, it’s difficult to assess the evidence. I suggest you read Jonathan Eisen’s analysis of the abstract. Essentially, it makes two claims:

  • that cultured astrocytes (a type of brain cell) can adsorb and internalize lipopolysaccharide (LPS) from Porphyromonas gingivalis, a bacterium found in the mouth
  • that LPS was also detected in brain tissue from 4/10 Alzheimer’s disease (AD) cases, but not in tissue from 10 matched normal brains

Regardless of the biochemistry – which does not sound especially convincing to me[1] – how about the statistics?

LPS was detected in 0/10 normal brains, compared with 4/10 AD brains. The “tl;dr” version of this discussion – if you think that those look like rather small numbers, you’re correct.

We can set up a matrix in R to contain those values.

ad <- matrix(c(4, 6, 0, 10), nrow = 2)
colnames(ad) <- c("AD", "Norm")
rownames(ad) <- c("lps+", "lps-")

#      AD Norm
# lps+  4    0
# lps-  6   10

Are those proportions significantly different? Or to put it another way: “I just need to know if 3 (10) patients are enough.” Let’s talk about statistical power.

Without going too deeply into the mathematics, the power of a statistical test is a number between 0 and 1, which describes the probability of a type II (false negative) error. For example, when power = 0.8, the probability of a false negative (concluding no difference between groups when in fact, there is one) is 0.2.

We’re looking at proportions in two groups (a two-proportion test), where the power depends on several parameters:

  • sample size (n, per group)
  • the proportions in each group (p1 and p2)
  • probability of type I (false positive) error (sig.level)
  • whether the test is one- or two-sided (alternative)

R provides us with the function power.prop.test():


     power.prop.test(n = NULL, p1 = NULL, p2 = NULL, sig.level = 0.05,
                     power = NULL,
                     alternative = c("two.sided", "one.sided"),
                     strict = FALSE)

How it works: you set one of the parameters n, p1, p2, sig.level or power to NULL and it is calculated from the other parameters.

To get started – what’s the power of the study in the publication, using sig.level = 0.05?

ppt <- power.prop.test(n = 10, p1 = 0, p2 = 0.4)
# [1] 0.6250675

Effectively, what that means is that the probability of a false negative (concluding no difference in LPS detection between normal and AD brains when there is a difference) is about 0.375. That’s rather high.

Most researchers set power = 0.8 as an acceptable threshold. So – how many samples per group do we need to achieve power = 0.8 at sig.level = 0.05?

ppt <- power.prop.test(p1 = 0, p2 = 0.4, power = 0.8)
# [1] 14.45958

About 15 samples. Not many more than 10 – but more than 10, nevertheless. How about something more stringent: power = 0.9, sig.level = 0.01?

ppt <- power.prop.test(p1 = 0, p2 = 0.4, power = 0.9, sig.level = 0.01)
# [1] 27.16856

Note that with larger sample sizes (e.g. 100 per group), the proportion of normal brains containing LPS can be quite high compared with AD brains and still be significant (at p = 0.05):

ppt <- power.prop.test(p2 = 0.4, power = 0.8, n = 100)
# [1] 0.2180086

Finally, a plot showing the increase of power with group sample size at p = 0.05, p = 0.01 (click for larger version):

Power versus sample size

Power versus sample size at p = 0.01, p = 0.05

# quick, dirty, ugly, but it works
df1 <- data.frame(n = 10:30,
                  p = sapply(10:30, function(x) power.prop.test(p1 = 0, p2 = 0.4, n = x)$power),
                  s = 0.05)
df2 <- data.frame(n = 10:30,
                  p = sapply(10:30, function(x) power.prop.test(p1 = 0, p2 = 0.4, n = x, sig.level = 0.05)$power),
                  s = 0.05)
df3 <- rbind(df1, df2)
ggplot(df3) + geom_point(aes(n, p, color = factor(s))) + theme_bw()

So much for power. How about testing for a difference between the groups?

You might be tempted to reach for the two-proportion test, implemented in R as prop.test(). You should not – but here’s the result anyway:


	2-sample test for equality of proportions with continuity correction

data:  c(0, 4) out of c(10, 10)
X-squared = 2.8125, df = 1, p-value = 0.09353
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.803636315  0.003636315
sample estimates:
prop 1 prop 2 
   0.0    0.4 

Warning message:
In prop.test(c(0, 4), c(10, 10)) :
  Chi-squared approximation may be incorrect

Note the warning message. That’s telling you that we don’t have enough samples for the calculation to be informative. The so-called Cochran conditions stipulate that no cell should contain a count of zero and more than 80% of cells should have counts of at least 5. Some power calculators, such as this one, will tell you when these assumptions have been violated.

The alternative is Fisher’s exact test:


	Fisher's Exact Test for Count Data

data:  ad
p-value = 0.08669
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.7703894       Inf
sample estimates:
odds ratio 

In summary then: not only is the conclusion that “LPS from periodontal bacteria can access the AD brain during life” rather premature, but this study “lacks power”. We cannot say whether LPS detection in the AD brains is significantly different to that in normal brains.

[1] and I used to be a biochemist (D. Phil., 1997)

Filed under: publications, R, science news, statistics Tagged: alzheimers, bacteria, lps, power, significance
Author: "nsaunders" Tags: "publications, R, science news, statistic..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 17 Sep 2013 02:55

Web services are great. Pass them a URL. Structured data comes back. Parse it, analyse it, visualise it. Done.

Web scraping – interacting programmatically with a web page – is not so great. It requires more code and when the web page changes, the code breaks. However, in the absence of a web service, scraping is better than nothing. It can even be rather satisfying. Early in my bioinformatics career the realisation that code, rather than humans, can automate the process of submitting forms and reading the results was quite a revelation.

In this post: how to interact with a web page at the NCBI using the Mechanize library.

A Biostar question:

I would like to know if I can convert PMID to NIHMSID, if there is no PMCID associated with that PMID and retrieve both of them if PMCID exists. I want to do this programmatically may be using eutils if possible. I know about this link http://www.ncbi.nlm.nih.gov/pmc/pmctopmid/ that will do the work but its not programmatically. I also tried the services by PMC http://www.ncbi.nlm.nih.gov/pmc/tools/oai/#examples but not much of success.

Some definitions. PMID = PubMed ID, an identifier for articles in the PubMed database. PMCID, a similar identifier for articles deposited in PubMed Central. NIHMSID – articles in PubMed Central have one of these identifiers if submitted via the NIH manuscript submission system.

The NCBI provide this web page, but no web service, for converting PMID to PMCID/NIHMSID or PMCID to PMID.

Enter Mechanize. I know of libraries for Perl, Python and Ruby – there may be others. Here’s how to use the Ruby version to submit a form and retrieve the results.

First, load the library and fetch the web page:

require 'mechanize'

agent = Mechanize.new
page  = agent.get("http://www.ncbi.nlm.nih.gov/pmc/pmctopmid/")

If you view the page on the Web, you’ll see that it contains 2 forms: one for a general search at NCBI, the other for ID conversion. You can verify this by searching the Mechanize::Page object for forms and pretty-printing the second one:

puts page.search("form").count
# 2
pp page.search("form")[1]
# output not shown

Next, we want to set values in the form. Let’s get the second form and examine the fields:

form = page.forms[1]
pp form
 {name nil}
 {method "POST"}
 {action "/pmc/pmctopmid/"}
  [hidden:0xfa4150 type: hidden name: p$a value: ]
  [hidden:0xfa3fe8 type: hidden name: p$l value: PAFAppLayout]
  [hidden:0xfa3e80 type: hidden name: p$st value: pmc]
  [hidden:0xfa3d2c type: hidden name: SessionId value: F4FC1F49237BDA11_0186SID]
  [hidden:0xfa3bc4 type: hidden name: Snapshot value: /projects/PMC/PMCStatic@2.60]
  [textarea:0xd543dc type:  name: PAFAppLayout.AppController.Page.PMCToPmidC.MainPortlet.Ids value: ]}
  [radiobutton:0xfa4a10 type: radio name: PAFAppLayout.AppController.Page.PMCToPmidC.MainPortlet.from_db value: from_pmid]
  [radiobutton:0xfa4880 type: radio name: PAFAppLayout.AppController.Page.PMCToPmidC.MainPortlet.from_db value: from_pmcid]}
  [checkbox:0xfa459c type: checkbox name: PAFAppLayout.AppController.Page.PMCToPmidC.MainPortlet.ToFile value: false]}
  [submit:0xfa4704 type: submit name: Clipboard value: Get IDs from PubMed clipboard]
  [submit:0xfa4420 type: submit name: ConvertButton value: Convert]
  [submit:0xfa42cc type: submit name: ClearButton value: Clear]}>

Compare that with the web page or its source code. We want to do the following:

  • check the radio button “PMID to PMCID (or NIHMSID)”
  • enter PMIDs in the text area
  • check the box for CSV file download

There are various ways to do all of those things:

# select the radio button by value
form.radiobutton_with(:value => "from_pmid").check
# select the text area by name and set values
form["PAFAppLayout.AppController.Page.PMCToPmidC.MainPortlet.Ids"] = "21707345\n23482678"
# select the CSV checkbox by name
form.checkbox_with(:name => "PAFAppLayout.AppController.Page.PMCToPmidC.MainPortlet.ToFile").check

Here PMIDs are separated by newline but you could also use commas, spaces, semicolons or vertical bars.

Finally, submit the form. There are 3 buttons in the form; we want the second one which is named ConvertButton and has value Convert:

f = agent.submit(form, form.buttons[1])
f.save "ids.txt"

Result – a file named ids.txt, with these contents:


My tip: study the web page and its source code carefully, making note of the names and values for the elements of interest. Those elements will then be much easier to find and alter when you come to work with the Mechanize object representation.

Filed under: programming, ruby, web resources Tagged: how to, mechanize, ncbi, web scraping
Author: "nsaunders" Tags: "programming, ruby, web resources, how to..."
Comments Send by mail Print  Save  Delicious 
Date: Thursday, 22 Aug 2013 05:16

When dealing with data from high-throughput experimental platforms such as microarrays, it’s important to account for potential batch effects. A simple example: if you process all your normal tissue samples this week and your cancerous tissue samples next week, you’re in big trouble. Differences between cancer and normal are now confounded with processing time and you may as well start over with new microarrays.

Processing date is often a good surrogate for batch and it was once easy to extract dates from Affymetrix CEL files using Bioconductor. It seems that this is no longer the case.

Once upon a time (about 10 months ago), I could write code to process CEL files from the Affymetrix Human Exon 1.0 ST array that looked like this:

# assuming a covdesc file in same directory as CEL files
exp.cel         <- read.affy(path = "data/celfiles")
exp.cel@cdfName <- "exon.pmcdf"
exp.rma         <- rma(exp.cel)

Getting the scan date could not have been easier:

pd <- pData(protocolData(exp.rma))
#                               ScanDate
# HCT_CON_A.CEL     2009-12-11T02:57:33Z
# HCT_CON_B.CEL     2009-12-11T00:56:07Z
# HCT_Treated_A.CEL 2009-12-11T04:27:12Z
# HCT_Treated_B.CEL 2009-12-11T01:26:20Z
# HT29_CON_A.CEL    2009-12-11T03:27:22Z
# HT29_CON_B.CEL    2009-12-11T03:56:58Z

Alas, this code now fails at the first hurdle (reading CEL files):

Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData,  : 
The affy package is not designed for this array type.
Please use either the oligo or xps package.


Fair enough. I’ve used the oligo package before, so we try that instead:

# assuming CEL files in ./data/celfiles
exp.cel <- read.celfiles(list.celfiles("data/celfiles", full.names = T))
exp.rma <- rma(exp.cel)

Wait – what happened to the dates?

pd <- pData(protocolData(exp.rma))
#                                              exprs dates
# HCT_CON_A.CEL         data/celfiles//HCT_CON_A.CEL      
# HCT_CON_B.CEL         data/celfiles//HCT_CON_B.CEL      
# HCT_Treated_A.CEL data/celfiles//HCT_Treated_A.CEL      
# HCT_Treated_B.CEL data/celfiles//HCT_Treated_B.CEL      
# HT29_CON_A.CEL       data/celfiles//HT29_CON_A.CEL      
# HT29_CON_B.CEL       data/celfiles//HT29_CON_B.CEL

# oligo has a runDate() method
# no joy there either
#  [1]                
# Levels:

OK – let’s look at xps. A prerequisite for installation is something called ROOT; not the super user, but a data analysis framework. If you’re on an Ubuntu-like system, ignore all the confusing advice around the Web about setting environment variables and just:

sudo apt-get install root-system

After xps installation we read the manual:

In contrast to other packages, which rely on the Bioconductor method for creating cdf environments, we need to create ROOT scheme files directly from the Affymetrix source files, which need to be downloaded first from the Affymetrix web site. However, once created, it is in principle possible to distribute the ROOT scheme files, too.

I’m sorry. There are only so many hoops that I’m willing to jump through in order to simply read from a file and this is a step too far.

So my current fix – use apt-cel-convert, part of the Affymetrix Power Tools suite, to convert to text and then grep for dates in the format MM/DD/YY:

apt-cel-convert -f text -o . myfile.CEL
grep ^DatHeader myfile.CEL | grep -ioP "\b\d+\/\d+\/\d+\b"
# 01/15/13

I don’t know yet whether those regular expressions work for all cases.

Have a better suggestion? Let’s hear it.

Update – for completeness I’m using R 3.0.1, Bioconductor 2.12, oligo 1.24.2 on Ubuntu 10.04. I’ve also used a CEL file from a different platform, HG U133A, which gave this:

# [1] [0..46114]  10434-10425 #799 133A 8-222-02:CLS=4733 RWS=4733 XIN=3  YIN=3  VE=17        # 2.0 08/22/02 12:08:07    \024  \024 HG-U133A.1sq \024  \024  \024  \024  \024  \024  \024  # \024  \024 6
# Levels: [0..46114]  10434-10425 #799 133A 8-222-02:CLS=4733 RWS=4733 XIN=3  YIN=3  VE=17        # 2.0 08/22/02 12:08:07    \024  \024 HG-U133A.1sq \024  \024  \024  \024  \024  \024  \024  # \024  \024 6

i.e. the entire DatHeader line has been stored as a factor.

Filed under: bioinformatics, R, research diary, statistics Tagged: affymetrix, batch effect, microarray
Author: "nsaunders" Tags: "bioinformatics, R, research diary, stati..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 15 Jul 2013 22:20

Scientific writing – by which I mean journal articles – is a strange business, full of arcane rules and conventions with origins that no-one remembers but to which everyone adheres.

I’ve always been amused by one particular convention: the sentence adverb. Used with a comma to make a point at the start of a sentence, as in these examples:

Surprisingly, we find that the execution of karyokinesis and cytokinesis is timely…
Grossly, the tumor is well circumscribed with fibrous capsule…
Correspondingly, the short-term Smad7 gene expression is graded…

The example that always makes me smile is interestingly. “This is interesting. You may not have realised that. So I said interestingly, just to make it clear.”

With that in mind, let’s go looking for sentence adverbs in article abstracts.

1. Download PubMed Central
Code and data for this post can be found at Github.

We need abstracts. One source of these is the PubMed Central (PMC) archive at the NCBI FTP site. Create a directory on your system to hold the files (e.g. db/pmc), move to it and:

wget http://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.A-B.tar.gz
wget http://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.C-H.tar.gz
wget http://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.I-N.tar.gz
wget http://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.O-Z.tar.gz
find ./ -name "*.tar.gz" -exec tar zxvf {} \;

Caution: the compressed archives are 1.5 – 2.5 GB and will take some time to download. They uncompress to about 47 GB of storage (at the time of writing).

2. Extract the sentence adverbs
The PMC files are in XML format. I’m not especially expert in XML parsing but when required, I reach for Ruby’s nokogiri. Here’s the complete code:


require "nokogiri"

f   = File.open(ARGV[0])
doc = Nokogiri::XML(f)

ameta = doc.xpath("//article/front/article-meta")
pmc   = ameta.xpath("//article-id[@pub-id-type='pmc']").text.chomp
epub  = ameta.xpath("//pub-date[@pub-type='epub']/year").text.chomp
ppub  = ameta.xpath("//pub-date[@pub-type='ppub']/year").text.chomp
abs   = ameta.xpath("//abstract").text.chomp
titl  = ameta.xpath("//title-group/article-title").text.chomp
jour  = doc.xpath("//article/front/journal-meta/journal-id[@journal-id-type='nlm-ta']").text.chomp

abs.scan(/[A-Z][a-z]+ly,/m) {
  b = $~.begin(0)
  e = $~.end(0)
  a = $&.gsub(",", "")
  o = [pmc, jour.downcase, epub, ppub, a, b, e]
  puts o.join(",")

The code uses XPath expressions to pull out abstracts and some other metadata from the PMC NXML files. It then searches the abstract for words that end with “ly,”. For reasons to be explained later, journal abbreviations are converted to lower-case and the position of the match is also recorded.

Save it as pmcly.rb and run it on the directory of PMC files like so:

find db/pmc -name "*.nxml" -exec pmcly.rb {} > adverbs.csv \;

This takes some time – overnight on my home machine. The result (first 10 lines of 92 940):

2751461,aaps j,2008,NA,Alternatively,350,364
2751463,aaps j,2008,NA,Unfortunately,315,329
2844510,aaps j,2010,NA,Generally,125,135
2751449,aaps j,2008,NA,Ideally,581,589
2751387,aaps j,2008,NA,Finally,1376,1384
2976997,aaps j,2010,NA,Finally,851,859
2976990,aaps j,2010,NA,Significantly,460,474
2751391,aaps j,2008,NA,Clearly,712,720
2751391,aaps j,2008,NA,Importantly,1175,1187
3385822,aaps j,2012,NA,Additionally,798,811

There will of course be false positives – words ending with “ly,” that are not adverbs. Some of these include: the month of July, the country of Italy, surnames such as Whitely, medical conditions such as renomegaly and typographical errors such as “Findingsinitially“. These examples are uncommon and I just ignore them where they occur. There will also be sentence adverbs that do not include the comma but for now, I’m restricting the analysis to the “more dramatic” form, with commas.

3. Clean up the mess
Skip to the last sentence in this section if you want to avoid the tedious details of data cleaning.

You’ll note that in the Ruby code, extracted text was passed through chomp to remove end of line characters. This was not the case originally, since I was not expecting to find line breaks in the tag contents. However, it’s always good to check that the lines in your CSV file contain the expected number of fields (7):

which(count.fields("data/adverbs.csv", sep = ",") != 7)
 [1] 55294 55295 55309 55310 55311 55312 55313 55314 55332 55333 55410 55411
[13] 55463 55464 55523 55524 55525 55526 55665 55666 55707 55708 56183 56184
[25] 56200 56201 56263 56264 57923 57924 57925 57926 57927 57928 57929 57930
[37] 57931 57932 57933 57934 57935 57936 57937 57938 57939 57940 57941 57942
[49] 57943 57944 57945 57946 57947 57948 57949 57950

Oh dear. Closer inspection indicates that we are looking at 28 pairs of adjacent lines. A quick look at the first 2 such lines in the CSV file:

tail -n+55294 adverbs.csv | head -2
540057,Nucleic Acids Res,2004

Seems like our 7 fields are being broken by an unexpected new line. Surely, PubMed Central, you would not allow a line break in the pub-date[@pub-type='epub']/year field? Let’s look at PMC 540057:

<pub-date pub-type="epub">

You would. OK then.

Rather than run the code again, I wrote an ugly fix in R. It identifies the offending lines and pastes them back together. When that’s done, it figures out unique occurrences of sentence adverbs by assuming that additional records with the same PMC uid, adverb and start/end position in an abstract are duplicates. That last step is required because there are files with duplicate contents in the PMC dataset; something that I only discovered when summing adverbs per PMC uid and realising that some occur twice.

Ultimately we end up with a CSV file, adverbs.uniq.csv, containing 91 618 unique and cleaned records.

4. Analysis
R code for analysing the adverbs is in the file adverbs.R. It contains 4 functions: for counting adverbs, plotting the top 20 as a bar plot, visualizing the top 100 as a word cloud and examining adverb occurrence by journal.

4.1 Top 20 adverbs
Let’s start with a basic “top N” list. Plot on the left, word cloud on the right.


Top 20 sentence adverbs in PubMed Central abstracts


Top 100 sentence adverbs in PubMed Central abstracts

It seems that the most popular use of the sentence adverb is to draw a close to the proceedings, with finally. The next most common uses: to indicate further points of interest (additionally), label results as interesting (interestingly) or important (importantly) and show that the authors are up to date with their reading (recently).

Findings are frequently surprising, but more often unfortunate than remarkable.

Altogether, there are 710 unique words in the adverb column from 91 618 records. Note that this includes the false positives mentioned previously, including a number of typographical errors. For example, we find both phylogenitically and phylogentically, in addition to the correct term, phylogenetically (25 occurrences). Similarly, (see what I did there?) both intriguinly and intringuingly are present, as well as the correct intriguingly (648 occurrences).

4.2 Top 20 opening adverbs
If you’re anything like me, you have sat in front of a blank screen searching for that brilliant opening sentence for an article…and then typed:


What inspiration did the authors of PMC find?

Same code, but applied to the subset of adverbs where match start = 0, i.e. the opening sentence of the abstract.


Top 20 sentence adverbs that open abstracts in PubMed Central


Top 100 sentence adverbs that open abstracts in PubMed Central

Looks like I’m not the only one who can’t do any better than recently. Authors also open by referring to previous work or assessing the current state of play.

4.3 The bad and the ugly
It’s possible to create all manner of adverbs by sticking -ly on the ends of things. This is rarely a good idea.

You might imagine that very long adverbs stand a good chance of being ugly. Let’s find the longest:

longest <- subset(adverbs.uniq, nchar(adverbs.uniq$adv) == max(nchar(adverbs.uniq$adv)))
# [1] "Electronmicroscopically"

You’d be correct.

You might also imagine that rarely-used adverbs stand a good chance of being ugly:

rarest  <- subset(adverbs.freq, Freq == 1)

270 sentence adverbs appear only once in the dataset, so we’ll leave that as an “exercise for the reader”. Suffice to say that some of them include: endosonographically, ethnopharmacologically, ophthalmoscopically and tinctorially.

4.4 Adverbs by journal
It would be interesting to know whether adverbs are over-represented in particular journals. Simply counting the adverb by journal is no good, since some journals have many more articles in PMC than others. PLoS ONE, for example, accounts for almost 20% of all sentence adverb records:

head(jour[ order(jour$Freq, decreasing = T),])
#                    Var1  Freq
# 2036           plos one 18285
# 1359        j cell biol  3053
# 1428          j exp med  2892
# 1897  nucleic acids res  2531
# 2033         plos genet  2076
# 2037        plos pathog  2043

Incidentally, this calculation is why we converted journal abbreviations to lower-case. PLoS ONE is also listed as PLoS One in PMC, but we want to count those names as one journal, hence plos one.

So, we need some sort of adjustment for total articles. I’ve gone with “occurrences per 100 adverbs.” That is: for every 100 sentence adverbs found in abstracts from a journal, how many occurrences of a particular adverb do we see? Furthermore, I’ve applied this metric to only those journals that have 100 or more sentence adverbs in the PMC dataset.

Let’s start with the surprising index (column a in this and subsequent tables).

surprising  <- advIndex("Surprisingly", adverbs.jour, jour)
head(surprising[, 2:4], 10)
#                Var2 Freq         a
# 1440517   plos biol   94 11.032864
# 1443357  plos genet  191  9.200385
# 1521457     sci rep   26  8.965517
# 1312007      nature   24  8.695652
# 1302777  nat commun    9  8.411215
# 964817  j cell biol  250  8.188667
# 208667     bmc biol   12  7.500000
# 1446197 plos pathog  151  7.391092
# 1013807   j exp med  194  6.708160
# 940677  j biol chem   29  6.487696

The message seems clear: go with a Nature or specialist PLoS journal if your results are surprising.

How about interesting?

head(interesting[, 2:4], 10)
#                     Var2 Freq        a
# 225387       bmc immunol   32 20.91503
# 945327      j biomed sci   19 19.00000
# 1010647        j exp bot   40 18.26484
# 1278317 mol neurodegener   22 18.18182
# 1616277          virol j   82 18.14159
# 1262697       mol cancer   81 17.96009
# 806167    int j biol sci   23 17.69231
# 1435937    plant physiol   19 17.27273
# 233907     bmc microbiol   83 16.50099
# 1280447         mol pain   26 16.04938

No clear winner there. Anyone for remarkable results?

head(remarkable[, 2:4], 10)
#                     Var2 Freq        a
# 1311933           nature   24 8.695652
# 1440443        plos biol   36 4.225352
# 686423       genome biol   15 4.000000
# 687133  genome biol evol    5 3.937008
# 1284243    mol syst biol    9 3.422053
# 1502923    retrovirology   10 3.246753
# 1443283       plos genet   67 3.227360
# 1521383          sci rep    8 2.758621
# 740383     hum mol genet    3 2.702703
# 545133      embo mol med    3 2.631579

Save your most remarkable results for submission to Nature.

Finally, what if it’s all a bit unfortunate?

#                           Var2 Freq        a
# 1444113               plos med  106 8.870293
# 168953          bioinformatics   14 7.329843
# 1600313                 trials    7 6.930693
# 210133          bmc biotechnol   12 6.451613
# 1072783     j med case reports   12 6.030151
# 208003      bmc bioinformatics   74 5.501859
# 252023           bmc syst biol   17 5.396825
# 1444823     plos negl trop dis   34 5.082212
# 1610963 vasc health risk manag    5 4.901961
# 186703       biomed eng online    5 4.761905

I’m a little surprised that bioinformatics features so prominently. What could be so unfortunate? The results of applying your method to real data, as opposed to simulated or training data?

4.5 Most sentence adverbs in an abstract

Who got a bit carried away with the sentence adverbs? Easy to find out by counting up the PMC uids:

pmc <- as.data.frame(table(adverbs.uniq$pmc))
pmc <- pmc[ order(pmc$Freq, decreasing = T),]
#          Var1 Freq
# 11769 2214797    7
# 27434 2873921    7
# 39930 3130555    7
# 42598 3173291    7
# 49235 3280967    7
# 13      17814    6

There are five abstracts that contain 7 sentence adverbs. Going back to the PMC website, we see that these are:

  1. IFN-γ Mediates the Rejection of Haematopoietic Stem Cells in IFN-γR1-Deficient Hosts (the editors’ summary is to blame for that one)
  2. Leishmania donovani Isolates with Antimony-Resistant but Not -Sensitive Phenotype Inhibit Sodium Antimony Gluconate-Induced Dendritic Cell Activation
  3. Quantitative analysis of transient and sustained transforming growth factor-β signaling dynamics (oddly, far less occurrences in the current online version)
  4. Review of juxtaglomerular cell tumor with focus on pathobiological aspect
  5. Nondisjunction of a Single Chromosome Leads to Breakage and Activation of DNA Damage Checkpoint in G2

As usual, this analysis is intended to be a bit of fun. Next time you’re writing that article though, ask yourself: is that sentence enhanced by the sentence adverb? Or are you simply following convention?

Filed under: R, research diary, ruby, statistics Tagged: adverbs, pubmed central, text-mining
Author: "nsaunders" Tags: "R, research diary, ruby, statistics, adv..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 10 Jul 2013 22:00

Tweet length: 140 characters. Quote + URL that I wanted to tweet: 160 characters. Solution: brief blog post.

the probability that people who can help each other can be connected has risen to the point that for many types of problem that they actually are

Please read the rest of Cameron’s thoughts on motivations for openness in research: Open is a state of mind.

Filed under: open access, open science Tagged: cameron neylon, open science, science blogs
Author: "nsaunders" Tags: "open access, open science, cameron neylo..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 25 Jun 2013 05:51

Just how many (bad) -omics are there anyway? Let’s find out.

Update: code and data now at Github

1. Get the raw data

It would be nice if we could search PubMed for titles containing all -omics:


However, we cannot since leading wildcards don’t work in PubMed search. So let’s just grab all articles from 2013:


and save them in a format which includes titles. I went with “Send to…File”, “Format…CSV”, which returns 575 068 records in pubmed_result.csv, around 227 MB in size.

2. Extract the -omics
Titles are in column 1 and we only want the -omics, so:

cut -f1 -d "," pubmed_result.csv | grep -ioP "(\w+omics)" > omics.txt
wc -l omics.txt
# 1763 omics.txt

Note: grep changed so the following now works. Note: this approach will miss a very few cases where omics is preceded by a hyphen. That included the classic stain-omics.
It also ignores the standalone term “-omics”, which is used quite often

Of course, this results in some “false positives” (e.g. economics). Curation would be required to detect the “true -omics”.

3. Visualize
Time to break out the R. The top 20 -omics in 2013 and the less popular:

omics <- readLines("omics.txt")
omics <- tolower(omics)
omics.freq <- as.data.frame(table(omics))
omics.freq <- omics.freq[ order(omics.freq$Freq, decreasing = T),]
omics.freq$omics <- factor(omics.freq$omics, levels = omics.freq$omics)
ggplot(head(omics.freq, 20)) + geom_bar(aes(omics, Freq), stat = "identity", fill = "darkblue") +
       theme_bw() + coord_flip()
# and the less popular
subset(omics.freq, Freq == 1)
On the right, the top 20. Click for a larger version of the graphic. Top of the list so far for 2013 is proteomics, followed by genomics and metabolomics.

Listed below, those -omics found only once in titles from 2013. Some shockers, I think you’ll agree (paging Jonathan Eisen).

          aquaphotomics    1
       biointeractomics    1
             calciomics    1
            cholanomics    1
           cytogenomics    1
           cytokinomics    1
          econogenomics    1
            glcnacomics    1
 glycosaminoglycanomics    1
          interactomics    1
               ionomics    1
         macroeconomics    1
            materiomics    1
      metalloproteomics    1
     metaproteogenomics    1
           microbiomics    1
         microeconomics    1
          microgenomics    1
        microproteomics    1
               miromics    1
         mitoproteomics    1
             mobilomics    1
               modomics    1
             morphomics    1
              museomics    1
              neuromics    1
       neuropeptidomics    1
        nitroproteomics    1
      nutrimetabonomics    1
           oncogenomics    1
        orthoproteomics    1
            pangenomics    1
           petroleomics    1
   pharmacometabolomics    1
     pharmacoproteomics    1
   phylotranscriptomics    1
              phytomics    1
           postgenomics    1
              pyteomics    1
          radiogenomics    1
           rehabilomics    1
     retrophylogenomics    1
                 romics    1
            secretomics    1
              sensomics    1
         speleogenomics    1
           surfaceomics    1
              surfomics    1
     toxicometabolomics    1
            vaccinomics    1
              variomics    1
Top 20 -omics PubMed titles 2013

Top 20 -omics PubMed titles 2013

Never heard of romics? That’s OK. It’s a surname.

Filed under: bioinformatics, publications, R, statistics Tagged: omics, pubmed
Author: "nsaunders" Tags: "bioinformatics, publications, R, statist..."
Comments Send by mail Print  Save  Delicious 
Next page
» You can also retrieve older items : Read
» © All content and copyrights belong to their respective authors.«
» © FeedShow - Online RSS Feeds Reader