• Shortcuts : 'n' next unread feed - 'p' previous unread feed • Styles : 1 2

» Publishers, Monetize your RSS feeds with FeedShow:  More infos  (Show/Hide Ads)


Date: Wednesday, 24 Sep 2014 04:58

File this one under “has troubled me (and others) for some years now, let’s try to resolve it.”

Let’s use the excellent R/rentrez package to search PubMed for articles that were retracted in 2013.

library(rentrez)

es <- entrez_search("pubmed", "\"Retracted Publication\"[PTYP] 2013[PDAT]", usehistory = "y")
es$count
# [1] 117

117 articles. Now let’s fetch the records in XML format.

xml <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey, 
                    rettype = "xml", retmax = es$count)

Next question: which XML element specifies the “Date of publication” (PDAT)?

To make a long story short: there are several nodes in PubMed XML that contain the word “Date”, but the one which looks most promising is named PubDate. Given that our search used the year (2013), you might think that years can be extracted using the XPath expression //PubDate/Year. You would be mostly, but not entirely right.

doc <- xmlTreeParse(xml, useInternalNodes = TRUE)
table(xpathSApply(doc, "//PubDate/Year", xmlValue))
# 2013 2014 
#  111    2 

Well, that’s confusing. Not only do we not get the expected total number of years (117), but two of them have the value 2014. Time to delve deeper into the nodes under PubDate.

children <- xpathSApply(doc, "//PubDate", xmlChildren)
table(names(unlist(children)))

#         Day MedlineDate       Month        Year 
#          25           4          87         113 

table(xpathSApply(doc, "//PubDate/MedlineDate", xmlValue))

# 2013 Jan-Mar 2013 May-Jun 2013 Nov-Dec 2013 Oct-Dec 
#            1            1            1            1 

Interesting. So in addition to //PubDate/Year, 4 records have a node named //PubDate/MedlineDate.

It’s also possible to retrieve records in docsum format, which is also XML but with a different structure. Here, PubDate is an attribute of an Item node.

ds <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey,
                   rettype = "docsum", retmax = es$count)
ds.doc <- xmlTreeParse(ds, useInternalNodes = TRUE)
table(xpathSApply(ds.doc, "//Item[@Name='PubDate']", xmlValue))

#         2013     2013 Apr   2013 Apr 1   2013 Apr 2     2013 Aug  2013 Aug 15  2013 Aug 29 
#           23            7            1            1            2            2            1 
#     2013 Dec   2013 Dec 1     2013 Feb  2013 Feb 26   2013 Feb 7     2013 Jan  2013 Jan 24 
#            3            1            6            1            1           10            2 
#   2013 Jan 3  2013 Jan 30   2013 Jan 7 2013 Jan-Mar     2013 Jul  2013 Jul 25     2013 Jun 
#            1            1            1            1            4            1            3 
#  2013 Jun 18   2013 Jun 5   2013 Jun 7     2013 Mar   2013 Mar 1  2013 Mar 12  2013 Mar 28 
#            1            1            1            5            1            1            1 
#   2013 Mar 9     2013 May   2013 May 1  2013 May 29   2013 May 6   2013 May 8   2013 May 9 
#            1            4            3            1            1            2            1 
#     2013 Nov 2013 Nov-Dec     2013 Oct 2013 Oct-Dec     2013 Sep  2013 Sep 30     2014 Feb 
#            8            1            2            1            5            1            1 
#     2014 Jan 
#            1 

A fair old mix of formats in there then, and still the issue of the 2014 years when we searched for PDAT = 2013. We can split on space to get years:

yr <- xpathSApply(ds.doc, "//Item[@Name='PubDate']", function(x) strsplit(xmlValue(x), " ")[[1]][1])
which(yr == "2014")
# [1] 16 26

And examine records 16 and 26:

xmlRoot(ds.doc)[[16]] # complete output not shown
# <DocSum>
#   <Id>24156249</Id>
#   <Item Name="PubDate" Type="Date">2014 Jan</Item>
#   <Item Name="EPubDate" Type="Date">2013 Oct 25</Item>

xmlRoot(ds.doc)[[26]] # complete output not shown
# <DocSum>
#   <Id>24001238</Id>
#   <Item Name="PubDate" Type="Date">2014 Feb</Item>
#   <Item Name="EPubDate" Type="Date">2013 Sep 4</Item>

Not every record has EPubDate. Is it simply the case that where it exists and is earlier than PubDate, then EPubDate == PDAT?

So we haven’t really resolved very much, have we?

  • we started with the Entrez search term PDAT (Date of publication)
  • both PubMed XML and DocSum contain something called PubDate
  • in the former case, most child node names = Year, but some = MedlineDate
  • we retrieve some records where PubDate year = 2014, even when searching for 2013[PDAT]

It appears that PDAT does not map consistently to any XML node in either XML or DocSum formats. It might be derived from (1) EPubDate, where that exists and is earlier than PubDate, or (2) PubDate, where EPubDate does not exist.


Filed under: bioinformatics, R, statistics Tagged: entrez, eutils, ncbi, pubmed, xml
Author: "nsaunders" Tags: "bioinformatics, R, statistics, entrez, e..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 22 Sep 2014 05:04

Sometimes, several strands of thought come together in one place. For me right now, it’s the Wikipedia page “Ebola virus epidemic in West Africa”, which got me thinking about the perennial topic of “data wrangling”, how best to provide public data and why I can’t shake my irritation with the term “data science”. Not to mention Ebola, of course.

I imagine that a lot of people with an interest in biological data are following this story and thinking “how can I visualise the numbers for myself?” Maybe you’d like to reproduce the plots in the Timeline section of that Wikipedia entry. Surprise: the raw numbers are not that easy to obtain.

2014-09-26 note: when Wikipedia pages change, as this one has, code breaks, as this code has; updates maintained at Github

Ebola virus epidemic in West Africa   Wikipedia  the free encyclopedia

Ebola cases and deaths by country and by date, from Wikipedia

The Wikipedia page includes a data table, which is a starting point. It’s not especially well-designed (click image at right to see the headers and a few rows) and the notes underneath suggest that a large amount of manual intervention was required to obtain the numbers.

The last column contains hyperlinked references. Now we see why so much manual work was required. The citations link out to two main types of information:

  1. Paragraphs of free text with numbers, somewhere in amongst it, like this example
  2. Infographic-style reports in PDF format, like this example

That’s more wrangling than I have time for just now; OK, so the Wikipedia table it is. Still a little more “wrangling” to get the data out of that HTML table.

edited 2014-09-24 based on comment from Rainer

library(XML)
library(ggplot2)
library(reshape2)

# get all tables on the page
ebola &lt;- readHTMLTable(&quot;http://en.wikipedia.org/wiki/Ebola_virus_epidemic_in_West_Africa&quot;, 
                  stringsAsFactors = FALSE)
# thankfully our table has a name; it is table #5
# this is not something you can really automate
head(names(ebola))
# [1] &quot;Ebola virus epidemic in West Africa&quot;          
# [2] &quot;Nigeria Ebola areas-2014&quot;                     
# [3] &quot;Treatment facilities in West Africa&quot;          
# [4] &quot;Democratic Republic of Congo-2014&quot;            
# [5] &quot;Ebola cases and deaths by country and by date&quot;
# [6] &quot;NULL&quot;

ebola &lt;- ebola$`Ebola cases and deaths by country and by date`

# again, manual examination reveals that we want rows 2-71 and columns 1-3
ebola.new &lt;- ebola[2:nrow(ebola), 1:3]
colnames(ebola.new) &lt;- c(&quot;date&quot;, &quot;cases&quot;, &quot;deaths&quot;)

# need to fix up a couple of cases that contain text other than the numbers
ebola.new$cases[nrow(ebola.new)-43]  &lt;- &quot;759&quot;
ebola.new$deaths[nrow(ebola.new)-43] &lt;- &quot;467&quot;

# get rid of the commas; convert to numeric
ebola.new$cases  &lt;- gsub(&quot;,&quot;, &quot;&quot;, ebola.new$cases)
ebola.new$cases  &lt;- as.numeric(ebola.new$cases)
ebola.new$deaths &lt;- gsub(&quot;,&quot;, &quot;&quot;, ebola.new$deaths)
ebola.new$deaths &lt;- as.numeric(ebola.new$deaths)

# the days in the dates are encoded 1-31
# are we there yet? quick and dirty attempt to reproduce Wikipedia plot
ebola.m &lt;- melt(ebola.new)
ggplot(ebola.m, aes(as.Date(date, &quot;%e %b %Y&quot;), value)) + 
       geom_point(aes(color = variable)) + 
       coord_trans(y = &quot;log10&quot;) + xlab(&quot;Date&quot;) + 
       labs(title = &quot;Cumulative totals log scale&quot;) + 
       theme_bw() 
ebola

First attempt to reproduce a plot from the Wikipedia page

Result: on the right, click for full-size.

We can complain: if only the WHO, CDC and other organisations provided data as a web service. Or even as files in CSV format. Anything but PDF. But right now at least, they do not. So hats off to the heroic efforts of the Wikipedian so-called “data janitors“. From that article:

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone

Surprising? Not to scientists (who don’t qualify the profession with the redundant word “data”). “Key hurdle to insights”, says the article title? Not really – just part and parcel of the job. I’d even argue that effective wrangling is where most of the skills are required. So perhaps, think twice before belittling peoples extensive skill sets with terms like “janitor”. You might need them to wrangle your data some day.


Filed under: R, statistics, web resources Tagged: data science, ebola, wikipedia
Author: "nsaunders" Tags: "R, statistics, web resources, data scien..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 08 Sep 2014 21:34

I must have a minor reputation as a critic of Excel in bioinformatics, since strangers are now sending contributions to my work email address. Thanks, you know who you are!

PLOS ONE  Online Survival Analysis Software to Assess the Prognostic Value of Biomarkers Using Transcriptomic Data in Non Small Cell Lung Cancer

When asked why I didn’t mask this email address, I replied “the authors didn’t”

This week: Online Survival Analysis Software to Assess the Prognostic Value of Biomarkers Using Transcriptomic Data in Non-Small-Cell Lung Cancer. Scroll on down to supporting Table S1 and right there on the page, staring you in the face is a rather unusual-looking microarray probeset ID.

I wonder if we should start collecting notable examples in one place?

To be fair, this is more human error than an issue with Excel per se, but I’m going to argue that using Excel promotes sloppy data management errors by making minds lazy :)


Filed under: uncategorized Tagged: reproducibility
Author: "nsaunders" Tags: "uncategorized, reproducibility"
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 26 Aug 2014 21:46

I’ve been complaining about this for years. They fixed it. The NCBI have reorganised their genomes FTP site and finally, Archaea are not lumped in with Bacteria.

GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/
RefSeq:  ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/

Archaea are still included in the ASSEMBLY_BACTERIA directory; hopefully that’s next on the list.

[*] to be fair, they’ve always recognised Archaea – just not in a form that makes downloads convenient


Filed under: bioinformatics, genomics, uncategorized Tagged: archaea, bacteria, ftp, genomes, ncbi
Author: "nsaunders" Tags: "bioinformatics, genomics, uncategorized,..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 13 Aug 2014 22:34

Ever wondered whether the “>” symbol can, or does, appear in FASTA sequence headers at positions other than the start of the line?

I have a recent copy of the NCBI non-redundant nucleotide (nt) BLAST database on my server, so let’s take a look. The files are in a directory which is also named nt:

# %i = sequence ID; %t = sequence title
blastdbcmd -db /data/db/nt/nt -entry all -outfmt '%i %t' | grep ">" > ~/Documents/nt.txt

wc -l ~/Documents/nt.txt
# => 54451 /home/sau103/Documents/nt.txt

# and how many sequences total in nt?
blastdbcmd -list /data/db/nt/ -list_outfmt '%n' | head -1
# => 23745273

Short answer – yes, about 0.23% of nt sequence descriptions contain the “>” character. Inspection of the output shows that it’s used in a number of ways. A few examples:

# as "brackets" (very common)
emb|V01351.1| Sea urchin fragment, 3' to the actin gene in <SPAC01>
gb|GU086899.1| Cotesia sp. jft03 voucher BIOUG<CAN>:CPWH-0042 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial

# to indicate mutated bases or amino acids
gb|M21581.1|SYNHUMUBA Synthetic human ubiquitin gene (Thr14->Cys), complete cds
dbj|AB047520.1| Homo sapiens gene for PER3, exon 1, 128+22(G->A) polymorphism

# in chemical nomenclature
gb|AF134414.1|AF134414 Homo sapiens B-specific alpha 1->3 galactosyltransferase (ABO) mRNA, ABO-*B101 allele, complete cds

# as "arrows"
gb|EU303182.1| Apoi virus note kitaoka-> canals ->NIMR nonstructural protein 5 (NS5) gene, partial cds
ref|XM_001734501.1| Entamoeba dispar SAW760 5'->3' exoribonuclease, putative EDI_265520 mRNA, complete cds

Something to keep in mind when writing code to process FASTA format.


Filed under: bioinformatics Tagged: fasta, formats, twitter
Author: "nsaunders" Tags: "bioinformatics, fasta, formats, twitter"
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 12 Aug 2014 22:47

6-way Venn banana

6-way Venn banana

I thought nothing could top the classic “6-way Venn banana“, featured in The banana (Musa acuminata) genome and the evolution of monocotyledonous plants.

That is until I saw Figure 3 from Compact genome of the Antarctic midge is likely an adaptation to an extreme environment.

5-way Venn roadkill

5-way Venn roadkill

What’s odd is that Figure 2 in the latter paper is a nice, clear R/ggplot2 creation, using facet_grid(), so someone knew what they were doing.

That aside, the Antarctic midge paper is an interesting read; go check it out.

This led to some amusing Twitter discussion which pointed me to *A New Rose : The First Simple Symmetric 11-Venn Diagram.


[*] +1 for referencing The Damned, if indeed that was the intention.


Filed under: bioinformatics, genomics, humour, R, statistics Tagged: graphics, venn, visualisation
Author: "nsaunders" Tags: "bioinformatics, genomics, humour, R, sta..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 05 Aug 2014 21:34

Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”.

I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them your way.

I have created a simple Excel workbook. It contains one sheet, in which you will find one column headed “x” followed by the numbers 1-10. The cells A2:A11 are filled with colour and alternate: red-green-red-green-red-green-red-green-red-green.

We’ll read it into R using the xlsx package and extract the worksheet.

library(xlsx)
wb     <- loadWorkbook("test.xlsx")
sheet1 <- getSheets(wb)[[1]]

Now we can get the rows and the cells.

# get all rows
rows  <- getRows(sheet1)
cells <- getCells(rows)
# quick look at the values
sapply(cells, getCellValue)
#  1.1  2.1  3.1  4.1  5.1  6.1  7.1  8.1  9.1 10.1 11.1 
#  "x"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9" "10" 

Colour information is contained in the cell styles.

styles <- sapply(cells, getCellStyle)

Somewhat confusingly, although the Excel GUI specifies “fill background”, the property is accessed using getFillForegroundXSSFColor().

Here’s a little function which, given a cell, returns the RGB code for the fill colour. The header cell has no fill colour, so getRgb() returns an error that we can catch.

cellColor <- function(style) {
    fg  <- style$getFillForegroundXSSFColor()
    rgb <- tryCatch(fg$getRgb(), error = function(e) NULL)
    rgb <- paste(rgb, collapse = "")
    return(rgb)
}

Applying that to our styles gives this:

sapply(styles, cellColor)
     1.1      2.1      3.1      4.1      5.1      6.1      7.1      8.1      9.1     10.1 
      "" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" 
    11.1 
"00ff00" 

Knowing that, for example, green = 00ff00 = normal and red = ff0000 = tumour, we can now generate category labels something like this:

pheno <- list(normal = "00ff00", tumour = "ff0000")
m     <- match(sapply(styles, cellColor), pheno)
labs  <-names(pheno)[m]
labs
 [1] NA       "tumour" "normal" "tumour" "normal" "tumour" "normal" "tumour" "normal"
[10] "tumour" "normal"

and get on with our lives. Or at least, our analyses.


Filed under: bioinformatics, programming, R, statistics Tagged: excel, spreadsheet
Author: "nsaunders" Tags: "bioinformatics, programming, R, statisti..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 30 Jul 2014 02:40

“Take a look at the TP53 mutation database“, my colleague suggested. “OK then, I will”, I replied.

I present what follows as “a typical day in the life of a bioinformatician”.


I click through to the Download page. At first sight, it’s reasonably promising. Excel files – not perfect, but beats a PDF – and ooh, TXT files. OK, I grab the “US version” (which uses decimal points, not commas for decimal numbers) of the uncurated database:

wget http://p53.fr/TP53_database_download/TP53_tumor_database/TP53_US/UMDTP53_all_2012_R1_US.txt.zip
unzip UMDTP53_all_2012_R1_US.txt.zip
#  inflating: UMDTP53_all_2012_R1_US.txt  
#  inflating: ._UMDTP53_all_2012_R1_US.txt

Wait now. What’s that weird hidden file, starting with a dot?

less ._UMDTP53_all_2012_R1_US.txt 
# "._UMDTP53_all_2012_R1_US.txt" may be a binary file.  See it anyway? 
# no, I don't want to do that
strings ._UMDTP53_all_2012_R1_US.txt 
# TEXTXCEL

At this point, alarm bells are ringing. I’m thinking “Mac user”. Don’t get me wrong; I own a Macbook Air myself, I see smart people doing good work with Macs. But in my experience, biologist + Mac + text files = trouble.

I push on and check the number of records in the main text file.

wc -l UMDTP53_all_2012_R1_US.txt
# 0 UMDTP53_all_2012_R1_US.txt

line endings

Mmm – oh, ^M

Zero lines. That can’t be right. We look at the file using less.

Aha, the old “line endings” issue. Easy enough to fix with the classic utility dos2unix – in this case, called as mac2unix:

mac2unix UMDTP53_all_2012_R1_US.txt > /tmp/UMDTP53_all_2012_R1_US.txt
# dos2unix: Skipping binary file UMDTP53_all_2012_R1_US.txt

Well that’s odd. Even odder is that this works.

cat UMDTP53_all_2012_R1_US.txt | mac2unix > /tmp/UMDTP53_all_2012_R1_US.txt
# on we go; overwriting the original since we can always get it from the zip
mv /tmp/UMDTP53_all_2012_R1_US.txt UMDTP53_all_2012_R1_US.txt
wc -l UMDTP53_all_2012_R1_US.txt
# 36248

The web page said there should be 36 249 records…but the file looks OK. Right, let’s get on and parse this file. It’s tab-delimited, I like Ruby, so I try to do the right thing and use a library – in this case, the Ruby CSV library. First, just a run-through to see that it all works as expected:

require 'csv'
f = "UMDTP53_all_2012_R1_US.txt"
CSV.foreach(f, :headers => true, :col_sep => "\t") do |row|
    puts row['Name']
end
# prints lots of names and then...
# ArgumentError: invalid byte sequence in UTF-8
# 	from /usr/lib/ruby/1.9.1/csv.rb:1855:in `sub!'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1855:in `block in shift'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1849:in `loop'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1849:in `shift'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1791:in `each'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1208:in `block in foreach'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1354:in `open'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1207:in `foreach'

Holy Molès

Holy Molès

Not all as expected, then. Colour me astonished. Frankly, this is why people resort to splitting on the delimiter instead of using libraries…anyway, search the web for that error and you’ll find a lot of confusing, conflicting and sometimes, just plain wrong explanations and advice. Let me save you the trouble. The Ruby CSV library expects UTF-8 encoding. This file is not encoded in UTF-8. So what is it? A surprisingly tricky question to answer.

First step: another visual examination using less, which shows some odd characters. The file contains a “Medline” column, so I search PubMed with the UID and see that author Mol<8F>s is supposed to be author Molès. I am not an expert in character encoding but when I post my frustration to Twitter, I find someone who is:

So tell CSV to transcode from macRoman to UTF-8:

CSV.foreach(f, :headers => true, :col_sep => "\t", :encoding => "macRoman:UTF-8") do |row|
    puts row['Name']
end

All good! I can parse the file and start using the fields to do something useful. Once again the interesting part (analysis) takes minutes, getting to the analysis takes hours or days.

It’s always tempting to say “well all of these problems could be avoided if people only did [insert better approach here]“. The thing is, they don’t. Dealing with it is just part of the job and having the skills to deal with it, I think, deserves more recognition than it gets.


Filed under: bioinformatics, computing, research diary Tagged: database, parsing, tp53
Author: "nsaunders" Tags: "bioinformatics, computing, research diar..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 28 Jul 2014 22:41

I’ve had a half-formed, but not very interesting blog post in my head for some months now. It’s about a conversation I had with a PhD student, around 10 years ago, after she went to a bioinformatics talk titled “Excel is not a database” and how she laughed as I’d been telling her that “for years already”. That’s basically the post so as I say, not that interesting, except as an illustration that we’ve been talking about this stuff for a long time (and little has changed).

HEp-2 or not HEp2?

HEp-2 or not HEp2?

Anyway, we have something better. I was exploring PubMed Commons, which is becoming a very good resource. The top-featured comment looks very interesting (see image, right).

Intrigued, I went to investigate the Database of Cross-contaminated or Misidentified Cell Lines, hovered over the download link and saw that it’s – wait for it – a PDF. I’ll say that again. The “database” is a PDF.

The sad thing is that this looks like very useful, interesting information which I’m sure would be used widely if presented in an appropriate (open) format and better-publicised. Please, biological science, stop embarrassing yourself. If you don’t know how to do data properly, talk to someone who does.


Filed under: humour, publications Tagged: cancer, cell lines, database, pdf
Author: "nsaunders" Tags: "humour, publications, cancer, cell lines..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 23 Jul 2014 05:18

I’m currently rather sleep-deprived and prone to doing stupid things. Like this, for example:

rsync -av ~/Dropbox /path/to/backup/directory/

where the directory /path/to/backup/directory already contains a much-older Dropbox directory. So when I set up a new machine, install Dropbox and copy the Dropbox directory back to its default location – hey! What happened to all my space? What are all these old files? Oh wait…I forgot to delete:

rsync -av --delete ~/Dropbox /path/to/backup/directory/

Now, files can be restored of course, but not when there are thousands of them and I don’t even know what’s old and new. What I want to do is restore the directories under ~/Dropbox to the state that they were in yesterday, before I stuffed up.

Luckily Chris Clark wrote dropbox-restore. It does exactly what it says on the tin. For example:

python restore.py /Camera\ Uploads 2014-07-22

Thanks Chris!


Filed under: computing, research diary, software Tagged: dropbox, github
Author: "nsaunders" Tags: "computing, research diary, software, dro..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 21 Jul 2014 13:59

This post is an apology and an attempt to make amends for contributing to the decay of online bioinformatics resources. It’s also, I think, a nice example of why reproducible research can be difficult.

Come back in time with me 10 years, to 2004.

1. Introduction

In 2004, I was working on archaeal genomes. One day in the lab, we were wondering why there are no known archaeal pathogens of humans. As a first step, we wondered if there was a way to look for nucleotide sequences from human-associated Archaea.

You’re thinking: “why not just sequence the human microbiome?” Yes, that is the 2014 solution. This, however, was 2004 when next-generation sequencing and metagenomics were just emerging. So we had a different idea: screen the EST database for transcripts that are annotated as human but which more closely resemble archaeal sequences. Some of those would, hopefully, be derived from human-associated Archaea.

I came up with a strategy, described in the next section and generated some not-especially-promising results. Back in 2004, the ability to create a MySQL database with a PHP front-end was considered to be an impressive skill in bioinformatics, so we set up a web server and wrote a paper called An online database for the detection of novel archaeal sequences in human ESTs. Today, this is so trivial that I doubt it would even be sent out for review. I still laugh when I think of a comment from one of the reviewers, that the work was of major interest due to its potential use in fighting bio-terrorism.

Screenshot from 2014-07-21 12:25:06.png

Possibly even worse than a 404 not found


And then I left the group. Click on the URL in the Availability section of that paper and after a long wait, you’ll see something like this, at right.

I apologise. In 2004, I gave little if any thought to the problem of maintaining online resources and my attitude, like many, was that the academic publication was the only goal, the endpoint. How very wrong I was.

2. The strategy

Let’s look in more detail at the process for screening the EST database.

You might ask: why not simply BLAST search ESTs (query) versus nt (subject) and parse the output for cases where the top hit was to an archaeal sequence? Two main reasons.

First – infrastructure. We did not have access to parallel BLAST on a large cluster, which would be required given the size of both query and subject.

Second – efficiency. The vast majority of EST sequences in the human subset of dbEST really are derived from human transcripts. It makes more sense to perform an initial screen for those sequences that might not be human, then follow up those results using BLAST.

So here’s the screening strategy that I devised.

  1. Grab all known coding sequences from archaeal genomes to use as a search database.
  2. Use BLAT to search human EST sequences against the archaeal sequences. Using BLAT speeds up the process since files can be split to fit into available memory, where BLAT runs, then the tab-delimited output can easily be recombined.
  3. Extract the much smaller subset of EST sequences that had a hit in step 2 and use those as queries in a BLAST search versus the large non-redundant nt nucleotide database.
  4. Parse the BLAST report and assign species to the top hit for each EST query.

3. The 2014 solution

I’ve long wanted not only to update the results and make them available, but also to automate the procedure and try to make it reproducible. My current solution to automation is a Ruby Rakefile. It works for me on my current system, meaning that I can at least run the pipeline at regular intervals with minimal effort or input on my part.

Whether anyone else could is another question. For reference: the server on which I work has 24 6-core CPUs, 96 GB RAM, plenty (TB) of storage and runs Ubuntu Linux. Other requirements should be apparent as we examine the contents of the Rakefile. It should be clear that this is a far-from-perfect work in progress.

My solution to availability is, naturally, a Github repository.

Feel free to skip the rest of this section if technical discussion of Rakefiles is not your thing.

Checking for prerequisites

The rake check task looks for required programs in $PATH, using the useful find_executables method and gives a hint about where to find them, if not installed.

desc "Check for prerequisites"
task :check do
	# check for various tools in PATH
	# we need seq, GNU parallel, wget, fastasplitn, blat, blastn, blastdbcmd
	cmds = {
		"seq" => "seq is part of coreutils http://www.gnu.org/software/coreutils/",
		"parallel" => "GNU Parallel is at http://www.gnu.org/software/parallel/",
		"wget" => "GNU Wget is at https://www.gnu.org/software/wget/",
		"fastasplitn" => "fastasplitn can be compiled from the src/ directory",
		"blat" => "BLAT is at http://users.soe.ucsc.edu/~kent/src/",
		"blastn" => "The latest BLAST+ is at ftp://ftp.ncbi.nih.gov/blast/executables/LATEST",
		"blastdbcmd" => "The latest BLAST+ is at ftp://ftp.ncbi.nih.gov/blast/executables/LATEST"
	}
	cmds.each_pair do |k, v|
		f = find_executable k
		if f.nil?
			puts "#{k} not found in path; #{v}"
		end
	end
end

3.1 Fetch archaeal sequences

The Rakefile uses two namespaces (not shown in the following code snippets): db for database-related tasks and search, for BLAT/BLAST search-related tasks. Further namespaces are used for the type of database: archaea, est or nt. So we fetch archaeal sequences using rake db:archaea:fetch. This code uses a file from the NCBI FTP server to determine the FTP URL for archaeal genome sequence data.

		desc "Fetch archaea genome ffn files"
		task :fetch do
			Dir.chdir(File.expand_path("../../../db/archaea", __FILE__))
			ftp = Net::FTP.new(ncbiftp)
			ftp.login
			ftp.chdir("genomes/GENOME_REPORTS")
			ftp.gettextfile("prokaryotes.txt", nil) do |line|
				line   = line.chomp
				fields = line.split("\t")
				if fields[4] =~ /[Aa]rchae/ && fields[23] != "-"
					puts fields[23]
					system("wget -N ftp://#{ncbiftp}/genomes/ASSEMBLY_BACTERIA/#{fields[23]}/*.ffn*")
				end
			end
		end

3.2 Process archaeal sequences

The fetch task returns two kinds of files: those of the form NC_*.ffn from annotated genomes and compressed NZ*.tgz files, which uncompress to *.ffn, from incompletely-annotated genomes. The task rake db:archaea:concat combines them all into one FASTA file.

		desc "Uncompress archaea *..scaffold.ffn.tgz files; combine with *.ffn files"
		task :concat do
			Dir.chdir(File.expand_path("../../../db/archaea", __FILE__))
			puts "Extracting tgz files to NZ.ffn..."
			system("find . -name '*.tgz' -exec tar zxfO {} > NZ.ffn \\;")
			puts "Combining NZ.ffn + NC*.ffn files..."
			system("cat NZ.ffn NC_0*.ffn > archaea.ffn")
			puts "Finished; files are in #{Dir.pwd}"
		end

3.3 Fetch nt BLAST database

Fetching the other databases is more straightforward. First, the nt BLAST database, using rake db:nt:fetch. You can see that this is currently not very reproducible, since it makes some assumptions about the number and names of files to download.

		desc "Fetch and uncompress nt BLAST db"
		task :fetch do
			# fetch code here
			ftp = Net::FTP.new(ncbiftp)
			ftp.login
			ftp.chdir("blast/db")
			nt = ftp.list("nt.*.tar.gz")
			nt = nt.map { |f| f.split(" ").last }
			Dir.chdir(File.expand_path("../../../db/nt", __FILE__))
			system("seq 0 9 | parallel -j10 wget -N ftp://ftp.ncbi.nih.gov/blast/db/nt.0{}.tar.gz")
			system("seq 10 #{nt.count - 1} | parallel -j#{nt.count - 10} wget -N ftp://ftp.ncbi.nih.gov/blast/db/nt.{}.tar.gz")
			# now need to uncompress
			system("find . -name 'nt.*.tar.gz' -exec tar zxvf {} \\;")
			puts "Finished; files are in #{Dir.pwd}"
		end

3.4 Fetch human EST BLAST database

In similar fashion to the previous task, rake db:est:fetch downloads and extracts the human EST BLAST database. Again there are assumptions about number and names of files.

		desc "Fetch and uncompress human EST BLAST db"
		task :fetch do
			ftp = Net::FTP.new(ncbiftp)
			ftp.login
			ftp.chdir("blast/db")
			est = ftp.list("est_human*.tar.gz")
			est = est.map { |f| f.split(" ").last }
			Dir.chdir(File.expand_path("../../../db/est", __FILE__))
			system("seq 0 #{est.count - 1} | parallel -j#{est.count} wget -N ftp://#{ncbiftp}/blast/db/est_human.0{}.tar.gz")
			# now need to uncompress
			system("find . -name 'est_human.*.tar.gz' -exec tar zxvf {} \\;")
			puts "Finished; files are in #{Dir.pwd}"
		end

3.5 Dump human ESTs to multiple FASTA files

We can dump the human EST BLAST database to FASTA, splitting the output into multiple files in order to speed up the BLAT search later on. There are lots of tools for splitting FASTA files; I found that the best for this pipeline is fastasplitn. The task is rake db:est:split.

Once again there is some regrettable hard-coding here for the number of output files (20). It would be better to write code which could determine an optimal number of output files.

		desc "Dump human EST BLAST db to fasta, splitting for BLAT"
		task :split do
			# code to dump fasta and split
			Dir.chdir(File.expand_path("../../../db/est", __FILE__))
			ENV['SPLITFRAGTEMPLATE'] = "est_human%3.3d"
			system("blastdbcmd -db est_human -entry all -line_length 60 | fastasplitn - 20")
			puts "Finished; files are in #{Dir.pwd}"
		end

3.6 BLAT search human ESTs versus archaeal sequences

We’re ready to BLAT, using rake search:blat:run. This generates 20 output PSL files.

		desc "Run BLAT of human EST fasta files (query) v. archaea.ffn (subject)"
		task :run do
			Dir.chdir(File.expand_path("../../../", __FILE__))
			system("parallel -j10 'var=$(printf '%03d' {}); blat db/archaea/archaea.ffn db/est/est_human$var -ooc=data/11.ooc data/psl/est_human$var.psl' ::: $(seq 1 10)")
			system("parallel -j10 'var=$(printf '%03d' {}); blat db/archaea/archaea.ffn db/est/est_human$var -ooc=data/11.ooc data/psl/est_human$var.psl' ::: $(seq 11 20)")
			puts "Finished; output PSL files are in #{Dir.pwd}/data/psl"
		end

3.7 Parse BLAT reports

Next, we want to retrieve unique EST query GIs from the BLAT reports. The task for that is rake search:blat:parse and the GIs are written to a file, one per line.

		desc "Parse PSL files, extract list of unique human EST GI, write to file"
		task :parse do
			Dir.chdir(File.expand_path("../../../data", __FILE__))
			system("cat psl/*.psl | cut -f10 | grep ^gi | cut -d '|' -f2 | sort -u > est_hits_gi.txt")
			puts "Finished; GIs in file #{Dir.pwd}/est_hits_gi.txt"			
		end

3.8 Dump EST hits to FASTA file

Now we can use the list of GIs from the previous step to dump those ESTs with similarity to archaeal sequences into a FASTA file, using rake db:est:hitdump.

		desc "Dump GIs in data/est_hits_gi.txt to fasta"
		task :hitdump do
			Dir.chdir(File.expand_path("../../..", __FILE__))
			system("blastdbcmd -db db/est/est_human -entry_batch data/est_hits_gi.txt -line_length 60 -out data/est_hits.fa")
			puts "Finished; sequences are in #{Dir.pwd}/data/est_hits.fa"			
		end

3.9 BLAST search EST hits versus nt database

We’re getting close. Archaeal-like ESTs versus nt database using rake search:blast:run. The output format is tab-delimited and includes taxon IDs, important for the next step. On my system BLAST crashes if more than 4 threads are specified, hence the hard-coded limit.

		desc "BLAST search human EST BLAT hits to archaea v. nt database"
		task :run do
			Dir.chdir(File.expand_path("../../../", __FILE__))
			format = "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids" 
			system("blastn -db db/nt/nt -query data/est_hits.fa -out data/est_v_nt.tsv -max_target_seqs 1 -outfmt '#{format}' -num_threads 4")
			puts "Finished; output in #{Dir.pwd}/data/est_v_nt.tsv"
		end

3.10 Add species information to BLAST report

The final task, rake search:blast:parse, uses the BioRuby implementation of NCBI EUtils/Efetch and some XML parsing to determine species and kingdom from the taxon IDs and add them to the BLAST output, in a new CSV file.

		desc "Parse EST v nt BLAST output; use taxid to retrieve species and kingdom"
		task :parse do
			Bio::NCBI.default_email = "me@me.com"
			taxids = Hash.new
			Dir.chdir(File.expand_path("../../../", __FILE__))
			File.open("data/est_v_nt.csv", "w") do |f|
				File.readlines("data/est_v_nt.tsv").each do |line|
					line = line.chomp
					fields = line.split("\t").map(&:strip)
					if taxids.has_key? fields[12]
						fields = fields + taxids[fields[12]]
					else
						xml   = Bio::NCBI::REST::EFetch.taxonomy(fields[12], "xml")
						doc   = Nokogiri::XML(xml)
						names = doc.xpath("//Taxon/ScientificName").children.map { |child| child.inner_text }
						taxids[fields[12]] = [names[0], names[2]]
						fields = fields + taxids[fields[12]]
					end
					f.write(fields.join(",") + "\n")
				end
			end	
			puts "Finished; output in #{Dir.pwd}/data/est_v_nt.csv"		
		end

4. The results

Let’s count up the kingdoms in the output file:

cut -f15 -d "," data/est_v_nt.csv | sort | uniq -c
     14 Archaea
   1565 artificial sequences
    262 Bacteria
     60 dsDNA viruses
  14128 Eukaryota

So most of the hits are to eukaryotic sequences; of those 11 229 “really are” human. What about those 14 hits to Archaea?

grep Archaea data/est_v_nt.csv | cut -f1,2,14,15 -d ","
gi|12397306|gb|BF990981.1|,gi|1145360|gb|U39025.1|MAU39025,Methanococcus aeolicus,Archaea
gi|1441578|gb|N88376.1|,gi|393188278|gb|CP003685.1|,Pyrococcus furiosus COM1,Archaea
gi|1570770|dbj|C16063.1|,gi|393188278|gb|CP003685.1|,Pyrococcus furiosus COM1,Archaea
gi|1570774|dbj|C16067.1|,gi|393188278|gb|CP003685.1|,Pyrococcus furiosus COM1,Archaea
gi|30285459|gb|CB990939.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285481|gb|CB990961.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285514|gb|CB990994.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285523|gb|CB991003.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285523|gb|CB991003.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285523|gb|CB991003.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30286509|gb|CB991989.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30288113|gb|CB993593.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30291474|gb|CB997050.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|8613918|gb|BE151197.1|,gi|393188278|gb|CP003685.1|,Pyrococcus furiosus COM1,Archaea

An improvement on our original publication, which yielded only one weak candidate. We’re still seeing hits to Pyrococcus furiosus, likely a result of contamination with Pfu polymerase DNA. The other hits are to methanogen genomes sequenced after 2004 although intriguingly, the tissue from which these ESTs were derived is pre-eclamptic placenta; see for example GI 30285459. Methanogens in placenta? Or on the fingers of less-than-hygienic researchers? Spurious hits due to low complexity sequence? “More research required”.

Conclusion

I’m now in a position to run this analysis at regular intervals – probably monthly – and push the results to Github. Watch this space for any interesting developments.

Making this kind of pipeline available and reproducible by others is less easy. If you have access to a machine with all the prerequisites, clone the repository, replace the symlinks to database directories with real directories, change to the directory code/ruby and run the rake tasks – well, you might be able to do the same thing. But you’d probably rather sequence the human microbiome.


Filed under: bioinformatics, research diary, ruby Tagged: archaea, database, est, github
Author: "nsaunders" Tags: "bioinformatics, research diary, ruby, ar..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 01 Jul 2014 03:27

I’ve long admired the work of the Open Source Malaria Project. Unfortunately time and “day job” constraints prevent me from being as involved as I’d like.

So: I was happy to make a small contribution recently in response to this request for help:

Note – this all works fine under Linux; there seem to be some issues with Open Babel library files under OSX.

First step: make that data usable by rescuing it from the spreadsheet ;) We’ll clean up a column name too.

library(XLConnect)
mmv <- readWorksheetFromFile("TP compounds with solid amounts 14_3_14.xlsx", sheet = "Sheet1")
colnames(mmv)[5] <- "EC50"

head(mmv)

  COMPOUND_ID                                                      Smiles     MW QUANTITY.REMAINING..mg.
1   MMV668822 c1[n+](cc2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)[O-] 434.35                     0.0
2   MMV668823      c1nc(c2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)Cl 452.79                     0.0
3   MMV668824                        c1ncc2n(c1CCO)c(nn2)c1ccc(cc1)OC(F)F 306.27                    29.6
4   MMV668955                        C1NCc2n(C1CCO)c(nn2)c1ccc(cc1)OC(F)F 310.30                    18.5
5   MMV668956    C1(CN(C1)c1cc(c(cc1)F)F)Oc1cncc2n1c(nn2)c1ccc(cc1)OC(F)F 445.38                   124.2
6   MMV668957          c1ncc2n(c1N1CCC(C1)c1ccccc1)c(nn2)c1ccc(cc1)OC(F)F 407.42                    68.5
   EC50 New.quantity.remaining
1  4.01                      0
2  0.16                      0
3 10.00                     29
4  8.37                     18
5  0.43                    124
6  2.00                     62

What OSM would like: an output file in Chemical Markup Language, containing the Compound ID and properties (MW and EC50).

The ChemmineR package makes conversion of SMILES strings to other formats pretty straightforward; we start by converting to Structure Data Format (SDF):

library(ChemmineR)
library(ChemmineOB)

mmv.sdf   <- smiles2sdf(mmv$Smiles)

That will throw a warning, since all molecules in the SDF object have the same CID; currently, no CID (empty string). We add the CID using the compound ID, then use datablock() to add properties:

cid(mmv.sdf) <- mmv$COMPOUND_ID
datablock(mmv.sdf) <- data.frame(MW = mmv$MW, EC50 = mmv$EC50)

Now we can write out to a SDF file. We could also use a loop or an apply function to write individual files per molecule.

write.SDF(mmv.sdf, "mmv-all.sdf", cid = TRUE)

It would be nice to stay in the one R script for conversion to CML too but for now, I just run Open Babel from the command line. Note that the -xp flag is required to include the properties in CML:

babel -xp mmv-all.sdf mmv-all.cml

That’s it; here’s my OSMinformatics Github repository, here’s the output.


Filed under: open science, programming, R, statistics Tagged: cheminformatics, conversion, malaria, osm, smiles
Author: "nsaunders" Tags: "open science, programming, R, statistics..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 01 Jul 2014 03:27

I’ve long admired the work of the Open Source Malaria Project. Unfortunately time and “day job” constraints prevent me from being as involved as I’d like.

So: I was happy to make a small contribution recently in response to this request for help:

Note – this all works fine under Linux; there seem to be some issues with Open Babel library files under OSX.

First step: make that data usable by rescuing it from the spreadsheet ;) We’ll clean up a column name too.

library(XLConnect)
mmv <- readWorksheetFromFile("TP compounds with solid amounts 14_3_14.xlsx", sheet = "Sheet1")
colnames(mmv)[5] <- "EC50"

head(mmv)

  COMPOUND_ID                                                      Smiles     MW QUANTITY.REMAINING..mg.
1   MMV668822 c1[n+](cc2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)[O-] 434.35                     0.0
2   MMV668823      c1nc(c2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)Cl 452.79                     0.0
3   MMV668824                        c1ncc2n(c1CCO)c(nn2)c1ccc(cc1)OC(F)F 306.27                    29.6
4   MMV668955                        C1NCc2n(C1CCO)c(nn2)c1ccc(cc1)OC(F)F 310.30                    18.5
5   MMV668956    C1(CN(C1)c1cc(c(cc1)F)F)Oc1cncc2n1c(nn2)c1ccc(cc1)OC(F)F 445.38                   124.2
6   MMV668957          c1ncc2n(c1N1CCC(C1)c1ccccc1)c(nn2)c1ccc(cc1)OC(F)F 407.42                    68.5
   EC50 New.quantity.remaining
1  4.01                      0
2  0.16                      0
3 10.00                     29
4  8.37                     18
5  0.43                    124
6  2.00                     62

What OSM would like: an output file in Chemical Markup Language, containing the Compound ID and properties (MW and EC50).

The ChemmineR package makes conversion of SMILES strings to other formats pretty straightforward; we start by converting to Structure Data Format (SDF):

library(ChemmineR)
library(ChemmineOB)

mmv.sdf   <- smiles2sdf(mmv$Smiles)

That will throw a warning, since all molecules in the SDF object have the same CID; currently, no CID (empty string). We add the CID using the compound ID, then use datablock() to add properties:

cid(mmv.sdf) <- mmv$COMPOUND_ID
datablock(mmv.sdf) <- data.frame(MW = mmv$MW, EC50 = mmv$EC50)

Now we can write out to a SDF file. We could also use a loop or an apply function to write individual files per molecule.

write.SDF(mmv.sdf, "mmv-all.sdf", cid = TRUE)

It would be nice to stay in the one R script for conversion to CML too but for now, I just run Open Babel from the command line. Note that the -xp flag is required to include the properties in CML:

babel -xp mmv-all.sdf mmv-all.cml

That’s it; here’s my OSMinformatics Github repository, here’s the output.


Filed under: open science, programming, R, statistics Tagged: cheminformatics, conversion, malaria, osm, smiles
Author: "nsaunders" Tags: "open science, programming, R, statistics..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 23 Jun 2014 04:24

Over the years, I’ve written a lot of small “utility scripts”. You know the kind of thing. Little code snippets that facilitate research, rather than generate research results. For example: just what are the fields that you can use to qualify Entrez database searches?

Typically, they end up languishing in long-forgotten Dropbox directories. Sometimes, the output gets shared as a public link. No longer! As of today, “little code snippets that do (hopefully) useful things” have a new home at Github.

Also as of today: there’s not much there right now, just the aforementioned Entrez database code and output. I’m not out to change the world here, just to do a little better.


Filed under: bioinformatics, research diary, software Tagged: bioinformatics, code, github, repository
Author: "nsaunders" Tags: "bioinformatics, research diary, software..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 23 Jun 2014 04:24

Over the years, I’ve written a lot of small “utility scripts”. You know the kind of thing. Little code snippets that facilitate research, rather than generate research results. For example: just what are the fields that you can use to qualify Entrez database searches?

Typically, they end up languishing in long-forgotten Dropbox directories. Sometimes, the output gets shared as a public link. No longer! As of today, “little code snippets that do (hopefully) useful things” have a new home at Github.

Also as of today: there’s not much there right now, just the aforementioned Entrez database code and output. I’m not out to change the world here, just to do a little better.


Filed under: bioinformatics, research diary, software Tagged: bioinformatics, code, github, repository
Author: "nsaunders" Tags: "bioinformatics, research diary, software..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 14 May 2014 01:59

There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.

One answer: we begin with exploratory data analysis and never get around to cleaning it up.

An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:

Update: those of you commenting “should have used Python instead” have completely missed the point. Your comments are off-topic and will not be published. Doubly-so when you get snarky about it.

library(GEOquery)
gse <- getGEO("GSE22255")

Bob is interested in the covariates and metadata associated with the experiment, which he can access using pData().

pd <- pData(gse$GSE22255_series_matrix.txt.gz)
names(pd)
#  [1] "title"                   "geo_accession"          
#  [3] "status"                  "submission_date"        
#  [5] "last_update_date"        "type"                   
#  [7] "channel_count"           "source_name_ch1"        
#  [9] "organism_ch1"            "characteristics_ch1"    
# [11] "characteristics_ch1.1"   "characteristics_ch1.2"  
# [13] "characteristics_ch1.3"   "characteristics_ch1.4"  
# [15] "characteristics_ch1.5"   "characteristics_ch1.6"  
# [17] "treatment_protocol_ch1"  "molecule_ch1"           
# [19] "extract_protocol_ch1"    "label_ch1"              
# [21] "label_protocol_ch1"      "taxid_ch1"              
# [23] "hyb_protocol"            "scan_protocol"          
# [25] "description"             "data_processing"        
# [27] "platform_id"             "contact_name"           
# [29] "contact_email"           "contact_phone"          
# [31] "contact_fax"             "contact_laboratory"     
# [33] "contact_department"      "contact_institute"      
# [35] "contact_address"         "contact_city"           
# [37] "contact_state"           "contact_zip/postal_code"
# [39] "contact_country"         "supplementary_file"     
# [41] "data_row_count"  

Bob discovers that pd$characteristics_ch1.2 is “age at examination”, but it’s stored as a factor. He’d like to use it as a numeric variable. So he sets about figuring out how to do the conversion.

# first you need factor to character
as.character(pd$characteristics_ch1.2[1:5])
# [1] "age-at-examination: 68" "age-at-examination: 68" "age-at-examination: 73"
# [4] "age-at-examination: 65" "age-at-examination: 73"

# now you want to get rid of everything but the age value
# so you wrap it in a gsub()
gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5]))
# [1] "68" "68" "73" "65" "73"

# and the last step is character to numeric
as.numeric(gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5])))
# [1] 68 68 73 65 73

Three levels of nested methods. Ugly. However, it works, so Bob moves to the next task (using those ages to do some science).

He thinks: “at some point, I should rewrite that as a function to convert ages.”

But he never does. We stop when it works.


Filed under: programming, R, research diary, statistics
Author: "nsaunders" Tags: "programming, R, research diary, statisti..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 14 May 2014 01:59

There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.

One answer: we begin with exploratory data analysis and never get around to cleaning it up.

An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:

Update: those of you commenting “should have used Python instead” have completely missed the point. Your comments are off-topic and will not be published. Doubly-so when you get snarky about it.

library(GEOquery)
gse <- getGEO("GSE22255")

Bob is interested in the covariates and metadata associated with the experiment, which he can access using pData().

pd <- pData(gse$GSE22255_series_matrix.txt.gz)
names(pd)
#  [1] "title"                   "geo_accession"          
#  [3] "status"                  "submission_date"        
#  [5] "last_update_date"        "type"                   
#  [7] "channel_count"           "source_name_ch1"        
#  [9] "organism_ch1"            "characteristics_ch1"    
# [11] "characteristics_ch1.1"   "characteristics_ch1.2"  
# [13] "characteristics_ch1.3"   "characteristics_ch1.4"  
# [15] "characteristics_ch1.5"   "characteristics_ch1.6"  
# [17] "treatment_protocol_ch1"  "molecule_ch1"           
# [19] "extract_protocol_ch1"    "label_ch1"              
# [21] "label_protocol_ch1"      "taxid_ch1"              
# [23] "hyb_protocol"            "scan_protocol"          
# [25] "description"             "data_processing"        
# [27] "platform_id"             "contact_name"           
# [29] "contact_email"           "contact_phone"          
# [31] "contact_fax"             "contact_laboratory"     
# [33] "contact_department"      "contact_institute"      
# [35] "contact_address"         "contact_city"           
# [37] "contact_state"           "contact_zip/postal_code"
# [39] "contact_country"         "supplementary_file"     
# [41] "data_row_count"  

Bob discovers that pd$characteristics_ch1.2 is “age at examination”, but it’s stored as a factor. He’d like to use it as a numeric variable. So he sets about figuring out how to do the conversion.

# first you need factor to character
as.character(pd$characteristics_ch1.2[1:5])
# [1] "age-at-examination: 68" "age-at-examination: 68" "age-at-examination: 73"
# [4] "age-at-examination: 65" "age-at-examination: 73"

# now you want to get rid of everything but the age value
# so you wrap it in a gsub()
gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5]))
# [1] "68" "68" "73" "65" "73"

# and the last step is character to numeric
as.numeric(gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5])))
# [1] 68 68 73 65 73

Three levels of nested methods. Ugly. However, it works, so Bob moves to the next task (using those ages to do some science).

He thinks: “at some point, I should rewrite that as a function to convert ages.”

But he never does. We stop when it works.


Filed under: programming, R, research diary, statistics
Author: "nsaunders" Tags: "programming, R, research diary, statisti..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 29 Apr 2014 21:29

Just a brief technical note.

I figured that for a given compound in PubChem, it would be interesting to know whether that compound had been used in a high-throughput experiment, which you might find in GEO. Very easy using the E-utilities, as implemented in the R package rentrez:

library(rentrez)
links <- entrez_link(dbfrom = "pccompound", db = "gds", id = "62857")
length(links$pccompound_gds)
# [1] 741

Browsing the rentrez documentation, I note that db can take the value “all”. Sounds useful!

links <- entrez_link(dbfrom = "pccompound", db = "all", id = "62857")
length(links$pccompound_gds)
# [1] 0

That’s odd. In fact, this query does not even link pccompound to gds:

length(names(links))
# [1] 39
which(names(links) == "pccompound_gds")
# integer(0)

It’s not a rentrez issue, since the same result occurs using the E-utilities URL.

The good people at ropensci have opened an issue, contacting NCBI for clarification. We’ll keep you posted.


Filed under: bioinformatics, programming, research diary Tagged: api, elink, entrez, ncbi, rentrez
Author: "nsaunders" Tags: "bioinformatics, programming, research di..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 29 Apr 2014 21:29

Just a brief technical note.

I figured that for a given compound in PubChem, it would be interesting to know whether that compound had been used in a high-throughput experiment, which you might find in GEO. Very easy using the E-utilities, as implemented in the R package rentrez:

library(rentrez)
links <- entrez_link(dbfrom = "pccompound", db = "gds", id = "62857")
length(links$pccompound_gds)
# [1] 741

Browsing the rentrez documentation, I note that db can take the value “all”. Sounds useful!

links <- entrez_link(dbfrom = "pccompound", db = "all", id = "62857")
length(links$pccompound_gds)
# [1] 0

That’s odd. In fact, this query does not even link pccompound to gds:

length(names(links))
# [1] 39
which(names(links) == "pccompound_gds")
# integer(0)

It’s not a rentrez issue, since the same result occurs using the E-utilities URL.

The good people at ropensci have opened an issue, contacting NCBI for clarification. We’ll keep you posted.


Filed under: bioinformatics, programming, research diary Tagged: api, elink, entrez, ncbi, rentrez
Author: "nsaunders" Tags: "bioinformatics, programming, research di..."
Comments Send by mail Print  Save  Delicious 
Date: Thursday, 20 Mar 2014 03:00

Next week I’ll be in Melbourne for one of my favourite meetings, the annual Computational and Simulation Sciences and eResearch Conference.

The main reason for my visit is the Bioinformatics FOAM workshop. Day 1 (March 27) is not advertised since it is an internal CSIRO day, but I’ll be presenting a talk titled “SQL, noSQL or no database at all? Are databases still a core skill?“. Day 2 (March 28) is open to all and I’ll be talking about “Learning from complete strangers: social networking for bioinformaticians“.

I imagine these and other talks will appear on Slideshare soon, at both my account and that of the Australian Bioinformatics Network.

I’m also excited to see that Victoria Stodden is presenting a keynote at the main CSS meeting (PDF) on “Reproducibility in Computational Science: Opportunities and Challenges”.

Hope to see some of you there.


Filed under: bioinformatics, computing, meetings, travel Tagged: conference, foam, melbourne
Author: "nsaunders" Tags: "bioinformatics, computing, meetings, tra..."
Comments Send by mail Print  Save  Delicious 
Next page
» You can also retrieve older items : Read
» © All content and copyrights belong to their respective authors.«
» © FeedShow - Online RSS Feeds Reader