• Shortcuts : 'n' next unread feed - 'p' previous unread feed • Styles : 1 2

» Publishers, Monetize your RSS feeds with FeedShow:  More infos  (Show/Hide Ads)


Date: Wednesday, 23 Jul 2014 05:18

I’m currently rather sleep-deprived and prone to doing stupid things. Like this, for example:

rsync -av ~/Dropbox /path/to/backup/directory/

where the directory /path/to/backup/directory already contains a much-older Dropbox directory. So when I set up a new machine, install Dropbox and copy the Dropbox directory back to its default location - hey! What happened to all my space? What are all these old files? Oh wait...I forgot to:

rsync -av --delete

Now, files can be restored of course, but not when there are thousands of them and I don't even know what's old and new. What I want to do is restore the directories under ~/Dropbox to the state that they were in yesterday, before I stuffed up.

Luckily Chris Clark wrote dropbox-restore. It does exactly what it says on the tin. For example:

python restore.py /Camera\ Uploads 2014-07-22

Thanks Chris!


Filed under: computing, research diary, software Tagged: dropbox, github
Author: "nsaunders" Tags: "computing, research diary, software, dro..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 21 Jul 2014 13:59

This post is an apology and an attempt to make amends for contributing to the decay of online bioinformatics resources. It’s also, I think, a nice example of why reproducible research can be difficult.

Come back in time with me 10 years, to 2004.

1. Introduction

In 2004, I was working on archaeal genomes. One day in the lab, we were wondering why there are no known archaeal pathogens of humans. As a first step, we wondered if there was a way to look for nucleotide sequences from human-associated Archaea.

You’re thinking: “why not just sequence the human microbiome?” Yes, that is the 2014 solution. This, however, was 2004 when next-generation sequencing and metagenomics were just emerging. So we had a different idea: screen the EST database for transcripts that are annotated as human but which more closely resemble archaeal sequences. Some of those would, hopefully, be derived from human-associated Archaea.

I came up with a strategy, described in the next section and generated some not-especially-promising results. Back in 2004, the ability to create a MySQL database with a PHP front-end was considered to be an impressive skill in bioinformatics, so we set up a web server and wrote a paper called An online database for the detection of novel archaeal sequences in human ESTs. Today, this is so trivial that I doubt it would even be sent out for review. I still laugh when I think of a comment from one of the reviewers, that the work was of major interest due to its potential use in fighting bio-terrorism.

Screenshot from 2014-07-21 12:25:06.png

Possibly even worse than a 404 not found


And then I left the group. Click on the URL in the Availability section of that paper and after a long wait, you’ll see something like this, at right.

I apologise. In 2004, I gave little if any thought to the problem of maintaining online resources and my attitude, like many, was that the academic publication was the only goal, the endpoint. How very wrong I was.

2. The strategy

Let’s look in more detail at the process for screening the EST database.

You might ask: why not simply BLAST search ESTs (query) versus nt (subject) and parse the output for cases where the top hit was to an archaeal sequence? Two main reasons.

First – infrastructure. We did not have access to parallel BLAST on a large cluster, which would be required given the size of both query and subject.

Second – efficiency. The vast majority of EST sequences in the human subset of dbEST really are derived from human transcripts. It makes more sense to perform an initial screen for those sequences that might not be human, then follow up those results using BLAST.

So here’s the screening strategy that I devised.

  1. Grab all known coding sequences from archaeal genomes to use as a search database.
  2. Use BLAT to search human EST sequences against the archaeal sequences. Using BLAT speeds up the process since files can be split to fit into available memory, where BLAT runs, then the tab-delimited output can easily be recombined.
  3. Extract the much smaller subset of EST sequences that had a hit in step 2 and use those as queries in a BLAST search versus the large non-redundant nt nucleotide database.
  4. Parse the BLAST report and assign species to the top hit for each EST query.

3. The 2014 solution

I’ve long wanted not only to update the results and make them available, but also to automate the procedure and try to make it reproducible. My current solution to automation is a Ruby Rakefile. It works for me on my current system, meaning that I can at least run the pipeline at regular intervals with minimal effort or input on my part.

Whether anyone else could is another question. For reference: the server on which I work has 24 6-core CPUs, 96 GB RAM, plenty (TB) of storage and runs Ubuntu Linux. Other requirements should be apparent as we examine the contents of the Rakefile. It should be clear that this is a far-from-perfect work in progress.

My solution to availability is, naturally, a Github repository.

Feel free to skip the rest of this section if technical discussion of Rakefiles is not your thing.

Checking for prerequisites

The rake check task looks for required programs in $PATH, using the useful find_executables method and gives a hint about where to find them, if not installed.

desc "Check for prerequisites"
task :check do
	# check for various tools in PATH
	# we need seq, GNU parallel, wget, fastasplitn, blat, blastn, blastdbcmd
	cmds = {
		"seq" => "seq is part of coreutils http://www.gnu.org/software/coreutils/",
		"parallel" => "GNU Parallel is at http://www.gnu.org/software/parallel/",
		"wget" => "GNU Wget is at https://www.gnu.org/software/wget/",
		"fastasplitn" => "fastasplitn can be compiled from the src/ directory",
		"blat" => "BLAT is at http://users.soe.ucsc.edu/~kent/src/",
		"blastn" => "The latest BLAST+ is at ftp://ftp.ncbi.nih.gov/blast/executables/LATEST",
		"blastdbcmd" => "The latest BLAST+ is at ftp://ftp.ncbi.nih.gov/blast/executables/LATEST"
	}
	cmds.each_pair do |k, v|
		f = find_executable k
		if f.nil?
			puts "#{k} not found in path; #{v}"
		end
	end
end

3.1 Fetch archaeal sequences

The Rakefile uses two namespaces (not shown in the following code snippets): db for database-related tasks and search, for BLAT/BLAST search-related tasks. Further namespaces are used for the type of database: archaea, est or nt. So we fetch archaeal sequences using rake db:archaea:fetch. This code uses a file from the NCBI FTP server to determine the FTP URL for archaeal genome sequence data.

		desc "Fetch archaea genome ffn files"
		task :fetch do
			Dir.chdir(File.expand_path("../../../db/archaea", __FILE__))
			ftp = Net::FTP.new(ncbiftp)
			ftp.login
			ftp.chdir("genomes/GENOME_REPORTS")
			ftp.gettextfile("prokaryotes.txt", nil) do |line|
				line   = line.chomp
				fields = line.split("\t")
				if fields[4] =~ /[Aa]rchae/ && fields[23] != "-"
					puts fields[23]
					system("wget -N ftp://#{ncbiftp}/genomes/ASSEMBLY_BACTERIA/#{fields[23]}/*.ffn*")
				end
			end
		end

3.2 Process archaeal sequences

The fetch task returns two kinds of files: those of the form NC_*.ffn from annotated genomes and compressed NZ*.tgz files, which uncompress to *.ffn, from incompletely-annotated genomes. The task rake db:archaea:concat combines them all into one FASTA file.

		desc "Uncompress archaea *..scaffold.ffn.tgz files; combine with *.ffn files"
		task :concat do
			Dir.chdir(File.expand_path("../../../db/archaea", __FILE__))
			puts "Extracting tgz files to NZ.ffn..."
			system("find . -name '*.tgz' -exec tar zxfO {} > NZ.ffn \\;")
			puts "Combining NZ.ffn + NC*.ffn files..."
			system("cat NZ.ffn NC_0*.ffn > archaea.ffn")
			puts "Finished; files are in #{Dir.pwd}"
		end

3.3 Fetch nt BLAST database

Fetching the other databases is more straightforward. First, the nt BLAST database, using rake db:nt:fetch. You can see that this is currently not very reproducible, since it makes some assumptions about the number and names of files to download.

		desc "Fetch and uncompress nt BLAST db"
		task :fetch do
			# fetch code here
			ftp = Net::FTP.new(ncbiftp)
			ftp.login
			ftp.chdir("blast/db")
			nt = ftp.list("nt.*.tar.gz")
			nt = nt.map { |f| f.split(" ").last }
			Dir.chdir(File.expand_path("../../../db/nt", __FILE__))
			system("seq 0 9 | parallel -j10 wget -N ftp://ftp.ncbi.nih.gov/blast/db/nt.0{}.tar.gz")
			system("seq 10 #{nt.count - 1} | parallel -j#{nt.count - 10} wget -N ftp://ftp.ncbi.nih.gov/blast/db/nt.{}.tar.gz")
			# now need to uncompress
			system("find . -name 'nt.*.tar.gz' -exec tar zxvf {} \\;")
			puts "Finished; files are in #{Dir.pwd}"
		end

3.4 Fetch human EST BLAST database

In similar fashion to the previous task, rake db:est:fetch downloads and extracts the human EST BLAST database. Again there are assumptions about number and names of files.

		desc "Fetch and uncompress human EST BLAST db"
		task :fetch do
			ftp = Net::FTP.new(ncbiftp)
			ftp.login
			ftp.chdir("blast/db")
			est = ftp.list("est_human*.tar.gz")
			est = est.map { |f| f.split(" ").last }
			Dir.chdir(File.expand_path("../../../db/est", __FILE__))
			system("seq 0 #{est.count - 1} | parallel -j#{est.count} wget -N ftp://#{ncbiftp}/blast/db/est_human.0{}.tar.gz")
			# now need to uncompress
			system("find . -name 'est_human.*.tar.gz' -exec tar zxvf {} \\;")
			puts "Finished; files are in #{Dir.pwd}"
		end

3.5 Dump human ESTs to multiple FASTA files

We can dump the human EST BLAST database to FASTA, splitting the output into multiple files in order to speed up the BLAT search later on. There are lots of tools for splitting FASTA files; I found that the best for this pipeline is fastasplitn. The task is rake db:est:split.

Once again there is some regrettable hard-coding here for the number of output files (20). It would be better to write code which could determine an optimal number of output files.

		desc "Dump human EST BLAST db to fasta, splitting for BLAT"
		task :split do
			# code to dump fasta and split
			Dir.chdir(File.expand_path("../../../db/est", __FILE__))
			ENV['SPLITFRAGTEMPLATE'] = "est_human%3.3d"
			system("blastdbcmd -db est_human -entry all -line_length 60 | fastasplitn - 20")
			puts "Finished; files are in #{Dir.pwd}"
		end

3.6 BLAT search human ESTs versus archaeal sequences

We're ready to BLAT, using rake search:blat:run. This generates 20 output PSL files.

		desc "Run BLAT of human EST fasta files (query) v. archaea.ffn (subject)"
		task :run do
			Dir.chdir(File.expand_path("../../../", __FILE__))
			system("parallel -j10 'var=$(printf '%03d' {}); blat db/archaea/archaea.ffn db/est/est_human$var -ooc=data/11.ooc data/psl/est_human$var.psl' ::: $(seq 1 10)")
			system("parallel -j10 'var=$(printf '%03d' {}); blat db/archaea/archaea.ffn db/est/est_human$var -ooc=data/11.ooc data/psl/est_human$var.psl' ::: $(seq 11 20)")
			puts "Finished; output PSL files are in #{Dir.pwd}/data/psl"
		end

3.7 Parse BLAT reports

Next, we want to retrieve unique EST query GIs from the BLAT reports. The task for that is rake search:blat:parse and the GIs are written to a file, one per line.

		desc "Parse PSL files, extract list of unique human EST GI, write to file"
		task :parse do
			Dir.chdir(File.expand_path("../../../data", __FILE__))
			system("cat psl/*.psl | cut -f10 | grep ^gi | cut -d '|' -f2 | sort -u > est_hits_gi.txt")
			puts "Finished; GIs in file #{Dir.pwd}/est_hits_gi.txt"			
		end

3.8 Dump EST hits to FASTA file

Now we can use the list of GIs from the previous step to dump those ESTs with similarity to archaeal sequences into a FASTA file, using rake db:est:hitdump.

		desc "Dump GIs in data/est_hits_gi.txt to fasta"
		task :hitdump do
			Dir.chdir(File.expand_path("../../..", __FILE__))
			system("blastdbcmd -db db/est/est_human -entry_batch data/est_hits_gi.txt -line_length 60 -out data/est_hits.fa")
			puts "Finished; sequences are in #{Dir.pwd}/data/est_hits.fa"			
		end

3.9 BLAST search EST hits versus nt database

We're getting close. Archaeal-like ESTs versus nt database using rake search:blast:run. The output format is tab-delimited and includes taxon IDs, important for the next step. On my system BLAST crashes if more than 4 threads are specified, hence the hard-coded limit.

		desc "BLAST search human EST BLAT hits to archaea v. nt database"
		task :run do
			Dir.chdir(File.expand_path("../../../", __FILE__))
			format = "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids" 
			system("blastn -db db/nt/nt -query data/est_hits.fa -out data/est_v_nt.tsv -max_target_seqs 1 -outfmt '#{format}' -num_threads 4")
			puts "Finished; output in #{Dir.pwd}/data/est_v_nt.tsv"
		end

3.10 Add species information to BLAST report

The final task, rake search:blast:parse, uses the BioRuby implementation of NCBI EUtils/Efetch and some XML parsing to determine species and kingdom from the taxon IDs and add them to the BLAST output, in a new CSV file.

		desc "Parse EST v nt BLAST output; use taxid to retrieve species and kingdom"
		task :parse do
			Bio::NCBI.default_email = "me@me.com"
			taxids = Hash.new
			Dir.chdir(File.expand_path("../../../", __FILE__))
			File.open("data/est_v_nt.csv", "w") do |f|
				File.readlines("data/est_v_nt.tsv").each do |line|
					line = line.chomp
					fields = line.split("\t").map(&:strip)
					if taxids.has_key? fields[12]
						fields = fields + taxids[fields[12]]
					else
						xml   = Bio::NCBI::REST::EFetch.taxonomy(fields[12], "xml")
						doc   = Nokogiri::XML(xml)
						names = doc.xpath("//Taxon/ScientificName").children.map { |child| child.inner_text }
						taxids[fields[12]] = [names[0], names[2]]
						fields = fields + taxids[fields[12]]
					end
					f.write(fields.join(",") + "\n")
				end
			end	
			puts "Finished; output in #{Dir.pwd}/data/est_v_nt.csv"		
		end

4. The results

Let's count up the kingdoms in the output file:

cut -f15 -d "," data/est_v_nt.csv | sort | uniq -c
     14 Archaea
   1565 artificial sequences
    262 Bacteria
     60 dsDNA viruses
  14128 Eukaryota

So most of the hits are to eukaryotic sequences; of those 11 229 "really are" human. What about those 14 hits to Archaea?

grep Archaea data/est_v_nt.csv | cut -f1,2,14,15 -d ","
gi|12397306|gb|BF990981.1|,gi|1145360|gb|U39025.1|MAU39025,Methanococcus aeolicus,Archaea
gi|1441578|gb|N88376.1|,gi|393188278|gb|CP003685.1|,Pyrococcus furiosus COM1,Archaea
gi|1570770|dbj|C16063.1|,gi|393188278|gb|CP003685.1|,Pyrococcus furiosus COM1,Archaea
gi|1570774|dbj|C16067.1|,gi|393188278|gb|CP003685.1|,Pyrococcus furiosus COM1,Archaea
gi|30285459|gb|CB990939.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285481|gb|CB990961.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285514|gb|CB990994.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285523|gb|CB991003.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285523|gb|CB991003.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285523|gb|CB991003.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30286509|gb|CB991989.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30288113|gb|CB993593.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30291474|gb|CB997050.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|8613918|gb|BE151197.1|,gi|393188278|gb|CP003685.1|,Pyrococcus furiosus COM1,Archaea

An improvement on our original publication, which yielded only one weak candidate. We're still seeing hits to Pyrococcus furiosus, likely a result of contamination with Pfu polymerase DNA. The other hits are to methanogen genomes sequenced after 2004 although intriguingly, the tissue from which these ESTs were derived is pre-eclamptic placenta; see for example GI 30285459. Methanogens in placenta? Or on the fingers of less-than-hygienic researchers? Spurious hits due to low complexity sequence? "More research required".

Conclusion

I'm now in a position to run this analysis at regular intervals - probably monthly - and push the results to Github. Watch this space for any interesting developments.

Making this kind of pipeline available and reproducible by others is less easy. If you have access to a machine with all the prerequisites, clone the repository, replace the symlinks to database directories with real directories, change to the directory code/ruby and run the rake tasks - well, you might be able to do the same thing. But you'd probably rather sequence the human microbiome.


Filed under: bioinformatics, research diary, ruby Tagged: archaea, database, est, github
Author: "nsaunders" Tags: "bioinformatics, research diary, ruby, ar..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 01 Jul 2014 03:27

I’ve long admired the work of the Open Source Malaria Project. Unfortunately time and “day job” constraints prevent me from being as involved as I’d like.

So: I was happy to make a small contribution recently in response to this request for help:

Note – this all works fine under Linux; there seem to be some issues with Open Babel library files under OSX.

First step: make that data usable by rescuing it from the spreadsheet ;) We’ll clean up a column name too.

library(XLConnect)
mmv <- readWorksheetFromFile("TP compounds with solid amounts 14_3_14.xlsx", sheet = "Sheet1")
colnames(mmv)[5] <- "EC50"

head(mmv)

  COMPOUND_ID                                                      Smiles     MW QUANTITY.REMAINING..mg.
1   MMV668822 c1[n+](cc2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)[O-] 434.35                     0.0
2   MMV668823      c1nc(c2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)Cl 452.79                     0.0
3   MMV668824                        c1ncc2n(c1CCO)c(nn2)c1ccc(cc1)OC(F)F 306.27                    29.6
4   MMV668955                        C1NCc2n(C1CCO)c(nn2)c1ccc(cc1)OC(F)F 310.30                    18.5
5   MMV668956    C1(CN(C1)c1cc(c(cc1)F)F)Oc1cncc2n1c(nn2)c1ccc(cc1)OC(F)F 445.38                   124.2
6   MMV668957          c1ncc2n(c1N1CCC(C1)c1ccccc1)c(nn2)c1ccc(cc1)OC(F)F 407.42                    68.5
   EC50 New.quantity.remaining
1  4.01                      0
2  0.16                      0
3 10.00                     29
4  8.37                     18
5  0.43                    124
6  2.00                     62

What OSM would like: an output file in Chemical Markup Language, containing the Compound ID and properties (MW and EC50).

The ChemmineR package makes conversion of SMILES strings to other formats pretty straightforward; we start by converting to Structure Data Format (SDF):

library(ChemmineR)
library(ChemmineOB)

mmv.sdf   <- smiles2sdf(mmv$Smiles)

That will throw a warning, since all molecules in the SDF object have the same CID; currently, no CID (empty string). We add the CID using the compound ID, then use datablock() to add properties:

cid(mmv.sdf) <- mmv$COMPOUND_ID
datablock(mmv.sdf) <- data.frame(MW = mmv$MW, EC50 = mmv$EC50)

Now we can write out to a SDF file. We could also use a loop or an apply function to write individual files per molecule.

write.SDF(mmv.sdf, "mmv-all.sdf", cid = TRUE)

It would be nice to stay in the one R script for conversion to CML too but for now, I just run Open Babel from the command line. Note that the -xp flag is required to include the properties in CML:

babel -xp mmv-all.sdf mmv-all.cml

That’s it; here’s my OSMinformatics Github repository, here’s the output.


Filed under: open science, programming, R, statistics Tagged: cheminformatics, conversion, malaria, osm, smiles
Author: "nsaunders" Tags: "open science, programming, R, statistics..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 01 Jul 2014 03:27

I’ve long admired the work of the Open Source Malaria Project. Unfortunately time and “day job” constraints prevent me from being as involved as I’d like.

So: I was happy to make a small contribution recently in response to this request for help:

Note – this all works fine under Linux; there seem to be some issues with Open Babel library files under OSX.

First step: make that data usable by rescuing it from the spreadsheet ;) We’ll clean up a column name too.

library(XLConnect)
mmv <- readWorksheetFromFile("TP compounds with solid amounts 14_3_14.xlsx", sheet = "Sheet1")
colnames(mmv)[5] <- "EC50"

head(mmv)

  COMPOUND_ID                                                      Smiles     MW QUANTITY.REMAINING..mg.
1   MMV668822 c1[n+](cc2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)[O-] 434.35                     0.0
2   MMV668823      c1nc(c2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)Cl 452.79                     0.0
3   MMV668824                        c1ncc2n(c1CCO)c(nn2)c1ccc(cc1)OC(F)F 306.27                    29.6
4   MMV668955                        C1NCc2n(C1CCO)c(nn2)c1ccc(cc1)OC(F)F 310.30                    18.5
5   MMV668956    C1(CN(C1)c1cc(c(cc1)F)F)Oc1cncc2n1c(nn2)c1ccc(cc1)OC(F)F 445.38                   124.2
6   MMV668957          c1ncc2n(c1N1CCC(C1)c1ccccc1)c(nn2)c1ccc(cc1)OC(F)F 407.42                    68.5
   EC50 New.quantity.remaining
1  4.01                      0
2  0.16                      0
3 10.00                     29
4  8.37                     18
5  0.43                    124
6  2.00                     62

What OSM would like: an output file in Chemical Markup Language, containing the Compound ID and properties (MW and EC50).

The ChemmineR package makes conversion of SMILES strings to other formats pretty straightforward; we start by converting to Structure Data Format (SDF):

library(ChemmineR)
library(ChemmineOB)

mmv.sdf   <- smiles2sdf(mmv$Smiles)

That will throw a warning, since all molecules in the SDF object have the same CID; currently, no CID (empty string). We add the CID using the compound ID, then use datablock() to add properties:

cid(mmv.sdf) <- mmv$COMPOUND_ID
datablock(mmv.sdf) <- data.frame(MW = mmv$MW, EC50 = mmv$EC50)

Now we can write out to a SDF file. We could also use a loop or an apply function to write individual files per molecule.

write.SDF(mmv.sdf, "mmv-all.sdf", cid = TRUE)

It would be nice to stay in the one R script for conversion to CML too but for now, I just run Open Babel from the command line. Note that the -xp flag is required to include the properties in CML:

babel -xp mmv-all.sdf mmv-all.cml

That's it; here's my OSMinformatics Github repository, here's the output.


Filed under: open science, programming, R, statistics Tagged: cheminformatics, conversion, malaria, osm, smiles
Author: "nsaunders" Tags: "open science, programming, R, statistics..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 23 Jun 2014 04:24

Over the years, I’ve written a lot of small “utility scripts”. You know the kind of thing. Little code snippets that facilitate research, rather than generate research results. For example: just what are the fields that you can use to qualify Entrez database searches?

Typically, they end up languishing in long-forgotten Dropbox directories. Sometimes, the output gets shared as a public link. No longer! As of today, “little code snippets that do (hopefully) useful things” have a new home at Github.

Also as of today: there’s not much there right now, just the aforementioned Entrez database code and output. I’m not out to change the world here, just to do a little better.


Filed under: bioinformatics, research diary, software Tagged: bioinformatics, code, github, repository
Author: "nsaunders" Tags: "bioinformatics, research diary, software..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 23 Jun 2014 04:24

Over the years, I’ve written a lot of small “utility scripts”. You know the kind of thing. Little code snippets that facilitate research, rather than generate research results. For example: just what are the fields that you can use to qualify Entrez database searches?

Typically, they end up languishing in long-forgotten Dropbox directories. Sometimes, the output gets shared as a public link. No longer! As of today, “little code snippets that do (hopefully) useful things” have a new home at Github.

Also as of today: there’s not much there right now, just the aforementioned Entrez database code and output. I’m not out to change the world here, just to do a little better.


Filed under: bioinformatics, research diary, software Tagged: bioinformatics, code, github, repository
Author: "nsaunders" Tags: "bioinformatics, research diary, software..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 14 May 2014 01:59

There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.

One answer: we begin with exploratory data analysis and never get around to cleaning it up.

An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:

Update: those of you commenting “should have used Python instead” have completely missed the point. Your comments are off-topic and will not be published. Doubly-so when you get snarky about it.

library(GEOquery)
gse <- getGEO("GSE22255")

Bob is interested in the covariates and metadata associated with the experiment, which he can access using pData().

pd <- pData(gse$GSE22255_series_matrix.txt.gz)
names(pd)
#  [1] "title"                   "geo_accession"          
#  [3] "status"                  "submission_date"        
#  [5] "last_update_date"        "type"                   
#  [7] "channel_count"           "source_name_ch1"        
#  [9] "organism_ch1"            "characteristics_ch1"    
# [11] "characteristics_ch1.1"   "characteristics_ch1.2"  
# [13] "characteristics_ch1.3"   "characteristics_ch1.4"  
# [15] "characteristics_ch1.5"   "characteristics_ch1.6"  
# [17] "treatment_protocol_ch1"  "molecule_ch1"           
# [19] "extract_protocol_ch1"    "label_ch1"              
# [21] "label_protocol_ch1"      "taxid_ch1"              
# [23] "hyb_protocol"            "scan_protocol"          
# [25] "description"             "data_processing"        
# [27] "platform_id"             "contact_name"           
# [29] "contact_email"           "contact_phone"          
# [31] "contact_fax"             "contact_laboratory"     
# [33] "contact_department"      "contact_institute"      
# [35] "contact_address"         "contact_city"           
# [37] "contact_state"           "contact_zip/postal_code"
# [39] "contact_country"         "supplementary_file"     
# [41] "data_row_count"  

Bob discovers that pd$characteristics_ch1.2 is "age at examination", but it's stored as a factor. He'd like to use it as a numeric variable. So he sets about figuring out how to do the conversion.

# first you need factor to character
as.character(pd$characteristics_ch1.2[1:5])
# [1] "age-at-examination: 68" "age-at-examination: 68" "age-at-examination: 73"
# [4] "age-at-examination: 65" "age-at-examination: 73"

# now you want to get rid of everything but the age value
# so you wrap it in a gsub()
gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5]))
# [1] "68" "68" "73" "65" "73"

# and the last step is character to numeric
as.numeric(gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5])))
# [1] 68 68 73 65 73

Three levels of nested methods. Ugly. However, it works, so Bob moves to the next task (using those ages to do some science).

He thinks: "at some point, I should rewrite that as a function to convert ages."

But he never does. We stop when it works.


Filed under: programming, R, research diary, statistics
Author: "nsaunders" Tags: "programming, R, research diary, statisti..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 14 May 2014 01:59

There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.

One answer: we begin with exploratory data analysis and never get around to cleaning it up.

An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:

Update: those of you commenting “should have used Python instead” have completely missed the point. Your comments are off-topic and will not be published. Doubly-so when you get snarky about it.

library(GEOquery)
gse <- getGEO("GSE22255")

Bob is interested in the covariates and metadata associated with the experiment, which he can access using pData().

pd <- pData(gse$GSE22255_series_matrix.txt.gz)
names(pd)
#  [1] "title"                   "geo_accession"          
#  [3] "status"                  "submission_date"        
#  [5] "last_update_date"        "type"                   
#  [7] "channel_count"           "source_name_ch1"        
#  [9] "organism_ch1"            "characteristics_ch1"    
# [11] "characteristics_ch1.1"   "characteristics_ch1.2"  
# [13] "characteristics_ch1.3"   "characteristics_ch1.4"  
# [15] "characteristics_ch1.5"   "characteristics_ch1.6"  
# [17] "treatment_protocol_ch1"  "molecule_ch1"           
# [19] "extract_protocol_ch1"    "label_ch1"              
# [21] "label_protocol_ch1"      "taxid_ch1"              
# [23] "hyb_protocol"            "scan_protocol"          
# [25] "description"             "data_processing"        
# [27] "platform_id"             "contact_name"           
# [29] "contact_email"           "contact_phone"          
# [31] "contact_fax"             "contact_laboratory"     
# [33] "contact_department"      "contact_institute"      
# [35] "contact_address"         "contact_city"           
# [37] "contact_state"           "contact_zip/postal_code"
# [39] "contact_country"         "supplementary_file"     
# [41] "data_row_count"  

Bob discovers that pd$characteristics_ch1.2 is “age at examination”, but it’s stored as a factor. He’d like to use it as a numeric variable. So he sets about figuring out how to do the conversion.

# first you need factor to character
as.character(pd$characteristics_ch1.2[1:5])
# [1] "age-at-examination: 68" "age-at-examination: 68" "age-at-examination: 73"
# [4] "age-at-examination: 65" "age-at-examination: 73"

# now you want to get rid of everything but the age value
# so you wrap it in a gsub()
gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5]))
# [1] "68" "68" "73" "65" "73"

# and the last step is character to numeric
as.numeric(gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5])))
# [1] 68 68 73 65 73

Three levels of nested methods. Ugly. However, it works, so Bob moves to the next task (using those ages to do some science).

He thinks: “at some point, I should rewrite that as a function to convert ages.”

But he never does. We stop when it works.


Filed under: programming, R, research diary, statistics
Author: "nsaunders" Tags: "programming, R, research diary, statisti..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 29 Apr 2014 21:29

Just a brief technical note.

I figured that for a given compound in PubChem, it would be interesting to know whether that compound had been used in a high-throughput experiment, which you might find in GEO. Very easy using the E-utilities, as implemented in the R package rentrez:

library(rentrez)
links <- entrez_link(dbfrom = "pccompound", db = "gds", id = "62857")
length(links$pccompound_gds)
# [1] 741

Browsing the rentrez documentation, I note that db can take the value "all". Sounds useful!

links <- entrez_link(dbfrom = "pccompound", db = "all", id = "62857")
length(links$pccompound_gds)
# [1] 0

That's odd. In fact, this query does not even link pccompound to gds:

length(names(links))
# [1] 39
which(names(links) == "pccompound_gds")
# integer(0)

It's not a rentrez issue, since the same result occurs using the E-utilities URL.

The good people at ropensci have opened an issue, contacting NCBI for clarification. We'll keep you posted.


Filed under: bioinformatics, programming, research diary Tagged: api, elink, entrez, ncbi, rentrez
Author: "nsaunders" Tags: "bioinformatics, programming, research di..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 29 Apr 2014 21:29

Just a brief technical note.

I figured that for a given compound in PubChem, it would be interesting to know whether that compound had been used in a high-throughput experiment, which you might find in GEO. Very easy using the E-utilities, as implemented in the R package rentrez:

library(rentrez)
links <- entrez_link(dbfrom = "pccompound", db = "gds", id = "62857")
length(links$pccompound_gds)
# [1] 741

Browsing the rentrez documentation, I note that db can take the value “all”. Sounds useful!

links <- entrez_link(dbfrom = "pccompound", db = "all", id = "62857")
length(links$pccompound_gds)
# [1] 0

That’s odd. In fact, this query does not even link pccompound to gds:

length(names(links))
# [1] 39
which(names(links) == "pccompound_gds")
# integer(0)

It’s not a rentrez issue, since the same result occurs using the E-utilities URL.

The good people at ropensci have opened an issue, contacting NCBI for clarification. We’ll keep you posted.


Filed under: bioinformatics, programming, research diary Tagged: api, elink, entrez, ncbi, rentrez
Author: "nsaunders" Tags: "bioinformatics, programming, research di..."
Comments Send by mail Print  Save  Delicious 
Date: Thursday, 20 Mar 2014 03:00

Next week I’ll be in Melbourne for one of my favourite meetings, the annual Computational and Simulation Sciences and eResearch Conference.

The main reason for my visit is the Bioinformatics FOAM workshop. Day 1 (March 27) is not advertised since it is an internal CSIRO day, but I’ll be presenting a talk titled “SQL, noSQL or no database at all? Are databases still a core skill?“. Day 2 (March 28) is open to all and I’ll be talking about “Learning from complete strangers: social networking for bioinformaticians“.

I imagine these and other talks will appear on Slideshare soon, at both my account and that of the Australian Bioinformatics Network.

I’m also excited to see that Victoria Stodden is presenting a keynote at the main CSS meeting (PDF) on “Reproducibility in Computational Science: Opportunities and Challenges”.

Hope to see some of you there.


Filed under: bioinformatics, computing, meetings, travel Tagged: conference, foam, melbourne
Author: "nsaunders" Tags: "bioinformatics, computing, meetings, tra..."
Comments Send by mail Print  Save  Delicious 
Date: Thursday, 20 Mar 2014 03:00

Next week I’ll be in Melbourne for one of my favourite meetings, the annual Computational and Simulation Sciences and eResearch Conference.

The main reason for my visit is the Bioinformatics FOAM workshop. Day 1 (March 27) is not advertised since it is an internal CSIRO day, but I’ll be presenting a talk titled “SQL, noSQL or no database at all? Are databases still a core skill?“. Day 2 (March 28) is open to all and I’ll be talking about “Learning from complete strangers: social networking for bioinformaticians“.

I imagine these and other talks will appear on Slideshare soon, at both my account and that of the Australian Bioinformatics Network.

I’m also excited to see that Victoria Stodden is presenting a keynote at the main CSS meeting (PDF) on “Reproducibility in Computational Science: Opportunities and Challenges”.

Hope to see some of you there.


Filed under: bioinformatics, computing, meetings, travel Tagged: conference, foam, melbourne
Author: "nsaunders" Tags: "bioinformatics, computing, meetings, tra..."
Comments Send by mail Print  Save  Delicious 
Date: Sunday, 09 Mar 2014 22:58

A DOI, this morning

A DOI, this morning


When I arrive at work, the first task for the day is “check feeds”. If I’m lucky, in the “journal TOCs” category, there will be an abstract that looks interesting, like this one on the left (click for larger version).

Sometimes, the title is a direct link to the article at the journal website. Often though, the link is a Digital Object Identifier or DOI. Frequently, when the article is labelled as “advance access” or “early”, clicking on the DOI link leads to a page like the one below on the right.

DOI #fail

DOI #fail

In the grand scheme of things I suppose this rates as “minor annoyance”; it means that I have to visit the journal website and search for the article in question. The question is: why does this happen? I’m not familiar with the practical details of setting up a DOI, but I assume that the journal submits article URLs to the DOI system for processing. So who do I blame – journals, for making URLs public before the DOI is ready, or the DOI system, for not processing new URLs quickly enough?

There’s also the issue of whether terms like “advance access” have any meaning in the era of instant, online publishing but that’s for another day.


Filed under: uncategorized Tagged: doi, journals, publishing
Author: "nsaunders" Tags: "uncategorized, doi, journals, publishing"
Comments Send by mail Print  Save  Delicious 
Date: Sunday, 09 Mar 2014 22:58

A DOI, this morning

A DOI, this morning


When I arrive at work, the first task for the day is “check feeds”. If I’m lucky, in the “journal TOCs” category, there will be an abstract that looks interesting, like this one on the left (click for larger version).

Sometimes, the title is a direct link to the article at the journal website. Often though, the link is a Digital Object Identifier or DOI. Frequently, when the article is labelled as “advance access” or “early”, clicking on the DOI link leads to a page like the one below on the right.

DOI #fail

DOI #fail

In the grand scheme of things I suppose this rates as “minor annoyance”; it means that I have to visit the journal website and search for the article in question. The question is: why does this happen? I’m not familiar with the practical details of setting up a DOI, but I assume that the journal submits article URLs to the DOI system for processing. So who do I blame – journals, for making URLs public before the DOI is ready, or the DOI system, for not processing new URLs quickly enough?

There’s also the issue of whether terms like “advance access” have any meaning in the era of instant, online publishing but that’s for another day.


Filed under: uncategorized Tagged: doi, journals, publishing
Author: "nsaunders" Tags: "uncategorized, doi, journals, publishing"
Comments Send by mail Print  Save  Delicious 
Date: Thursday, 27 Feb 2014 23:42

One of my more popular posts is A brief introduction to “apply” in R. Come August, it will be four years old. Technology moves on, old blog posts do not.

So: thanks to BioStar user zx8754 for pointing me to this Stack Overflow post, in which someone complains that the code in the post does not work as described. The by example is now fixed.

Side note: I often find “contact the author” is the most direct approach to solving this kind of problem ;) always happy to be contacted.


Filed under: R, statistics, this blog Tagged: biostar, stack overflow, stackexchange
Author: "nsaunders" Tags: "R, statistics, this blog, biostar, stack..."
Comments Send by mail Print  Save  Delicious 
Date: Thursday, 27 Feb 2014 23:42

One of my more popular posts is A brief introduction to “apply” in R. Come August, it will be four years old. Technology moves on, old blog posts do not.

So: thanks to BioStar user zx8754 for pointing me to this Stack Overflow post, in which someone complains that the code in the post does not work as described. The by example is now fixed.

Side note: I often find “contact the author” is the most direct approach to solving this kind of problem ;) always happy to be contacted.


Filed under: R, statistics, this blog Tagged: biostar, stack overflow, stackexchange
Author: "nsaunders" Tags: "R, statistics, this blog, biostar, stack..."
Comments Send by mail Print  Save  Delicious 
Date: Thursday, 06 Feb 2014 02:28

I’m pleased to announce an open-access publication with my name on it:

Mitchell, S.M., Ross, J.P., Drew, H.R., Ho, T., Brown, G.S., Saunders, N.F.W., Duesing, K.R., Buckley, M.J., Dunne, R., Beetson, I., Rand, K.N., McEvoy, A., Thomas, M.L., Baker, R.T., Wattchow, D.A., Young, G.P., Lockett, T.J., Pedersen, S.K., LaPointe L.C. and Molloy, P.L. (2014). A panel of genes methylated with high frequency in colorectal cancer. BMC Cancer 14:54.

If you’re thinking “but Neil never blogs about what he actually does at work anymore”, jump to this footnote [1].

MAMDC2

Expression of MAMDC2 in cell lines HT29 and SW480 with (orange) or without (blue) 5-aza 2′deoxycytidine treatment

The title pretty much sums it up. Take a bunch of tissues and cell lines, subject them to a bunch of methods to detect methylation and select some promising methylation biomarker candidates based on consensus between methods.

My main contribution to the paper was analysis of the data from one of the experiments described in the abstract: “activation of gene expression in CRC cell lines following DNA demethylating treatment.” When the final version of the supplementary material appears, you should see plots similar to the one at right, showing increased expression of exon microarray probesets in some colorectal cancer cell lines when treated with a demethylating agent.

Biomarker discovery is our main goal, so to quote the conclusion verbatim:

This study has characterised a panel of 23 genes that show elevated DNA methylation in
>50% of CRC tissue relative to non-neoplastic tissue. Six of these genes (SOX21, SLC6A15,
NPY, GRASP, ST8SIA1 and ZSCAN18) show very low methylation in non-neoplastic
colorectal tissue and are candidate biomarkers for stool-based assays, while 11 genes
(BCAT1, COL4A2, DLX5, FGF5, FOXF1, FOXI2, GRASP, IKZF1, IRF4, SDC2 and
SOX21) have very low methylation in peripheral blood DNA and are suitable for further
evaluation as blood-based diagnostic markers.


[1] Despite the tagline for this blog, very few of my posts feature the details of my own day-to-day research. The reasons for this are many and sometimes complicated, but are often because (1) I work with confidential data that I don’t own and (2) I work for an organisation which has strict guidelines about discussing internal information on social media.

I get around this by writing about bioinformatics in general, rather than what I do in particular. I think this is more interesting anyway, so it suits me just fine.


Filed under: bioinformatics, publications, research diary Tagged: biomarker, colorectal cancer, csiro, methylation
Author: "nsaunders" Tags: "bioinformatics, publications, research d..."
Comments Send by mail Print  Save  Delicious 
Date: Thursday, 06 Feb 2014 02:28

I’m pleased to announce an open-access publication with my name on it:

Mitchell, S.M., Ross, J.P., Drew, H.R., Ho, T., Brown, G.S., Saunders, N.F.W., Duesing, K.R., Buckley, M.J., Dunne, R., Beetson, I., Rand, K.N., McEvoy, A., Thomas, M.L., Baker, R.T., Wattchow, D.A., Young, G.P., Lockett, T.J., Pedersen, S.K., LaPointe L.C. and Molloy, P.L. (2014). A panel of genes methylated with high frequency in colorectal cancer. BMC Cancer 14:54.

If you’re thinking “but Neil never blogs about what he actually does at work anymore”, jump to this footnote [1].

MAMDC2

Expression of MAMDC2 in cell lines HT29 and SW480 with (orange) or without (blue) 5-aza 2′deoxycytidine treatment

The title pretty much sums it up. Take a bunch of tissues and cell lines, subject them to a bunch of methods to detect methylation and select some promising methylation biomarker candidates based on consensus between methods.

My main contribution to the paper was analysis of the data from one of the experiments described in the abstract: “activation of gene expression in CRC cell lines following DNA demethylating treatment.” When the final version of the supplementary material appears, you should see plots similar to the one at right, showing increased expression of exon microarray probesets in some colorectal cancer cell lines when treated with a demethylating agent.

Biomarker discovery is our main goal, so to quote the conclusion verbatim:

This study has characterised a panel of 23 genes that show elevated DNA methylation in
>50% of CRC tissue relative to non-neoplastic tissue. Six of these genes (SOX21, SLC6A15,
NPY, GRASP, ST8SIA1 and ZSCAN18) show very low methylation in non-neoplastic
colorectal tissue and are candidate biomarkers for stool-based assays, while 11 genes
(BCAT1, COL4A2, DLX5, FGF5, FOXF1, FOXI2, GRASP, IKZF1, IRF4, SDC2 and
SOX21) have very low methylation in peripheral blood DNA and are suitable for further
evaluation as blood-based diagnostic markers.


[1] Despite the tagline for this blog, very few of my posts feature the details of my own day-to-day research. The reasons for this are many and sometimes complicated, but are often because (1) I work with confidential data that I don’t own and (2) I work for an organisation which has strict guidelines about discussing internal information on social media.

I get around this by writing about bioinformatics in general, rather than what I do in particular. I think this is more interesting anyway, so it suits me just fine.


Filed under: bioinformatics, publications, research diary Tagged: biomarker, colorectal cancer, csiro, methylation
Author: "nsaunders" Tags: "bioinformatics, publications, research d..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 05 Feb 2014 22:22

So, I read the title:

Mining locus tags in PubMed Central to improve microbial gene annotation

and skimmed the abstract:

The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases.

and thought, well OK, but wouldn’t it be better to incorporate annotations in the first place – when submitting to the public databases – rather than by this indirect method?

The point, of course, is to incorporate new findings from the literature into existing records, rather than to use the tool as a primary method of annotation. I do believe that public databases could do more to enforce data quality standards at deposition time, but that’s an entirely separate issue.

Big thanks to Michael Hoffman for a spirited Twitter discussion that put me straight.


Filed under: bioinformatics, publications Tagged: discussion, mea culpa, twitter
Author: "nsaunders" Tags: "bioinformatics, publications, discussion..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 05 Feb 2014 22:22

So, I read the title:

Mining locus tags in PubMed Central to improve microbial gene annotation

and skimmed the abstract:

The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases.

and thought, well OK, but wouldn’t it be better to incorporate annotations in the first place – when submitting to the public databases – rather than by this indirect method?

The point, of course, is to incorporate new findings from the literature into existing records, rather than to use the tool as a primary method of annotation. I do believe that public databases could do more to enforce data quality standards at deposition time, but that’s an entirely separate issue.

Big thanks to Michael Hoffman for a spirited Twitter discussion that put me straight.


Filed under: bioinformatics, publications Tagged: discussion, mea culpa, twitter
Author: "nsaunders" Tags: "bioinformatics, publications, discussion..."
Comments Send by mail Print  Save  Delicious 
Next page
» You can also retrieve older items : Read
» © All content and copyrights belong to their respective authors.«
» © FeedShow - Online RSS Feeds Reader