• Shortcuts : 'n' next unread feed - 'p' previous unread feed • Styles : 1 2

» Publishers, Monetize your RSS feeds with FeedShow:  More infos  (Show/Hide Ads)


Date: Monday, 17 Jun 2013 23:35

File under “interesting articles that I don’t have time to write about at length.”

  • Archaea and Fungi of the Human Gut Microbiome: Correlations with Diet and Bacterial Residents
  • Long ago, before metagenomics and NGS, I did a little work on detection of Archaea in human microbiomes. There’s a blog post in the pipeline about that but until then, enjoy this article in PLoS ONE.

  • Mutational heterogeneity in cancer and the search for new cancer-associated genes
  • This article is getting a lot of attention on Twitter this week. Brief summary: cancer cells are really messed up in all sorts of ways, most of which are not causal with respect to the cancer. Anyone who has ever looked at microarray data knows that it’s not uncommon for 50% or more of genes to show differential expression in a cancer/normal comparison, so this is hardly a new concept. I think we need to move away from ever-more detailed characterizations of the ways in which cancer cells are “messed up.” We know that they are and that doesn’t provide much insight, in my opinion.

  • The vast majority of statistical analysis is not performed by statisticians
  • Interesting post by Jeff Leek, summarized very well by its title. It points out that many more people are now interested in data analysis, many of them are not trained professionally as statisticians (I’m in this category myself) and we need to recognize and plan for that.

Bonus post doing the rounds of social media: Using Metadata to Find Paul Revere. Social network analysis, 18th-century style. Amusing, informative and topical.


Filed under: publications, statistics Tagged: archaea, cancer, microbiology, statistics
Author: "nsaunders" Tags: "publications, statistics, archaea, cance..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 04 Jun 2013 04:28

I subscribe to the Ensembl blog and found, in my feed reader this morning, a post which linked to the Variant Effect Predictor (VEP). The original blog post, strangely, has disappeared.

Not to worry: so, the VEP takes genotyping data in one of several formats, compares it with the Ensembl variation + core databases and returns a summary of how the variants affect transcripts and regulatory regions. My first thought – can I apply this to my own 23andme data?

1. Convert 23andme data to VCF
If you download your raw data from 23andme, it looks something like this (ignoring comment lines):

rs4477212	1	82154	AA
rs3094315	1	752566	AA
rs3131972	1	752721	GG
rs12562034	1	768448	GG
rs12124819	1	776546	AA

The rest of this post assumes that your raw data are saved in a file named mySNPs.txt; note that the original file name of your download will be something like genome_YOUR_NAME_Full_TIMESTAMP.txt.zip.

VEP will accept several file formats, including VCF (variant call format). At Github, I discovered a tool named 23andme2vcf to perform the conversion. My first attempt failed:

perl 23andme2vcf.pl mySNPs.txt mySNPs.vcf
# raw data file and reference file are out of sync at ./23andme2vcf.pl line 154,  line 4.

Digging around the Github issues page, I noted that this error has occurred before; the issue was closed when the reference file supplied with 23andme2vcf was updated.

A quick line count reveals the problem: more SNPs in my 23andme data (V3 platform) than the reference file.

gunzip -c 23andme_hg19ref_20121017.txt.gz | wc -l
# 960613
grep -cv "^#" mySNPs.txt
# 991786

Easiest fix: read both into R, discard those SNPs not found in the reference file and write back out to a new tab-delimited file. To avoid discarding SNPs, you’ll have to figure out how to create the reference file; perhaps the issue will be fixed in a future update.

ref <- read.table("23andme_hg19ref_20121017.txt.gz", header = F, stringsAsFactors = F)
snp <- read.table("mySNPs.txt", header = F, stringsAsFactors = F)
snp <- subset(snp, V1 %in% ref$V3)
write.table(snp, "mynewSNPs.txt", quote = F, col.names = F, row.names = F, sep = "\t")

Now the conversion works as expected:

perl 23andme2vcf.pl mynewSNPs.txt mynewSNPs.vcf

Note that the script does not yet handle insertions or deletions. However, these are not well-defined in the 23andme data file in any case.

2. Install VEP locally
This is all as described in the VEP script documentation. Click the “download latest version” link, navigate to wherever you downloaded it and then:

tar zxvf variant_effect_predictor.tar.gz
cd variant_effect_predictor
perl INSTALL.pl

Follow the prompts. You will want to install the core and VEP files for Homo sapiens, the latest release being 71. Note that VEP files are a large download, currently ~ 2.5 GB, which will take some time.

3. Run VEP

From the variant_effect_predictor directory, as simple as:

perl variant_effect_predictor.pl --cache -i ~/path/to/mynewSNPs.vcf -o ~/path/to/myVEP.txt
Variant consequences from VEP

Variant consequences from VEP

In addition to the delimited text output (mine is around 163 MB in size), the script generates a rather nice HTML summary with some statistics. It’s also easy to read the output from VEP back into R. See plot, right, for a summary of the VEP Consequence column.

library(ggplot2)
vep <- read.table("vep.txt", header = F, stringsAsFactors = F, sep = "\t")
colnames(vep) <- c("Uploaded_variation", "Location", "Allele",  "Gene", "Feature", "Feature_type",
                   "Consequence", "cDNA_position", "CDS_position", "Protein_position", "Amino_acids",
                   "Codons", "Existing_variation", "Extra")
cons <- as.data.frame(table(vep$Consequence))
cons <- cons[sort.list(cons$Freq, decreasing = T),]
cons$Var1 <- factor(cons$Var1, levels = cons$Var1)
ggplot(cons) + geom_bar(aes(Var1, log10(Freq)), stat = "identity") + coord_flip() + theme_bw()

It’s a lot of output to trawl through and summarize; I’ll post on the topic again if I discover anything interesting.


Filed under: bioinformatics, genomics, personal, R, statistics Tagged: 23andme, ensembl, prediction, variant
Author: "nsaunders" Tags: "bioinformatics, genomics, personal, R, s..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 28 May 2013 05:12

While we’re on the topic of mistaking Archaea for Bacteria, here’s an issue with the NCBI FTP site that has long annoyed me and one workaround. Warning: I threw this together minutes ago and it’s not fully tested.

Let’s cut to the chase. In the NCBI FTP site, archaeal genome data is stored along with bacterial genomes in a single directory named Bacteria.

Aside from the fact that this is taxonomically incorrect, it makes bulk retrieval of archaeal data rather difficult. For example, I know that Methanococcoides burtonii is an archaeon and if I want to download its protein-coding genes (files ending with .ffn), I can do:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcoides_burtonii_DSM_6242_uid58023/*.ffn

What if I want all available ffn files for Archaea? Somehow, I need to know which of the organism names in the Bacteria directory correspond to Archaea. So here is one approach.

1. Download the summary file prokaryotes.txt

Leaving aside issues with the term prokaryote, the NCBI FTP site contains a useful tab-delimited file which summarises current bacterial and archaeal genomes.

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt

2. Extract the Archaea

We could use shell tools (awk, grep, cut and so on) but I’m going to go with R. First, read in the file and examine the contents of the Group column:

p <- read.table("prokaryotes.txt", header = T, sep = "\t", quote = "", comment.char = "")
table(p$Group)

#                    Actinobacteria                         Aquificae 
#                              2040                                18 
#                   Armatimonadetes      Bacteroidetes/Chlorobi group 
#                                 1                               618 
#                       Caldiserica  Chlamydiae/Verrucomicrobia group 
#                                 1                               232 
#                       Chloroflexi                    Chrysiogenetes 
#                                29                                 2 
#                     Crenarchaeota                     Cyanobacteria 
#                                80                               226 
#                   Deferribacteres               Deinococcus-Thermus 
#                                 6                                54 
#                       Dictyoglomi                     Elusimicrobia 
#                                 2                                 2 
#             environmental samples                     Euryarchaeota 
#                                 1                               283 
# Fibrobacteres/Acidobacteria group                        Firmicutes 
#                                18                              5163 
#                      Fusobacteria                  Gemmatimonadetes 
#                                70                                 1 
#                      Korarchaeota                     Nanoarchaeota 
#                                 1                                 1 
#                       Nitrospinae                       Nitrospirae 
#                                 5                                10 
#                    Planctomycetes                    Proteobacteria 
#                                28                             10793 
#                      Spirochaetes                     Synergistetes 
#                               461                                18 
#                       Tenericutes                    Thaumarchaeota 
#                               191                                17 
#             Thermodesulfobacteria                       Thermotogae 
#                                 6                                42 
#              unclassified Archaea             unclassified Bacteria 
#                                 1                                19

Only the Archaea contain the string “archae”, so we can extract them using grep():

a <- p[grep("archae", p$Group),]

3. Construct FTP URLs

Now: each organism in the FTP site has its own directory of the form:

ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ORGN_uidNNNN

where ORGN and NNNN correspond to the columns #Organism Name and BioProject ID, respectively, in the file prokaryotes.txt. This was not always the case and may not be so in the future but right now, it is. So in R, we can create a new column for the URL. We could stay in R, but let’s write out the FTP URLs to a file. Using write.table() for a single column is not really appropriate, but it works.

a$url <- paste("ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/", 
               gsub(" ", "_", a$X.Organism.Name), 
               "_uid", a$BioProject.ID, sep = "")
write.table(a$url, file = "archaea_urls.txt", col.names = F, row.names = F, quote = F)

That gives us (first 5 lines):

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C5_uid15999
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_S2_uid58035
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C5_uid58741
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C7_uid58847
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C6_uid58947
...

4. Download

Now, to get those ffn files it’s as simple as sending each line in the file of URLs to wget:

cat archaea_urls.txt | while read LINE; do wget "$LINE/*.ffn"; done

Some of those URLs will not exist. That’s OK, wget will just report “No such directory” and the shell script will continue on its merry way. I get 172 files, which seems “about right”.

A lot of day-to-day bioinformatics involves these types of “how do I even get the data” tasks. Life would be much easier and research quicker if the NCBI would just put the Archaea in their own directory.


Filed under: bioinformatics, genomics Tagged: database, ftp, ncbi, sequence
Author: "nsaunders" Tags: "bioinformatics, genomics, database, ftp,..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 27 May 2013 21:54

My journey from bench scientist to bioinformatician began with archaeal genomes. So I was somewhat startled to read The catalytic mechanism for aerobic formation of methane by bacteria, in which we learn about the “ocean-dwelling bacterium Nitrosopumilus maritimus“.

So was Jonathan Eisen of course and you should go and read why. Every top hit in a Web search for that organism tells us that Nitrosopumilus maritimus is an archaeon.

Looking forward to a rapid correction and apology from Nature.

Title edited from “phylogeny” to “taxonomy” at the insistence of @BioinfoTools ;)


Filed under: publications Tagged: archaea, biochemistry, enzymology, errors, nature, phylogeny, taxonomy
Author: "nsaunders" Tags: "publications, archaea, biochemistry, enz..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 22 Apr 2013 00:06

Recently, someone asked me if I could generate a list of genes associated with a particular pathway. Sure, I said and hacked together some rather nasty code in R which, given a KEGG pathway identifier, used a combination of the KEGG REST API, DBGET and biomaRt to return HGNC symbols.

Coincidentally, someone asked the same question at Biostar. Pierre recommended the TogoWS REST service, which provides an API to multiple biological data sources. An article describing TogoWS was published in 2010.

An excellent suggestion – and one which, I later discovered, I had bookmarked. Twice. As long ago as 2008. This “rediscovery of things I once knew” happens to me with increasing frequency now, which makes me wonder whether (1) we really are drowning in information, (2) my online curation tools/methods require improvement or (3) my mind is not what it was. Perhaps some combination of all three.

Anyway – using Ruby (1.8.7), a list of HGNC symbols given a KEGG pathway, e.g. MAPK signaling, is as simple as:

require 'rubygems'
require 'open-uri'
require 'json/pure'

j = JSON.parse(open("http://togows.dbcls.jp/entry/pathway/hsa04010/genes.json").read)
g = j.first.values.map {|v| /^(.*?);/.match(v)[1] }
# first 5 genes
g[0..4]
# ["MAP3K14", "FGF17", "FGF6", "DUSP9", "MAP3K6"]

This code parses the JSON returned from TogoWS into an array with one element; the element is a hash with key/value pairs of the form:

"9020"=>"MAP3K14; mitogen-activated protein kinase kinase kinase 14 [KO:K04466] [EC:2.7.11.25]"

Values for all keys that I’ve seen to date begin with the HGNC symbol followed by a semicolon, making extraction quite straightforward with a simple regular expression.


Filed under: bioinformatics, programming, ruby Tagged: biostar, how to, kegg, pathways, rest
Author: "nsaunders" Tags: "bioinformatics, programming, ruby, biost..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 03 Apr 2013 22:07

Today marks the release of R 3.0.0. There will be plenty of commentary and useful information at sites such as R-bloggers (for example, Tal’s post).

Version 3.0.0 is great news for bioinformaticians, due to the introduction of long vectors. What does that mean? Well, several months ago, I was using the simpleaffy package from Bioconductor to normalize Affymetrix exon microarrays. I began as usual by reading the CEL files:

f <- list.files(path = "data/affyexon", pattern = ".CEL.gz", full.names = T, recursive = T)
cel <- ReadAffy(filenames = f)

When this happened:

Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData,  : 
  allocMatrix: too many elements specified

I had a relatively-large number of samples (337), but figured a 64-bit machine with ~ 100 GB RAM should be able to cope. I was wrong: due to a hard-coded limit to vector length in R, my matrix had become too large regardless of available memory. See this post and this StackOverflow question for the computational details.

My solution at the time was to resort to Affymetrix Power Tools. Hopefully, the introduction of the LONG vector will make Bioconductor even more capable and useful.


Filed under: bioinformatics, programming, R, statistics Tagged: 3.0.0, affymetrix, bioconductor, microarray
Author: "nsaunders" Tags: "bioinformatics, programming, R, statisti..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 26 Mar 2013 22:12

Last week, I attended the annual Computational and Simulation Sciences and eResearch Conference, hosted by CSIRO in Melbourne. The meeting includes a workshop that we call Bioinformatics FOAM (Focus On Analytical Methods). This year it was run over 2.5 days (up from the previous 1.5 by popular request); one day for internal CSIRO stuff and the rest open to external participants.

I had the pleasure of giving a brief presentation on the use of Git in bioinformatics. Nothing startling; aimed squarely at bioinformaticians who may have heard of version control in general and Git in particular but who are yet to employ either. I’m excited because for once I am free to share, resulting in my first upload to Slideshare in almost 4.5 years. You can view it here, or at the Australian Bioinformatics Network Slideshare, or in the embed below.



Filed under: australia, bioinformatics, computing, meetings Tagged: csiro, eresearch, foam, git, ict, slideshare, version control
Author: "nsaunders" Tags: "australia, bioinformatics, computing, me..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 18 Mar 2013 10:53

Since 2005, I have started almost every working day by using one Web application – an application that occupies a permanent browser tab on my work and home desktop machines. That application is Google Reader.

If you’re reading this, you’re probably aware that Google Reader will cease to exist from July 1 2013. Others have ranted, railed against the corporate machine and expressed their sadness. I thought I’d try to explain why, for this working scientist at least, RSS and feed readers are incredibly useful tools which I think should be valued highly.

GReader

Some feeds, yesterday

RSS: a primer
When I first discovered the concept of RSS, it was one of those moments that made me think: “this is so brilliant, simple and obvious – why isn’t everyone using this?”

In fact even today, very few of my immediate peers know what RSS is or why it’s useful. This may be an issue specific to Australian science, which is not exactly renowned for being at the cutting edge of the web revolution. However, for anyone else struggling with the concept, let’s spell it out:

The point of RSS is that:

  • you can monitor multiple, diverse sources of information in one location (aggregation)
  • you don’t have to visit those sources until their content updates and your feed reader tells you when that happens

What are these multiple, diverse sources of information? For a scientist they could include:

  • Tables of contents from journals
  • Alerts and searches at key research-related websites e.g. NCBI PubMed
  • Science blogs
  • Saved job searches
  • Activity monitoring at personal websites

Brilliant, simple, obvious. I wonder how scientists keep up to date in their field without RSS.

The Rise of Google Reader

Soon after launch, Google Reader rose to become the predominant feed reader. Undoubtedly, this was due in part to the brand. However, GReader does boast several key features which I believe contributed to its adoption:

  • It’s part of the “Google suite”; one login, multiple applications; in other words it’s “just there”
  • It’s “in the cloud” and so available and synched on all your machines; no local setup
  • No need to read everything immediately; it’s a searchable archive (they did take their time implementing search though, didn’t they)
  • Sharing to multiple networks is relatively easy via the “send to” function (forget about the now-defunct sharing button)
  • Intelligent keyboard shortcuts and simple layout enable rapid click-through, allowing focus on interesting items and discarding of irrelevant ones

RSS Is Not Dead (even if you’d like it to be)

Permit me a brief rant?

I’m tired of twenty-something hipster data scientists telling me that RSS is dead and has been supplanted by Twitter, Google+ and so on. Or as someone put it at the Google Operating System blog: “not everyone likes seeing 1% of their news as it scrolls by”. It seems that there are those who would like RSS to be dead and who believe that if they repeat the phrase often enough, it will come true. They may be right.

I speculate that the popularity of RSS among (enlightened) scientists and librarians is an indication that it’s a tool for people who like to read things properly, slowly and in-depth.

The Future

Google can do what they like of course, none of us paid for the product and there are alternatives available. I’m currently trying Feedly, who assure us that their product will continue to work after July 1. My fear is that there are those who equate RSS with Google Reader and who see the demise of the latter as further evidence of the death of the former. And again, they may be right as a self-fulfilling prophecy takes hold.

However the Web evolves, I just hope there will always be tools and protocols to provide information for people with attention spans longer than a gnat and a requirement for serious research.


Filed under: google, web resources Tagged: google reader, rss
Author: "nsaunders" Tags: "google, web resources, google reader, rs..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 25 Feb 2013 23:40

I’m a big fan of ggplot2. Recently, I ran into a situation which called for a useful feature that I had not used previously: aes_string.

Imagine that you have data consisting of observations for several variables – let’s say A, B, C – where each observation is from one of two groups – call them X and Y:

df1 <- data.frame(A = rnorm(50), B = rnorm(50), 
                  C = rnorm(50), group = rep(LETTERS[24:25], 25))
head(df1)
#           A          B           C group
# 1 0.2748922 -0.4805635 -1.80242191     X
# 2 0.0060852 -1.2972077  0.64262069     Y
# 3 0.1994655 -0.4628783  0.07670911     X
# 4 0.5416900  0.3853958  0.50193895     Y
# 5 0.3118773  0.9488503 -0.55855749     X
# 6 2.0924626  0.3027878 -0.03000122     Y

If you were interested in the distribution of variable A by group, you might generate a boxplot like so:

png("A.png", width = 800, height = 600)
print(ggplot(df1) + geom_boxplot(aes(group, A, fill = group)) + theme_bw())
dev.off()
Boxplot of A by group

Boxplot of A by group

Here, the arguments to aes() are expressions (group, A) which ggplot interprets as column names from the data frame.

What if you wanted to generate plots for each of variable A, B and C using a loop? You might start like this:

for(i in names(df1)[1:3])
# oh wait, these are characters not expressions
# [1] "A" "B" "C"

You see the problem. How do we pass the column names which are characters, not expressions, to aes()?

The answer: use aes_string() instead.

Description:
     Aesthetic mappings describe how variables in the data are mapped
     to visual properties (aesthetics) of geoms.  Compared to aes this
     function operates on strings rather than expressions.

And so:

for(i in names(df1)[1:3]) {
  png(paste(i, "png", sep = "."), width = 800, height = 600)
  df2 <- df1[, c(i, "group")]
  print(ggplot(df2) + geom_boxplot(aes_string(x = "group", y = i, fill = "group")) + theme_bw())
  dev.off()
}

It’s a little ugly as it stands (better to write a function using one of the apply family). However, the key point is: you can pass data frame column names as expressions to aes() or as characters to aes_string().

With thanks to Hadley’s contribution to this mailing list thread.


Filed under: programming, R, research diary, statistics Tagged: aes, aes_string, ggplot2
Author: "nsaunders" Tags: "programming, R, research diary, statisti..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 13 Feb 2013 04:08

File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.”

Let’s create a data frame which contains five variables, vars, named A – E, each of which appears twice, along with some measurements:

df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
df.orig
#    vars obs1 obs2
# 1     A    1   11
# 2     B    2   12
# 3     C    3   13
# 4     D    4   14
# 5     E    5   15
# 6     A    6   16
# 7     B    7   17
# 8     C    8   18
# 9     D    9   19
# 10    E   10   20

Now, let’s say we want only the rows that contain the maximum values of obs1 for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this:

# use aggregate to create new data frame with the maxima
df.agg <- aggregate(obs1 ~ vars, df.orig, max)
# then simply merge with the original
df.max <- merge(df.agg, df.orig)
df.max
#   vars obs1 obs2
# 1    A    6   16
# 2    B    7   17
# 3    C    8   18
# 4    D    9   19
# 5    E   10   20

This also works using min() and, I guess, using any function that returns a single value per variable mapping to a value in the original data frame.

With thanks to this mailing list thread.


Filed under: programming, R, research diary, statistics
Author: "nsaunders" Tags: "programming, R, research diary, statisti..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 12 Feb 2013 04:11

One of my bioinformatics pet peeves involves statements like this one, from the CNAmet user guide:

Inputs to CNAmet are three m x n matrices, where m is the number of genes and n the number samples

What we’re looking at here is the hot, but poorly-defined topic of data integration, in which biological measurements from two or more different platforms are somehow combined in a way that provides more information than each platform separately. Read any paper on this topic, download the software and you’ll find example datasets containing two or more matched matrices, with rows where measurements have been summarized to a “gene”. What you won’t find, typically, is a detailed explanation of the summarization procedure that you could implement yourself.


To their credit the authors of CNAmet are quite clear that the procedure used to generate these matrices is not their problem:

Since the three microarray platforms contain non-overlapping probes, the m dimension of the input matrices must match. This is because the problem of mapping measurements (probe to probe mapping) between different array types is not dealt with by CNAmet.

Two problems.

First, let’s face it, the very concept of an object called a “gene” is flawed; what we have in reality are fuzzy locations of transcriptional activity.

Second, some measurements summarize more readily than others. Exon expression arrays, for example, are frequently summarized to “gene level” by taking the median measurement of probesets in a transcript cluster. For copy number arrays, we might typically segment the measurements over each chromosome, then assign a number to a “gene” by determining overlap between gene and segment. However, something like a methylation array is more difficult; probesets map to different transcript-associated features (islands, shores, shelves) – which do we use?

Our group recently looked at several publications which tried to integrate measurements of methylation and gene expression. We found at least half a dozen ways of generating the “genes x samples” matrices, from selecting one probe per gene using particular criteria (e.g. highest variance) to complex clustering procedures based on chromosome coordinates. In one horror show of a study, the authors decided that it was fine to combine methylation data from their study with completely-unrelated publicly-available expression data. Why the reviewers and editors agreed is anyone’s guess.

My second law of bioinformatics, then:

On no account must the data pre-processing steps required to summarize multi-platform measurements to gene level be revealed

Seriously, if you have a great idea about the best way to combine, for example, measurements from the Affymetrix Human Exon 1.0 ST and the Illumina Infinium HumanMethylation450 beadchip – go for it in the comments.


Filed under: bioinformatics, research diary
Author: "nsaunders" Tags: "bioinformatics, research diary"
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 06 Feb 2013 00:38

January/February are exciting months for open [data|research|science|access] proponents in our region – by which I mean Australia and New Zealand.

First, we’ve enjoyed a speaking tour by Sir Tim Berners-Lee, during which he discussed the benefits of open data several times. I was able to attend two events in Sydney in person and a third, linux.conf.au, by video stream. The events were the work of many people but in particular, Pia Waugh. Go follow her on Twitter, now.

Next – I wish I had been able to get to this one – the Open Research Conference on February 6-7, University of Auckland. I’m enjoying the high-quality live stream right now. Flying the flag for Sydney are Mat and Alex.

Not strictly under the “open” umbrella but worth a mention anyway: software carpentry is in town, February 7-8, just up the road from me at Macquarie University. Looking forward to hearing some reports from that.


Filed under: australia, australian news, open access, open science Tagged: australia, new zealand, open data, tim berners-lee
Author: "nsaunders" Tags: "australia, australian news, open access,..."
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 30 Jan 2013 21:28

Retraction Watch reports a study of microarray data sharing. The article, published in Clinical Chemistry, is itself behind a paywall despite trumpeting the virtues of open data. So straight to the Open Access Irony Award group at CiteULike it goes.

I was not surprised to learn that the rate of public deposition of data is low, nor that most deposited data ignores standards and much of it is low quality. What did catch my eye though, was a retraction notice for one of the articles from the study, in which the authors explain the reason for retraction.

Two phrases in particular stand out:

we discovered an error in the data fed into the software

This decision was based on the instructions from the software during the initial data feed process

The language used strongly suggests a process whereby data was blindly “fed” into software, with little or no understanding of either how the software worked or the statistical methods employed. To quote Bill in our Twitter discussion:

If you are in this situation, seek help. Talk to a friendly local statistician. Or if there isn’t one, do your research on the Web before publishing. At the very least, try to ensure that what you’re doing corresponds broadly with what most other people in the field would consider “best practice” – even if figuring out what that is, from the literature, is not always easy.


Filed under: bioinformatics, publications, statistics Tagged: microarray, reproducibility, retraction
Author: "nsaunders" Tags: "bioinformatics, publications, statistics..."
Comments Send by mail Print  Save  Delicious 
Date: Thursday, 10 Jan 2013 22:36

Floating by in the Twitter stream, this from @leonidkruglyak. It leads to a light-hearted opinion(ated) piece by Sydney Brenner in Current Biology, 1996.

In 1996, you may recall, the Web was just a few years old. Amusingly (sadly?), it seems that Brenner predicted many of the topics in science publishing that we’re still discussing in 2013. It’s just that he thought they would be implemented in no time at all.

For example, open refereeing:

It is incidents such as this that have led me to question whether the anonymity of referees needs to be guarded so closely

Self-publishing/archiving and post-publication peer review:

The electronic pre-print with open discussion (not refereeing) will soon become commonplace; in fact, labs could go into the publication business by themselves

Demise of the journal impact factor, publishing economics and altmetrics:

We will need something to substitute for the present ratings given to papers appearing in ‘superior, peer-reviewed publications’ (and commercial publishers will find ways of making people pay for this)

Perhaps we should have a readership index; it should not be beyond the wit of man to devise a way of recording whenever a paper is read, hard-copied or cited

As Ethan said:


Filed under: publications Tagged: altmetrics, history, publishing, sydney brenner, www
Author: "nsaunders" Tags: "publications, altmetrics, history, publi..."
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 08 Jan 2013 21:04

We can debate the economics, complexities, details, implementation… of open access publishing for as long as we like. However, the basic principle: that publicly-funded research should be publicly-accessible seems to me at least, very obviously correct and “the right thing to do”.

So this, from April 2012, was very depressing.

Open access not as simple as it sounds: outgoing ARC boss

For those outside Australia, the ARC is the Australian Research Council. Much debate ensued in which one contributor to the comment thread wrote:

…it is particularly galling that Sheil is projecting her own simplistic understanding of open access onto its advocates. Hopefully she will be replaced at the Australian Research Council by someone who understands and supports open access.

Voilà.

The ARC has introduced a new open access policy for ARC funded research which takes effect from 1 January 2013. According to this new policy the ARC requires that any publications arising from an ARC supported research project must be deposited into an open access institutional repository within a twelve (12) month period from the date of publication.

I did giggle at the assumption that the author’s version of their article is by default a Word document, but then I guess that’s true for > 90% of authors.

Outcomes like this come dangerously close to restoring hope.


Filed under: australia, open access Tagged: arc, open access, publishing
Author: "nsaunders" Tags: "australia, open access, arc, publishing"
Comments Send by mail Print  Save  Delicious 
Date: Friday, 04 Jan 2013 01:51

Retweeted via the always-reliable Kay: “Data Sharing and Management Snafu in 3 Short Acts.”

I want to say “may there be less of this in 2013″ but that wish is probably futile.


Filed under: humour, video Tagged: data management, youtube
Author: "nsaunders" Tags: "humour, video, data management, youtube"
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 01 Jan 2013 00:30

Thankfully, WordPress.com have established an end-of-year tradition whereby you get a blog post for no effort, in the form of an annual report. Here it is.

Frankly, I’m amazed that I managed even 22 posts – hey, that’s almost 1 every 2 weeks! As for 2013…no promises. However, I’m looking forward to sharing at least a few useful snippets with you. All the best to you and yours for the coming year.


Filed under: this blog
Author: "nsaunders" Tags: "this blog"
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 05 Dec 2012 23:54

Long time, no blogging. Breaking the silence with something a bit different than my usual content – a molecular biology question for you.

So a colleague posted this link to Yammer; a collection of images selected for the Voyager Golden Record. The record was designed as a “time capsule” illustrating aspects of life on Earth and was launched on the Voyager spacecraft in 1977.

image014

Chemical definitions

Three of the images are confusing me: chemical definitions (shown at right), DNA Structure and DNA Structure magnified, light hit.

You see why I’m confused, right? The symbol used for cytosine is “S”.

I’ve had two suggestions so far. First:

@neilfws I think to avoid confusion with ‘G’

Second – someone has suggested that they had already used “C” for carbon and wanted to avoid confusion.

Or – was “S” commonly used as the symbol for cytosine in 1977? I know it can mean “C or G”, but that does not make sense in this context. Or is it – surely not – an error?

My web searches on the topic are leading nowhere, so if you know the answer – let’s hear it!


Filed under: space science Tagged: biochemistry, cytosine, molecular biology, voyager
Author: "nsaunders" Tags: "space science, biochemistry, cytosine, m..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 22 Oct 2012 01:37

June 23, 2004. BMC Bioinformatics publishes “Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics”. We roll our eyes. Do people really do that? Is it really worthy of publication? However, we admit that if it happens then it’s good that people know about it.

October 17, 2012. A colleague on our internal Yammer network writes:

Sad but true. I keep finding newbie bioinformatics errors in the Cancer Genome Atlas project data. This time a text download of 450K methylation from the Cancer Genome Atlas project reveals that Excel has had its evil way with the data at some point. Gene names such as MAR1, DEC1, OCT4 and SEPT9 are now reformatted as dates.

For example:

barcode	                        probe name      beta value       gene symbol    chromosome    position
TCGA-06-0125-02A-11D-2004-05    cg13918206      0.92035091902012 1-Dec	        9             118159781

I click through the CGA data portal in search of more datasets and choose, more or less at random, another file containing data from the Illumina 450K platform. It’s called jhu-usc.edu__HumanMethylation450__TCGA-G4-6628-01A-11D-1837-05__methylation_analysis.txt. Let’s get that into R:

tcga <- read.table("jhu-usc.edu__HumanMethylation450__TCGA-G4-6628-01A-11D-1837-05__methylation_analysis.txt", sep = "\t", header = T)
dim(tcga)

# [1] 485577      6

head(tcga$gene.symbol)

# [1] 1-Dec 1-Dec 1-Dec 1-Dec 1-Dec 1-Dec
# 25107 Levels:  10-Mar 11-Mar 11-Sep 13-Sep 14-Sep 1-Dec 1-Mar 1-Sep ... ZZZ3

Seems we have a problem. Let’s count up gene names:

genes <- as.data.frame(table(tcga$gene.symbol))
head(genes, 20)

     Var1   Freq
# 1         119652
# 2  10-Mar      5
# 3  11-Mar     21
# 4  11-Sep     32
# 5  13-Sep     18
# 6  14-Sep      4
# 7   1-Dec      8
# 8   1-Mar     30
# 9   1-Sep     10
# 10  3-Mar     33
# 11  4-Mar     25
# 12  5-Mar      4
# 13  5-Sep     12
# 14  6-Mar     21
# 15  7-Mar     13
# 16  8-Sep      2
# 17  9-Mar      6
# 18  9-Sep     16
# 19   A1BG      6
# 20   A1CF      5

Yes, we have a problem.

“Newbie bioinformaticians” is one thing. Large institutes, awarded millions of dollars to contribute to “big science” projects is another.

Despair at the quality of public data, fears about reproducibility in science. Must be Monday.


Filed under: bioinformatics, research diary Tagged: cancer, errors, excel, reproducibility, spreadsheet, tcga
Author: "nsaunders" Tags: "bioinformatics, research diary, cancer, ..."
Comments Send by mail Print  Save  Delicious 
Date: Monday, 27 Aug 2012 23:56


Updates from RStudio support:
(1) “Thanks for reporting and I was able to reproduce this as well. I’ve filed a bug and we’ll take a look.”
(2) Taking a further look, this is actually a bug in the Markdown package and we’ve asked the maintainer (Jeffrey Horner) to look into it.

As juejung points out in a comment on my previous post, applying custom CSS to R Markdown by sourcing the custom rendering function breaks the rendering of inline equations.

I’ve opened an issue with RStudio support and will update here with their response. In the meantime, one solution to this problem is:

  1. Do not create the files custom.css or style.R, as described yesterday
  2. Instead, just put the custom CSS at the top of your R Markdown file using style tags, as shown below

<style type="text/css">
table {
   max-width: 95%;
   border: 1px solid #ccc;
}

th {
  background-color: #000000;
  color: #ffffff;
}

td {
  background-color: #dcdcdc;
}
</style>


Filed under: programming, R, research diary, statistics Tagged: markdown, rstudio
Author: "nsaunders" Tags: "programming, R, research diary, statisti..."
Comments Send by mail Print  Save  Delicious 
Next page
» You can also retrieve older items : Read
» © All content and copyrights belong to their respective authors.«
» © FeedShow - Online RSS Feeds Reader