» Publishers, Monetize your RSS feeds with FeedShow: More infos (Show/Hide Ads)
File under “interesting articles that I don’t have time to write about at length.”
- Archaea and Fungi of the Human Gut Microbiome: Correlations with Diet and Bacterial Residents
- Mutational heterogeneity in cancer and the search for new cancer-associated genes
- The vast majority of statistical analysis is not performed by statisticians
Long ago, before metagenomics and NGS, I did a little work on detection of Archaea in human microbiomes. There’s a blog post in the pipeline about that but until then, enjoy this article in PLoS ONE.
This article is getting a lot of attention on Twitter this week. Brief summary: cancer cells are really messed up in all sorts of ways, most of which are not causal with respect to the cancer. Anyone who has ever looked at microarray data knows that it’s not uncommon for 50% or more of genes to show differential expression in a cancer/normal comparison, so this is hardly a new concept. I think we need to move away from ever-more detailed characterizations of the ways in which cancer cells are “messed up.” We know that they are and that doesn’t provide much insight, in my opinion.
Interesting post by Jeff Leek, summarized very well by its title. It points out that many more people are now interested in data analysis, many of them are not trained professionally as statisticians (I’m in this category myself) and we need to recognize and plan for that.
Bonus post doing the rounds of social media: Using Metadata to Find Paul Revere. Social network analysis, 18th-century style. Amusing, informative and topical.
Filed under: publications, statistics Tagged: archaea, cancer, microbiology, statistics
I subscribe to the Ensembl blog and found, in my feed reader this morning, a post which linked to the Variant Effect Predictor (VEP). The original blog post, strangely, has disappeared.
Not to worry: so, the VEP takes genotyping data in one of several formats, compares it with the Ensembl variation + core databases and returns a summary of how the variants affect transcripts and regulatory regions. My first thought – can I apply this to my own 23andme data?
1. Convert 23andme data to VCF
If you download your raw data from 23andme, it looks something like this (ignoring comment lines):
rs4477212 1 82154 AA rs3094315 1 752566 AA rs3131972 1 752721 GG rs12562034 1 768448 GG rs12124819 1 776546 AA
The rest of this post assumes that your raw data are saved in a file named mySNPs.txt; note that the original file name of your download will be something like genome_YOUR_NAME_Full_TIMESTAMP.txt.zip.
VEP will accept several file formats, including VCF (variant call format). At Github, I discovered a tool named 23andme2vcf to perform the conversion. My first attempt failed:
perl 23andme2vcf.pl mySNPs.txt mySNPs.vcf # raw data file and reference file are out of sync at ./23andme2vcf.pl line 154, line 4.
Digging around the Github issues page, I noted that this error has occurred before; the issue was closed when the reference file supplied with 23andme2vcf was updated.
A quick line count reveals the problem: more SNPs in my 23andme data (V3 platform) than the reference file.
gunzip -c 23andme_hg19ref_20121017.txt.gz | wc -l # 960613 grep -cv "^#" mySNPs.txt # 991786
Easiest fix: read both into R, discard those SNPs not found in the reference file and write back out to a new tab-delimited file. To avoid discarding SNPs, you’ll have to figure out how to create the reference file; perhaps the issue will be fixed in a future update.
ref <- read.table("23andme_hg19ref_20121017.txt.gz", header = F, stringsAsFactors = F)
snp <- read.table("mySNPs.txt", header = F, stringsAsFactors = F)
snp <- subset(snp, V1 %in% ref$V3)
write.table(snp, "mynewSNPs.txt", quote = F, col.names = F, row.names = F, sep = "\t")
Now the conversion works as expected:
perl 23andme2vcf.pl mynewSNPs.txt mynewSNPs.vcf
Note that the script does not yet handle insertions or deletions. However, these are not well-defined in the 23andme data file in any case.
2. Install VEP locally
This is all as described in the VEP script documentation. Click the “download latest version” link, navigate to wherever you downloaded it and then:
tar zxvf variant_effect_predictor.tar.gz cd variant_effect_predictor perl INSTALL.pl
Follow the prompts. You will want to install the core and VEP files for Homo sapiens, the latest release being 71. Note that VEP files are a large download, currently ~ 2.5 GB, which will take some time.
3. Run VEP
From the variant_effect_predictor directory, as simple as:
perl variant_effect_predictor.pl --cache -i ~/path/to/mynewSNPs.vcf -o ~/path/to/myVEP.txt
In addition to the delimited text output (mine is around 163 MB in size), the script generates a rather nice HTML summary with some statistics. It’s also easy to read the output from VEP back into R. See plot, right, for a summary of the VEP Consequence column.
library(ggplot2)
vep <- read.table("vep.txt", header = F, stringsAsFactors = F, sep = "\t")
colnames(vep) <- c("Uploaded_variation", "Location", "Allele", "Gene", "Feature", "Feature_type",
"Consequence", "cDNA_position", "CDS_position", "Protein_position", "Amino_acids",
"Codons", "Existing_variation", "Extra")
cons <- as.data.frame(table(vep$Consequence))
cons <- cons[sort.list(cons$Freq, decreasing = T),]
cons$Var1 <- factor(cons$Var1, levels = cons$Var1)
ggplot(cons) + geom_bar(aes(Var1, log10(Freq)), stat = "identity") + coord_flip() + theme_bw()
It’s a lot of output to trawl through and summarize; I’ll post on the topic again if I discover anything interesting.
Filed under: bioinformatics, genomics, personal, R, statistics Tagged: 23andme, ensembl, prediction, variant
While we’re on the topic of mistaking Archaea for Bacteria, here’s an issue with the NCBI FTP site that has long annoyed me and one workaround. Warning: I threw this together minutes ago and it’s not fully tested.
Let’s cut to the chase. In the NCBI FTP site, archaeal genome data is stored along with bacterial genomes in a single directory named Bacteria.
Aside from the fact that this is taxonomically incorrect, it makes bulk retrieval of archaeal data rather difficult. For example, I know that Methanococcoides burtonii is an archaeon and if I want to download its protein-coding genes (files ending with .ffn), I can do:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcoides_burtonii_DSM_6242_uid58023/*.ffn
What if I want all available ffn files for Archaea? Somehow, I need to know which of the organism names in the Bacteria directory correspond to Archaea. So here is one approach.
1. Download the summary file prokaryotes.txt
Leaving aside issues with the term prokaryote, the NCBI FTP site contains a useful tab-delimited file which summarises current bacterial and archaeal genomes.
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt
2. Extract the Archaea
We could use shell tools (awk, grep, cut and so on) but I’m going to go with R. First, read in the file and examine the contents of the Group column:
p <- read.table("prokaryotes.txt", header = T, sep = "\t", quote = "", comment.char = "")
table(p$Group)
# Actinobacteria Aquificae
# 2040 18
# Armatimonadetes Bacteroidetes/Chlorobi group
# 1 618
# Caldiserica Chlamydiae/Verrucomicrobia group
# 1 232
# Chloroflexi Chrysiogenetes
# 29 2
# Crenarchaeota Cyanobacteria
# 80 226
# Deferribacteres Deinococcus-Thermus
# 6 54
# Dictyoglomi Elusimicrobia
# 2 2
# environmental samples Euryarchaeota
# 1 283
# Fibrobacteres/Acidobacteria group Firmicutes
# 18 5163
# Fusobacteria Gemmatimonadetes
# 70 1
# Korarchaeota Nanoarchaeota
# 1 1
# Nitrospinae Nitrospirae
# 5 10
# Planctomycetes Proteobacteria
# 28 10793
# Spirochaetes Synergistetes
# 461 18
# Tenericutes Thaumarchaeota
# 191 17
# Thermodesulfobacteria Thermotogae
# 6 42
# unclassified Archaea unclassified Bacteria
# 1 19
Only the Archaea contain the string “archae”, so we can extract them using grep():
a <- p[grep("archae", p$Group),]
3. Construct FTP URLs
Now: each organism in the FTP site has its own directory of the form:
ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ORGN_uidNNNN
where ORGN and NNNN correspond to the columns #Organism Name and BioProject ID, respectively, in the file prokaryotes.txt. This was not always the case and may not be so in the future but right now, it is. So in R, we can create a new column for the URL. We could stay in R, but let’s write out the FTP URLs to a file. Using write.table() for a single column is not really appropriate, but it works.
a$url <- paste("ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/",
gsub(" ", "_", a$X.Organism.Name),
"_uid", a$BioProject.ID, sep = "")
write.table(a$url, file = "archaea_urls.txt", col.names = F, row.names = F, quote = F)
That gives us (first 5 lines):
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C5_uid15999 ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_S2_uid58035 ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C5_uid58741 ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C7_uid58847 ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C6_uid58947 ...
4. Download
Now, to get those ffn files it’s as simple as sending each line in the file of URLs to wget:
cat archaea_urls.txt | while read LINE; do wget "$LINE/*.ffn"; done
Some of those URLs will not exist. That’s OK, wget will just report “No such directory” and the shell script will continue on its merry way. I get 172 files, which seems “about right”.
A lot of day-to-day bioinformatics involves these types of “how do I even get the data” tasks. Life would be much easier and research quicker if the NCBI would just put the Archaea in their own directory.
Filed under: bioinformatics, genomics Tagged: database, ftp, ncbi, sequence
My journey from bench scientist to bioinformatician began with archaeal genomes. So I was somewhat startled to read The catalytic mechanism for aerobic formation of methane by bacteria, in which we learn about the “ocean-dwelling bacterium Nitrosopumilus maritimus“.
So was Jonathan Eisen of course and you should go and read why. Every top hit in a Web search for that organism tells us that Nitrosopumilus maritimus is an archaeon.
Looking forward to a rapid correction and apology from Nature.
Title edited from “phylogeny” to “taxonomy” at the insistence of @BioinfoTools ;)
Filed under: publications Tagged: archaea, biochemistry, enzymology, errors, nature, phylogeny, taxonomy
Recently, someone asked me if I could generate a list of genes associated with a particular pathway. Sure, I said and hacked together some rather nasty code in R which, given a KEGG pathway identifier, used a combination of the KEGG REST API, DBGET and biomaRt to return HGNC symbols.
Coincidentally, someone asked the same question at Biostar. Pierre recommended the TogoWS REST service, which provides an API to multiple biological data sources. An article describing TogoWS was published in 2010.
An excellent suggestion – and one which, I later discovered, I had bookmarked. Twice. As long ago as 2008. This “rediscovery of things I once knew” happens to me with increasing frequency now, which makes me wonder whether (1) we really are drowning in information, (2) my online curation tools/methods require improvement or (3) my mind is not what it was. Perhaps some combination of all three.
Anyway – using Ruby (1.8.7), a list of HGNC symbols given a KEGG pathway, e.g. MAPK signaling, is as simple as:
require 'rubygems'
require 'open-uri'
require 'json/pure'
j = JSON.parse(open("http://togows.dbcls.jp/entry/pathway/hsa04010/genes.json").read)
g = j.first.values.map {|v| /^(.*?);/.match(v)[1] }
# first 5 genes
g[0..4]
# ["MAP3K14", "FGF17", "FGF6", "DUSP9", "MAP3K6"]
This code parses the JSON returned from TogoWS into an array with one element; the element is a hash with key/value pairs of the form:
"9020"=>"MAP3K14; mitogen-activated protein kinase kinase kinase 14 [KO:K04466] [EC:2.7.11.25]"
Values for all keys that I’ve seen to date begin with the HGNC symbol followed by a semicolon, making extraction quite straightforward with a simple regular expression.
Filed under: bioinformatics, programming, ruby Tagged: biostar, how to, kegg, pathways, rest
Today marks the release of R 3.0.0. There will be plenty of commentary and useful information at sites such as R-bloggers (for example, Tal’s post).
Version 3.0.0 is great news for bioinformaticians, due to the introduction of long vectors. What does that mean? Well, several months ago, I was using the simpleaffy package from Bioconductor to normalize Affymetrix exon microarrays. I began as usual by reading the CEL files:
f <- list.files(path = "data/affyexon", pattern = ".CEL.gz", full.names = T, recursive = T) cel <- ReadAffy(filenames = f)
When this happened:
Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData, : allocMatrix: too many elements specified
I had a relatively-large number of samples (337), but figured a 64-bit machine with ~ 100 GB RAM should be able to cope. I was wrong: due to a hard-coded limit to vector length in R, my matrix had become too large regardless of available memory. See this post and this StackOverflow question for the computational details.
My solution at the time was to resort to Affymetrix Power Tools. Hopefully, the introduction of the LONG vector will make Bioconductor even more capable and useful.
Filed under: bioinformatics, programming, R, statistics Tagged: 3.0.0, affymetrix, bioconductor, microarray
Last week, I attended the annual Computational and Simulation Sciences and eResearch Conference, hosted by CSIRO in Melbourne. The meeting includes a workshop that we call Bioinformatics FOAM (Focus On Analytical Methods). This year it was run over 2.5 days (up from the previous 1.5 by popular request); one day for internal CSIRO stuff and the rest open to external participants.
I had the pleasure of giving a brief presentation on the use of Git in bioinformatics. Nothing startling; aimed squarely at bioinformaticians who may have heard of version control in general and Git in particular but who are yet to employ either. I’m excited because for once I am free to share, resulting in my first upload to Slideshare in almost 4.5 years. You can view it here, or at the Australian Bioinformatics Network Slideshare, or in the embed below.
Filed under: australia, bioinformatics, computing, meetings Tagged: csiro, eresearch, foam, git, ict, slideshare, version control
Since 2005, I have started almost every working day by using one Web application – an application that occupies a permanent browser tab on my work and home desktop machines. That application is Google Reader.
If you’re reading this, you’re probably aware that Google Reader will cease to exist from July 1 2013. Others have ranted, railed against the corporate machine and expressed their sadness. I thought I’d try to explain why, for this working scientist at least, RSS and feed readers are incredibly useful tools which I think should be valued highly.
RSS: a primer
When I first discovered the concept of RSS, it was one of those moments that made me think: “this is so brilliant, simple and obvious – why isn’t everyone using this?”
In fact even today, very few of my immediate peers know what RSS is or why it’s useful. This may be an issue specific to Australian science, which is not exactly renowned for being at the cutting edge of the web revolution. However, for anyone else struggling with the concept, let’s spell it out:
The point of RSS is that:
- you can monitor multiple, diverse sources of information in one location (aggregation)
- you don’t have to visit those sources until their content updates and your feed reader tells you when that happens
What are these multiple, diverse sources of information? For a scientist they could include:
- Tables of contents from journals
- Alerts and searches at key research-related websites e.g. NCBI PubMed
- Science blogs
- Saved job searches
- Activity monitoring at personal websites
Brilliant, simple, obvious. I wonder how scientists keep up to date in their field without RSS.
The Rise of Google Reader
Soon after launch, Google Reader rose to become the predominant feed reader. Undoubtedly, this was due in part to the brand. However, GReader does boast several key features which I believe contributed to its adoption:
- It’s part of the “Google suite”; one login, multiple applications; in other words it’s “just there”
- It’s “in the cloud” and so available and synched on all your machines; no local setup
- No need to read everything immediately; it’s a searchable archive (they did take their time implementing search though, didn’t they)
- Sharing to multiple networks is relatively easy via the “send to” function (forget about the now-defunct sharing button)
- Intelligent keyboard shortcuts and simple layout enable rapid click-through, allowing focus on interesting items and discarding of irrelevant ones
RSS Is Not Dead (even if you’d like it to be)
Permit me a brief rant?
I’m tired of twenty-something hipster data scientists telling me that RSS is dead and has been supplanted by Twitter, Google+ and so on. Or as someone put it at the Google Operating System blog: “not everyone likes seeing 1% of their news as it scrolls by”. It seems that there are those who would like RSS to be dead and who believe that if they repeat the phrase often enough, it will come true. They may be right.
I speculate that the popularity of RSS among (enlightened) scientists and librarians is an indication that it’s a tool for people who like to read things properly, slowly and in-depth.
The Future
Google can do what they like of course, none of us paid for the product and there are alternatives available. I’m currently trying Feedly, who assure us that their product will continue to work after July 1. My fear is that there are those who equate RSS with Google Reader and who see the demise of the latter as further evidence of the death of the former. And again, they may be right as a self-fulfilling prophecy takes hold.
However the Web evolves, I just hope there will always be tools and protocols to provide information for people with attention spans longer than a gnat and a requirement for serious research.
Filed under: google, web resources Tagged: google reader, rss
I’m a big fan of ggplot2. Recently, I ran into a situation which called for a useful feature that I had not used previously: aes_string.
Imagine that you have data consisting of observations for several variables – let’s say A, B, C – where each observation is from one of two groups – call them X and Y:
df1 <- data.frame(A = rnorm(50), B = rnorm(50),
C = rnorm(50), group = rep(LETTERS[24:25], 25))
head(df1)
# A B C group
# 1 0.2748922 -0.4805635 -1.80242191 X
# 2 0.0060852 -1.2972077 0.64262069 Y
# 3 0.1994655 -0.4628783 0.07670911 X
# 4 0.5416900 0.3853958 0.50193895 Y
# 5 0.3118773 0.9488503 -0.55855749 X
# 6 2.0924626 0.3027878 -0.03000122 Y
If you were interested in the distribution of variable A by group, you might generate a boxplot like so:
png("A.png", width = 800, height = 600)
print(ggplot(df1) + geom_boxplot(aes(group, A, fill = group)) + theme_bw())
dev.off()
|
Here, the arguments to aes() are expressions (group, A) which ggplot interprets as column names from the data frame.
What if you wanted to generate plots for each of variable A, B and C using a loop? You might start like this:
for(i in names(df1)[1:3]) # oh wait, these are characters not expressions # [1] "A" "B" "C"
You see the problem. How do we pass the column names which are characters, not expressions, to aes()?
The answer: use aes_string() instead.
Description:
Aesthetic mappings describe how variables in the data are mapped
to visual properties (aesthetics) of geoms. Compared to aes this
function operates on strings rather than expressions.
And so:
for(i in names(df1)[1:3]) {
png(paste(i, "png", sep = "."), width = 800, height = 600)
df2 <- df1[, c(i, "group")]
print(ggplot(df2) + geom_boxplot(aes_string(x = "group", y = i, fill = "group")) + theme_bw())
dev.off()
}
It’s a little ugly as it stands (better to write a function using one of the apply family). However, the key point is: you can pass data frame column names as expressions to aes() or as characters to aes_string().
With thanks to Hadley’s contribution to this mailing list thread.
Filed under: programming, R, research diary, statistics Tagged: aes, aes_string, ggplot2
File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.”
Let’s create a data frame which contains five variables, vars, named A – E, each of which appears twice, along with some measurements:
df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20)) df.orig # vars obs1 obs2 # 1 A 1 11 # 2 B 2 12 # 3 C 3 13 # 4 D 4 14 # 5 E 5 15 # 6 A 6 16 # 7 B 7 17 # 8 C 8 18 # 9 D 9 19 # 10 E 10 20
Now, let’s say we want only the rows that contain the maximum values of obs1 for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this:
# use aggregate to create new data frame with the maxima df.agg <- aggregate(obs1 ~ vars, df.orig, max) # then simply merge with the original df.max <- merge(df.agg, df.orig) df.max # vars obs1 obs2 # 1 A 6 16 # 2 B 7 17 # 3 C 8 18 # 4 D 9 19 # 5 E 10 20
This also works using min() and, I guess, using any function that returns a single value per variable mapping to a value in the original data frame.
With thanks to this mailing list thread.
Filed under: programming, R, research diary, statistics
One of my bioinformatics pet peeves involves statements like this one, from the CNAmet user guide:
Inputs to CNAmet are three m x n matrices, where m is the number of genes and n the number samples
What we’re looking at here is the hot, but poorly-defined topic of data integration, in which biological measurements from two or more different platforms are somehow combined in a way that provides more information than each platform separately. Read any paper on this topic, download the software and you’ll find example datasets containing two or more matched matrices, with rows where measurements have been summarized to a “gene”. What you won’t find, typically, is a detailed explanation of the summarization procedure that you could implement yourself.
To their credit the authors of CNAmet are quite clear that the procedure used to generate these matrices is not their problem:
Since the three microarray platforms contain non-overlapping probes, the m dimension of the input matrices must match. This is because the problem of mapping measurements (probe to probe mapping) between different array types is not dealt with by CNAmet.
Two problems.
First, let’s face it, the very concept of an object called a “gene” is flawed; what we have in reality are fuzzy locations of transcriptional activity.
Second, some measurements summarize more readily than others. Exon expression arrays, for example, are frequently summarized to “gene level” by taking the median measurement of probesets in a transcript cluster. For copy number arrays, we might typically segment the measurements over each chromosome, then assign a number to a “gene” by determining overlap between gene and segment. However, something like a methylation array is more difficult; probesets map to different transcript-associated features (islands, shores, shelves) – which do we use?
Our group recently looked at several publications which tried to integrate measurements of methylation and gene expression. We found at least half a dozen ways of generating the “genes x samples” matrices, from selecting one probe per gene using particular criteria (e.g. highest variance) to complex clustering procedures based on chromosome coordinates. In one horror show of a study, the authors decided that it was fine to combine methylation data from their study with completely-unrelated publicly-available expression data. Why the reviewers and editors agreed is anyone’s guess.
My second law of bioinformatics, then:
On no account must the data pre-processing steps required to summarize multi-platform measurements to gene level be revealed
Seriously, if you have a great idea about the best way to combine, for example, measurements from the Affymetrix Human Exon 1.0 ST and the Illumina Infinium HumanMethylation450 beadchip – go for it in the comments.
Filed under: bioinformatics, research diary
January/February are exciting months for open [data|research|science|access] proponents in our region – by which I mean Australia and New Zealand.
First, we’ve enjoyed a speaking tour by Sir Tim Berners-Lee, during which he discussed the benefits of open data several times. I was able to attend two events in Sydney in person and a third, linux.conf.au, by video stream. The events were the work of many people but in particular, Pia Waugh. Go follow her on Twitter, now.
Next – I wish I had been able to get to this one – the Open Research Conference on February 6-7, University of Auckland. I’m enjoying the high-quality live stream right now. Flying the flag for Sydney are Mat and Alex.
Not strictly under the “open” umbrella but worth a mention anyway: software carpentry is in town, February 7-8, just up the road from me at Macquarie University. Looking forward to hearing some reports from that.
Filed under: australia, australian news, open access, open science Tagged: australia, new zealand, open data, tim berners-lee
Retraction Watch reports a study of microarray data sharing. The article, published in Clinical Chemistry, is itself behind a paywall despite trumpeting the virtues of open data. So straight to the Open Access Irony Award group at CiteULike it goes.
I was not surprised to learn that the rate of public deposition of data is low, nor that most deposited data ignores standards and much of it is low quality. What did catch my eye though, was a retraction notice for one of the articles from the study, in which the authors explain the reason for retraction.
Two phrases in particular stand out:
we discovered an error in the data fed into the software
This decision was based on the instructions from the software during the initial data feed process
The language used strongly suggests a process whereby data was blindly “fed” into software, with little or no understanding of either how the software worked or the statistical methods employed. To quote Bill in our Twitter discussion:
@neilfws "decision was based on the instructions from the software" What the…? It's #overlyhonestmethods come to life!—
Bill Hooker (@sennoma) January 30, 2013
If you are in this situation, seek help. Talk to a friendly local statistician. Or if there isn’t one, do your research on the Web before publishing. At the very least, try to ensure that what you’re doing corresponds broadly with what most other people in the field would consider “best practice” – even if figuring out what that is, from the literature, is not always easy.
Filed under: bioinformatics, publications, statistics Tagged: microarray, reproducibility, retraction
Floating by in the Twitter stream, this from @leonidkruglyak. It leads to a light-hearted opinion(ated) piece by Sydney Brenner in Current Biology, 1996.
In 1996, you may recall, the Web was just a few years old. Amusingly (sadly?), it seems that Brenner predicted many of the topics in science publishing that we’re still discussing in 2013. It’s just that he thought they would be implemented in no time at all.
For example, open refereeing:
It is incidents such as this that have led me to question whether the anonymity of referees needs to be guarded so closely
Self-publishing/archiving and post-publication peer review:
The electronic pre-print with open discussion (not refereeing) will soon become commonplace; in fact, labs could go into the publication business by themselves
Demise of the journal impact factor, publishing economics and altmetrics:
We will need something to substitute for the present ratings given to papers appearing in ‘superior, peer-reviewed publications’ (and commercial publishers will find ways of making people pay for this)
Perhaps we should have a readership index; it should not be beyond the wit of man to devise a way of recording whenever a paper is read, hard-copied or cited
As Ethan said:
@neilfws what's taking so long?!
—
Ethan Perlstein (@eperlste) January 10, 2013
Filed under: publications Tagged: altmetrics, history, publishing, sydney brenner, www
We can debate the economics, complexities, details, implementation… of open access publishing for as long as we like. However, the basic principle: that publicly-funded research should be publicly-accessible seems to me at least, very obviously correct and “the right thing to do”.
So this, from April 2012, was very depressing.
Open access not as simple as it sounds: outgoing ARC boss
For those outside Australia, the ARC is the Australian Research Council. Much debate ensued in which one contributor to the comment thread wrote:
…it is particularly galling that Sheil is projecting her own simplistic understanding of open access onto its advocates. Hopefully she will be replaced at the Australian Research Council by someone who understands and supports open access.
The ARC has introduced a new open access policy for ARC funded research which takes effect from 1 January 2013. According to this new policy the ARC requires that any publications arising from an ARC supported research project must be deposited into an open access institutional repository within a twelve (12) month period from the date of publication.
I did giggle at the assumption that the author’s version of their article is by default a Word document, but then I guess that’s true for > 90% of authors.
Outcomes like this come dangerously close to restoring hope.
Filed under: australia, open access Tagged: arc, open access, publishing
Thankfully, WordPress.com have established an end-of-year tradition whereby you get a blog post for no effort, in the form of an annual report. Here it is.
Frankly, I’m amazed that I managed even 22 posts – hey, that’s almost 1 every 2 weeks! As for 2013…no promises. However, I’m looking forward to sharing at least a few useful snippets with you. All the best to you and yours for the coming year.
Filed under: this blog
Long time, no blogging. Breaking the silence with something a bit different than my usual content – a molecular biology question for you.
So a colleague posted this link to Yammer; a collection of images selected for the Voyager Golden Record. The record was designed as a “time capsule” illustrating aspects of life on Earth and was launched on the Voyager spacecraft in 1977.
You see why I’m confused, right? The symbol used for cytosine is “S”.
I’ve had two suggestions so far. First:
Second – someone has suggested that they had already used “C” for carbon and wanted to avoid confusion.
Or – was “S” commonly used as the symbol for cytosine in 1977? I know it can mean “C or G”, but that does not make sense in this context. Or is it – surely not – an error?
My web searches on the topic are leading nowhere, so if you know the answer – let’s hear it!
Filed under: space science Tagged: biochemistry, cytosine, molecular biology, voyager
June 23, 2004. BMC Bioinformatics publishes “Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics”. We roll our eyes. Do people really do that? Is it really worthy of publication? However, we admit that if it happens then it’s good that people know about it.
October 17, 2012. A colleague on our internal Yammer network writes:
Sad but true. I keep finding newbie bioinformatics errors in the Cancer Genome Atlas project data. This time a text download of 450K methylation from the Cancer Genome Atlas project reveals that Excel has had its evil way with the data at some point. Gene names such as MAR1, DEC1, OCT4 and SEPT9 are now reformatted as dates.
For example:
barcode probe name beta value gene symbol chromosome position TCGA-06-0125-02A-11D-2004-05 cg13918206 0.92035091902012 1-Dec 9 118159781
I click through the CGA data portal in search of more datasets and choose, more or less at random, another file containing data from the Illumina 450K platform. It’s called jhu-usc.edu__HumanMethylation450__TCGA-G4-6628-01A-11D-1837-05__methylation_analysis.txt. Let’s get that into R:
tcga <- read.table("jhu-usc.edu__HumanMethylation450__TCGA-G4-6628-01A-11D-1837-05__methylation_analysis.txt", sep = "\t", header = T)
dim(tcga)
# [1] 485577 6
head(tcga$gene.symbol)
# [1] 1-Dec 1-Dec 1-Dec 1-Dec 1-Dec 1-Dec
# 25107 Levels: 10-Mar 11-Mar 11-Sep 13-Sep 14-Sep 1-Dec 1-Mar 1-Sep ... ZZZ3
Seems we have a problem. Let’s count up gene names:
genes <- as.data.frame(table(tcga$gene.symbol))
head(genes, 20)
Var1 Freq
# 1 119652
# 2 10-Mar 5
# 3 11-Mar 21
# 4 11-Sep 32
# 5 13-Sep 18
# 6 14-Sep 4
# 7 1-Dec 8
# 8 1-Mar 30
# 9 1-Sep 10
# 10 3-Mar 33
# 11 4-Mar 25
# 12 5-Mar 4
# 13 5-Sep 12
# 14 6-Mar 21
# 15 7-Mar 13
# 16 8-Sep 2
# 17 9-Mar 6
# 18 9-Sep 16
# 19 A1BG 6
# 20 A1CF 5
Yes, we have a problem.
“Newbie bioinformaticians” is one thing. Large institutes, awarded millions of dollars to contribute to “big science” projects is another.
Despair at the quality of public data, fears about reproducibility in science. Must be Monday.
Filed under: bioinformatics, research diary Tagged: cancer, errors, excel, reproducibility, spreadsheet, tcga
Updates from RStudio support:
(1) “Thanks for reporting and I was able to reproduce this as well. I’ve filed a bug and we’ll take a look.”
(2) Taking a further look, this is actually a bug in the Markdown package and we’ve asked the maintainer (Jeffrey Horner) to look into it.
As juejung points out in a comment on my previous post, applying custom CSS to R Markdown by sourcing the custom rendering function breaks the rendering of inline equations.
I’ve opened an issue with RStudio support and will update here with their response. In the meantime, one solution to this problem is:
- Do not create the files custom.css or style.R, as described yesterday
- Instead, just put the custom CSS at the top of your R Markdown file using style tags, as shown below
<style type="text/css">
table {
max-width: 95%;
border: 1px solid #ccc;
}
th {
background-color: #000000;
color: #ffffff;
}
td {
background-color: #dcdcdc;
}
</style>
Filed under: programming, R, research diary, statistics Tagged: markdown, rstudio












