The following is the authentic code in my on-going project to find the key players in controlling left-right (L-R) patterning during heart development(Chen et al., 2010). Raw data is from wanglab’s recent work, including mRNA and lncRNA at different stages (Touma et al., 2016). In this tutorial, we would like to address the issue of how to analyse the public data in reproducible way and present and save the final result in Rmarkdown and share code with your colleagues.
We have met numerous problems in biomedicine field. However, the researcher’s work is not fully reproducible(Prinz et al., 2011; Baker, 2015). And data-sharing is also arguable in biomedicine researchers (International Consortium of Investigators for Fairness in Trial Data Sharing et al., 2016). And source code is kept confidentially in individual labs due to some competitive reason or even private daul. This unhealthy behavior impede the advance in biomedical research and consume huge amount of taxpayers’ money.
In the present working flow,
My biological question is, what are the novel determinants involved in LR patterning during heart development? To tackle this problem using the public data, I will perform post quality control (post-QC) procedure and then perform advanced methods to find the downstream targets, which display left-right specific expression pattern. This scenario is a traditional strategy to pinpoint the candidates involved in certain biological pathway. In general, we use the animal model, in our case, it is mouse, to infer the LR specific genes. If possible, we explore the same information using human samples or human population genetics data.
graph <-
create_graph() %>%
add_n_nodes(4) %>%
add_edges_w_string(
'1->2 2->3 3->4' , 'black') %>%
set_node_attrs('label',c( 'Ask a good biological question',
'Design a working pipeline to crack it',
'Explore and interpret the data',
'Draw publishable figures in the coda')) %>%
select_nodes_by_id(nodes = c(1:4)) %>%
set_node_attrs_ws('fontsize',10) %>%
set_node_attrs_ws('fontname', 'Microsoft YaHei') %>%
set_node_attrs_ws('fontcolor', 'black') %>%
set_node_attrs_ws('shape','square') %>%
set_graph_name('biological data analysis schema')
#> `select_nodes_by_id()` INFO: created a new selection of 4 nodes
render_graph(graph)
this is my own setting in our high performace cluster(HPC-fuwai). I have configured the path to the required reference genomes and corresponding annotation files. I used the miniconda to manage the software’s version and local preference control.
I get raw data from public database using aspera client (Linux) to get my data from NCBI as most RNA-seq data, particularly, lncRNA data are large in doing so to reach the deep sequence depth. But before running the demo code below, the user should install his/her own aspera program locally.
"
I used the system call in R to perform the aspera task. I first define the
parameters requried by aspera client. the 'key', the 'host',
and raw data path depostied in NCBI aspera server.
"
aspc.exe <- '/bioware/bin/.aspera/connect/bin/ascp'
key <- '/bioware/bin/.aspera/connect/etc/asperaweb_id_dsa.openssh'
host <- 'ftp-private.ncbi.nlm.nih.gov'
rawdata <- '/sra/sra-instant/reads/ByStudy/sra/SRP/SRP043/SRP043076'
download.data <- system( paste( aspc.exe,' -QT -L- -k 3 -i ',
key,' --host=', host,
' --user=anonftp',
' --mode=recv --retry-timeout=10 -d ',
rawdata, './') )
system( paste( 'find ', '.', ' -name ', '*.sra', ' -type f | xargs -n 1 -P 2 fastq-dump --split-3') )
raw.data.path <- '/home/zhenyisong/biodata/cardiodata/GSE85728'
GSE85728.files <- list.files( raw.data.path, pattern = 'SRR.*.fastq$')
read.1.files <- GSE85728.files[grep('_1',GSE85728.files)]
read.2.files <- GSE85728.files[grep('_2',GSE85728.files)]
"
I filtered out the sample names from raw sequencing data
"
GSE85728.output.filenames <- basename(read.1.files) %>%
sub(pattern = '_1.fastq', replacement = '') %>%
paste0(wangyb.output.dir,'/', . ,'.bam')
dir.create(GSE85728.output.dir, showWarnings = FALSE, recursive = TRUE)
setwd(rsubread.index.lib)
base.string <- 'mm10'
align( index = base.string,
readfile1 = read.1.files,
readfile2 = read.2.files,
input_format = 'FASTQ',
type = 'rna',
output_file = GSE85728.output.filenames,
output_format = 'BAM',
PE_orientation = 'fr',
nthreads = 30,
indels = 1,
maxMismatches = 3,
phredOffset = 33,
unique = T )
GSE85728.rsubread.gencode <- featureCounts( GSE85728.output.filenames,
useMetaFeatures = TRUE,
countMultiMappingReads = FALSE,
strandSpecific = 2,
isPairedEnd = TRUE,
requireBothEndsMapped = TRUE,
autosort = TRUE,
nthreads = 17,
annot.ext = lncRmRNA.GRCm38.GTF.file ,
isGTFAnnotationFile = TRUE,
GTF.featureType = 'exon',
GTF.attrType = 'gene_id',
allowMultiOverlap = TRUE)
gencode.GRCh38.mouse <- import(lncRmRNA.GRCm38.GTF.file)
In this tutorial, I use the preprocessed data (Rdata, saved objects, GSE85728.rsubread.gencode & gencode.GRCh38.mouse) generated from the above steps to show the inference procedure below.
Raw data already processed and we get the mapping count data and annotation files.
#---
#this will extract all gene ensembl id from rsubread feathrecount object
#---
GRCm38.genenames <- GSE85728.rsubread.gencode %$% annotation %$%
GeneID %>% sub('.\\d+$','',., perl = F) %>%
mapIds( org.Mm.eg.db, keys = ., column = 'SYMBOL',
keytype = 'ENSEMBL', multiVals = 'first') %>%
make.names( unique = TRUE)
"
I extract the gene alone from then annotation file.
Gencode is a public project to annotate all genes in
human and mouse
"
#> [1] "\nI extract the gene alone from then annotation file.\nGencode is a public project to annotate all genes in\nhuman and mouse\n"
GRCm38.extra.annot <- gencode.GRCm38.mouse %$%
{data.frame( ensembl.id = gene_id,
gene = type,
gene_type = gene_type)}%>%
filter(gene == 'gene') %>% unique()
"
this function aims to get normrlized and log transformed gene
expression value from rsubread mapping count result
@param genes: rsubread feathureCount object
@param annot: annotation list to caculate the RPKM, this param
forward the gene length
"
#> [1] "\nthis function aims to get normrlized and log transformed gene \nexpression value from rsubread mapping count result\n@param genes: rsubread feathureCount object\n@param annot: annotation list to caculate the RPKM, this param\n forward the gene length\n"
rnaseq.rlog.func <- function(genes,annot) {
genes %$% counts %>%
DGEList(genes = annot) %>%
calcNormFactors() %>% rpkm(log = T)
}
GSE85728.rpkm <- rnaseq.rlog.func( GSE85728.rsubread.gencode,
GSE85728.rsubread.gencode$annotation)
"
I change the original column names to the new names having clear
biological meaning.
"
#> [1] "\nI change the original column names to the new names having clear\nbiological meaning.\n"
sample.names <- c( 'LVP0_rep1','LVP0_rep2','LVP0_rep3','LVP3_rep1',
'LVP3_rep2','LVP3_rep3','LVP7_rep1','LVP7_rep2',
'LVP7_rep3','RVP0_rep1','RVP0_rep2','RVP0_rep3',
'RVP3_rep1','RVP3_rep2','RVP3_rep3','RVP7_rep1',
'RVP7_rep2')
colnames(GSE85728.rpkm) <- sample.names
GSE85728.extra.annot <- GSE85728.rsubread.gencode %$%
annotation %$%
GeneID %>%
data.frame(GeneID = .) %>%
inner_join( GRCm38.extra.annot,
by = c('GeneID' = 'ensembl.id')) %>%
mutate(GeneID = as.character(GeneID))
apply.mean.func <- function(df.matrix, col_nums) {
apply(df.matrix[,col_nums],1, mean)
}
comparision.list <- list( LVP0 = 1:3, LVP3 = 4:6,
LVP7 = 7:9, RVP0 = 10:12,
RVP3 = 13:15, RVP7 = 16:17 )
time.point.rpkm <- map(comparision.list, apply.mean.func, df.matrix = GSE85728.rpkm) %>%
as.data.frame() %>% mutate(GeneName = GRCm38.genenames) %>%
mutate(EnsemblID = GSE85728.rsubread.gencode$annotation$GeneID ) %>%
inner_join(GSE85728.extra.annot, by = c('EnsemblID' = 'GeneID')) %>%
as_tibble()
graph.text.size <- 8
left.right.markers <- list()
QC.genename <- 'Myl2'
#---
# I always, like in biological experiments, use the positive
# control to validate my results. In the public data, I prefer
# to choose the known pathway markers to confirm that the data
# is valid compared with the established observations
#---
data.QC.checker <- time.point.rpkm %>%
filter(GeneName == QC.genename) %>%
melt(.) %>%
mutate(stages = as.factor(str_sub(variable, 3,4))) %>%
mutate(stages = factor(stages, levels(stages)[1:3])) %>%
mutate(exprs = value) %>%
mutate(groups = as.factor(str_sub(variable, 1, 1))) %>%
ggplot( data = .,
aes( x = stages, y = exprs, group = groups)) +
geom_line( linetype = 'dashed',
aes(color = groups)) +
geom_point(aes(color = groups)) +
scale_x_discrete( limits = c('P0','P3','P7')) +
xlab('developmental stages') +
ylab('gene expression log(rpkm)') +
labs( title = paste('gene symbol', QC.genename, 'QC checker', sep = ' ')) +
theme( plot.title = element_text(hjust = 0.5)) +
theme( aspect.ratio = 1, text = element_text(size = graph.text.size),
plot.title = element_blank())
#> Using GeneName, EnsemblID, gene, gene_type as id variables
#(data.QC.checker)
# PCA analysis for GSE85728 (RNA-seq), data validation check
#---
tissue.groups <- sample.names %>%
sub('_.*$', '', perl = F, .) %>% as.factor() %>%
data.frame(tissueGroups = .)
GSE85728.vsd.data <- GSE85728.rsubread.gencode %$% counts %>%
DESeqDataSetFromMatrix( countData = .,
colData = tissue.groups,
design = ~ tissueGroups) %>%
DESeq() %>% getVarianceStabilizedData()
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
colnames(GSE85728.vsd.data) <- sample.names
#---
# generate the PCA result by calling the function prcomp
# data needs to be transposed if rows are genes and column
# are samples.
#---
GSE85728.vsd.PCA <- GSE85728.vsd.data %>% t() %>% prcomp()
#---
# quick and dirty map
# I am used to use the base graphics fucntion to
# fast view the result and then use the ggplot
# to make a scientific report for journla publicaiton
#
# deprecated
#---
"
plot(GSE85728.vsd.PCA$x, col = 'white', main = 'GSE85728 LncRNA project')
text( GSE85728.vsd.PCA$x[,1], GSE85728.vsd.PCA$x[,2],
labels = colnames(GSE85728.vsd.data), cex = 0.7)
"
#> [1] "\nplot(GSE85728.vsd.PCA$x, col = 'white', main = 'GSE85728 LncRNA project')\ntext( GSE85728.vsd.PCA$x[,1], GSE85728.vsd.PCA$x[,2], \n labels = colnames(GSE85728.vsd.data), cex = 0.7)\n"
#
# using the pipeline function in tidyverse
# this make life easier and simple.
#---
pca.no <- dim(GSE85728.vsd.PCA$x)[2]
scree.plot.ggplot <- {GSE85728.vsd.PCA$sdev^2} %>%
{./sum(.)} %>%
{data.frame(variance = ., pca = c(1:pca.no)) } %>%
ggplot(data = .) +
xlab('GSE85728 lincRNA Principle Components') +
ylab('Proportion of Variance Explained') +
scale_x_continuous( breaks = c(1:pca.no),
labels = as.character(c(1:pca.no),
limits = as.character(c(1:pca.no)))) +
geom_point(aes(x = pca, y = variance), size = 3) +
geom_line(aes(x = pca, y = variance), size = 0.8) +
scale_linetype_discrete() +
theme(legend.position = 'none') +
theme_classic() +
theme( aspect.ratio = 1, text = element_text(size = graph.text.size))
#(scree.plot.ggplot)
cumsum.plot.ggplot <- {GSE85728.vsd.PCA$sdev^2} %>%
{./sum(.)} %>%
{data.frame(variance = cumsum(.), pca = c(1:pca.no))}%>%
ggplot(data = .) +
xlab('GSE85728 lncR Principle Components') +
ylab('Culmulative Proportion of Variance Explained') +
scale_x_continuous( breaks = c(1:pca.no), labels = as.character(c(1:pca.no),
limits = as.character(c(1:pca.no)))) +
geom_point(aes(x = pca, y = variance), size = 3) +
geom_line(aes(x = pca, y = variance),size = 0.8) +
scale_linetype_discrete() +
theme(legend.position = 'none') +
theme_classic() +
theme( aspect.ratio = 1, text = element_text(size = graph.text.size))
#(cumsum.plot.ggplot)
GSE85728.PCA.ggplot <- as.data.frame(GSE85728.vsd.PCA$x) %>%
mutate(color.choice = { rownames(.) %>%
str_replace_all('_.*$', '') %>%
as.factor}) %>%
ggplot(data = . ) +
geom_point( aes(x = PC1, y = PC2, color = color.choice),
size = 3, show.legend = FALSE) +
scale_colour_manual( name = 'GSE85728 lncRNA sample stages',
values = c( 'magenta','orangered', 'violetred',
'steelblue','royalblue','navyblue'),
labels = sample.names) +
geom_text_repel( aes(x = PC1, y = PC2),
label = sample.names,
color = 'grey80', size = 2.5, fontface = 'bold') +
theme_classic() +
theme( aspect.ratio = 1, text = element_text(size = graph.text.size))
#(GSE85728.PCA.ggplot)
GSE85728.final.QC.figure <- ggdraw() +
draw_plot(data.QC.checker, x = 0, y = .5, width = .5, height = .5) +
draw_plot(scree.plot.ggplot, x = .5, y = .5, width = .5, height = .5) +
draw_plot(cumsum.plot.ggplot, x = 0, y = 0, width = .5, height = .5) +
draw_plot(GSE85728.PCA.ggplot, x = .5, y = 0, width = .5, height = .5) +
draw_plot_label( label = LETTERS[1:4],
x = c(0,.5,0,.5),
y = c(1,1,.5,.5),
size = 10)
(GSE85728.final.QC.figure)
session_info()
From my limited knowledge, most bioinformaticians in small research groups are self-learners with biological background. Usually,they lack of formal training in the methodology of software engineering and quantitative analysis skills. Their routine focus on calling the well-maintained packages and running the pipelines to analyse the small data. Like Roger Peng’s saying, they will experience the three stages in their coding style, which includes snippets, function and package.
Baker,M. (2015) Over half of psychology studies fail reproducibility test. Nature News.
Chen,C.M. et al. (2010) Transcriptional control of left-right patterning in cardiac development. Pediatr.Cardiol., 31, 371–377.
International Consortium of Investigators for Fairness in Trial Data Sharing et al. (2016) Toward fairness in data sharing. N. Engl. J. Med., 375, 405–407.
Prinz,F. et al. (2011) Believe it or not: How much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov., 10, nrd3439–c1.
Touma,M. et al. (2016) Decoding the long noncoding RNA during cardiac maturation: A roadmap for functional discovery. Circ.Cardiovasc.Genet., 9, 395–407.