Int. J. of Computational Biology and Drug Design, 2014, Vol 7 Issue 2/3, pp 195 - 213
High throughput bacterial RNA-Seq experiments can generate extremely high and imbalanced sequencing coverage. Over- or under-estimation of gene expression levels will hinder accurate gene differential expression analysis. Here we evaluated strategies to identify expression differences of genes with high coverage in bacterial transcriptome data using either raw sequence reads or unique reads with duplicate fragments removed. In addition, we proposed a generalised linear model (GLM) based approach to identify imbalance in read coverage based on sequence compositions. Our results show that analysis using raw reads identifies more differentially expressed genes with more accurate fold change than using unique reads. We also demonstrate the presence of sequence composition related biases that are independent of gene expression levels and experimental conditions. Finally, genes that still show strong coverage imbalance after correction were tagged using statistical approach.
bacterial transcriptome sequencing; RNA-Seq; gene differential expression; coverage imbalance; tri-nucleotides; GLM; generalised linear modelling; computational biology; RNA sequences; gene expression levels.
View record in Inderscience