KOBV Portal

Hits per page

hit 1 - 1 | 1 hit

Select All Export

Online Resource

Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion

Xi, Ruibin ; Hadjipanayis, Angela G. ; Luquette, Lovelace J. ; [et al.]

Proceedings of the National Academy of Sciences ; 2011

In: Proceedings of the National Academy of Sciences Vol. 108, No. 46 ( 2011-11-15)

add to watchlist on the watchlist

Details

In: Proceedings of the National Academy of Sciences, Proceedings of the National Academy of Sciences, Vol. 108, No. 46 ( 2011-11-15)

Abstract: This work demonstrates that high-coverage, whole-genome sequencing can be used to identify numerous somatic CNVs in the tumor genome that are missed by microarray-based methods. As whole-genome sequencing studies become widespread, the BIC-seq method can be used to quickly and efficiently characterize CNVs to identify potential variants that might be related to disease phenotypes. We compared these CNVs with those detected by two microarray platforms, Agilent 244K CGH microarrays and Affymetrix SNP 6.0 arrays. We found that for the large CNVs ( 〉 15 kb), the copy ratio estimates given by BIC-seq and the two microarray platforms, especially BIC-seq and the Agilent CGH array, were close. However, many small CNVs were missed by the two microarray platforms. To test the accuracy of the copy ratio estimates given by BIC-seq for small CNVs (less than 15 kb), we selected 16 CNVs ranging from 110 to 14 Kb, for qPCR validation. The experimental validation confirmed that 14 out of 16 CNV calls are true CNVs, demonstrating the accuracy of BIC-seq. We observed that several of the validated small CNVs overlapped with cancer-related genes, including INPP1 , ADAM12 , MGMT , MLL3 , PCBD2 , and AMY2A , whose copy number alteration may influence cancer development and progression. Fig. P1 illustrates an example of the validated small CNVs. Fig. P1. A qPCR-validated CNV (350 bp) that was missed by the array-based platforms. ( A ) The local profile given by BIC-seq (red line). The circles are the copy ratios calculated based on 10-bp bins. The regions marked by cyan and purple lines are the 95% credible intervals for the left and right breakpoints of the CNV. The CNV overlaps with the intronic region of the gene MLL3 and the position of the CNV in the gene is marked by the dark cyan bar. ( B ) The profiles given by Affimetrix and Agilent platforms. We applied BIC-seq to profiles from a cancer patient, sequenced as part of the Cancer Genome Atlas project. Tumor and matched control (blood) DNA were obtained from the same individual and sequenced. We obtained 833 million (10×) and 603 million (7×) uniquely aligned 35 bp reads for the tumor and its matched normal genome, respectively. We set the tuning parameter λ = 4. After filtering for segments with copy ratios, we obtained 291 candidate CNVs, which covered 89 Mb (3%) of the human genome. Among the 291 somatic CNVs, 192 (74%) showed at least a partial overlap with 1,926 genes and 170 (58%) with protein coding sequences of the genes. We further compared these CNVs with cancer genes listed in Futreal et al. ( 3 ) and found that 19 out of the 291 somatic CNVs showed overlap with 22 out of 288 cancer genes. These 22 genes include the well-known cancer-related genes EGFR and CDKN2A , whose amplification and deletion have been frequently observed in glioblastoma ( 4 , 5 ). BIC-seq has a single tuning parameter λ , which controls the smoothness of the CNV profile. The choice of λ is related to the sensitivity and specificity of BIC-seq. A larger λ gives CNV calls with fewer false discoveries, whereas a smaller λ is more sensitive. To obtain optimal results, CNVs should be first filtered by a threshold value of the copy ratio. For very low coverage data ( 〈 1×), a small λ (e.g., 1 or 1.2) can achieve good sensitivity. To further reduce FDR, one can apply a p -value threshold to remove the less significant, small CNVs. For medium coverage (2–5×), a larger λ (e.g., 2) should work well with no additional p -value filtering. For high coverage (10–30×), λ = 4 will give very confident calls while still detecting many small CNVs (100–1,000 bp). Currently available algorithms such as SegSeq ( 1 ) and CNVseq ( 2 ) usually assume a parametric model (e.g., a Poisson model) and perform statistical tests for CNV detection. However, many datasets we examined cannot be approximated well with a global parametric model. The BIC-seq algorithm proposed in this paper is based on a nonparametric model. It starts with the genome segmented into small bins of equal size and iteratively combines adjacent bins that are deemed to have the same copy number, based on a well-known statistic called the Bayesian information criterion (BIC). The iteration stops when the BIC cannot be minimized further. Because of its nonparametric nature, the proposed algorithm is more robust to outliers and to datasets that cannot be well approximated with a parametric model. It is also computationally fast and able to handle high-coverage genomes ( 〉 10×) effectively. The statistical framework of BIC-seq can be easily extended to detect recurrent CNVs in multiple tumor samples. Furthermore, a method called Gibbs sampling is applied to assign error bars to breakpoints and a resampling strategy is used to estimate the false discovery rate (FDR) for the CNV calls given by BIC-seq. A comparison with some of the existing methods shows that BIC-seq has higher statistical power for detecting many types of CNVs at the same FDR level. A common approach to estimate DNA copy number from whole-genome sequencing data is based on the density of the sequenced short reads, which are sequences of short random DNA (36–100 bp), along the genome. The read density in a region with a heterozygous deletion (a one-copy deletion in a diploid genome such as the human genome), for instance, should be about half of the read density in its neighboring regions. However, although the density of aligned reads (short reads that can be successfully mapped to the reference genome) generally corresponds to the DNA copy number, it is also affected by many sources of bias in sample preparation and sequencing, such as genome mappability, GC content, irregular fragmentation of the genome, and uneven PCR amplification. Thus, somatic CNVs in the tumor genome can be best identified by comparing the read distribution in the tumor genome with that of the matched normal genome. The genomic regions with disproportionate read counts indicate potential CNVs; for example, the prevalence of tumor reads over normal ones suggests an increased copy number in the tumor genome. DNA copy number variations (CNVs), which are gains or deletions of genomic segments, play an important role in the pathogenesis and progression of cancer and confer susceptibility to a variety of human disorders. In the past decade, a technique called array comparative genomic hybridization (CGH) based on the microarray technology has allowed a genome-wide characterization of these CNVs in cancer genomes as well as in the genomes of the normal population. With the decreasing cost of DNA sequencing on the next-generation sequencing platforms, sequencing of the entire genome is becoming commonplace, enabling an order-of-magnitude improvement in sensitivity and resolution for characterizing CNVs. However, as the data on a cancer genome and its matched control, e.g., at 30× coverage (each nucleotide is sequenced 30 times on average), exceed half a terabyte in size, an efficient and robust computational method is needed. In this study, we developed an algorithm called BIC-seq (Bayesian information criterion sequencing) to detect CNVs from data obtained by a technique called whole-genome sequencing and applied it to a newly sequenced tumor genome of a patient with brain cancer called glioblastoma with a matched control. Using BIC-seq, we identified hundreds of CNVs as small as 40 bp in the cancer genome sequenced at 10× coverage, although we could only detect large CNVs ( 〉 15 kb) in the profiles of the same genome obtained by array CGH. Eighty percent (14/16) of the small variants tested (110 bp to 14 kb) were experimentally validated by quantitative PCR (qPCR), demonstrating high sensitivity and true positive rate of the algorithm.

Type of Medium: Online Resource

ISSN: 0027-8424 , 1091-6490

URL: Article

DOI: 10.1073/pnas.1110574108

RVK:

TA 1000

RVK:

WA 15000

Language: English

Publisher: Proceedings of the National Academy of Sciences

Publication Date: 2011

detail.hit.zdb_id: 209104-5