Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 4;20(Suppl 2):195.
doi: 10.1186/s12864-019-5491-x.

A distance-type measure approach to the analysis of copy number variation in DNA sequencing data

Affiliations

A distance-type measure approach to the analysis of copy number variation in DNA sequencing data

Bipasa Biswas et al. BMC Genomics. .

Abstract

Background: The next generation sequencing technology allows us to obtain a large amount of short DNA sequence (DNA-seq) reads at a genome-wide level. DNA-seq data have been increasingly collected during the recent years. Count-type data analysis is a widely used approach for DNA-seq data. However, the related data pre-processing is based on the moving window method, in which a window size need to be defined in order to obtain count-type data. Furthermore, useful information can be reduced after data pre-processing for count-type data.

Results: In this study, we propose to analyze DNA-seq data based on the related distance-type measure. Distances are measured in base pairs (bps) between two adjacent alignments of short reads mapped to a reference genome. Our experimental data based simulation study confirms the advantages of distance-type measure approach in both detection power and detection accuracy. Furthermore, we propose artificial censoring for the distance data so that distances larger than a given value are considered potential outliers. Our purpose is to simplify the pre-processing of DNA-seq data. Statistically, we consider a mixture of right censored geometric distributions to model the distance data. Additionally, to reduce the GC-content bias, we extend the mixture model to a mixture of generalized linear models (GLMs). The estimation of model can be achieved by the Newton-Raphson algorithm as well as the Expectation-Maximization (E-M) algorithm. We have conducted simulations to evaluate the performance of our approach. Based on the rank based inverse normal transformation of distance data, we can obtain the related z-values for a follow-up analysis. For an illustration, an application to the DNA-seq data from a pair of normal and tumor cell lines is presented with a change-point analysis of z-values to detect DNA copy number alterations.

Conclusion: Our distance-type measure approach is novel. It does not require either a fixed or a sliding window procedure for generating count-type data. Its advantages have been demonstrated by our simulation studies and its practical usefulness has been illustrated by an experimental data application.

Keywords: Copy number variation; DNA; Distance-type measure; Genome-wide sequencing; Geometric distribution; Mixture model.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

The authors agree the consent for publication.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Illustration of distance measure, which is measured in base pairs (bps) between the first aligned base pair of two adjacent alignments of short reads mapped to a reference sequence
Fig. 2
Fig. 2
RMSE of the parameter estimates (π1 = 0.3, π2 = 0.5, π3 = 0.2 and p1 = 0.002, p2 = 0.02, p3 = 0.2)
Fig. 3
Fig. 3
RMSE of the parameter estimates (π1 = 0.008,  π2 = 0.754, π3 = 0.238 and p1 = 0.999, p2 = 0.011, p3 = 0.0006)
Fig. 4
Fig. 4
Plots of log2 of the distance and INT of the cumulative distribution. (a) Plot of log2 of the distance between short reads for the tumor sample against the position on chromosome 9; (b) Plot of inverse normal transform of the cumulative distribution for the tumor sample versus the position on chromosome 9; (c) Plot of log2 of the distance between short reads for the normal sample against the position on chromosome 9; (d) Plot of inverse normal transform of the cumulative-distribution for the normal sample versus the position on chromosome 9
Fig. 5
Fig. 5
Plots of INT of the cumulative distribution against the position on chromosome 9, represented as bps
Fig. 6
Fig. 6
INT of the cumulative distribution with the estimated mean represented by the red line. (a) Plot for the tumor sample where the red line is the estimated mean from the recursive combination algorithm; (b) plot for the normal sample where the red line is the estimated mean from the recursive combination algorithm

Similar articles

References

    1. Chiang DY, et al. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat Methods. 2009;6:99–103. doi: 10.1038/nmeth.1276. - DOI - PMC - PubMed
    1. Miller CA, et al. ReadDepth: a parallel R package for detecting copy number alterations from short sequencing reads. PLoS One. 2011;6(1):e16327. doi: 10.1371/journal.pone.0016327. - DOI - PMC - PubMed
    1. Xie R, et al. Detecting structural variations in the human genome using next-generation sequencing. Brief Funct Genomics. 2011;9:405–415. doi: 10.1093/bfgp/elq025. - DOI - PMC - PubMed
    1. Kim TM, et al. rSW-seq: algorithm for detection of copy number alterations in deep sequencing data. BMC Bioinformatics. 2010;11:432. doi: 10.1186/1471-2105-11-432. - DOI - PMC - PubMed
    1. Krishnan NM, et al. COPS: a sensitive and accurate tool for detecting somatic copy number alterations using short-read sequence data from paired samples. PLoS One. 2012;7(10):e47812. doi: 10.1371/journal.pone.0047812. - DOI - PMC - PubMed

LinkOut - more resources