In:
PLOS ONE, Public Library of Science (PLoS), Vol. 15, No. 12 ( 2020-12-1), p. e0237412-
Abstract:
Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.
Type of Medium:
Online Resource
ISSN:
1932-6203
DOI:
10.1371/journal.pone.0237412
DOI:
10.1371/journal.pone.0237412.g001
DOI:
10.1371/journal.pone.0237412.g002
DOI:
10.1371/journal.pone.0237412.g003
DOI:
10.1371/journal.pone.0237412.g004
DOI:
10.1371/journal.pone.0237412.t001
DOI:
10.1371/journal.pone.0237412.s001
DOI:
10.1371/journal.pone.0237412.s002
DOI:
10.1371/journal.pone.0237412.s003
DOI:
10.1371/journal.pone.0237412.s004
DOI:
10.1371/journal.pone.0237412.s005
DOI:
10.1371/journal.pone.0237412.s006
DOI:
10.1371/journal.pone.0237412.s007
DOI:
10.1371/journal.pone.0237412.s008
DOI:
10.1371/journal.pone.0237412.s009
DOI:
10.1371/journal.pone.0237412.s010
DOI:
10.1371/journal.pone.0237412.s011
DOI:
10.1371/journal.pone.0237412.s012
DOI:
10.1371/journal.pone.0237412.s013
DOI:
10.1371/journal.pone.0237412.s014
DOI:
10.1371/journal.pone.0237412.s015
DOI:
10.1371/journal.pone.0237412.s016
DOI:
10.1371/journal.pone.0237412.s017
DOI:
10.1371/journal.pone.0237412.s018
DOI:
10.1371/journal.pone.0237412.s019
DOI:
10.1371/journal.pone.0237412.s020
DOI:
10.1371/journal.pone.0237412.s021
DOI:
10.1371/journal.pone.0237412.s022
DOI:
10.1371/journal.pone.0237412.s023
DOI:
10.1371/journal.pone.0237412.s024
DOI:
10.1371/journal.pone.0237412.s025
DOI:
10.1371/journal.pone.0237412.s026
DOI:
10.1371/journal.pone.0237412.s027
DOI:
10.1371/journal.pone.0237412.r001
DOI:
10.1371/journal.pone.0237412.r002
DOI:
10.1371/journal.pone.0237412.r003
DOI:
10.1371/journal.pone.0237412.r004
DOI:
10.1371/journal.pone.0237412.r005
DOI:
10.1371/journal.pone.0237412.r006
Language:
English
Publisher:
Public Library of Science (PLoS)
Publication Date:
2020
detail.hit.zdb_id:
2267670-3
Bookmarklink