Significant non-existence of sequences in genomes and proteomes
- PMID: 33693858
- PMCID: PMC8034619
- DOI: 10.1093/nar/gkab139
Significant non-existence of sequences in genomes and proteomes
Abstract
Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers, which are statistically expected to exist. This suggests that their absence is due to negative selection. We survey genomes and proteomes covering the diversity of life and find thousands of significant absent sequences. Common significant MAWs are often mono- or dinucleotide tracts, or palindromic. Significant viral MAWs are often restriction sites and may indicate unknown restriction motifs. Surprisingly, significant mammal genome MAWs are often present, but rare, in other mammals, suggesting that they are suppressed but not completely forbidden. Significant human MAWs are frequently present in prokaryotes, suggesting immune function, but rarely present in human viruses, indicating viral mimicry of the host. More than one-fourth of human proteins are one substitution away from containing a significant MAW, with the majority of replacements being predicted harmful. We provide a web-based, interactive database of significant MAWs across genomes and proteomes.
© The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research.
Figures








Similar articles
-
The bulk and the tail of minimal absent words in genome sequences.Phys Biol. 2016 Apr 4;13(2):026004. doi: 10.1088/1478-3975/13/2/026004. Phys Biol. 2016. PMID: 27043075
-
CD-MAWS: An Alignment-Free Phylogeny Estimation Method Using Cosine Distance on Minimal Absent Word Sets.IEEE/ACM Trans Comput Biol Bioinform. 2023 Jan-Feb;20(1):196-205. doi: 10.1109/TCBB.2021.3136792. Epub 2023 Feb 3. IEEE/ACM Trans Comput Biol Bioinform. 2023. PMID: 34928803
-
C-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families.BMC Genomics. 2007 Jun 26;8:191. doi: 10.1186/1471-2164-8-191. BMC Genomics. 2007. PMID: 17594486 Free PMC article.
-
Viral proteomics: global evaluation of viruses and their interaction with the host.Expert Rev Proteomics. 2007 Dec;4(6):815-29. doi: 10.1586/14789450.4.6.815. Expert Rev Proteomics. 2007. PMID: 18067418 Review.
-
Viral proteomics: the emerging cutting-edge of virus research.Sci China Life Sci. 2011 Jun;54(6):502-12. doi: 10.1007/s11427-011-4177-7. Epub 2011 Jun 26. Sci China Life Sci. 2011. PMID: 21706410 Free PMC article. Review.
Cited by
-
Frequentmers - a novel way to look at metagenomic next generation sequencing data and an application in detecting liver cirrhosis.BMC Genomics. 2023 Dec 12;24(1):768. doi: 10.1186/s12864-023-09861-w. BMC Genomics. 2023. PMID: 38087204 Free PMC article.
-
The fitness cost of spurious phosphorylation.bioRxiv [Preprint]. 2023 Oct 10:2023.10.08.561337. doi: 10.1101/2023.10.08.561337. bioRxiv. 2023. Update in: EMBO J. 2024 Oct;43(20):4720-4751. doi: 10.1038/s44318-024-00200-7. PMID: 37873463 Free PMC article. Updated. Preprint.
-
The fitness cost of spurious phosphorylation.EMBO J. 2024 Oct;43(20):4720-4751. doi: 10.1038/s44318-024-00200-7. Epub 2024 Sep 10. EMBO J. 2024. PMID: 39256561 Free PMC article.
-
Structural underpinnings of mutation rate variations in the human genome.Nucleic Acids Res. 2023 Aug 11;51(14):7184-7197. doi: 10.1093/nar/gkad551. Nucleic Acids Res. 2023. PMID: 37395403 Free PMC article.
-
The determinants of the rarity of nucleic and peptide short sequences in nature.NAR Genom Bioinform. 2024 Apr 4;6(2):lqae029. doi: 10.1093/nargab/lqae029. eCollection 2024 Jun. NAR Genom Bioinform. 2024. PMID: 38584871 Free PMC article.
References
-
- Hampikian G., Andersen T.. Absent sequences: nullomers and primes. Pac. Symp. Biocomput. 2007; 12:355–366. - PubMed
-
- Alileche A., Goswami J., Bourland W., Davis M., Hampikian G.. Nullomer derived anticancer peptides (NulloPs): differential lethal effects on normal and cancer cells in vitro. Peptides. 2012; 38:302–311. - PubMed
-
- Goswami J., Davis M.C., Andersen T., Alileche A., Hampikian G.. Safeguarding forensic DNA reference samples with nullomer barcodes. J. Forensic Leg. Med. 2013; 20:513–519. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources