Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 6;49(6):3139-3155.
doi: 10.1093/nar/gkab139.

Significant non-existence of sequences in genomes and proteomes

Affiliations

Significant non-existence of sequences in genomes and proteomes

Grigorios Koulouras et al. Nucleic Acids Res. .

Abstract

Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers, which are statistically expected to exist. This suggests that their absence is due to negative selection. We survey genomes and proteomes covering the diversity of life and find thousands of significant absent sequences. Common significant MAWs are often mono- or dinucleotide tracts, or palindromic. Significant viral MAWs are often restriction sites and may indicate unknown restriction motifs. Surprisingly, significant mammal genome MAWs are often present, but rare, in other mammals, suggesting that they are suppressed but not completely forbidden. Significant human MAWs are frequently present in prokaryotes, suggesting immune function, but rarely present in human viruses, indicating viral mimicry of the host. More than one-fourth of human proteins are one substitution away from containing a significant MAW, with the majority of replacements being predicted harmful. We provide a web-based, interactive database of significant MAWs across genomes and proteomes.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Comparison of significant genomic absent sequences across mammals. Only MAWs that are shared in at least two species are shown. A red grid-cell indicates a significant MAW (evaluated by Nullomers Assessor) while blue colour denotes a non-significant or present motif.
Figure 2.
Figure 2.
(A) Frequencies of 13 human genomic MAWs in Pan troglodytes, Mus musculus and Canis lupusfamiliaris. The green boxplots show the expected count of each human MAW in each non-human species, while the purple boxes correspond to the observed frequencies. Similarly in (B) the MAWs of chimpanzee have been searched against the genomes of Homo sapiens, M. musculus and C. lupusfamiliaris. In (C) and (D), the MAWs of M. musculus and C. lupusfamiliaris, respectively, have been searched against the other 3 species.
Figure 3.
Figure 3.
(A) Venn diagram showing the number of shared MAWs derived from 900 genomes grouped by division. Created using InteractiVenn (; http://www.interactivenn.net/index2.html). (B) Distribution of MAW length per division. Each bar represents the count of MAWs for a specific motif length.
Figure 4.
Figure 4.
Absent, avoided and frequent pentamers of the human proteome are compared against eukaryotic (metazoa, plant, fungi) and non-eukaryotic (bacteria, archaea, viruses) sequences. The height of each bar indicates the number of occurrences of each motif in eukaryota and prokaryota. The entire Swiss-Prot component of the UniProt database was used as a reference dataset for the analysis.
Figure 5.
Figure 5.
Prediction of mutational effects on protein function caused by amino acid replacements which generate (i) absent, (ii) avoided, (iii) frequent as well as (iv) disease implicated peptides in human proteins. Boxplots display the frequency of the predicted mutational effect (benign, possibly damaging, probably damaging) as predicted by the PolyPhen-2 algorithm. White circles depict the actual values of predictions while the outliers are shown as red circles. Each red box summarizes 13 datapoints (one per significant human MAW), each blue and green box summarizes 11 datapoints, and each purple box summarizes 10 datapoints.
Figure 6.
Figure 6.
(A) Chord diagram presents five significant relative absent words, present in viral genomic sequences (virus families) but absent from the human genome. (B) Chord-diagram of correlations between human-derived peptide MAWs and virus families. The highlighted ‘NGLGV’ MAW is not absent in sequences of Coronaviridae.
Figure 7.
Figure 7.
WebLogos of 25-amino-acid sequence windows from (A) 156 aligned spike glycoproteins of HCoV-HKU1 and HCoV-OC43 species in the region where the relative absent word ‘NGLGV’ occurs, (B) 435 aligned sequences from the spike glycoprotein of SARS-CoV-2 from the same protein region, (C) 71 aligned sequence windows from spike glycoproteins of various bat species and (D) five aligned sequences from spike glycoproteins of Betacoronaviruses extracted from pangolins.
Figure 8.
Figure 8.
Snapshot from the graphical user interface (GUI) of Nullomers Database. Two interactive panels interconnect sequential annotation with tertiary structures offering a visual environment to explore MAW-making mutations in proteins of interest.

Similar articles

Cited by

References

    1. Hampikian G., Andersen T.. Absent sequences: nullomers and primes. Pac. Symp. Biocomput. 2007; 12:355–366. - PubMed
    1. Pinho A.J., Ferreira P.J., Garcia S.P., Rodrigues JM.. On finding minimal absent words. BMC Bioinformatics. 2009; 10:137. - PMC - PubMed
    1. Alileche A., Hampikian G.. The effect of nullomer-derived peptides 9R, 9S1R and 124R on the NCI-60 panel and normal cell lines. BMC Cancer. 2017; 17:533. - PMC - PubMed
    1. Alileche A., Goswami J., Bourland W., Davis M., Hampikian G.. Nullomer derived anticancer peptides (NulloPs): differential lethal effects on normal and cancer cells in vitro. Peptides. 2012; 38:302–311. - PubMed
    1. Goswami J., Davis M.C., Andersen T., Alileche A., Hampikian G.. Safeguarding forensic DNA reference samples with nullomer barcodes. J. Forensic Leg. Med. 2013; 20:513–519. - PubMed

Publication types

LinkOut - more resources