In a recent study published in Cell Reports, researchers performed a genomic analysis to investigate the origination of human microproteins of biological importance.
Studies have reported that sORFs (small open reading frames) encode functional microproteins essential for several biological processes. However, the origination and conservation of such microproteins have not been well-characterized. Genomic analysis of microproteins could deepen understanding of human genomic characteristics critical for functionality.
About the study
In the present study, researchers investigated the origin of functional human microproteins. They investigated cases wherein the proteins evolved from non-coding sequences and acquired biological importance.
The study comprised open reading frames translated in a previous study (Chen et al) and were reported in the human FANTCOM-CAT transcriptome dataset by Hon et al. The analysis was restricted to ORFs situated on noncoding transcripts (‘new’), located upstream of coding ORF genes (‘upstream‘), located downstream of coding ORFs (‘downstream’), or situated on transcripts devoid of coding ORF genes but belonging to transcript families with one coding member (‘new_iso’). The team matched ORF genes from the aforementioned two previous studies on the basis of their chromosomal coordinate similarity, 100.0% sequence identities, and comparable lengths.
In total, 715 ORFs, situated on 527 transcripts, were analyzed. Data on fitness effects, phenotypic scores, and classification based on their significance using induced pluripotent stem cells and obtained from previous studies. CPAT (coding potential assessment tool) was applied to ORF sequences to determine coding probability scores. Ribonucleic acid sequencing (RNA-seq) analysis data were mapped to their relevant genomic assemblies. Inference of orthologous transcription based on reference transcriptomes and expression data analysis was performed.
Further, orthologous genomic regions were identified, and the presence of ancestral ORFs was inferred, following which functional signatures were assessed. To estimate the origination timing for every ORF (i.e. the most ancient ancestor with intact ORFs), the team searched for orthologous chromosomal regions of the human ORFs in genomic data of 99 species of vertebrates. The team aligned the orthologous sequences of all ORFs subjected to PhyloCSF (phylogenetic codon substitution frequencies) analysis. ASR (ancestral sequence reconstruction) analysis was performed to infer the absence or presence of ORFs at human ancestor nodes based on ORF lengths.
The origination timing of microproteins was considered based on the first node at which ORFs and transcripts were detectable (putative origin) and was independent of the origination mode. In the case wherein ancestors lacking intact ORFs preceded ancestors possessing intact ones, the origination mode was termed de novo. Data on the origination timings of ORFs and transcripts were combined to infer the origination timing of microproteins with de novo origin. To evaluate the effect of ORF lengths, strict (50%) and relaxed (80%) de novo attribution values were assessed. The team investigated the biological importance/functionality of the de novo-emerged microproteins. All known single-nucleotide polymorphisms (SNPs) annotated as pathogenic or likely pathogenic were surveyed.