Student Theses and Dissertations


Eric Bo Zheng

Date of Award


Document Type


Degree Name

Doctor of Philosophy (PhD)

RU Laboratory

Zhao Laboratory


Evolution is at its heart the study of variation: which variants survive and even thrive across generations, and why? Behind this lies the fundamental question of the origins of novelty, for there cannot be variation without a source of new variants. Understanding how novelty emerges, then, is a crucial problem of basic relevance in evolutionary biology. In this thesis, I examine two aspects regarding the origins of novelty: first, how new proteins originate and subsequently evolve; second, how chromatin accessibility maintains evolutionary lability while remaining conserved across broad sequence divergence. The study of the origins of new genes has flourished in the genomic era with the sequencing of genomes. De novo gene birth, where a previously non-genic genomic sequence becomes genic through evolution, is a particularly exciting mechanism for the origin of new genes. While many de novo genes have been proposed to be protein-coding, and in several cases have been experimentally shown to yield protein products, their systematic study as proteins has been hampered by doubts regarding the translation of their transcripts without the experimental observation of protein products. Using a systematic, ORF-focused mass-spectrometry-first computational approach, I identify protein evidence for almost 1000 unannotated open reading frames with evidence of translation (utORFs) in the model organism Drosophila melanogaster. Using an integrative comparative genomics approach, I then identify different properties and evolutionary patterns amongst these utORFs, including their implied gene ages. My results suggest that there is substantial unappreciated diversity in de novo protein evolution: many more may exist than have been previously appreciated; there may be divergent evolutionary trajectories; and de novo proteins may be gained and lost frequently. Turning away from genes and to their regulation, I next examine the evolution of regulatory regions in the genome as part of a joint collaboration. The evolution of regulation plays a critical role in shaping the diversity of life, as the diversity of protein-coding sequences is insufficient to solely yield the observed diversity in phenotypes. Despite substantial genetic divergence, chromatin accessibility in the head and testis of Drosophila is generally conserved at the phenotypic level between species. Applying deep neural networks as a tool to investigate the sequence determinants that govern chromatin accessibility, we find that hybrid convolution-attention neural networks can predict ATAC-seq peaks using only local DNA sequences as input. This predictive capability is maintained across species, even in very distantly related ones, suggesting that these models capture conserved sequence-dependent processes that determine chromatin accessibility. Turning to examine regions with species-specific changes in chromatin accessibility, we find that their orthologous inaccessible regions in other species have unusual model outputs, suggesting that these regions may be ancestrally poised for evolution. Finally, using an array of in silico genetics experiments, we find that chromatin accessibility is simultaneously robust to random mutation and labile to extremely strong selection. Together, these results illuminate aspects regarding the origins of evolutionary novelty and suggest new directions for further experimental and computational studies. All in all, there likely does not exist a single characteristic model of evolution, but rather complex origins and diverse evolutionary mechanisms.


A Thesis Presented to the Faculty of The Rockefeller University in Partial Fulfillment of the Requirements for the degree of Doctor of Philosophy

Available for download on Monday, October 23, 2023

Included in

Life Sciences Commons