Yeast “adopt a proto-gene” project

What is the adopt a proto-gene project?

In a synergistic educational activity designed to promote literacy in evolutionary biology,  a novel “adopt a proto-gene” initiative provides resources for students and educators working with undergraduates to characterize individual proto-genes at their home institution. The project provides:

  • modules for undergraduate students to explore proto-genes in the model eukaryote Saccharomyces cerevisiae. (below on this page)
  • Virtual workshops to assist faculty in using the modules with undergraduates at their home institutions
  • summer undergraduate research experiences at the University of Pittsburgh

What is a proto-gene?

It has become increasingly clear that eukaryotic genomes are pervasively transcribed and translated.
Thousands of small, evolutionarily novel polypeptides expand the coding potential of fungal, plant and
animal genomes beyond established protein-coding genes. Genomic scientists have proposed that pervasive
translation generates a reservoir of “proto-genes” that promote de novo gene birth by exposing genetic
variation to natural selection in the form of novel polypeptides. Some proto-genes are occasionally
retained by selection and become de novo genes, but most eventually return to a non-genic state. Aside
from their evolutionary potential, how do proto-genes impact cell biology? The physiological significance
of proto-genes has not yet been systematically explored in any species. As a result of this gap in
knowledge, current models of cellular systems are missing thousands of genetic elements that are
potentially critical for understanding genotype-phenotype relationships. This missing biology is likely to
explain key molecular differences between species, to unveil novel mechanisms of evolutionary
adaptation, and to shed light on the first steps of de novo gene emergence.

Proto-gene discovering is in this paper: ;

A second paper covers proto-gene RNA-seq analyses:

The set of modules from the 2023 edition of this workshop (below), gives information on how you and your students are able to access these datasets on SGD.

The “adopt a proto-gene” initiative is support by an NSF-CAREER award to Anne-Ruxandra Carvunis, Associate Professor in the Department of Computational and Systems Biology at the University of Pittsburgh School of Medicine

2024 workshop: Gene Expression
Lab Modules

Introduction Guide (google doc)

In this series of modules, we will explore the biology of proto-genes by looking at their expression and regulation.

Module 1: Differential Expression

Explore the regulation and potential function of proto-genes by looking at their expression in a treatment condition. In this module, you will start with an RNA sequencing dataset of interest such as, salt stress, exposure to certain drugs, different nutrient sources, and more. Find proto-genes that are expressed differentially (either more or less) in these conditions compared to normal conditions. What can this tell us about what these proto-genes might do?

Differential Expression Guide (google doc)

Differential Expression Worksheet (google doc)

Note:  If technical errors occur, the output files from this module (list of DE proto-genes and list of non-DE proto-genes) can be found in the following Google drive folder:
There are different folders corresponding to the GEO dataset you are using.

Module 2: Regulatory Motifs

Not all genes are “on” or active all the time, and many genes only need to be active under certain conditions. How does the cell activate these genes at the right time? One way is through various regulators such as transcription factors, which are proteins that bind near genes and activate or repress them. These transcription factors usually bind to specific DNA sequences. Can we use the DNA sequences near the differentially expressed proto-genes to find potential regulators?

Regulatory Motifs Guide  (google doc)

Regulatory Motifs Worksheet (google doc)

Module 3: Genome Browser

What else can we learn about these differentially expressed proto-genes? We can use a genome browser to integrate and visualize many different datasets to answer the following questions: Where are these differentially expressed proto-genes located in the genome? What do their RNA transcripts look like? Is any part of their sequence conserved? How translated are they? Can we find experimental evidence for the predicted TF regulators (from module 2) binding near these proto-genes?

Genome Browser Guide (google doc)

Genome Browser Worksheet (google doc)

Module 4: Coexpression

 What other genes are expressed when our proto-genes of interest are expressed? We can look at the correlation or similarity in expression between our proto-genes and other genes. This can allow us to make inferences that these proto-genes could be involved in the same processes as those genes. Are these coexpressed genes also differentially expressed in our RNA sequencing dataset (from module 1)?

Coexpression Module (google doc)

Coexpression Worksheet (google doc)

Module 5: Localization

The cell is compartmentalized and has different organelles that perform different functions. Many proteins that are involved in the same process will go to, or localize at, the same part of the cell. Where do our proto-genes of interest localize to? Do the genes that are also differentially expressed localize there?

Localization Module (google doc)

Localization Worksheet (google doc)

other links

2024 workshop welcome & introduction presentation (Y3_Intro)

student presentation template (google slides)

faculty presentation template (google slides)

2023 workshop: Gene Evolution
Lab Modules

These modules allow students to utilize cutting-edge bioinformatics algorithms to explore proto-genes via web based tools without requiring any coding experience. Linked here are proto-genes we pre-curated to serve as good illustrations for each module.

Module 1. Genome Browser
This module provides an introduction to the SGD site and genome browser JBrowse. In this module participants will learn how to use a genome browser to view the position of genes relative to one another and how to integrate across different data types at the genome scale.  A video walk through of this module can be viewed at

Genome Browser Guide (google doc) (PDF)

Genome Browser Worksheet (google doc) (PDF)

Module 2. Cellular Localization
This module provides instructions to predict protein localization from amino acid sequence and to identify sequence or structure motifs important for predicting localization. A video walk through of this module can be viewed at

Cellular Localization Guide (google doc) (PDF)

Cellular Localization Worksheet (google doc) (PDF)

Module 3. Structure Prediction
This module provides instructions to predict protein structure from amino acid sequence and search for proteins with similar structures using cutting-edge machine learning algorithms such as ESMFold and Foldseek. Coming soon: a video walk through of this module.

Structure Prediction Guide (google doc) (PDF)

Structure Prediction Worksheet (google doc) (PDF)

Module 4. Coexpression
This module provides instructions on how to query a large coexpression network to identify genes and other proto-genes that have similar transcriptional patterns. Coming soon: a video walk through of this module.

Coexpression Guide (google doc) (PDF)

Coexpression Worksheet (google doc) (PDF)

Module 5. Ancestral Reconstruction
This module provides instructions for reconstructing the ancestral sequences that gave rise to a given extant sequence and use alignment tools to compare sequence similarities across yeast species. Coming soon: a video walk through of this module.

Ancestral Reconstruction Guide (google doc) (PDF)

Ancestral Reconstruction Worksheet (google doc) (PDF)

other links

list of proto-genes (google sheet) (excel file)

2023 workshop welcome & introduction presentation (PDF)

student presentation template (google slides)

faculty presentation template (google slides)


How to Use a Genome Browser: JBROWSE: GUIDE

How to Use a Genome Browser: JBROWSE: WORKSHEET

NOTE: If worksheet (docx) links don’t work, right click, copy link address and paste in a new tab to download