Metazoans typically keep the same genetic information in all body cells over their life cycle. However, some organisms show exceptions to this rule through programmed DNA elimination from the somatic cell line, as it is the case of the puzzling germline-restricted chromosome (GRC) in songbirds. Recently, cytogenetic evidence suggests that the GRC is widely present in songbirds, i.e., over half of all bird species. In addition, we demonstrated in our last work that the zebra finch GRC is enriched in developmental genes. This latter constitutes a novel mechanism for germline-specific gene expression in multicellular organisms, which is of significant relevance to fields as genomics, developmental biology, ecology or even cancer research. However, it remains mysterious how the GRC have diverged between species and which content of the GRC is crucial for its function. Thus, we will determine genes linked to the GRC for longer time, with lower coding nucleotide divergence, fewer rearrangements, less repeat accumulation, and more stable copy numbers across the songbird phylogeny.
To fill these gaps, we will perform a comparative phylogenomic analysis including 10 new estrildid finches and a broad taxonomic sampling across deep lineages of Passeriformes, including 8 oscine and 2 suboscine songbird species. We have sequenced DNA from soma (~35x coverage each) and testis (~70x) for these species using 10x Chromium libraries that are ready for the analyses. We will start generating draft germline genomes from 10x data for these 20 extra species to determine the gene content of GRC using both coverage and SNP approaches. We will map 10x reads against each assembly to compare genomic coverage between tissues and identify genes with high testis/somatic coverage ratio to perform tissue-specific SNP calling. Moreover, we are currently developing a novel in silico sequence capture approach to verify the scaffolds. It consists of identifying reads with GRC-specific SNPs in the 10x data and selecting linked reads with the same barcode, i.e., from the same input molecule, to jointly assemble reads from the GRC using Supernova. We are testing two different approaches: one based on read mapping and other one based on kmer comparison. After this, for the most ancient GRC-linked genes we will build maximum-likelihood phylogenies for each GRC-linked gene and its A-chromosomal paralog to determine when they were incorporated into the GRC. This information is also necessary to search for signatures of selection.
Altogether, we will resolve the evolutionary trajectories of the GRC across songbird evolution to characterize key genes throughout this project. In addition, we will perform the first characterization of the long-term evolution of programmed DNA elimination in any organism, helping to understand this phenomenon in other models. We are asking for 10,000 hours/month of computation time considering that these analyses are very demanding, since we plan to compare 20 pairs of high coverage libraries at different levels and will also benchmark a new bioinformatic method.