9月bioRxiv生信好文速览

原创 montreal 生信人

十一黄金周期间，同很多读者一样，我们的小编也没有闲着，让我们来看看九月份的bioRxiv上有哪些有趣的生信预印本文章（结尾有彩蛋）。

1. 【Genomics】小鼠减数分裂后的遗传多样性

A high-resolution map of non-crossover events in mice reveals impacts of genetic diversity on meiotic recombination（CC-BY-NC-ND 4.0）

In mice and humans, meiotic recombination begins with programmed DNA double-strand breaks at PRDM9-bound sites. These mainly resolve as difficult-to-detect non-crossovers, rather than crossovers. Here, we intercrossed two mouse subspecies over five generations and deep-sequenced 119 offspring, whose high heterozygosity allowed detection of 2,500 crossover and 1,575 non-crossover events with unprecedented power and spatial resolution. These events were strongly depleted at 'asymmetric' sites where PRDM9 mainly binds one homologue, implying they instead repair from the sister chromatid. This proves that symmetric PRDM9 binding promotes inter-homologue interactions, illuminating the mechanism of PRDM9-related hybrid infertility. Non-crossovers were surprisingly short (mean 30-41 bp), and complex non-crossovers, seen commonly in humans, were extremely rare. Unexpectedly, GC-biased gene conversion disappeared at non-crossovers containing multiple mismatches. These results demonstrate that local genetic diversity can alter meiotic repair pathway decisions in mammals by changing PRDM9 binding symmetry and non-crossover resolution, which influence genome evolution, fertility, and speciation.

2. 【Genomics】小鼠过后看豚鼠：同为啮齿目，为何水豚如此大吨位？测序揭晓基因组层面的证据

How to make a rodent giant: Genomic basis and tradeoffs of gigantism in the capybara, the world's largest rodent（CC-BY-NC-ND 4.0）

Gigantism is the result of one lineage within a clade evolving extremely large body size relative to its small-bodied ancestors, a phenomenon observed numerous times in animals. Theory predicts that the evolution of giants should be constrained by two tradeoffs. First, because body size is negatively correlated with population size, purifying selection is expected to be less efficient in species of large body size, leading to a genome-wide elevation of the ratio of non-synonymous to synonymous substitution rates (dN/dS) or mutation load. Second, gigantism is achieved through higher number of cells and higher rates of cell proliferation, thus increasing the likelihood of cancer. However, the incidence of cancer in gigantic animals is lower than the theoretical expectation, a phenomenon referred to as Peto's Paradox. To explore the genetic basis of gigantism in rodents and uncover genomic signatures of gigantism-related tradeoffs, we sequenced the genome of the capybara, the world's largest living rodent. We found that dN/dS is elevated genome wide in the capybara, relative to other rodents, implying a higher mutation load. Conversely, a genome-wide scan for adaptive protein evolution in the capybara highlighted several genes involved in growth regulation by the insulin/insulin-like growth factor signaling (IIS) pathway. Capybara-specific gene-family expansions included a putative novel anticancer adaptation that involves T cell-mediated tumor suppression, offering a potential resolution to Peto's Paradox in this lineage. Gene interaction network analyses also revealed that size regulators function simultaneously as growth factors and oncogenes, creating an evolutionary conflict. Based on our findings, we hypothesize that gigantism in the capybara likely involved three evolutionary steps: 1) Increase in body size by cell proliferation through the ISS pathway, 2) coupled evolution of growth-regulatory and cancer-suppression mechanisms, possibly driven by intragenomic conflict, and 3) establishment of the T cell-mediated tumor suppression pathway as an anticancer adaptation. Interestingly, increased mutation load appears to be an inevitable outcome of an increase in body size.

左图来自原文图1：啮齿目动物进化关系及基因组大小、重复序列比例、基因数目、吨位（由左至右）；右图为水豚大小参照图。

3. 【Evolution】酵母基因丢失的偏好由温度决定？

Temperature preference biases parental genome retention during hybrid evolution（CC-BY 4.0）

Interspecific hybridization can introduce genetic variation that aids in adaptation to new or changing environments. Here we investigate how the environment, and more specifically temperature, interacts with hybrid genomes to alter parental genome representation over time. We evolved Saccharomyces cerevisiae x Saccharomyces uvarum hybrids in nutrient-limited continuous culture at 15°C for 200 generations. In comparison to previous evolution experiments at 30°C, we identified a number of temperature specific responses, including the loss of the S. cerevisiae allele in favor of the cryotolerant S. uvarum allele for several portions of the hybrid genome. In particular, we discovered a genotype by environment interaction in the form of a reciprocal loss of heterozygosity event on chromosome XIII. Which species haplotype is lost or maintained is dependent on the parental species temperature preference and the temperature at which the hybrid was evolved. We show that a large contribution to this directionality is due to temperature sensitivity at a single locus, the high affinity phosphate transporter PHO84. This work helps shape our understanding of what forces impact genome evolution after hybridization, and how environmental conditions may favor or disfavor hybrids over time.

4. 【Evolution】中山大学吴仲义团队： McDonald–Kreitman检验和PAML结果的低overlap及其理论意义

Skepticism toward adaptive signals in DNA sequence comparisons - Is the neutral theory dead yet?（CC-BY-NC-ND 4.0）

Measuring positive selection on DNA sequences between species is key to testing the neutral theory of molecular evolution. Here, we compare the two most commonly used tests that rely on very different assumptions. The MK test compares divergence and polymorphism data, while the PAML test analyzes multi-species divergence. The two tests are now forced to detect positive selection on the same phylogenetic branch in Drosophila and Arabidopsis using large-scale genomic data. When applied to individual coding genes, both MK and PAML identify >100 adaptively evolving genes but the two sets hardly overlap. To rule out high false negatives, we merge 20 - 30 genes into "supergenes", 8% - 56% of which yield adaptive signals. Nevertheless, the joint calls still do not overlap. The two tests do show very modest concordance at lower stringencies. There may be several possibilities for the discordance between the two major tests. 1) Selective advantage is weak, falling in the "nearly neutral" range; 2) The adaptive landscape shifts constantly, akin to the Red Queen landscape; 3) Positive selection which accelerates the evolution is confounded by the relaxation of negative selection, resulting in less deceleration. Whether the neutral theory should be rejected depends on which of these factors prevails.

5. 【Genome editing】新工具辅助CRISPR中的gRNA设计

beditor: A computational workflow for designing libraries of guide RNAs for CRISPR base editing（CC-BY-NC-ND 4.0）

Recently engineered CRISPR base editors have opened unique avenues for scar-free genome-wide mutagenesis. Here, we describe a comprehensive computational workflow called beditor that can be broadly adapted for designing guide RNA libraries to be used for CRISPR base editing. The computational framework allows users to assess editing possibilities using a range of CRISPR base editors, PAM recognition sequences and the genome of any species. Additionally, potential editing efficiencies of the designed guides are evaluated in terms of an a priori estimates, through a specifically designed beditor scoring system.

6. 【Transcriptomics】四种线虫长非编码RNA转录组学和比较基因组学分析

Transcriptomic analyses reveal groups of co-expressed, syntenic lncRNAs in four species of the genus Caenorhabditis（CC-BY-NC-ND 4.0）

Long non-coding RNAs (lncRNAs) are a heterogeneous class of genes that do not code for proteins. Since lncRNAs (or a fraction thereof) are expected to be functional, many efforts have been dedicated to catalog lncRNAs in numerous organisms, but our knowledge of lncRNAs in non vertebrate species remains very limited. Here, we annotated lncRNAs using transcriptomic data from the same larval stage of four Caenorhabditis species. The number of annotated lncRNAs in self-fertile nematodes was lower than in out-crossing species. We used a combination of approaches to identify putatively homologous lncRNAs: synteny, sequence conservation, and structural conservation. We classified a total of 1,532 out of 7,635 genes from the four species into families of lncRNAs with conserved synteny and expression at the larval stage, suggesting that a large fraction of the predicted lncRNAs may be species specific. Despite both sequence and local secondary structure seem to be poorly conserved, sequences within families frequently shared BLASTn hits and short sequence motifs, which were more likely to be unpaired in the predicted structures. We provide the first multi-species catalog of lncRNAs in nematodes and identify groups of lncRNAs with conserved synteny and expression, that share exposed motifs.

7. 【Genomics】678种真核微生物基因组的重新组装和注释

Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes（CC-BY 4.0）

Background: De novo transcriptome assemblies are required prior to analyzing RNAseq data from a species without an existing reference genome or transcriptome. Despite the prevalence of transcriptomic studies, the effects of using different workflows, or "pipelines", on the resulting assemblies are poorly understood. Here, a pipeline was programmatically automated and used to assemble and annotate raw transcriptomic short read data collected by the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP). The resulting transcriptome assemblies were evaluated and compared against assemblies that were previously generated with a different pipeline developed by the National Center for Genome Research (NCGR). Results: New transcriptome assemblies contained the majority of previous contigs as well as new content. On average, 7.8% of the annotated contigs in the new assemblies were novel gene names not found in the previous assemblies. Taxonomic trends were observed in the assembly metrics, with assemblies from the Dinoflagellata and Ciliophora phyla showing a higher percentage of open reading frames and number of contigs than transcriptomes from other phyla. Conclusions: Given current bioinformatics approaches, there is no single 'best' reference transcriptome for a particular set of raw data. As the optimum transcriptome is a moving target, improving (or not) with new tools and approaches, automated and programmable pipelines are invaluable for managing the computationally-intensive tasks required for re-processing large sets of samples with revised pipelines and ensuring a common evaluation workflow is applied to all samples. Thus, re-assembling existing data with new tools using automated and programmable pipelines may yield more accurate identification of taxon-specific trends across samples in addition to novel and useful products for the community.

8. 【single-cell】斯坦福大学Chang和Khavari教授强强联手打造：Perturb-ATAC——单细胞测序里的CRISPR+ATAC-seq

Coupled single-cell CRISPR screening and epigenomic profiling reveals causal gene regulatory networks（CC-BY-ND 4.0）

Here we present Perturb-ATAC, a method which combines multiplexed CRISPR interference or knockout with genome-wide chromatin accessibility profiling in single cells, based on the simultaneous detection of CRISPR guide RNAs and open chromatin sites by assay of transposase-accessible chromatin with sequencing (ATAC-seq). We applied Perturb-ATAC to transcription factors (TFs), chromatin-modifying factors, and noncoding RNAs (ncRNAs) in ~4,300 single cells, encompassing more than 63 unique genotype-phenotype relationships. Perturb-ATAC in human B lymphocytes uncovered regulators of chromatin accessibility, TF occupancy, and nucleosome positioning, and identified a hierarchical organization of TFs that govern B cell state, variation, and disease-associated cis-regulatory elements. Perturb-ATAC in primary human epidermal cells revealed three sequential modules of cis-elements that specify keratinocyte fate, orchestrated by the TFs JUNB, KLF4, ZNF750, CEBPA, and EHF. Combinatorial deletion of all pairs of these TFs uncovered their epistatic relationships and highlighted genomic co-localization as a basis for synergistic interactions. Thus, Perturb-ATAC is a powerful and general strategy to dissect gene regulatory networks in development and disease.

9. 【Transcriptomics】rnaSPAdes：这个黑桃（spade）不一般

rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data（CC-BY-NC-ND 4.0）

Possibility to generate large RNA-seq datasets has led to development of various reference-based and de novo transcriptome assemblers with their own strengths and limitations. While reference-based tools are widely used in various transcriptomic studies, their application is limited to the model organisms with finished and annotated genomes. De novo transcriptome reconstruction from short reads remains an open challenging problem, which is complicated by the varying expression levels across different genes, alternative splicing and paralogous genes. In this paper we describe a novel transcriptome assembler called rnaSPAdes, which is developed on top of SPAdes genome assembler and explores surprising computational parallels between assembly of transcriptomes and single-cell genomes. We also present quality assessment reports for rnaSPAdes assemblies, compare it with modern transcriptome assembly tools using several evaluation approaches on various RNA-Seq datasets, and briefly highlight strong and weak points of different assemblers.

10. 【Genomics】转录组数据看腻了？来看看如何由DNA序列直接预测基因表达

Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks（CC-BY-NC 4.0）

Algorithms that accurately predict gene structure from primary sequence alone were transformative for annotating the human genome. Can we also predict the expression levels of genes based solely on genome sequence? Here we sought to apply deep convolutional neural networks towards this goal. Surprisingly, a model that includes only promoter sequences and features associated with mRNA stability explains 59% and 71% of variation in steady-state mRNA levels in human and mouse, respectively. This model, which we call Xpresso, more than doubles the accuracy of alternative sequence-based models, and isolates rules as predictive as models relying on ChIP-seq data. Xpresso recapitulates genome-wide patterns of transcriptional activity and predicts the influence of enhancers, heterochromatic domains, and microRNAs. Model interpretation reveals that promoter-proximal CpG dinucleotides strongly predict transcriptional activity. Looking forward, we propose the accurate prediction of cell type-specific gene expression based solely on primary sequence as a grand challenge for the field.

PCI 预印本社区简介

最后，给大家介绍一个最近迅速蹿红的预印本社区：Peer Community In，简称PCI( https://peercommunityin.org)。该社区目前包括三个组成部分：PCI Evolution, PCI Ecology，PCI Paleontology，分别涵盖，进化生物学、生态学以及古生物学，且目前仍在扩大规模。

PCI社区的目的在于从越发增长的preprint队伍中挑选优质文章，帮助相关领域的科学工作者们highlight出来：The goal of the PCIs is to highlight and recommend preprints as of particular interest to the community concerned. The preprints recommended by PCIs are complete articles of high value that do not necessarily need to be published in traditional journals.

其工作流程简单来讲如下：

i) 在bioRxiv等预印本服务器上放出manuscript

ii) 向PCI投稿

iii) 进入PCI审稿

PCI对通过review的manuscript会进行推荐，公布在其网站上，同时公布的还有审稿意见。若通过审稿，manuscript及其审稿意见可成为很多杂志考虑的依据，例如，下面这些杂志都已经表示会考虑来自PCI预印本社区的manuscript并以PCI的审稿意见为参考：

可以说，PCI作为一个非营利组织，为高质量预印本提供了一个新的平台。如果有兴趣投稿或者想看看相关领域有哪些被推荐的preprint，请访问PCI的主页：https://peercommunityin.org。

国庆节快乐

TCGA | 小工具 | 数据库 |组装| 注释 | 基因家族 | Pvalue

基因预测 |bestorf | sci | NAR | 在线工具 | 生存分析 | 热图

舞台|基因组 | 黄金测序 | 套路 | 杂谈组装 | 进化 | 测序简史

继续滑动看下一个