(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . LRT: Integrative analysis of scRNA-seq and scTCR-seq data to investigate clonal differentiation heterogeneity [1] ['Juan Xie', 'The Interdisciplinary Program In Biostatistics', 'The Ohio State University', 'Columbus', 'Ohio', 'United States Of America', 'Hyeongseon Jeon', 'Department Of Biomedical Informatics', 'Pelotonia Institute For Immuno-Oncology', 'The James Comprehensive Cancer Center'] Date: 2023-08 Single-cell RNA sequencing (scRNA-seq) data has been widely used for cell trajectory inference, with the assumption that cells with similar expression profiles share the same differentiation state. However, the inferred trajectory may not reveal clonal differentiation heterogeneity among T cell clones. Single-cell T cell receptor sequencing (scTCR-seq) data provides invaluable insights into the clonal relationship among cells, yet it lacks functional characteristics. Therefore, scRNA-seq and scTCR-seq data complement each other in improving trajectory inference, where a reliable computational tool is still missing. We developed LRT, a computational framework for the integrative analysis of scTCR-seq and scRNA-seq data to explore clonal differentiation trajectory heterogeneity. Specifically, LRT uses the transcriptomics information from scRNA-seq data to construct overall cell trajectories and then utilizes both the TCR sequence information and phenotype information to identify clonotype clusters with distinct differentiation biasedness. LRT provides a comprehensive analysis workflow, including preprocessing, cell trajectory inference, clonotype clustering, trajectory biasedness evaluation, and clonotype cluster characterization. We illustrated its utility using scRNA-seq and scTCR-seq data of CD8 + T cells and CD4 + T cells with acute lymphocytic choriomeningitis virus infection. These analyses identified several clonotype clusters with distinct skewed distribution along the differentiation path, which cannot be revealed solely based on scRNA-seq data. Clones from different clonotype clusters exhibited diverse expansion capability, V-J gene usage pattern and CDR3 motifs. The LRT framework was implemented as an R package ‘LRT’, and it is now publicly accessible at https://github.com/JuanXie19/LRT . In addition, it provides two Shiny apps ‘shinyClone’ and ‘shinyClust’ that allow users to interactively explore distributions of clonotypes, conduct repertoire analysis, implement clustering of clonotypes, trajectory biasedness evaluation, and clonotype cluster characterization. Understanding the dynamic changes behind biological processes is important for determining molecular mechanisms underlying normal tissue formulation, developmental disorders and pathologies. Usually, a biological process can be characterized by identifying a trajectory, a path that goes through the various cellular states associated with the process. Since cells in different states may express different sets of genes, researchers often infer cell trajectory via capturing transcriptomics changes. Dozens of methods have been developed for cell trajectory inference, and scRNA-seq data is predominantly utilized. However, cells from different clones may exist distinct differentiation patterns, while methods based only on scRNA-seq data cannot capture this heterogeneity. T cells play a key role in the immune system, and their high antigen recognition specificity is largely determined by their TCR sequences. Thanks to the advent of scTCR-seq technology, we can now identify the group of cells coming from the same clone. This paper describes our novel computational framework, namely LRT, and demonstrates that by complementing scRNA-seq data with the clonal information from scTCR-seq data using LRT, we are able to examine the clonal differentiation heterogeneity that cannot be revealed solely based on scRNA-seq data. Funding: This work was supported by grants from the National Human Genome Research Institute (R21 HG012482) to D.C., National Institute on Aging (U54 AG075931) to Q.M., National Institute of General Medical Sciences (R01 GM122078) to D.C. and (R01 GM131399) Q.M., National Institute on Drug Abuse (U01 DA045300) to D.C., the National Science Foundation (NSF1945971) to Q.M., and the Pelotonia Institute of Immuno-Oncology (PIIO) to D.C. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Data Availability: The LRT framework was implemented as an R package ‘LRT’, and it is publicly available at https://github.com/JuanXie19/LRT . The Shiny apps ‘shinyClone’ and ‘shinyClust’ are also provided as part of this R package. The data used to produce the results presented in this manuscript are available at GEO with accession number GSE188670 and GSE158896, respectively. In this paper, we propose LRT ( L ineage inference by integrative analysis of sc R NA-seq and sc T CR-seq data), a rigorous computational framework for the integrative analysis of scTCR-seq and scRNA-seq data to study clonal differentiation heterogeneity. LRT addresses the complexity and the high dimensionality of scTCR-seq data through a clonotype clustering algorithm and effective information sharing between scTCR-seq and scRNA-seq data. In addition, to support researchers in utilizing the proposed LRT framework for their studies, we developed an R package ‘LRT’, which is now publicly accessible at https://github.com/JuanXie19/LRT . We also developed two Shiny apps, namely ‘shinyClone’ and ‘shinyClust’, and they allow researchers to interactively implement the complete LRT analysis workflow, including exploratory analysis, cell trajectory inference, clonotype clustering, differentiation trajectory biasedness evaluation and clonotype cluster characterization. These Shiny apps are provided as part of the R package ‘LRT’. Integration of scTCR-seq data with scRNA-seq allows for the identification of T cell clones and characterize their transcriptional states at a single-cell level. This integrative analysis can reveal important insights into the clonal expansion, diversification, and differentiation of T cells in response to various stimuli, including infections, cancer, and autoimmunity. For instance, by using paired scRNA-seq and scTCR-seq data, Daniel et al. [ 12 ] discovered divergent clonal differentiation trajectories of T cell exhaustion in response to chronic viral infection, where different T cell clones follow distinct differentiation trajectories. Similarly, Khatun et al. [ 13 ] found that the during acute lymphocytic choriomeningitis virus (LCMV) infection, different naïve CD4 T cell clones showed different preference toward to either type 1 helper T cell (Th1) or T follicular helper cell (Tfh). However, despite such potential to disentangle the heterogeneity in T cell clonal differentiation trajectories, to the best of our knowledge, a computational approach to integrate scRNA-seq and scTCR-seq data for the investigation of T cell differentiation heterogeneity is missing in the literature. This is partially due to the statistical and computational challenges related to the high dimensionality of the clonotype space. Specifically, for a given human subject, 10 13 TCR sequences are typically observed while each clonotype may only span a very limited number of cells. Hence, for effective cell trajectory inference and integration of scTCR-seq and scRNA-seq data, we need a cleverly designed computational framework that can handle these challenges. The recent advent of single-cell T cell receptor (TCR) sequencing (scTCR-seq) technology has enabled the use of TCR sequences as unique ’barcodes’ to identify clonally related cells. This is because the cells with identical paired TCR sequences usually arise from the same T cell clone due to the high dimensionality of TCR sequence space [ 6 ]. What is more, these clonal cells are often considered to be derived from the same progenitor and developmentally related [ 7 ]. Examining the abundance/proportion of clonal cells and their changes greatly benefits our understanding of adaptive immune response in health and disease [ 8 , 9 ]; therefore, there has been a rapid accumulation of scTCR-seq data in recent years. Accordingly, several computational and statistical approaches have also been developed, where examples include Immunarch [ 10 ] and scRepertoire [ 11 ]. However, currently available approaches mostly focus on TCR repertoire analysis, such as clonality, repertoire overlap, repertoire diversity, and gene usage analyses. The utilization of scTCR-seq data for cell trajectory analysis is still limited. In addition, although TCR data provides invaluable insights into the breadth of the antigenic response and clonal relationships between cells, it does not offer information about the functional characteristics of the T cells. Cell trajectory inference aims to understand how cells differentiate, which is a long-standing problem in developmental biology. Understanding how progenitor cells transform into specified functional cells can provide valuable insights into the molecular mechanisms underlying normal tissue formulation, as well as developmental disorders and pathologies [ 1 ]. Single-cell RNA-seq (scRNA-seq) data has been widely used to investigate cell trajectories. Assuming that cells in different states express different sets of marker genes, we may order cells along a differentiation trajectory via capturing differentiated transcriptional activities [ 2 ]. Based on this rationale, various computational and statistical approaches have been proposed for this purpose, namely trajectory/pseudotime analysis, where well known examples include Monocle [ 2 ] and Slingshot [ 3 ] (please see Saelens et al. [ 4 ] for a comprehensive review). While these approaches have been shown to be useful, they still suffer from intrinsic limitations. Specifically, it assumes that cells with similar expression profiles share similar developmental stages and may come from the same lineage, which might not be always true [ 1 , 5 ]. Besides, in the absence of clonal information, the inferred trajectory may not reflect the true clonal relationship between cells [ 1 ]. Materials and methods The workflow of the ‘LRT’ framework is shown in Fig 1, which starts from the integration of scTCR-seq and scRNA-seq data (Fig 1A). Essentially, scTCR-seq and scRNA-seq data are paired via barcode matching, and TCR sequence information is added as metadata of the scRNA-seq data. Using this integrated dataset, LRT applies the Slingshot algorithm [3] to infer the overall cell trajectory (Fig 1B). Next, LRT identifies clusters of clonotypes using a Dirichlet multinomial mixture model (DMM) (Fig 1C). The biasedness along the overall trajectory for clones in each clonotype cluster is evaluated by using permutation tests. Besides, the identified clonotype clusters are characterized via repertoire analysis, top-ranked clonotypes identification, and V-J gene usage pattern analysis (Fig 1D). PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 1. The framework of LRT. (A) The preprocessed scRNA-seq and scTCR-seq data are integrated based on cell barcodes matching. (B) Cell trajectories are first obtained using the Slingshot algorithm. (C) Clonotypes with similar cell cluster composition are grouped using the Dirichlet multinomial mixtures (DMM) model. (D) For each clonotype cluster, the distributional bias of clones along the trajectories is evaluated via permutation tests. The repertoire is characterized in terms of clonal expansion and diversity, the top-ranked clonotypes, and V and J gene usage patterns, among others. https://doi.org/10.1371/journal.pcbi.1011300.g001 Cell trajectory estimation To characterize the potential process of T cell functional changes, LRT applies the Slingshot algorithm [3] on the scRNA-seq data (see the Section ‘Software implementation’ for more details). Specifically, a weighted graph is constructed, with nodes being cell-cluster-specific centers and edge weights represent the Euclidean distances between the nodes on the low-dimensional space, such as Principal Component Analysis (PCA) or Uniform Manifold Approximation and Projection (UMAP) space [14]. Note that both cell clusters and low-dimensional space in this step are defined based on scRNA-seq data. A cluster-based MST [15] is then constructed, and a simultaneous principal curves method as described in Slingshot is applied to fit smooth branching curves [3]. The obtained smoothing curves are interpreted as trajectories and the arc length between the starting point of the curve and the projection of cell on the curves is treated as pseudotime. Clonotype clustering While a clonotype of our ultimate interest, scTCR-seq data often suffers from high dimensionality and sparsity issues, i.e., we often have a large number of clonotypes, each of which spans only a few cells. Such high dimensionality and sparsity remain challenges to reliably investigate the distributon of clones along the trajectories and their heterogeneity. In order to address these challenges, LRT conducts a clustering analysis to group similar clonotypes, where similarity is defined based on cell distribution across cell clusters/states. The rationale behind is that cells from different clones may have different preference towards a particular cell state, and thus exhibit diverse distribution across cell states. In LRT, the clonotype clustering is implemented via a Dirichlet multinomial mixtures (DMM) method proposed by [16], which can deal with the sparse and heterogenous nature of the frequency matrix for the clonotypes. In addition, because the DMM method is based on a Bayesian framework, it provides uncertainty quantification and posterior probabilities for cluster memberships of clonotypes, which allow us to prioritize clonotypes for downstream analyses and experimental validations. Let’s denote X as a matrix of frequencies, in wich elements x ij being the obseved frequency of cells that belong to cell state j for clonotype i, with i∈[1:C], j∈[1:S]. We further denote the rows of the matrix that give the cell distributions across cell states for each clonotype by x i . We assume that each clonotype is generated from a multinomial distribution with parameter vector p i . The elements of p i , p ij , are the probabilities that a cell from clonotype i belong to the cell state j. Under this setting, the likelihood of observing the cell distribution for a clonotype i is given as follows: where are the total number of cells in clonotype i. The total likelihood is the product of the clonotype likelihoods: As regard to the prior distribution for the multinomial parameter vectors p i , a natural choice would be the Dirichlet distribution, which is conjugate to the multinomial. Yet considering that different clonotypes might have significantly different cell distributions, a mixture of K Dirichlets is adopted instead of assuming a single Dirichlet prior to deal with cell state composition heterogeneity. Specifically, for each clonotype i, we assume that it belongs to one of K clusters. Hence, the prior is: where Dir(p i |α k ) represent the Dirichlet distribution with parameter α k , π k is the mixture weight, and Q = (α 1 ,…,α K ,π 1 ,…,π K ) is the hyperparameters. We further assume that α ik ~Γ(η,ν). We use a K-dimentional binary latent indicator vector z i to represent the membership of i-th clonotype. For the parameter interfence, a EM algorithm is employed, and the number of Dirichlet compoments, K, is determined through the Laplace approximation to the posterior probability. For more details about parameter estimation and model selection, please refer to [16]. After determining the number of mixture component, i.e, the number of clusters K, we infer which cluster an individual clonotype belong to by finding the value of k that maximizes the posterior probability of membership: Statistical inference for the trajectory bias In the integrative analysis of scRNA-seq and scTCR-seq, we are often interested in the distributional biasedness along the trajectory for clones. The DMM method described in the previous section allows us to obtain K clonotype clusters, each with distinct composition (i.e., cell distribution across cell states), which likely reflects such distributional bias. However, researchers can be benefited with a rigorous hypothesis testing that quantifies statistical significance of the distributional bias. For this purpose, we developed 3 different permutation tests to evaluate the distributional bias along the trajectory for clones. We note that we might want to evaluate such distributional bias at the clonotype level. However, as also discussed in the previous section, scTCR-seq data often suffers from high dimensionality and sparsity issues and these can make such permutation tests less stable and reliable. Hence, based on the similar rationale as in the previous section, we conducted these permutation tests at the clonotype cluster level. Hypothesis testing procedures have three components: definition of the null hypothesis (H 0 ), definition of the test statistics, and derivation of the null distribution to determine rejection of H 0 . First, here we consider three different types of distributional biases in the sense of (i) the biased use of different trajectories; (ii) the specific localization in a certain differentiation stage vs. the more dynamic changes across the differentiation stages; and (iii) the preference for earlier vs. later differentiation stages. We note that cells are often not uniformly distributed between different trajectories or along differentiation stages. Given this, instead of considering the uniform distribution as H 0 , we use the global patterns as the null hypothesis and evaluate whether a specific clonotype cluster shows significantly different patterns compared to this global distribution. Second, in the sense of test statistics, for the case of the hypothesis testing (i), we would like to evaluate whether clones were preferentially distributed towards a particular lineage or not. For this purpose, we use the test statistic , which is the log2-transformed ratio of the number of cells in lineage l to the number of cells in lineage m for clonotype cluster k. Note that R klm has nice interpretation: if R klm is significantly larger than 0, then it indicates the preference for lineage l, while if R klm is significantly smaller than 0, then it indicates the preference for lineage m. For the case of the hypothesis testing (ii), the specific localization in a certain differentiation stage vs. the more dynamic changes across the differentiation stages can be formulated as the degree how evenly cells are spaced along the trajectory path. We use the interquartile range (IQR) of the pseudotime of a clonotype cluster as the test statistic for this purpose based on the rationale that IQR will become smaller as we have more specific localization in a certain differentiation stage. This test is also applicable to the case when there are multiple lineages. For the case of the hypothesis testing (iii), if the cells are not evenly spaced along the trajectory path (i.e., if the hypothesis testing (ii) is rejected), then we want to check whether earlier stage along the trajectory or the latter stage is preferred. To answer this question, we use the median pseudotime as the test statistic because it can represent where the majority of cells belonging to a specific clonotype cluster are located over the pseudotime. Specifically, if the observed median pseudotime is significantly smaller/larger than the global median, we may say that cells prefer earlier/later stage. Finally, to derive the null distribution, we use permutation test approaches. Specifically, in each permutation set, we randomly shuffle the clonotype cluster membership for the cells while maintaining the size of each clonotype cluster. Then, for each permutation set, we calculate the test statistics described above and obtain the distribution of these null test statistics. Finally, in the case of the hypothesis testing (ii), we obtain the p-value as the proportion of permutation sets of which test statistics is smaller than the test statistic observed in our dataset (i.e., 1-sided hypothesis testing) because we are interested in the smaller IQR, which indicates the specific localization in a certain differentiation stage. On the other hand, in the case of the hypothesis testing (i) and (iii), we obtain the p-value as the proportion of permutation sets of which test statistics is either smaller or larger than the test statistic observed in our dataset (i.e., 2-sided hypothesis testing) because both directions are of interest. For example, the smaller and larger test statistics for the hypothesis testing (i) mean preference for lineage m and lineage l, respectively. Similarly, the smaller and larger test statistics for the hypothesis testing (iii) mean the preference for earlier vs. later differentiation stages, respectively. Clonotype cluster characterization To characterize the clonotype clusters, first, LRT also quantifies the magnitude of clonal expansion, migration potential and state transition potentials of clonotypes by utilizing the STARTRAC framework [17]. Both clonotype level and clonotype cluster level indices are calculated. For the details, please refer to Section A in S1 Text. Second, a range of repertoire analysis is conducted for each clonotype cluster. Specifically, LRT calculates several diversity indices including the Shannon index [18], which characterizes richness and evenness of the TCRs in a clonotype cluster. To understand the clonal expansion behaviors of the clones, LRT also analyzes the relative clonal space occupied by the clonotypes of a given frequency range as well as top-expanded clonotypes. Third, given that biased usage of specific genes may indicate the alterations in the repertoire, LRT calculates the V and J gene usages of the TCR β chain [11]. Finally, we do de novo motif analysis to identify the motifs for the amino acid sequences of the TCR β chain using the GLAM2 tool within the MEME Suite [19], based on the top-ranked clones with the default setting. Here clonotypes are ranked in each clonotype cluster based on their membership assignment posterior probability, , and clone size. Specifically, a clonotype is likely to be the higher ranking one if it has higher posterior probability and larger clone size. Software implementation The aforementioned LRT framework and the whole analysis workflow were implemented as an R package ‘LRT’ and it is now publicly accessible at https://github.com/JuanXie19/LRT. The LRT software takes both scRNA-seq and scTCR-seq data as input. It accepts scRNA-seq data in the form of a Seurat object [20]. Such integration with Seurat not only allows LRT to be easily integrated into an existing scRNA-seq analysis workflow, but also utilizes Seurat’s powerful features, e.g., correcting batch effects and variations due to technical factors (e.g., sequencing depth) [20,21]. The Seurat object needs to contain cell cluster labels and low-dimensional embeddings for each cell, such as PCA or UMAP. In the case of scTCR-seq data, it requires a data frame with at least two columns: one column with cell identifiers and another column with TCR sequence information, for which users can provide either amino acid or nucleotide sequences of TCR α, β, or paired α-β chain CDR3 region. Data preprocessing LRT integrates scTCR-seq data with scRNA-seq data by attaching the TCR sequence information to the metadata of the Seurat object. LRT also provides some basic quality control functionalities, e.g., checking whether the cell identifiers from the Seurat object match those in the TCR data frame. After the integration, users will get an S4 object containing the updated Seurat object, which is the input for clonotype exploratory analysis and trajectory inference. Shiny apps for interactive visualization and analysis: shinyClone and shinyClust We developed two Shiny apps, shinyClone, and shinyClust, for interactive and dynamic data exploration and integrative analysis of scRNA-seq and scTCR-seq data. [END] --- [1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011300 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/