Barbara Engelhardt Barbara Engelhardt

Poisson reduced rank regression (PRRR)

Poisson Reduced Rank Regression for single-cell sequencing associations

Tiana Fitzgerald, Andy Jones, Barbara Engelhardt

Our paper developing count-based model for association mapping published in #BMCBioinformatics! Great work by amazing Princeton undergrad Tiana Fitzgerald and PhD student Andy Jones.

A Poisson reduced-rank regression model for association mapping in sequencing data

We develop a Poisson reduced-rank regression model to identify low-dimensional associations in high-dimensional data.

We adapted association mapping methods developed for bulk sequencing to count-based single-cell technologies. The goal here is to identify associations between cell-specific covariates and each cell's expression levels.

Our model, Poisson reduced rank regression (PRRR), is a Poisson regression model with a low-rank coefficient matrix.

We find that PRRR is useful in several applications. For example, finding transcriptional hallmarks of cell types (here, in scRNA-seq data from human pancreas).

In characterizing pancreatic cell types, we find that PRRR's learned latent factors correspond to well-established biological processes in pancreas islet and non-islet cells.

We also applied PRRR to spatial transcriptomics data, where we find that the model can identify associations between spatial coordinates and gene expression.

As a final application, we used PRRR to perform low-rank eQTL mapping in bulk RNA-seq data from GTEx. Applied to liver tissue samples, we find that the model's latent factors pick up correlated eQTLs corresponding to interferon gamma and inflammatory responses.

Code for the model and experiments in the paper is here. Try it out and let us know what you think!

Read More
Barbara Engelhardt Barbara Engelhardt

single cell TBLDA

single cell telescoping bimodal latent Dirichlet allocation

BEEHIVE alumna Dr. Ari Gewirtz and @will_townes post preprint Expression QTLs in single-cell data that describes single-cell telescoping bimodal latent Dirichlet allocation (scTBLDA) applied to population-scale single-cell RNA-seq data to find expression QTLs.

These beautiful scRNA-seq and paired genotype data come from Jimmie Yee's Lab and include approximately 400,000 peripheral blood mononuclear cells (PBMCs) from 119 women with systemic lupus erythematosus (SLE).

We extend our method, telescoping bimodal latent Dirichlet allocation (TBLDA), that identifies covarying genotypes and gene expression values when the matching from samples to cells is not one-to-one in order to allow cell-type label agnostic discovery of eQTLs in noncomposite scRNA-seq data. In particular, we add GPU-compatibility, sparse priors, and amortization to enable fast inference on large-scale scRNA-seq data.

We find complex cell-type heterogeneity within specific immune cell-type labels.

We use linked genes and SNPs to identify 205 cis-eQTLS, 66 trans-eQTLs, and 53 cell type proportion QTLs. Our results demonstrate the ability of scTBLDA to identify genes involved in cell-type specific regulatory processes associated with SNPs in single-cell data.

We found broad replication of our findings in three scRNA-seq studies: [van der Wijst et al. 2018], [Vosa et al. 2021], and [Fairfax et al. 2012].

We found enrichment of our trans-eQTLs in enhancers using stratified LD score regression.

scTBLDA is an important step forward in methods for eQTL analysis in single-cell sequencing studies. Our approach shifts the paradigm of mapping QTLs from univariate analyses to an analysis that more accurately captures statistical complexities and biological models of regulation.

Code available for scTBLDA -- try it on your population-wide single-cell sequencing data! Feedback welcome!

Read More
Barbara Engelhardt Barbara Engelhardt

Spatial experimental design

Spatial experimental design

New preprint on optimizing the design of spatial genomics studies! This represents the hard work of Andy Jones, Diana Cai, and Didong Li. When gathering spatial sequencing data from a tissue, typically parallel axis-aligned slices are selected, but these slices may contain redundant information. Can we make spatial genomics experiments more cost-efficient by profiling tissue slices that are maximally informative?

We propose a Bayesian optimal experimental design method for two separate spatial genomics contexts: building atlases of tissues/organs and identifying the boundaries of regions of interest---such as tumors---in tissue samples.

For a given genomic assay, our method iteratively selects the tissue slice that maximizes the expected information gain, i.e., the slice that is most likely to provide the greatest amount of new information on top of the data already collected.

To demonstrate our approach in the atlas-building context, we used the Allen Brain Atlas’s spatial gene expression data from mouse brain. We found that our sequential approach selects slices with more information in fewer slices compared to traditional serial slicing strategies.

We also applied our model to an experimental setting in which a tumor is being localized. Applied to Visium data from a prostate cancer sample, we find that our approach is able to achieve near-optimal bounding with four slices, as opposed to serial approaches, which take eight.

Feedback welcome! Try out our approach – we hope it will design an efficient and informative experiment for you!

Read More