Integration of spatial and single-cell data across modalities with weakly linked features – Nature Biotechnology

[ad_1]

The MaxFuse pipeline

Input preparation

Consider a pair of datasets, $Y\in {{\mathbb{R}}}^{{N}_{y}\times {p}_{y}}$ and $Z\in {{\mathbb{R}}}^{{N}_{z}\times {p}_{z}}$, from two modalities (termed Y-modality and Z-modality for exposition convenience), with each row corresponding to a cell and each column a feature. In the ensuing discussion, we treat Y as the modality with a higher signal-to-noise ratio. For concreteness, one can think of Y as an snRNA-seq dataset and Z as a CODEX dataset. Suppose there are two known functions, ${f}_{y}:{{\mathbb{R}}}^{{p}_{y}}\to {{\mathbb{R}}}^{s}$ and ${f}_{z}:{{\mathbb{R}}}^{{p}_{z}}\to {{\mathbb{R}}}^{s}$, such that $f_y(\mathbf{y})$ predicts the values of $f_z(\mathbf{z})$ in a cell if the measured values under Y-modality are ${\mathbf{y}}$ in that cell and those under Z-modality are ${\mathbf{z}}$. For any matrix A with p_y columns, let f_y(A) denote the matrix with s columns and the same number of rows as A, obtained from applying f_y on each row of A and stacking the outputs as row vectors. For any matrix B with p_z columns, f_z(B) is analogously defined. We define ${Y}^{\circ }={f}_{y}(Y)\in {{\mathbb{R}}}^{{N}_{y}\times s}$ and ${Z}^{\circ }={f}_{z}(Z)\in {{\mathbb{R}}}^{{N}_{Z}\times s}$. In the snRNA-seq versus CODEX example, if one has a crude prediction for a subset S (with size $\left\vert S\right\vert =s$) of the proteins, then $f_z(\mathbf{z})={\mathbf{z}}_S$ returns the subvector indexed by S while ${f}_{y}({\bf{y}})={\hat{\mathbf{z}}}_{S}$ predicts the observed CODEX values for these proteins based on transcriptomic information of a cell. In summary, we start with a pair of original datasets (Y, Z) and a pair of datasets (${Y^{\circ}}$, ${Z^{\circ}}$), where the columns of the latter have one-to-one correspondence based on domain knowledge. The columns of ${Y^{\circ}}$ and ${Z^{\circ}}$ can be learned feature-wise prediction functions, as described above, or learned coembedding coordinates from some model trained on multi-omics data.

Meta-cell construction. To alleviate sparsity and to scale to large datasets, we start by constructing meta-cells. Let n_y be the desired number of meta-cells. We first construct a nearest-neighbor graph of the rows of Y, apply Leiden clustering with an appropriate resolution level to obtain n_y clusters and average over the rows within each cluster to obtain the features for each meta-cell. Consequently, we obtain ${Y}_{{\mathtt{m}}}\in {{\mathbb{R}}}^{{n}_{y}\times {p}_{y}}$. Using this clustering structure (induced by Y), we can average feature vectors in ${Y^{\circ}}$ to obtain ${Y}_{{\mathtt{m}}}^{\circ }\in {{\mathbb{R}}}^{{n}_{y}\times s}$. When desired, the same operation can be performed on the Z-modality to obtain ${Z}_{{\mathtt{m}}}\in {{\mathbb{R}}}^{{n}_{z}\times {p}_{z}}$ and ${Z}_{{\mathtt{m}}}^{\circ }\in {{\mathbb{R}}}^{{n}_{z}\times s}$. We recommend only constructing meta-cells for modalities that allow cell state differentiation at fine granularity. For example, if Y-modality contains snRNA-seq data and Z-modality contains CODEX data, then we would usually recommend to construct meta-cells only in Y-modality. The choices of meta-cell size for analyses reported in this work are given in Supplementary Table 2. In addition, in Extended Data Figs. 3 and 4 and Supplementary Figs. 1 and 2, we benchmarked robustness of results with respect to meta-cell size. Meta-cell sizes of 2–3 are optimal across the datasets we tested. After this curation step, we have two pairs of datasets, $({Y}_{{\mathtt{m}}}^{},{Z}_{{\mathtt{m}}}^{})$ and $({Y}_{{\mathtt{m}}}^{\circ },{Z}_{{\mathtt{m}}}^{\circ })$. The former pair can have completely distinct feature sets, while the latter pair must have matching feature sets with corresponding columns. In Fig. 1a, the former correspond to the pair of all-feature matrices, and the latter correspond to the pair of linked-feature matrices.

Fuzzy smoothing

Let ${G}_{Y}\in {\{0,1\}}^{{n}_{y}\times {n}_{y}}$ be a nearest-neighbor graph of ${Y}_{{\mathtt{m}}}^{}$ where each row i is connected to ${k}_{i}^{Y}$ rows that are closest in a chosen similarity measure, including itself. So row i of G_Y has ${k}_{i}^{Y}$ entries equal to one and others zeros. In addition, all its diagonal entries are equal to one. Let ${{{{\mathcal{A}}}}}_{Y}({Y}_{{\mathtt{m}}})={K}_{Y}^{-1}{G}_{Y}{Y}_{{\mathtt{m}}}$ and ${{{{\mathcal{A}}}}}_{Y}({Y}_{{\mathtt{m}}}^{\circ })={K}_{Y}^{-1}{G}_{Y}{Y}_{{\mathtt{m}}}^{\circ }$ be locally averaged versions of ${Y}_{{\mathtt{m}}}^{}$ and ${Y}_{{\mathtt{m}}}^{\circ }$ over G_Y, respectively, where ${K}_{Y}={{{\rm{diag}}}}({k}_{1}^{Y},\ldots ,{k}_{{n}_{y}}^{Y})$. For a nearest-neighbor graph G_Z, we define ${{{{\mathcal{A}}}}}_{Z}({Z}_{{\mathtt{m}}})$ and ${{{{\mathcal{A}}}}}_{Z}({Z}_{{\mathtt{m}}}^{\circ })$ in an analogous way. Finally, for any weight w ∈ [0, 1] and any matrices A and B with n_y and n_z rows, respectively, we define

$$\begin{array}{rcl}{{{{\mathcal{S}}}}}_{Y}(A;w)&=&wA+(1-w){{{{\mathcal{A}}}}}_{Y}(A),\\ {{{{\mathcal{S}}}}}_{Z}(B;w)&=&wB+(1-w){{{{\mathcal{A}}}}}_{Z}(B).\end{array}$$

(1)

In this way, we define ${\widetilde{Y}}_{{\mathtt{m}}}^{\circ }={{{{\mathcal{S}}}}}_{Y}({Y}_{{\mathtt{m}}}^{\circ };{w}_{0})$ and ${\widetilde{Z}}_{{\mathtt{m}}}^{\circ }={{{{\mathcal{S}}}}}_{Z}({Z}_{{\mathtt{m}}}^{\circ };{w}_{0})$ with w₀ ∈ [0, 1]. In Fig. 1a, these are matrices with smoothed Y-modality linked features and smoothed Z-modality linked features, respectively. See Supplementary Table 3 for a list of smoothing weights used in data analyses reported in this work.

Initial matching via linear assignment

As the columns in ${\widetilde{Y}}_{{\mathtt{m}}}^{\circ }$ and in ${\widetilde{Z}}_{{\mathtt{m}}}^{\circ }$ have correspondences, we can compute an n_y × n_z distance matrix ${D^{\circ}}$ where ${D}_{ij}^{\circ }$ measures the distance between the i-th row in ${\widetilde{Y}}_{{\mathtt{m}}}^{\circ }$ and the j-th row in ${\widetilde{Z}}_{{\mathtt{m}}}^{\circ }$ after projecting to respective leading singular subspaces. We obtain an initial matching ${\widehat{\Pi }}^{\circ }$ as the solution to the linear assignment problem^32,64:

$$\begin{array}{rcl}&\,{{\mbox{minimize}}}\,&\langle \Pi ,{D}^{\circ }\rangle \\ &\,{{\mbox{subject to}}}\,&\Pi \in {\{0,1\}}^{{n}_{y}\times {n}_{z}}\\ &&\mathop{\sum}\limits_{i}{\Pi }_{ij}\le 1,\forall j,\quad \mathop{\sum}\limits_{j}{\Pi }_{ij}\le 1,\forall i,\\ &&\mathop{\sum}\limits_{i,\,j}{\Pi }_{ij}={n}_{\min }.\end{array}$$

(2)

Here, ${n}_{\min }=\min \{{n}_{y},{n}_{z}\}$ and, for two matrices A and B of the same size, 〈A, B〉 = ∑_i,jA_ijB_ij denotes the trace inner product. The linear assignment problem in equation (2) can be efficiently solved by relaxing the first constraint to $\Pi \in {[0,1]}^{{n}_{y}\times {n}_{z}}$. The resulting linear program has the same solution as equation (2). The Python implementation we used is based on the shortest augmenting path approach summarized in ref. ⁶⁵. The estimator ${\widehat{\Pi }}^{\circ }$ provides a relatively crude matching using only the information provided by the previous knowledge encapsulated in f_y and f_z which link features in the two modalities. By definition, ${\widehat{\Pi }}^{\circ }$ gives ${n}_{\min }$ pairs of matched rows between the two modalities, which we call initial pivots.

Cross-modality joint embedding and iterative refinement

From matched pairs to joint embedding. An estimated matching $\widehat{\Pi }$ induces a cross-modality joint embedding of ${Y}_{{\mathtt{m}}}^{}$ and ${Z}_{{\mathtt{m}}}^{}$. Let ${Y}_{{\mathtt{m}}}^{\,{\mathtt{r}}}\in {{\mathbb{R}}}^{{n}_{y}\times {r}_{y}}$ and ${Z}_{{\mathtt{m}}}^{\,{\mathtt{r}}}\in {{\mathbb{R}}}^{{n}_{z}\times {r}_{z}}$ collect the leading principal components of all features (that is, ${Y}_{{\mathtt{m}}}^{}$ and ${Z}_{{\mathtt{m}}}^{}$) in the two modalities, respectively. Here, the numbers of principal components to retain, that is, r_y and r_z, are chosen based on data. For any matrix A, let [A]_i⋅ denote its i-th row. Suppose $\{({i}_{\ell },{i}_{\ell }^{{\prime} }):\ell =1,\ldots ,{n}_{\min }\}$ are the matched pairs specified by $\widehat{\Pi }$. We perform CCA on data pairs

$$\left\{\left({\left[{Y}_{{\mathtt{m}}}^{\,{\mathtt{r}}}\right]}_{{i}_{\ell }\cdot },{\left[{Z}_{{\mathtt{m}}}^{\,{\mathtt{r}}}\right]}_{{i}_{\ell }^{{\prime} }\cdot }\right):\ell =1,\ldots ,{n}_{\min }\right\}$$

to obtain the leading ${r}_{{\mathtt{cc}}}^{}$ loading vectors for either modality, collected as the columns of ${\widehat{C}}_{y}={\widehat{C}}_{y}(\widehat{\Pi })$ and ${\widehat{C}}_{z}={\widehat{C}}_{z}(\widehat{\Pi })$, respectively. The joint embedding induced by $\widehat{\Pi }$ is then ${Y}_{{\mathtt{m}}}^{{\mathtt{cc}}}={Y}_{{\mathtt{m}}}^{\,{\mathtt{r}}}{\widehat{C}}_{y}\in {{\mathbb{R}}}^{{n}_{y}\times {r}_{{\mathtt{cc}}}}$ and ${Z}_{{\mathtt{m}}}^{{\mathtt{cc}}}=$${Z}_{{\mathtt{m}}}^{{\mathtt{r}}}{\widehat{C}}_{z}\in {{\mathbb{R}}}^{{n}_{z}\times {r}_{{\mathtt{cc}}}}$, the predicted canonical correlation (CC) scores of ${Y}_{{\mathtt{m}}}^{{\mathtt{r}}}$ and ${Z}_{{\mathtt{m}}}^{{\mathtt{r}}}$, respectively.

Iterative refinement. Let ${\widehat{\Pi }}^{(0)}={\widehat{\Pi }}^{\circ }$ be the initial matching obtained from equation (2). We fix a weight w₁ ∈ [0, 1] and the embedding dimension ${r{}^{{\mathtt{cc}}}}^{}$, and we refine the estimated matching by iterating the following steps for t = 1, …, T:

(1)
Compute joint embedding $\{{Y}_{{\mathtt{m}}}^{\,{\mathtt{cc}},(t)},{Z}_{{\mathtt{m}}}^{\,{\mathtt{cc}},(t)}\}$ induced by ${\widehat{\Pi }}^{(t-1)}$;
(2)
Apply fuzzy smoothing on joint embedding: ${\widetilde{Y}}_{{\mathtt{m}}}^{\,{\mathtt{cc}},(t)}={{{{\mathcal{S}}}}}_{Y}({Y}_{{\mathtt{m}}}^{\,{\mathtt{cc}},(t)},{w}_{1})$, ${\widetilde{Z}}_{{\mathtt{m}}}^{\,{\mathtt{cc}},(t)}={{{{\mathcal{S}}}}}_{Z}({Z}_{{\mathtt{m}}}^{\,{\mathtt{cc}},(t)},{w}_{1})$;
(3)
Calculate a distance matrix ${D}^{(t)}\in {{\mathbb{R}}}^{{n}_{y}\times {n}_{z}}$ where ${D}_{ij}^{(t)}$ measures the distance between ${[{\widetilde{Y}}_{{\mathtt{m}}}^{{\mathtt{cc}},(t)}]}_{i\cdot }$ and ${[{\widetilde{Z}}_{{\mathtt{m}}}^{{\mathtt{cc}},(t)}]}_{j\cdot }$, and obtain a refined matching ${\widehat{\Pi }}^{(t)}$ by solving equation (2) in which ${D^{\circ}}$ is replaced with D^(t).

Figure 1b illustrates the foregoing refinement iteration.

Propagation of matching and postprocessing

For downstream analyses, one would often like to find for each cell in Y a match in Z, or vice versa, and sometimes both ways. In addition, one often wants joint embedding of cells across different modalities in a common space. We now describe how MaxFuse achieves these goals.

Filtering and final joint embedding. Upon obtaining the matched pairs $\{({i}_{\ell },{i}_{\ell }^{{\prime} }):\ell =1,\ldots ,{n}_{\min }\}$ in ${\widehat{\Pi }}^{(T)}$, we rank them in descending order of ${D}_{{i}_{\ell }{i}_{\ell }^{{\prime} }}^{(T)}$ and only retain the top 100 × (1 − α)% pairs, where α is a user-specified filtering proportion (with a default α = 0). The retained pairs are called refined pivots. Then, we fit a CCA using the refined pivots and the corresponding rows in ${Y}_{{\mathtt{m}}}^{}$ and ${Z}_{{\mathtt{m}}}^{}$ to get the associated CCA loading matrices ${\widehat{C}}_{y}^{{\mathtt{e}}}\in {{\mathbb{R}}}^{{p}_{y}\times {r}^{{\mathtt{e}}}}$ and ${\widehat{C}}_{z}^{{\mathtt{e}}}\in {{\mathbb{R}}}^{{p}_{z}\times {r}^{{\mathtt{e}}}}$. Here the positive integer ${r{}^{{\mathtt{e}}}}^{}$ is a user-specified dimension for final joint embedding. Finally, the joint embedding of the full datasets is given by ${Y}^{{\mathtt{e}}}=Y{\widehat{C}}_{y}^{{\mathtt{e}}}\in {{\mathbb{R}}}^{{N}_{y}\times {r}^{{\mathtt{e}}}}$ and ${Z}^{{\mathtt{e}}}=Z{\widehat{C}}_{z}^{{\mathtt{e}}}\in {{\mathbb{R}}}^{{N}_{z}\times {r}^{{\mathtt{e}}}}$, respectively. In Fig. 1c, they correspond to the Y-modality embedding and Z-modality embedding matrices.

Using pivots to propagate matching. For each row index i ∈ {1, …, n_y} in Y-modality that does not have a match in Z-modality, MaxFuse searches for the nearest neighbor of the i-th row in ${\widetilde{Y}}_{{\mathtt{m}}}={{{{\mathcal{S}}}}}_{Y}({Y}_{{\mathtt{m}}};{w}_{0})$ that belongs to some refined pivot. Suppose the nearest neighbor is the j_i-th row with a match ${j}_{i}^{{\prime} }$ in Z-modality, then we call $(i,{j}_{i}^{{\prime} })$ a matched pair obtained via propagation. We can optionally filter out any matched pair via propagation in which the nearest-neighbor distance between ${[{\widetilde{Y}}_{{\mathtt{m}}}]}_{i\cdot }$ and ${[{\widetilde{Y}}_{{\mathtt{m}}}]}_{{j}_{i}\cdot }$ is above a user-specified threshold. The retained matched pairs compose the Y-to-Z propagated matching. This procedure is then repeated with the roles of Y– and Z-modalities switched to obtain the Z-to-Y propagated matching. Pooling all matched pairs from refined pivots and propagated matching together, we obtain a matching between meta-cells in Y-modality and those in Z-modality. Such a meta-cell-level matching defines a single-cell-level matching between the original datasets Y and Z by declaring $(i,{i}^{{\prime} })$ a matched pair for $1\le i\le {N}_{y},1\le {i}^{{\prime} }\le {N}_{z}$ if the meta-cell that i belongs to is matched to the meta-cell that ${i}^{{\prime} }$ belongs to.

Scoring and directional pruning of matching. For each single-cell-level matched pair $(i,{i}^{{\prime} })$, we compute Pearson correlation between the i-th row of ${Y{}^{{\mathtt{e}}}}^{}$ and the ${i}^{{\prime} }$-th row of ${Z{}^{{\mathtt{e}}}}^{}$ (that is, corresponding rows in final joint embedding) as its matching score. We use these matching scores to prune single-cell-level matching, with the direction of pruning specified by the user. Suppose the user wants to find for each cell in Z a match in Y (for example, Z is a CODEX dataset and Y snRNA-seq). Then for each cell index $1\le {i}^{{\prime} }\le {N}_{z}$, we first list all refined pivots and propagated matching pairs that contain ${i}^{{\prime} }$. If the list is nonempty, we only retain the pair with the highest matching score. Otherwise, we declare no match for cell ${i}^{{\prime} }$ in Z-modality. If the direction is reversed, we apply the foregoing procedure with the roles of Y and Z switched. Furthermore, if no directional pruning is desired, we just keep all refined pivots and postscreening propagated matching pairs in the final single-cell matching. In Extended Data Figs. 3 and 4 and Supplementary Figs. 1 and 2, we benchmarked how evaluation metrics change with different choices of filtering proportions in propagation and in pruning. In Supplementary Table 4, we reported the filtering proportions used in the data analyses reported in this work. After filtering, propagation and potential pruning, the final list of matched pairs corresponds to the final matching in Fig. 1c.

Systematic benchmarking on ground-truth datasets

MaxFuse and other methods in comparison

MaxFuse was implemented in Python, and the four methods used for comparison, Seurat V3, Harmony, Liger and BindSC, were implemented in R. All benchmarking datasets were preprocessed in the same way for all methods, including filtering of low-quality cells, selection of highly variable genes and protein features to be used in integration, feature linkage scheme (for example, protein to their corresponding gene names) and normalization of raw observed values (except for Liger which required scaling without centering). We used the default tuning parameters in each method suggested by the respective tutorial, with the exception of BindSC, for which we used the separate set of parameters suggested for the integration of protein-related data by its method tutorial website. For MaxFuse, initial matching used features that are weakly linked (for example, protein CD4 and RNA CD4) and are smoothed by all-feature nearest-neighbor graphs. For refined matching, all features from both modalities were used (for example, all proteins and RNAs that are highly variable). For other methods in comparison, BindSC used both the weakly linked features and all features, whereas others only used the weakly linked features by design. The full details were recorded and can be reproduced, with code deposited to https://github.com/shuxiaoc/maxfuse/tree/main/Archive.

Evaluation metrics

(1)
Cell-type matching accuracy: To evaluate the matching performance for Seurat V3, Liger, Harmony and BindSC, we used the respective integration embedding vectors produced by each method. For these methods, for each cell in one modality, we regarded its nearest neighbor from the other modality under Pearson correlation distance in the embedding space as its match. For MaxFuse, we directly used matched pairs produced in the final result. For all methods, we use the same matching direction (for example, for each cell in CODEX data finding a matched cell in scRNA-seq data) for fair comparison. Accuracy of the matchings was measured by fraction of matched pairs with identical cell-type annotations. Details on cell-type annotation are given below in the description of each benchmarking dataset.
(2)
FOSCTTM: FOSCTTM was used to evaluate single-cell-level alignment accuracy on datasets with ground-truth single-cell-level pairing. The measure has been used previously in cross-modality alignment benchmarking tasks^19,36,37. For such data, N_y = N_z = N, and FOSCTTM is defined as:

$${{{\rm{FOSCTTM}}}}=\frac{1}{2N}\left(\mathop{\sum }\limits_{i=1}^{N}\frac{{n}_{y}^{(i)}}{N}+\mathop{\sum }\limits_{i=1}^{N}\frac{{n}_{z}^{(i)}}{N}\right),$$

where for each $i,{n}_{y}^{(i)}=\left\vert \{\,j\left.\right\vert d({\,y}_{i},{z}_{j}) < d({\,y}_{i},{z}_{i})\}\right\vert$ with d a distance metric in the joint embedding space and for l = 1, …, N, y_l and z_l are the embedded vectors of the l-th cell with its measurements in Y– and Z-modality, respectively. The counts ${n}_{z}^{(i)},i=1,\ldots ,N$, are defined analogously. A lower value of FOSCTTM indicates better integration performance.
(3)
FOSKNN: FOSKNN was used to evaluate single-cell-level alignment accuracy on datasets with ground-truth single-cell-level pairing. For such data, N_y = N_z = N. For any method in comparison, let {y_i: i = 1, …, N} be the coordinates of cells in the joint embedding space from their Y-modality information, and let {z_i: i = 1, …, N} be embedding coordinates from their Z-modality information. Then

$${{{\rm{FOSKNN}}}}=\frac{1}{2N}\left(\mathop{\sum }\limits_{i=1}^{N}{{{{\bf{1}}}}}_{{E}_{y,k}}^{(i)}+\mathop{\sum }\limits_{i=1}^{N}{{{{\bf{1}}}}}_{{E}_{z,k}}^{(i)}\right)$$

where for $i=1,\ldots ,N,{{{{\bf{1}}}}}_{{E}_{y,k}}^{(i)}$ is the indicator of whether the k closest embedded vectors from Z-modality to y_i includes z_i. The quantity ${{{{\bf{1}}}}}_{{E}_{z,k}}^{(i)}$ is defined analogously. A higher value of FOSKNN indicates better integration performance.
(4)
Silhouette F1 score: Silhouette F1 score has been used to simultaneously measure modality mixing and information preservation post integration process^21,35. In brief, the F1 score was calculated by 2 ⋅ slt_mix ⋅ slt_clust/(slt_mix + slt_clust), where slt_mix is defined as one minus normalized Silhouette width with the label being modality index (two modalities); slt_clust is defined by the normalized Silhouette width with the label being cell-type annotations (for example, ‘CD4 T’, ‘CD8 T’, ‘B’ and so on). All Silhouette widths were computed using the silhouette function from R package cluster.
(5)
ARI F1 score: ARI F1 score has been used to jointly measure modality mixing and information preservation post integration process^21,35. The score was calculated in a similar way to Silhouette F1 score, while the ARI was used instead of the Silhouette width. All ARI scores were computed using the function adjustedRandIndex in R package mclust.

CITE-seq PBMC dataset analysis

The CITE-seq data from human PBMCs with antibody panel of 228 markers were retrieved from Hao et al.³³ and cell-type annotations (level 1: 8 cell types; and level 2: 31 cell types) were directly retrieved from the original annotation in ref. ³³. For benchmarking purposes, five batches of cells, each with 10,000 cells, were randomly sampled from the original dataset and used for benchmarking. The first 15 components of the embedding vectors produced by all methods were used for benchmarking metric calculation. The UMAP visualization of the integration process was also calculated with the first 15 components of the embedding vectors. For visualization purposes, the 31 cell types of level 2 annotation were manually binned into 20 cell types in the UMAP cell-type coloring.

For analyses with fewer antibodies, we ranked the importance of each individual antibody in the panel in terms of phenotyping contribution. The importance score was calculated by training a random forest model (function randomForest in R package randomForest, with default parameters) using all antibodies to predict cell-type labels (annotation level 2), then a permutation feature importance test (function varImp with default parameters in R package caret) was performed on the trained model to acquire the importance scores. Then antibodies were ranked by the importance scores, and four panels were used for the antibody dropping test: (1) full 228-antibody panel; (2) top 100 most important antibodies; (3) top 50 most important antibodies; (4) top 30 most important antibodies.

CITE-seq bone marrow cell dataset analysis

The CITE-seq healthy human bone marrow cells (BMCs) data with an antibody panel of 25 markers were retrieved from the R package SeuratData ‘bmcite’; these data were also reported by Hao et al.³³. A total of 20,000 cells were randomly sampled from the original dataset and used for benchmarking. The first 15 components of the embedding vectors produced by all methods were used for benchmarking metric calculation. The UMAP visualization of the integration process was also calculated with the first 15 components of the embedding vectors. The original cell-type annotation (lv2) from the R package was binned into eight populations, ‘DC’, ‘progenitor’, ‘monocyte’, ‘NK’, ‘B’, ‘CD4 T’, ‘CD8 T’ and ‘Other T’, and used for benchmarking.

Abseq BMC dataset analysis

The Abseq healthy human BMC data with antibody panel of 97 markers and whole transcriptome sequencing were retrieved from Triana et al.³⁹. All cells in the dataset (~13,000), except cells belonging to cell types with insufficient numbers of cells (<50 cells, annotated as ‘Doublet and Triplets’, ‘Early GMP’, ‘Gamma delta T cells’, ‘Immature B cells’, ‘Metaphase MPPs’, ‘Neutrophils’ in ref. ³⁹), were included for integration. The remaining 14 cell types were used during benchmarking. The first 15 components of the embedding vectors produced by all methods were used for benchmarking metric calculation. The UMAP visualization of the integration process was also calculated with the first 15 components of the embedding vectors.

TEA-seq PBMC dataset analysis

The TEA-seq neutrophil-depleted human PBMC dataset was retrieved from Swanson et al.⁴¹ (GSM4949911). This dataset contains 46 antibodies and chromatin accessibility information. Cell-type annotation was performed using R package Seurat (v.4) WNN-multi-modal clustering pipeline: function FindMultiModalNeighbors was run on the antibody-derived tags (ADT) assay principal component analysis (PCA) output (first 25 components) and the ATAC assay latent semantic indexing (LSI) output (first 2–50 components, calculated by R package Archr⁴²). Subsequently, the function FindClusters was used to generate unsupervised clustering (with parameters algorithm = 3, resolution = 0.2), followed by manual annotation. A total of eight populations were identified (‘Naive CD4’, ‘Mem CD4’, ‘Monocyte’, ‘NK’, ‘Naive CD8’, ‘Mem CD8’, ‘Effector CD8’, ‘B’, ‘NK’), and the total number of cells was ~7,400. ADT expressions and gene activity scores (calculated by R package Archr⁴²) were used as input for MaxFuse and other methods. Additionally, during matching refinement, MaxFuse used LSI reductions of the ATAC peaks (first 2–50 components) as features for the ATAC modality. The first 15 components of the embedding vectors produced by all methods were used for benchmarking metric calculation. The UMAP visualization of the integration process was also calculated with the first 15 components of the embedding vectors.

ASAP-seq PBMC dataset analysis

The ASAP-seq healthy human PBMC data (CD28 and CD3 stim PBMC control group) with an antibody panel of 227 markers and chromatin accessibility information were retrieved from Mimitou et al.⁴⁰ (GSM4732109 and GSM4732110). Cell-type annotation was performed using R package Seurat (v.4) WNN-multi-modal clustering pipeline: the function FindMultiModalNeighbors was run on ADT PCA (first 18 components) and ATAC LSI (2–40 components, calculated by R package Archr). Subsequently, the function FindClusters was used to generate unsupervised clustering (with parameters algorithm = 3, resolution = 0.3), followed by manual annotation. A total of nine populations were identified (‘Naive CD4’, ‘Mem CD4’, ‘Monocyte’, ‘NK’, ‘Naive CD8’, ‘Mem CD8’, ‘B’, ‘Other T’, ‘dirt’), and ‘dirt’ was removed from subsequent analyses, resulting in about 4,400 cells used. ADT expressions and gene activity scores (calculated by R package Archr) were used as input for MaxFuse and other methods. Additionally, during matching refinement, MaxFuse used LSI reductions of the ATAC peaks (first 2–50 components) as features for the ATAC modality. The first 15 components of the embedding vectors produced by all methods were used for benchmarking metric calculation. The UMAP visualization of the integration process was also calculated with the first 15 components of the embedding vectors.

MaxFuse on spatial-omics matching

CODEX and scRNA-seq human tonsil dataset analysis

CODEX multiplex imaging data of human tonsil tissues with a panel of 46 antibodies were retrieved from Kennedy-Darling et al.⁴⁹. Images from tonsil-9338 (region X2-8, Y7-15) were used. Whole-cell segmentation was performed with a local implementation of Mesmer⁶⁶, with weights downloaded from: https://deepcell-data.s3-us-west-1.amazonaws.com/model-weights/Multiplex_Segmentation_20200908_2_head.h5. Inputs of segmentation were DAPI (nuclear) and CD45 (membrane). Signals from the images were capped at 99.7th percentile, with prediction parameter model_mpp = 0.8. Cells smaller than 30 pixels or larger than 800 pixels were excluded. Signals from individual cells were then extracted, and scaled to the [0, 1] interval, with percentile cutoffs at 0.5% (floor) and 99.5% (ceiling). Cell-type annotation was performed using R package Seurat clustering pipeline: the function FindNeighbors was run on CODEX protein PCA (first 15 components). Subsequently, the function FindClusters was used to generate unsupervised clustering (with parameter resolution = 1), followed by manual annotation. A total of ten populations were identified (‘B-CD22-CD40’, ‘B-Ki67’, ‘Plasma’, ‘CD4 T’, ‘CD8 T’, ‘DC’, ‘Fibro/Epi’, ‘Vessel’, ‘Other’ and ‘Dirt’), and six populations (~180,000 cells in total) were used in subsequent analyses (‘B-CD22-CD40’, ‘B-Ki67’, ‘Plasma’, ‘CD4 T’, ‘CD8 T’ and ‘DC’).

scRNA-seq data of dissociated human tonsil cells were retrieved from King et al.⁵⁰. The preprocessing and cell typing steps were done in the R package Seurat, following the description presented in ref. ⁵⁰. In brief, tonsil cells (‘t1’, ‘t2’ and ‘t3’) were merged, then filtered by the criteria nFeature_RNA > 200 & nFeature_RNA < 7500 & percent.mt < 20, and subsequently values were normalized by the function SCTransform. Harmony batch correction was performed for different tonsils for clustering only, with the function RunHarmony. Unsupervised clustering was performed by the function FindNeighbors with Harmony embedding (1–27 dimensions) and function FindClusters with resolution = 0.5. A total of eight populations were defined (‘B-CD22-CD40’, ‘B-Ki67’, ‘circulating B’, ‘Plasma’, ‘CD4 T’, ‘CD8 T’, ‘DC’, ‘Other’), and six populations (~13,000 cells in total) were used in subsequent analyses (‘B-CD22-CD40’, ‘B-Ki67’, ‘Plasma’, ‘CD4 T’, ‘CD8 T’ and ‘DC’).

Boundaries of GCs from the CODEX images were drawn manually, and dilation and erosion from the boundary was performed with the Python package skimage, with functions morphology.binary_dilation and morphology.disk. Ten layers inward and ten layers outward from the boundary (each layer = 30 pixels; resolution: 376 nm per pixel) were performed, respectively. Cells were assigned to each layer based on locations of centroids. The RNA expression levels from each layer, based on the averaged CODEX-matched scRNA-seq cells, were plotted with the R package ggplot2. The UMAP visualization of the integration process was calculated with the first 15 components of the embedding vectors.

HUBMAP atlas: tri-modal human intestine dataset analysis

CODEX multiplex imaging (48 markers), snRNA-seq and snATAC-seq data of healthy human intestine cells were acquired from Hickey et al.³¹. For CODEX, samples ‘B005_SB’ and ‘B006_CL’ were used, while for snRNA-seq and snATAC-seq, single-ome sequencing data of four donors (‘B001’, ‘B004’, ‘B005’, ‘B006’) from the study were used. Cells annotated as ‘B cells’, ‘T cells’, ‘Endothelial’, ‘Enteroendocrine’, ‘Goblet’, ‘Mono_Macrophages’, ‘Plasma’, ‘Smooth muscle’ and ‘Stroma’ were selected for the integration process. Cell counts for each modality used for MaxFuse were: CODEX ~100,000 (small bowel) and ~70,000 (colon); snRNA-seq ~32,000 (small bowel) and ~16,000 (colon); snATAC-seq ~28,000 (small bowel) and ~21,000 (colon). CODEX protein expressions, snRNA-seq RNA expressions, snATAC-seq gene activity scores and LSI scores (calculated with R package Archr) were used as MaxFuse input (RNA expressions, gene activity scores and LSI scores were batch-corrected by Harmony²⁰, based on patient ID). The matching and integration processes were done on colon and small bowel samples, respectively.

Pairwise MaxFuse alignments of cells between protein (CODEX) and RNA (snRNA-seq), and of cells between RNA (snRNA-seq) and ATAC (snATAC-seq), were performed. Refined pivots from the two bimodal alignments were chained together by using the pivot cells in the RNA modality as the intermediary, resulting in a list of tri-modal pivots linking all three modalities. Subsequently, we used these pivots to calculate a tri-omic embedding via gCCA^21,59. In particular, we used the gCCA formulation and algorithm described in ref. ²¹.

The UMAP visualization of the tri-modal integration was calculated with the first 15 components of the embedding vectors (gCCA scores in this case). Embeddings of CODEX cells were overlaid with their protein expressions, or their matched cells’ RNA expressions, or gene activity scores. Spatial locations of these expression values and scores were plotted based on CODEX cells’ x–y centroid locations. Additionally, we showed spatial locations of transcription factor motif enrichment scores (Z-score) of CODEX cells, based on their matched snRNA-seq cells, which were calculated by the R package chromVAR⁶⁰. All values were capped between 5% and 95% quantiles for visualization purposes during plotting.