Multi-omics data integration using ratio-based quantitative profiling with Quartet reference materials – Nature Biotechnology

[ad_1]

Human participants

This study was approved by the institutional review board of the School of Life Sciences, Fudan University (BE2050). It was conducted under the principles of the Declaration of Helsinki. Four healthy volunteers from a family quartet, as part of the Taizhou Longitudinal Study in Taizhou, Jiangsu, China, were enrolled and their peripheral blood was collected to establish immortalized LCLs. All four donors signed informed consent forms.

Establishment of the Quartet LCLs

We adopted the widely used protocol in which EBV is used to establish immortalized LCLs⁸⁸. Peripheral blood mononuclear cells were isolated using a lymphocyte separation solution (Ficoll). Naive B cells were sorted by EasySep Human Naive B Cell Enrichment Kit (STEMCELL, 19254) and infected with EBV by centrifugation at 400g for 1 h. After incubation, successfully infected and immortalized cells were propagated in culture medium.

Cell culture

The Quartet LCLs were cultured in RPMI 1640 supplemented with 2 mM l-glutamine, 10% heat-inactivated FBS and 1% penicillin–streptomycin at 37 °C with 5% CO₂. The cells were passaged every 72 h at a 1:4 split ratio.

Preparation of the first batch of DNA reference materials

To obtain the first batch of DNA reference materials (lot no. 20160806), 2 × 10⁹ cells were collected simultaneously for each cell line. Specifically, the cells grew in suspension and were centrifuged at 300g for 5 min to obtain cell pellets. The cell pellets were then washed twice with cold PBS.

The DNA reference materials were isolated using the DNA with Blood & Cell Culture DNA Maxi Kit (Qiagen) according to the manufacturer’s instructions, divided into 1,000 aliquots for each of the Quartet members and then labeled as Quartet_DNA_D5_20160806, Quartet_DNA_D6_20160806, Quartet_DNA_F7_20160806 or Quartet_DNA_M8_20160806. A single vial contains approximately 10 μg of genomic DNA (220 ng µl^–1, 50 µl) in TE buffer (10 mM Tris pH 8.0, 1 mM EDTA, pH 8.0).

DNA integrity and long-term stability were evaluated with the Agilent 2200 TapeStation system (Agilent Technologies). Concentrations were determined by NanoDrop ND-2000 spectrophotometer (Thermo Fisher Scientific).

Preparation of multi-omics reference materials

To obtain the second batch of multi-omics reference materials (lot no. 20171028), 1 × 10¹⁰ cells were collected for each cell line.

Of these, 2 × 10⁹ cells were used to prepare the second batch of DNA reference materials (lot no. 20171028) with the same method described above for the first batch of DNA. The second batch of DNA reference materials was stocked in 1,000 vials (220 ng µl^–1, 50 µl) and labeled as Quartet_DNA_D5_20171028, Quartet_DNA_D6_20171028, Quartet_DNA_F7_20171028 and Quartet_DNA_M8_20171028. DNA QC and monitoring of stability were conducted using the same methods described above.

Additionally, 2 × 10⁹ cells pretreated with TRIzol reagent were used to prepare RNA reference materials using the RNeasy Maxi kit (Qiagen) according to the manufacturer’s instructions. The extracted RNA was divided into 1,000 aliquots for the quartet members and labeled as Quartet_RNA_D5_20171028, Quartet_RNA_D6_20171028, Quartet_RNA_F7_20171028 and Quartet_RNA_M8_20171028. A single vial contains approximately 5 μg of RNA in water (520 ng µl^–1, 10 µl). RNA integrity and long-term stability were assessed with a 2100 Bioanalyzer using RNA 6000 Nano chips (Agilent Technologies) and a Qsep 100 system (BiOptic). Concentrations were determined by NanoDrop ND-2000 spectrophotometer (Thermo Fisher Scientific).

Cell pellets of 2 × 10⁹ cells were used to prepare protein reference materials. Two batches of peptides were prepared separately at Fudan University (on 6 November 2017) and Novogene (on 16 June 2020), China. In brief, cells were lysed in 8 M urea lysis buffer supplemented with protease inhibitors. The extracted proteins were then digested with trypsin overnight at 37 °C. The resulting peptides were divided into 1,000 aliquots and dried under vacuum for each batch of peptide reference materials. Four chemically synthesized peptides with ¹³C- and ¹⁵N-labeled valine at fixed weight ratios were spiked into the second batch of the reference protein materials (lot no. 20200616) as external controls. The spiked peptides are YILAGVENSK (1:1,000), ADVTPADFSEWSK (1:3,000), DGLDAASYYAPVR (1:9,000) and DSPSAPVNVTVR (1:27,000).

Cell pellets of 1 × 10⁹ cells were used to prepare metabolite reference materials. In brief, cells were extracted using a solution with a 6:1 ratio of methanol to water. Eleven xenobiotics were spiked in at a known amount in each vial as external controls. These included indoleacetic acid (25 pmol), taurocholic acid (1 pmol), glycocholic acid (5 pmol), cholic acid (25 pmol), tauroursodeoxycholic acid (2.5 pmol), taurodeoxycholic acid (7.5 pmol), glycoursodeoxycholic acid (1 pmol), glycodeoxycholic acid (0.5 pmol), ursodeoxycholic acid (25 pmol), deoxycholic acid (50 pmol) and sulfadimethoxine (5 pmol). The cell extracts were divided into 1,000 vials and then dried under vacuum (Labconco) to obtain the cell extracts as metabolomics reference materials. Each vial contains dried cell extract from approximately 10⁶ cells. Stability was monitored by P300 targeted metabolomics using a UPLC–MS/MS system at the Human Metabolomics Institute (Shenzhen, China).

Whole-genome short-read sequencing data

Data generation

To evaluate the intra-lab performance of whole-genome short-read sequencing, three replicates for each of the Quartet DNA samples were sequenced in a fixed order (D5_1, D6_1, F7_1, M8_1, D5_2, D6_2, F7_2, M8_2, D5_3, D6_3, F7_3 and M8_3). A total of 108 libraries from six labs with either a PCR or PCR-free protocol were used in this study. The libraries were sequenced on short-read platforms, including Illumina HiSeq XTen, Illumina NovaSeq, MGI MGISEQ-2000 and MGI DNBSEQ-T7. In paired-end mode, the sequencing depth was at least 30×. More information is detailed in the accompanying paper on DNA⁶⁶.

Short-read sequencing read mapping and small variant calling

Read sequences were mapped to GRCh38 (https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files). Sentieon v2018.08.01 (https://www.sentieon.com/) was used to convert raw fastq files to GVCF files. The workflow included read mapping with BWA-MEM and duplicate removal, indel realignment, base quality score recalibration and variant calling with HaplotyperCaller in GVCF mode. We used default settings for all the processes.

Feature encoding for small variants

To perform vertical integration with other quantitative omics data, we used an encoding scheme for the genotypes of SNVs. For each genomic locus, we counted all alleles occurring in a total of 108 samples from nine batches and then encoded them. Heterozygotes that were consistent with the reference genome were encoded as 0, and the others were encoded as 1. We used chromosome 1 to represent the whole genome for the analysis.

Whole-genome long-read sequencing data

Data generation

We evaluated the performance of SV detection using different data analysis pipelines, without considering the technical variation from library preparation. A total of 12 libraries from three long-read sequencing platforms were generated for the Quartet DNA reference materials (one replicate for each sample). The long-read sequencing platforms used were Oxford Nanopore PromethION (~100×), PacBio Sequel (~100×) and PacBio Sequel II (~30×).

Long-read sequencing read mapping and SV calling

Reads were mapped to GRCh38 (GCA000001405.15) from the UCSC Genome Brower (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/). Three mappers (NGMLR, minimap2 and pbmm2) and five callers (cuteSV, NanoSV, Sniffles, pbsv and SVIM) were used to call SVs.

DNA methylation data

Data generation

To evaluate the intra-lab performance of DNA methylation analysis, three replicates for each of the Quartet sample groups were assayed in a fixed order (D5_1, D6_1, F7_1, M8_1, D5_2, D6_2, F7_2, M8_2, D5_3, D6_3, F7_3 and M8_3). Methylation data from 72 DNA samples generated by two labs using Illumina Infinium MethylationEPIC v1.0 BeadChip (850k) were used in this study.

Preprocessing of methylation data

Raw idat files were processed using the R packages ChAMP (v2.20.1)⁸⁹ and minfi (v1.36.0)⁹⁰. The single-sample Noob (ssNoob) method^91,92 was used to correct for background fluorescence and dye bias. Next, samples with a proportion of failed probes (probe detection P > 0.01) above 0.1 were discarded. Probes that failed in more than 10% of the remaining samples were removed. Probes with <3 beads in at least 5% of samples were also removed. All non-CpG probes, SNP-related probes, multi-hit probes, and probes located on chromosomes X and Y were filtered out. After preprocessing, the methylation dataset contained 735,296 probes. Finally, the corrected methylated and unmethylated signals were used to calculate M values and β values. In this process, the offset was set to 100 and the β threshold was set to 0.001.

Whole-transcriptome sequencing data

Data generation

To evaluate the intra-lab performance of whole-transcriptome sequencing, three replicates for each of the Quartet sample groups were sequenced in a fixed order (D5_1, D6_1, F7_1, M8_1, D5_2, D6_2, F7_2, M8_2, D5_3, D6_3, F7_3 and M8_3). A total of 252 libraries from eight labs with either a poly(A) selection or rRNA removal protocol were used in this study. On average, 100 million read pairs per replicate were sequenced on the Illumina NovaSeq or MGI DNBSEQ-T7 platform. More information is provided in the accompanying paper on RNA⁶⁷.

Alignment and RNA quantification

HISAT2 v2.1 was used for read alignment to GRCh38 (version GRCh38_snp_tran; https://genome-idx.s3.amazonaws.com/hisat/grch38_snptran.tar.gz)⁹³. SAMtools v1.3.1 was used to sort and convert SAM to BAM format⁹⁴. StringTie v1.3.4 was used for gene quantification with Ensembl reference annotation (Homo_sapiens.GRCh38.93.gtf)⁹⁵. Ballgown v2.14.1 and prepDE.py (https://ccb.jhu.edu/software/stringtie/dl/prepDE.py) were used to produce a gene expression matrix in FPKM for downstream analysis.

miRNA-seq data

Data generation

To evaluate the intra-lab performance of miRNA-seq, three replicates for each of the Quartet sample groups were sequenced in a fixed order (D5_1, D6_1, F7_1, M8_1, D5_2, D6_2, F7_2, M8_2, D5_3, D6_3, F7_3 and M8_3). A total of 72 libraries from four labs using five library kits (NEBNext, NEXTFLEX, TruSeq, Vazyme and QIAseq) were used in this study. The Illumina NovaSeq or HiSeq 2500 platform was used to generate the miRNA-seq data.

Alignment and miRNA quantification

An extracellular RNA processing toolkit (exceRpt) was used to preprocess miRNA-seq⁹⁶ data. Raw reads were aligned to the hg38 genome and the transcriptome in exceRptDB. CPM quantification of miRNA was extracted for downstream analysis.

Mass spectrometry-based proteomics data

Data generation

With the first batch of peptide reference materials, 312 libraries based on the LC–MS system were generated under a DDA mode. Samples were analyzed in a random order for each dataset, which contained three technical replicates for each of the four biological samples (D5, D6, F7 and M8). Mass spectrometers from three platforms were used: (1) the Q Exactive hybrid quadrupole–Orbitrap series (Q Exactive, Q Exactive Plus, Q Exactive HF and Q Exactive HF-X), Orbitrap Fusion Tribrid series (Fusion and Fusion Lumos) and Orbitrap Exploris 480 (all from Thermo Fisher Scientific); (2) Triple-TOF 6600 (from Sciex); and (3) timsTOF Pro (from Bruker Daltonics).

The second batch of peptide reference materials were analyzed in a fixed order (D5_1, D6_1, F7_1, M8_1, D5_2, D6_2, F7_2, M8_2, D5_3, D6_3, F7_3 and M8_3) on Q Exactive, Q Exactive HF, Q Exactive HF-X and Orbitrap Fusion Lumos instruments, generating 36 libraries based on DDA mode and 36 libraries based on DIA mode. All parameters were set according to the requirements of the manufacturers. More information is provided in the accompanying paper on protein⁶⁸.

Peptide identification and protein quantification

Raw MS files generated using the first batch of peptide reference materials were searched against the NCBI human RefSeq protein database (updated on 7 April 2013, 32,015 entries) using Firmiana 1.0 enabled with Mascot 2.3 (Matrix Science)⁹⁷. Raw MS files generated with the second batch of peptide reference materials were searched against UniProt (http://www.uniprot.org; release-2021_04), using in-house pipelines from different labs (MaxQuant 1.5.3.17, Spectronaut 14.4, mProphet or Proteome Discoverer 2.2). Fixed modifications included carbamidomethylation (cysteine), and variable modifications included oxidation (methionine) and acetylation (protein N terminus). Proteins with at least one unique peptide with 1% FDR at the peptide level and a Mascot ion score greater than 20 were selected for further analysis. FOT values were used for downstream analysis. FOT was defined as a protein’s iBAQ divided by the total iBAQ for all the proteins identified in one sample. The FOT value was multiplied by 10⁵ for ease of presentation.

Mass spectrometry-based metabolomics data

Data generation

To evaluate the intra-lab performance of MS-based metabolomics, three replicates for each of the Quartet sample groups were profiled in a fixed order (D5_1, D6_1, F7_1, M8_1, D5_2, D6_2, F7_2, M8_2, D5_3, D6_3, F7_3 and M8_3). Dried cell extracts were reconstituted in mobile phase in each lab, and a total of 264 libraries were generated from six labs. Nontargeted metabolomics datasets were generated using AB Sciex Triple TOF6600 and Thermo Scientific Q Exactive mass spectrometer systems in three different labs. Targeted metabolomics datasets were generated using Waters Xevo TQ-S, AB Sciex QTRAP 5500 and AB Sciex QTRAP 6500+ mass spectrometers in four labs. More information is provided in the accompanying paper on metabolites⁶⁹.

Compound identification and metabolite quantification

Raw data were extracted and underwent peak identification and QC processing using the in-house methods in each lab. Compound identification was conducted using an in-house library on the basis of the retention time/index (RI), mass-to-charge (m/z) ratio and MS spectral data for each metabolite. Metabolite quantification was conducted using area under the curve or the concentration calculated by calibration curve using standards for each metabolite. All expression tables for metabolomics data were log₂ transformed.

Microarray and RNA-seq datasets of reference RNA samples A–D from the MAQC and SEQC consortia

The microarray datasets⁵⁵ and RNA-seq datasets⁵⁴ of four well-characterized reference RNA samples, samples A (universal human reference RNA), B (human brain reference RNA), C (a mixture of samples A and B at a 3:1 ratio) and D (a mixture of samples A and B at a 1:3 ratio), were used to confirm the generalizability of the ratio-based profiling approach. These two datasets spanning 8 years were generated on different platforms across different labs.

Microarray data in the MAQC-I study were generated on three platforms, including Affymetrix, Agilent Technologies (for two colors) and Illumina platforms. Each platform provider selected three sites for testing, and five replicates of each sample were processed at each site. For the RNA-seq data from the SEQC1 study, we selected sequencing data from three sites for the Illumina and Life Technologies platforms, with four replicates of each sample randomly selected from each site. A total of 179 replicates of microarray data and 96 replicates of RNA-seq data were included in this study.

Data preprocessing

In the analysis of this study, the methylation microarray data were converted to M values, miRNA data were normalized to log₂(CPM), RNA data were normalized to log₂(FPKM), proteomics data were normalized to log₂(FOT) and metabolomics data were log₂ transformed according to the quantitative intensity.

We used different strategies to handle missing values for different omics types when the analysis involved multiple batches of data. For methylation data, features with missing values were removed. For other omics data types, a feature was retained when it was detected in more than 90% of the samples. For miRNA-seq and RNA-seq data, a flooring value of 0.01 was added to each gene’s FPKM or CPM value before log₂ transformation. For proteomics and metabolomics data, we used the estim_ncpPCA and imputePCA functions of the missMDA v1.18 package⁹⁸ to fill in missing values.

PCA

PCA was performed using the prcomp function in R statistical packages (v4.0.5). PCA was conducted on single or multiple batches of data from a single omics type. In assessing the quality of single-batch data, we used data from 12 samples (4 donors with 3 replicates per donor) after removing features containing null values and performing normalization. The results can be found in Fig. 2d and Supplementary Figs. 1 and 2.

For multi-batch data quality assessment, we first filtered out features that were null in more than 10% of the samples and then performed imputation as described in the ‘Data preprocessing’ section. PCA was calculated on the basis of the horizontally integrated data, and the corresponding results are presented in Fig. 3d–f and Extended Data Fig. 5.

PVCA

PVCA was used to measure the contribution of impact factors to the Quartet multi-omics profiles. PVCA uses two statistical methods, that is, PCA and variance component analysis. In addition to biological sample type (donor), technical factors taken into account included lab, platform and protocol. We performed PCA and then estimated the effects of each known factor using the lme4 v1.1-29 package⁹⁹. The residual accounts for variability from other sources or factors not included.

Differential expression analysis

For each omics type, we identified DEFs in three types of comparisons, that is, D5 versus F7, D5 versus M8 and F7 versus M8, where D6 was used as the common reference (denominator in ratio calculation). Although the four samples were from a family of parents and identical twins, the replicated observations made on the samples were all independent of each other for the quantitative multi-omics characteristics. For the identical distribution problem, the small sample size (n = 3) made it impossible to determine precisely whether the two sets of data under comparison were from the same distribution. In this case, direct use of a nonparametric test (for example, the Wilcoxon test) is not an appropriate choice¹⁰⁰. Therefore, we used a method that has been validated in a series of studies by the MAQC consortium to detect DEFs^47,54,55.

According to recommendations from the MAQC/SEQC consortia^55,101, a nonstringent t-test P-value cutoff with fold-change ranking could be used to identify differentially expressed genes. In this study, we assumed unequal variance and used Welsh’s modification for the degrees of freedom. A feature was identified as differentially expressed when it satisfied the criteria of P < 0.05 and log₂(fold change) of ≥0.5 or ≤–0.5 for miRNA, RNA, protein and metabolite profiling or P < 0.05 and log₂(fold change) of ≥2 or ≤–2 for methylation M values. The DEFs were further classified as up- or downregulated on the basis of the positive or negative sign of the log₂(fold change).

Workflow for construction of reference datasets

In the analysis of DEFs and cross-omics relationships, intra-batch QC was performed to minimize the influence of technical noise in the voting process. For each sample group, features that were not detected in more than one technical replicate or that had large variability (CV of >0.15 for methylation and >0.3 for other omics types) were excluded. Subsequently, we constructed reference datasets of DEFs and cross-omics feature relationships with the consensus voting approach described below.

Reference datasets for DEFs

Cross-batch QC was performed following the previous intra-batch QC. Features retained in more than a certain percentage of batches (70% for methylation, miRNA and RNA; 30% for protein and metabolites) were kept for the subsequent differential expression analysis.

For each omics type, we analyzed the DEFs between D5 and F7 (D5–F7), D5 and M8 (D5–M8) and F7 and M8 (F7–M8) within each batch using the method described in the ‘Differential expression analysis’ section.

After identifying DEFs from each batch, we kept the DEFs present in more than 70% of batches with consistent regulatory directionality (up or down). Finally, we calculated the mean log₂(fold change) values for all the retained intra-batch DEFs as reference values.

Reference datasets for cross-omics feature relationships

The reference datasets contained cross-omics feature relationships between methylation and miRNA, methylation and RNA, RNA and miRNA, RNA and protein, and protein and metabolites. We first performed feature selection to better identify biologically meaningful correlations by annotating cross-omics features to the same genes.

For methylation probes, we converted the features from the level of probes to genes by taking the mean value in the promoter region (TSS200 or TSS1500) to characterize the methylation level. For the other omics types, we did not perform transformation of feature values but simply searched for associated genes. RNA profiles were associated with gene names through Ensembl ID. Target genes associated with specific miRNAs in all of miRDB¹⁰² (prediction score of ≥80), miRTarBase¹⁰³ (support type of ‘Functional MTI’) and TargetScan¹⁰⁴ were considered as plausible. The proteomics profiles were characterized at the level of gene names. Metabolites were associated with genes in the same pathway on the basis of the HMDB database¹⁰⁵.

Subsequently, we exhaustively enumerated all batch combinations of the above five cross-omics types and conducted cross-batch QC. Associated feature pairs retained in more than a certain number of batch combinations were used for subsequent correlation analysis. This threshold was determined by the product of the respective batch and rate (70% for methylation, miRNA and RNA; 30% for protein and metabolites) of the two types of omics data being compared.

Next, we calculated the Pearson correlation coefficient for each feature pair in each batch combination of the five cross-omics types. According to the results of Pearson correlation analysis, the cross-omics relationships were classified as positive (r ≥ 0.5, P < 0.05), negative (r ≤ –0.5, P < 0.05) or none (P ≥ 0.05).

Finally, we preserved the cross-omics relationships with a category that accounted for more than 70% as high-confidence relationships. The reference Pearson correlation coefficients were the mean value of the retained data.

Performance metrics

ARI

ARI is a widely used QC metric to compare clustering results against external criteria⁸¹. It measures the similarity of the true labels and the clustering labels while ignoring permutations with chance normalization, which means that random assignments will have an ARI score close to zero. ARI is in the range of –1 to 1, with 1 corresponding to perfect clustering. ARI is calculated on the basis of RI as follows.

$${\textrm{ARI}}=\frac{{{\textrm{RI}}}+{{\textrm{expected}}}\left({{\textrm{RI}}}\right)}{\max \left({{\textrm{RI}}}\right)-{{\textrm{expected}}}\left({{\textrm{RI}}}\right)}$$

RMSE

RMSE, the standard deviation of the residuals (prediction errors), is a widely used statistic in bioinformatics and machine learning. In this study, we used RMSE to measure the consistency of DEFs detected from a dataset for a given pair of samples with the reference DEFs, or ‘RMSE of DEFs’. Reference DEFs were integrated by consensus voting the intra-batch results, and the reference difference was defined as the mean value of log₂(fold change) for high-confidence batches. RMSE is computed using the following equation:

$${{\textrm{RMSE}}}=\sqrt{\frac{\mathop{\sum }\limits_{i=1}^{N}{\left({x}_{i}-{\hat{x}}_{i}\right)}^{2}}{N}}$$

where N is the total number of features considered for evaluation, ${x}_{i}-{\hat{x}}_{i}$ is the error, ${\hat{x}}_{i}$ is the log₂(fold change) after horizontal integration and x_i is the log₂(fold change) of the corresponding feature in the reference dataset.

SNR

SNR is a parameter based on the Quartet study design for discriminating different types of reference samples. On the basis of PCA, SNR is defined as the ratio of the average distance among different samples (for example, D5_1 versus D6_1) to the average distance among technical replicates (for example, D5_1 versus D5_2). SNR is calculated as follows:

$${\rm{SNR}}=10\times {\log }_{10}\left(\frac{m\times \displaystyle\binom{n}{2}}{\displaystyle\binom{m}{2}\times n\times n}\times \frac{\mathop{\sum }\limits_{x=1}^{m}\mathop{\sum }\limits_{y=x+1}^{m}\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j=1}^{n}\mathop{\sum }\limits_{p=1}^{2}{W}_{p}{({\rm{PC}}_{p,i,x}-{\rm{PC}}_{p,\,j,y})}^{2}}{\mathop{\sum }\limits_{x=1}^{m}\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j=i+1}^{n}\mathop{\sum }\limits_{p=1}^{2}{W}_{p}{({\rm{PC}}_{p,i,x}-{\rm{PC}}_{p,\,j,x})}^{2}}\right)$$

where m is the number of sample groups and n is the number of replicates in each sample group. W_p represents the pth PC of variance. PC_p,i,x, PC_p,j,x and PC_p,j,y represent the pth component values of replicate i and replicate j in sample group x or sample group y.

The SNR metric was also used for the assessment of clustering accuracy for vertical integration. In this paper, SNR was calculated on the basis of the sample-to-sample similarity matrix output by integration tools, as follows:

$${\rm{SNR}}=10\times {\log }_{10}\left(\frac{\displaystyle\binom{m}{2}\times n\times n}{m\times \displaystyle\binom{n}{2}}\times \frac{\mathop{\sum }\limits_{x=1}^{m}\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j=i+1}^{n}{({S}_{i,x}-{S}_{j,x})}^{2}}{\mathop{\sum }\limits_{x=1}^{m}\mathop{\sum }\limits_{y=x+1}^{m}\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j=1}^{n}{({S}_{i,x}-{S}_{j,y})}^{2}}\right)$$

where m is the number of sample groups and n is the number of replicates in each sample group. S_i,x, S_j_,x and S_j_,y represent the similarity of replicate i and replicate j in sample group x or sample group y.

Balance of sample classes between batches

To evaluate the effect of the level of balance between the sample classes across batches on the horizontal and vertical integration tasks, we used the Jaccard index to represent the level of balance. The Jaccard index is a common statistic used to gauge the similarity and diversity of two sets. For data analysis involving multiple batches, we recorded information on the batches in which the samples of D5, D6, F7 and M8 were located. Further, we calculated the Jaccard index between batches of D5–D6, D5–F7, D5–M8, D6–F7, D6–M8 and F7–M8. Finally, the mean value of the above six Jaccard indexes represented the sample class-batch balance.

Datasets used in horizontal and vertical integration

Randomly sampled quantitative multi-omics datasets

For the multi-batch dataset for each quantitative omics type, we randomly selected an equal subset of all technical replicates from D5, D6, F7 and M8. This number of replicates was greater than three and less than the maximum number for the given omics type. The constituted datasets were used to perform calculation of SNR (Fig. 3f) and for the identification of DEFs (Extended Data Fig. 3a,b).

Quantitative multi-omics datasets of different quality (good versus bad)

For each type of quantitative omics profiling (methylation, miRNA, RNA, protein and metabolites), we defined the quality of a batch on the basis of the SNR values. Higher SNR values indicate better data quality, whereas lower SNR values indicate lower data quality. We ranked all batches for each omics type on the basis of their SNR values and classified the top three as good-quality batches (labeled G1, G2 and G3) and the bottom three as bad-quality batches (labeled B1, B2 and B3). Next, we exhausted all combinations of batches from each category, including three combinations of any two batches and one combination of three batches. These datasets were used for cross-omics feature relationship identification (Fig. 4e) and multi-omics data integration for sample classification (Fig. 5a).

Quantitative multi-omics datasets under different scenarios (balanced versus confounded)

First, we randomly selected six batches for each type of quantitative omics profiling (methylation, miRNA, RNA, protein and metabolites), resulting in 18 replicates per sample (D5, D6, F7 and M8). This was done because there were only six batches of data for methylation and miRNA. Second, we sorted the column names of the above data by batch, sample and replicate. Next, we constructed balanced and confounded datasets using different sampling methods. To generate a balanced dataset, we randomly selected four unique numbers from 1 to 18 and extracted the corresponding ranked data for samples D5, D6, F7 and M8. To generate a confounded dataset, we drew data from each of the four samples and randomly selected four unique numbers from 1 to 18 to extract the corresponding ranked data. A total of 300 sampled datasets were generated for the data integration analysis. These datasets were applied for cross-omics feature relationship identification (Fig. 4f) and multi-omics data integration for sample classification (Fig. 5b).

Genomics and quantitative multi-omics datasets with moderate quality

In the analysis in Fig. 6, the datasets of small variants were added to the integration task. Before vertical integration, we filtered out batches of very good (top 20%) or bad (bottom 20%) quality within the quantitative omics datasets on the basis of SNR values to reduce the impact of datasets with extreme quality values. A total of 60 batches of data were retained, including 9 batches of DNA (SNV/indel) data, 4 batches of DNA methylation profiles, 4 batches of miRNA profiles, 13 batches of RNA profiles, 18 batches of proteomics data and 12 batches of metabolomics data. In each vertical integration, we drew a random batch from the above dataset for each omics type, which means that the vertically integrated results were not affected by problems that exist in horizontal integration, for example, batch effects.

Horizontal (within-omics) integration methods

A total of six methods were used in this study to horizontally integrate multiple batches of data, including Absolute, Ratio, ComBat, Harmony, RUVg and z score.

Absolute

The Absolute method refers to direct integration of data after preprocessing such as normalization, log₂ transformation, missing value filtering and imputation.

Ratio

Ratio-based scaling refers to converting the quantitative profiles to relative-scale profiles within each batch on a feature-by-feature basis. To obtain ratio-based scaling data, the log₂-transformed profiles for each feature of study samples (that is, D5, F7 and M8) were subtracted from the mean log₂-transformed profiles of three replicates of the reference sample within the same batch (that is, D6).

ComBat

ComBat applies empirical Bayes shrinkage to adjust the mean and variance by pooling information across multiple genes to correct batch effects. It was implemented by using the ComBat function of the sva v3.38.0 package¹⁰⁶ to adjust the known batches.

Harmony

Harmony uses an iterative clustering–correction procedure based on soft clustering to correct for sample differences. The algorithm was implemented by using the HarmonyMatrix function of the harmony v0.1.0 package⁷⁵ with the parameter do_pca set to FALSE and other parameters as default.

RUVg

RUVg uses a subset of the data to estimate factors of unwanted variation. We used the least significant DEFs (25% of the total features) as ‘in silico empirical’ negative controls, which were considered not differentially expressed with the covariates of interest. Subsequently, we applied the RUVg function of the RUVSeq v1.24.0 package⁷⁶ with default parameter settings.

z score

z score was performed by scaling each batch by feature before merging multiple batches to eliminate the batch effect.

Vertical (cross-omics) integration methods

SNF⁵, iClusterBayes⁷⁷, MOFA+⁷⁸, MCIA⁷⁹ and intNMF⁸⁰ were used to integrate the multi-omics data. To reduce the impact of large differences in dimensionality across multi-omics datasets on the final results, for each omics type we selected the top 1,000 most variable features on the basis of the standard deviation. Subsequently, the data matrices were centered and scaled to a mean of 0 and a standard deviation of 1 feature by feature. In addition, because intNMF and MCIA are methods based on the principle of non-negative matrix decomposition, features containing negative values were added with their absolute of minimum values to ensure non-negativity.

The obtained sample labels were used to calculate ARI and the sample similarity matrix for the calculation of SNR. Because MOFA+ and MCIA do not provide clustering information, we applied PAM clustering to obtain the predicted sample labels. In addition to SNF, we used the affinityMatrix function in the SNFtools v2.3.1 package to process the results from iClusterBayes, MOFA+, MCIA and intNMF to derive the sample similarity matrices. The following vertical integration algorithms were used.

SNF

Similarity network fusion (SNF) constructs networks of samples for each available data type and then fuses these into one network that represents the full spectrum of underlying data. The SNFtool v2.3.1 package was used with the parameter K (number of neighbors) set to the square of the sample size after rounding, alpha (a hyperparameter) set to 0.5 and T (the number of iterations for the diffusion process) set to 10. The SNF-combined similarity matrix was directly used to calculate SNR.

iClusterBayes

iClusterBayes fits a Bayesian latent variable model on the basis of multi-omics data to identify a comprehensive cluster assignment, as well as latent variable features that contribute to clustering. This method was conducted with the iClusterBayes function of the iClusterPlus v1.26.0 package with the parameter K (number of eigen features) set to the number of sample groups minus one and other parameters as default. The latent variable was used to calculate the sample similarity matrix.

MOFA+

Multi-omics factor analysis v2 (MOFA+) reconstructs a low-dimensional representation of data using variational inference in terms of latent factors that capture the global sources of variability. The MOFA2 v1.1.21 package was used with the default parameters. The latent factors from the model were used to obtain the sample labels and similarity matrix.

MCIA

Multiple covariance analysis (MCIA) is an extension of covariance analysis (CIA) to multi-omics datasets. It was implemented by using the mcia function of the omicade4 v1.30.0 package. We obtained the sample labels and similarity matrix on the basis of the synthetic scores.

intNMF

intNMF is an extension of non-negative matrix factorization (NMF), which decomposes each omics dataset into the product of a factor matrix and an omics-specific weight matrix. This method was implemented using the intNMF v1.2.0 package with default parameters. The sample similarity matrix was obtained from the common basis matrix (W) across the multiple datasets.

Identification and annotation of the multi-omics features associated with DNMs and DEFs

On the basis of the genomic coordinates of 2,187 DNMs⁶⁶, we performed gene annotation using gencode.v40.annotation.gtf. The obtained gene list was applied to feature selection in other omics types. For methylation, we characterized the methylation level by taking the average M value in the promoter region (TSS200 or TSS1500) for these genes. miRNAs targeting these genes were obtained from the miRDB¹⁰² (prediction score of ≥80), miRTarBase¹⁰³ (support type is ‘Functional MTI’) and TargetScan¹⁰⁴ databases, and results appearing in all three databases were retained as miRNAs associated with these DNMs. RNA profiles were associated with gene names through Ensembl ID. Proteins were directly screened because they had been annotated to gene symbols. Finally, for metabolomics data, which are least closely associated with the genome from the central dogma, we obtained metabolites associated with the genes mentioned above from biological pathway information provided by the HMDB database¹⁰⁵.

DEFs for D5 and D6 were identified in the same way as in the construction of reference datasets for DEFs, as detailed in the ‘Workflow for construction of reference datasets’ section.

For the features found in each omics type that differed between D5 and D6 using the two methods described above, we assessed intersections to obtain reliable multi-omics features affected by genomic-level differences between the identical twins (Fig. 6d). These features were further annotated with Ingenuity Pathway Analysis software to obtain candidate biological pathways associated with these features. Ultimately, pathways associated with more than one feature and with Fisher’s exact test P < 0.05 were retained (Fig. 6e).

Similarity between D5 and D6

The SNF method was used to integrate data from different multi-omics combinations to explore the genomic inheritance patterns of the Quartet identical twins during data integration. During integration, we randomly selected a batch from each omics type with moderate quality (SNR in the range of 20% to 80%) and then calculated the inter-sample similarity matrix W using the SNFtool v2.3.1 package. Specifically, for single-omics datasets (that is, DNA or metabolites), we treated two random batches of moderate-quality data from the same omics type as different sources, also using SNF for integration and to obtain the W matrix. As D5 and D6 each had three technical replicates, there were nine similarity results in the W matrix. We used their mean values as the similarity between D5 and D6 obtained from one integration. A total of 50 iterations were performed to ensure robustness.

Statistical analysis

All statistical analyses were performed using R statistical packages (version 4.0.5) (https://www.r-project.org). Pearson’s correlation coefficients were calculated using the Hmisc v4.6.0 package (https://CRAN.R-project.org/package=Hmisc). Differential expression analyses were implemented using the ChAMP v2.20.1 package for EPIC methylation data⁸⁹ and using the rstatix v0.7.0 package for other omics data (https://github.com/kassambara/rstatix). PCA was conducted with univariance scaling using the prcomp function. PAM clustering was implemented using the cluster v2.1.3 package (https://CRAN.R-project.org/package=cluster). Data visualization was implemented using the R packages ggplot2 v3.3.6 (https://ggplot2.tidyverse.org/), ggsci v2.9 (https://github.com/nanxstats/ggsci), ggpubr v0.4.0 (https://github.com/kassambara/ggpubr/), ComplexHeatmap v2.6.2 (ref. ¹⁰⁷) and networkD3 v0.4 (https://christophergandrud.github.io/networkD3/).

Materials availability

The Quartet multi-omics reference materials generated in this study can be accessed from the Quartet Data Portal (https://chinese-quartet.org/) under the Administrative Regulations of the People’s Republic of China on Human Genetic Resources.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

Human participants

Establishment of the Quartet LCLs

Cell culture

Preparation of the first batch of DNA reference materials

Preparation of multi-omics reference materials

Whole-genome short-read sequencing data

Data generation

Short-read sequencing read mapping and small variant calling

Feature encoding for small variants

Whole-genome long-read sequencing data

Data generation

Long-read sequencing read mapping and SV calling

DNA methylation data

Data generation

Preprocessing of methylation data

Whole-transcriptome sequencing data

Data generation

Alignment and RNA quantification

miRNA-seq data

Data generation

Alignment and miRNA quantification

Mass spectrometry-based proteomics data

Data generation

Peptide identification and protein quantification

Mass spectrometry-based metabolomics data

Data generation

Compound identification and metabolite quantification

Microarray and RNA-seq datasets of reference RNA samples A–D from the MAQC and SEQC consortia

Data preprocessing

PCA

PVCA

Differential expression analysis

Workflow for construction of reference datasets

Reference datasets for DEFs

Reference datasets for cross-omics feature relationships

Performance metrics

ARI

RMSE

SNR

Balance of sample classes between batches

Datasets used in horizontal and vertical integration

Randomly sampled quantitative multi-omics datasets

Quantitative multi-omics datasets of different quality (good versus bad)

Quantitative multi-omics datasets under different scenarios (balanced versus confounded)

Genomics and quantitative multi-omics datasets with moderate quality

Horizontal (within-omics) integration methods

Absolute

Ratio

ComBat

Harmony

RUVg

z score

Vertical (cross-omics) integration methods

SNF

iClusterBayes

MOFA+

MCIA

intNMF

Identification and annotation of the multi-omics features associated with DNMs and DEFs

Similarity between D5 and D6

Statistical analysis

Materials availability

Reporting summary

Featured News