Study of Single Cell Integration Using Machine Learning
1Anamika, 2Ajay K Kaushik
1Maharaja Agrasen Institute of Technology, 2 Maharaja Agrasen Institute of Technology
Abstract : Single-cell genomics has revolutionized our understanding of biology by enabling the measurement of DNA, RNA, and proteins in individual cells. However, analyzing single-cell data presents several challenges due to sparse and noisy measurements, molecular sampling depths, and batch effects. Additionally, current pipelines for single-cell data analysis treat cells as static snapshots, disregarding underlying dynamical biological processes. Incorporating temporal dynamics alongside state changes over time is a crucial and ongoing challenge in single-cell data science.
This paper presents a methodology for analyzing single-cell multiomics data collected from mobilized peripheral CD34+ hematopoietic stem and progenitor cells (HSPCs) isolated from four healthy human donors. The data comprises five time points over a ten-day period, during which cells were cultured with StemSpan SFEM media supplemented with CC100 and thrombopoietin (TPO) and incubated at 37ºC. Two single-cell assays were used to measure two modalities each: chromatin accessibility (DNA) and gene expression (RNA) for the Multiome kit, and gene expression (RNA) and surface protein levels for the CITEseq kit.
The task is to predict gene expression from chromatin accessibility for the Multiome samples, and protein levels from gene expression for the CITEseq samples. The cell types include Mast Cell Progenitor, Megakaryocyte Progenitor, Neutrophil Progenitor, Monocyte Progenitor, Erythrocyte Progenitor, Hematoploetic Stem Cell, and B-Cell Progenitor.
The methodology includes exploratory data analysis, data pre-processing and feature engineering, and a model architecture comprising LightGBM and a neural network. The data pre-processing includes normalization, transformation, standardization, and batch-effect correction. Feature engineering involves decomposition methods such as Principal Component Analysis, Incremental PCA, and Factor Analysis, as well as feature selection based on stable correlations within each group. Cell-type encoding is done using a one-hot encoding scheme.
Cross-validation is performed using GroupK fold validation. We have been able to get an accuracy of 87.63% on the test dataset.