Research
Community Detection
Community detection involves dividing a network into communities where nodes within the same group are densely connected, while connections between different groups are sparse. Modularity is a key metric used to quantify the quality of these divisions, measuring the strength of community structures. High modularity indicates a strong community structure, with well-defined groups. Recently, combining Graph Neural Networks (GNNs) with modularity-based community detection has emerged as a powerful approach for analyzing complex networks.
Generative model
Generative model that can generate new data instances is the one of the models that has been in the spotlight recently. Generative model is powerful but is hard to model. Discriminative models try to draw boundaries in the data space, while generative models try to model how data is placed throughout the space. That is, generative models must model more. Generative models such as GANs, VAEs, Flow-based models and Diffusion models have been developed. The applications of generative models are various and useful including image generation. Our objective is to research a wide range of generative models and find interesting applications.
Link Prediction
Link Prediction is a task in graph and network analysis where the goal is to predict missing or future connections between nodes in a network. Given a partially observed network, the goal of link prediction is to infer which links are most likely to be added or missing based on the observed connections and the structure of the network.
We are conducting a study that uses graph neural networks (GNNs) to solve the link prediction problem of predicting infection routes between individuals using COVID-19 data. We construct a graph representing each individual as a node and the contact relationship between individuals as an edge from the network represented COVID-19 data. Then, we train the GNN model to predict the likelihood of one individual infecting another, taking into account the structural patterns and individual characteristics of this graph. We want to contribute to developing strategies to track infection chains and take preventive measures effectively. Our work focuses on fusing graph theory with neural network algorithms to find effective solutions to realistic and important social problems.
Inverse problem
Inverse problems in score-based and diffusion models involve reconstructing original data from noisy, corrupted, or incomplete observations. Score-based models work by estimating the gradients of the log-probability (scores) of the data distribution and iteratively refining the noisy data, pushing it back towards high-density regions of the true data distribution. Diffusion models, on the other hand, use a forward process that progressively adds noise to the data in discrete steps, followed by a reverse denoising process that gradually removes this noise to recover the original structure. Both approaches are highly effective in reconstructing detailed data structures and are particularly valuable for applications such as conditional generation, image restoration, inpainting, super resolution, and medical imaging, where precise recovery from imperfect observations is essential.
High-Throughput data analysis
Single-cell RNA sequencing is used to analyze the gene expression data of individual cells, thereby adding to existing knowledge of biological phenomena. Accordingly, this technology is widely used in numerous biomedical studies. Recently, the variational autoencoder has emerged and has been adopted for the analysis of single-cell data owing to its high capacity to manage large-scale data. Many different variants of the variational autoencoder have been applied, and have yielded superior results. However, because it is nonlinear, the model does not provide parameters that can be used to explain the underlying biological patterns. In this paper, we propose an interpretable nonnegative matrix factorization method that decomposes parameters into those shared across cells and those that are cell-specific. Effective nonlinear dimension reduction was achieved via a variational autoencoder applied to the cell-specific parameters. In addition to achieving nonlinear dimension reduction, our model could estimate the cell-type-specific gene expression. To improve the estimation accuracy, we introduced log-regularization, which reflects the single-cell property. Overall, our approach displayed excellent performance in a simulation study and in real data analyses, while maintaining good biological interpretability.
Graphical Models
We characterize a complex dependence structure among correlated biological variables using data integration. With the emergence of large collections of diverse biological datasets, a remaining challenge is how to integrate these rich collections in order to better reflect the biological process under study. To accomplish this aim, Dr. Chun and her collaborators developed a conditional Gaussian graphical model (CGGM) in which extra information is incorporated as additional predictors with a flexible reproducing kernel Hilbert space estimator. The research established a framework of data integration to systematically model complex biological networks of gene-gene and gene-genome regulations. The framework was then expanded in several directions. For example, Dr. Chun incorporated the feature that enables joint estimation of multiple networks and relaxed a distributional assumption to broaden the applicability.