MVCBench Project Page

MVCBench: A Multimodal Benchmark for
Drug-induced Virtual Cell Phenotypes

Bo Li^1,†, Qing Wang^2,†, Shihang Wang^3,†, Bob Zhang^1,4,*, Yuzhong Peng⁵, Pinxian Zeng¹, Chengliang Liu¹,
Mengran Li⁶, Ziyang Tang⁷, Xiaojun Yao³, Chuxia Deng⁵, Qianqian Song^2,8,*

¹FST, University of Macau, ²University of Florida, ³Macao Polytechnic University, ⁴Institute of Collaborative Innovation, UM,
⁵FHS, University of Macau, ⁶HKISI-CAS, ⁷NVIDIA, ⁸Wake Forest School of Medicine

Abstract

Realizing a holistic virtual cell capable of accurately predicting how drug treatments reshape cellular phenotypes across transcriptional and morphological landscapes is a central goal in drug discovery. To bridge the gap in current representation capabilities, we introduce MVCBench, a systematic benchmarking framework evaluating 24 representation models (12 molecular & 12 gene representations).

Leveraging nearly 1.1 million paired profiles, we propose a progressive evaluation logic moving from independent component assessment to holistic multimodal modeling. Our analysis reveals a pronounced performance asymmetry: advanced molecular representations (e.g., UniMolV2) excel in morphology but show limited gains over simple fingerprints for gene expression. Furthermore, we demonstrate that multimodal joint optimization is essential for integrating orthogonal biological signals, establishing empirical design principles for next-generation Virtual Cell models.

The MVCBench Framework

Figure 1. Overview of MVCBench. (a-b) The benchmark covers 1.1 million paired profiles across 5 transcriptomic and 3 morphological datasets. (c) We evaluate 24 SOTA models (12 Drug Reps, 12 Gene Reps). (d) The framework employs a progressive benchmarking protocol: (A) Chemical Focus, (B) Biological Focus, and (C) Multimodal Virtual Cell.

Part 1: Benchmarking Drug Representations

Insight: A pronounced performance asymmetry exists between gene expression and morphology prediction.

Gene Expression Prediction

For gene expression tasks, advanced molecular representations (e.g., UniMolV2, KPGT) show only marginal gains over simple ECFP4 fingerprints. The performance gap is minimal (< 1%), suggesting saturation in predicting transcriptomic responses solely from chemical structure.

Figure 2. Drug representations for gene expression prediction (LINCS2020 & Tahoe). Note the similar performance across methods.

Morphology Prediction

In contrast, for morphology (Cell Painting), deep learning-based methods (UniMolV2, Chemprop) significantly outperform the ECFP4 baseline. 3D geometric information proves crucial for predicting structural phenotypic changes.

Figure 3. Drug representations for cell morphology prediction. UniMolV2 and KPGT consistently lead across datasets (cpg0016, BBBC).

Part 2: Benchmarking Gene Representations

Perturbation-Specific Models Win: The model STATE maintains a consistent lead, with scGPT and scFoundation forming a competitive top tier. This underscores the value of alignment between pretraining objectives (perturbation prediction) and downstream tasks.

Figure 4. Comparative benchmarking of 12 gene representation methods. (a) Performance on LINCS2020. (b) Evaluation on high signal-to-noise Highly Expressed Genes (HEG). (c-d) Generalization on Tahoe_mini dataset.

Part 3: Multimodal Virtual Cell Construction

Joint Optimization Gains

We systematically compared single-modality training vs. multimodal joint optimization (MVC). Result: MVC consistently yields the highest performance, outperforming single-modality baselines by 2-6%.

UniMolV2 emerged as the most robust chemical backbone for multimodal tasks (CPS=1.0), followed by KPGT.

Figure 5. Benchmarking multimodal virtual cell construction. (c) Composite Performance Score ranking UniMolV2 as top. (d-e) Multimodal joint optimization (MVC) outperforms single-task (CP) and cross-modal auxiliary (MVC_CP) settings.

Figure 6. Design Principles. (a-b) Modality orthogonality: Gene and Morphology provide non-redundant info. (c) Loss Balancing: Fixed-ratio weighting outperforms adaptive schemes (MTA/UBA). (d) Fusion: Intermediate fusion is optimal for morphology; Late fusion for gene expression.

Empirical Design Principles

Modality Orthogonality: Transcriptomic and morphological predictive tasks are uncorrelated (PCC ≈ 0), confirming the necessity of joint modeling.
Loss Balancing: Simple Fixed-ratio weighting (explicit scale calibration) significantly outperforms complex adaptive schemes like Uncertainty weighting, preventing noise dominance.
Fusion Strategy:
- Intermediate Fusion: Best for structural tasks (Morphology).
- Late Fusion: Best for high-dimensional regulatory tasks (Gene Expression).

Citation

@article{li2026mvcbench, title={MVCBench: A Multimodal Benchmark for Drug-induced Virtual Cell Phenotypes}, author={Li, Bo and Wang, Qing and Wang, Shihang and Zhang, Bob and Peng, Yuzhong and Zeng, Pinxian and Liu, Chengliang and Li, Mengran and Tang, Ziyang and Yao, Xiaojun and Deng, Chuxia and Song, Qianqian}, journal={bioRxiv}, year={2026} }