MVCBench: A Multimodal Benchmark for
Drug-induced Virtual Cell Phenotypes

Bo Li1,†, Qing Wang2,†, Shihang Wang3,†, Bob Zhang1,4,*, Yuzhong Peng5, Pinxian Zeng1, Chengliang Liu1,
Mengran Li6, Ziyang Tang7, Xiaojun Yao3, Chuxia Deng5, Qianqian Song2,8,*
1FST, University of Macau, 2University of Florida, 3Macao Polytechnic University, 4Institute of Collaborative Innovation, UM,
5FHS, University of Macau, 6HKISI-CAS, 7NVIDIA, 8Wake Forest School of Medicine

Abstract

Realizing a holistic virtual cell capable of accurately predicting how drug treatments reshape cellular phenotypes across transcriptional and morphological landscapes is a central goal in drug discovery. To bridge the gap in current representation capabilities, we introduce MVCBench, a systematic benchmarking framework evaluating 24 representation models (12 molecular & 12 gene representations).

Leveraging nearly 1.1 million paired profiles, we propose a progressive evaluation logic moving from independent component assessment to holistic multimodal modeling. Our analysis reveals a pronounced performance asymmetry: advanced molecular representations (e.g., UniMolV2) excel in morphology but show limited gains over simple fingerprints for gene expression. Furthermore, we demonstrate that multimodal joint optimization is essential for integrating orthogonal biological signals, establishing empirical design principles for next-generation Virtual Cell models.

The MVCBench Framework

MVCBench Framework

Figure 1. Overview of MVCBench. (a-b) The benchmark covers 1.1 million paired profiles across 5 transcriptomic and 3 morphological datasets. (c) We evaluate 24 SOTA models (12 Drug Reps, 12 Gene Reps). (d) The framework employs a progressive benchmarking protocol: (A) Chemical Focus, (B) Biological Focus, and (C) Multimodal Virtual Cell.

Paired Profiles

1.1 Million

Transcriptomic Compounds

38,950

Morphology Compounds

84,858

Models

24

Part 1: Benchmarking Drug Representations

Insight: A pronounced performance asymmetry exists between gene expression and morphology prediction.

Gene Expression Prediction

For gene expression tasks, advanced molecular representations (e.g., UniMolV2, KPGT) show only marginal gains over simple ECFP4 fingerprints. The performance gap is minimal (< 1%), suggesting saturation in predicting transcriptomic responses solely from chemical structure.

Fig 2 Drug Gene

Figure 2. Drug representations for gene expression prediction (LINCS2020 & Tahoe). Note the similar performance across methods.

Morphology Prediction

In contrast, for morphology (Cell Painting), deep learning-based methods (UniMolV2, Chemprop) significantly outperform the ECFP4 baseline. 3D geometric information proves crucial for predicting structural phenotypic changes.

Fig 3 Drug Morphology

Figure 3. Drug representations for cell morphology prediction. UniMolV2 and KPGT consistently lead across datasets (cpg0016, BBBC).

Part 2: Benchmarking Gene Representations

Perturbation-Specific Models Win: The model STATE maintains a consistent lead, with scGPT and scFoundation forming a competitive top tier. This underscores the value of alignment between pretraining objectives (perturbation prediction) and downstream tasks.

Fig 4 Gene Reps

Figure 4. Comparative benchmarking of 12 gene representation methods. (a) Performance on LINCS2020. (b) Evaluation on high signal-to-noise Highly Expressed Genes (HEG). (c-d) Generalization on Tahoe_mini dataset.

Part 3: Multimodal Virtual Cell Construction

Joint Optimization Gains

We systematically compared single-modality training vs. multimodal joint optimization (MVC). Result: MVC consistently yields the highest performance, outperforming single-modality baselines by 2-6%.

UniMolV2 emerged as the most robust chemical backbone for multimodal tasks (CPS=1.0), followed by KPGT.

Fig 5 Multimodal Construction

Figure 5. Benchmarking multimodal virtual cell construction. (c) Composite Performance Score ranking UniMolV2 as top. (d-e) Multimodal joint optimization (MVC) outperforms single-task (CP) and cross-modal auxiliary (MVC_CP) settings.


Fig 6 Design Principles

Figure 6. Design Principles. (a-b) Modality orthogonality: Gene and Morphology provide non-redundant info. (c) Loss Balancing: Fixed-ratio weighting outperforms adaptive schemes (MTA/UBA). (d) Fusion: Intermediate fusion is optimal for morphology; Late fusion for gene expression.

Empirical Design Principles

  • Modality Orthogonality: Transcriptomic and morphological predictive tasks are uncorrelated (PCC ≈ 0), confirming the necessity of joint modeling.
  • Loss Balancing: Simple Fixed-ratio weighting (explicit scale calibration) significantly outperforms complex adaptive schemes like Uncertainty weighting, preventing noise dominance.
  • Fusion Strategy:
    • Intermediate Fusion: Best for structural tasks (Morphology).
    • Late Fusion: Best for high-dimensional regulatory tasks (Gene Expression).

Citation

@article{li2026mvcbench,
  title={MVCBench: A Multimodal Benchmark for Drug-induced Virtual Cell Phenotypes},
  author={Li, Bo and Wang, Qing and Wang, Shihang and Zhang, Bob and Peng, Yuzhong and Zeng, Pinxian and Liu, Chengliang and Li, Mengran and Tang, Ziyang and Yao, Xiaojun and Deng, Chuxia and Song, Qianqian},
  journal={bioRxiv},
  year={2026}
}