Rethinking Multimodal Molecular Learning

From Model-Centric to Data-Centric
Runqing Xu
MPhil Thesis Defense
HKUST(GZ) · Data Science and Analytics · May 9, 2026
Supervisor: Prof. Yongqi Zhang  |  Co-supervisor: Prof. Li Liu
Part 1 · Introduction

Why Multimodal Molecular Learning?

One molecule, multiple complementary representations

🧬
SMILES String
1D notation
🔗
2D Graph
Topology & bonds
🧊
3D Geometry
Spatial arrangement
📝
Text Description
Curated knowledge
🌐
Knowledge Graph
Relational context

These views are not redundant — they expose complementary evidence about molecular identity, reactivity, function, and biological context.

The challenge is not to collect more modalities, but to use heterogeneous evidence accurately and robustly.

Part 1 · Introduction

Three Challenges

Heterogeneity

Different modalities carry different types of mechanistic evidence. Molecular structure reflects pharmacokinetic mechanisms; text descriptions capture pharmacodynamic mechanisms. The relevance of each modality varies per instance.

Incompleteness

Molecular data is rarely balanced across modalities. Structural information is usually available, but for newly emerging drugs, textual descriptions or knowledge graph coverage may be sparse or entirely missing.

Weak Grounding

Many current systems reason through latent representations that are difficult to audit. Even when predictions are accurate, it is unclear whether reasoning is anchored to chemically meaningful evidence.

Part 1 · Introduction

The Dominant Response Has Been Model-Centric

Better Encoders

ChemBERTa, Uni-Mol

Better Alignment

MoleculeSTM, MoMu

Better Fusion

3D-MoLM, Mol-LLaMA

?

What’s next?

Is improving the model always the right place to intervene?
Part 1 · Introduction

Thesis Framework — Two Intervention Points

Model-Centric

Redesign how the model processes and fuses evidence

Specialized experts, adaptive routing, mechanism-aware fusion. The model architecture is the primary locus of improvement.

M²DDI
A Unified Framework for Dynamic Multimodal Fusion in Drug-Drug Interaction Event Prediction
Under review · KDD 2026 AI4Science
Data-Centric

Redesign the evidence interface: what the model sees and how it learns

Choice of input modality, explicit chemical abstractions, staged curriculum. The base model remains largely unchanged.

MolGlass
Don’t Just Encode But See: A Data-Centric Paradigm for Visual Molecular Understanding in LLMs
Under review · KDD 2026 AI4Science
Match the intervention to the bottleneck
Part 2 · M²DDI

DDI Event Prediction — Problem Setup

Drug A  +  Drug B  →  Which of 86 event types?
Structure
Pharmacokinetic
Function / Text
Pharmacodynamic
Knowledge Graph
Network-mediated
MethodStruct.Func.Rel.Fusion
DeepDDI
TextDDI
EmerGNN
TIGERStatic
MolecBioNetStatic
M²DDIDynamic
Part 2 · M²DDI

M²DDI Architecture Overview

M2DDI Architecture
1  Heterogeneous Expert Pool (4 struct + 4 func + 1 rel)
2  Prior-Enhanced Dual-Path Gating
3  Top-5 Softmax Selection
Part 2 · M²DDI

Evaluation Settings

Benchmarks

DrugBank

86 interaction types · Multi-class classification · Macro F1

TWOSIDES

200 side-effect labels · Multi-label prediction · PR-AUC

Generalization Settings

S0

Transductive

Both drugs seen

S1

Semi-inductive

One drug new

S2

Fully inductive

Both drugs new

EASY  ————————→  HARD
Part 2 · M²DDI

Main Results

Category Method DrugBank (Macro F1) TWOSIDES (PR-AUC)
S0 S1 S2 S0 S1 S2
Structure MLP60.7521.5220.1581.7181.7064.84
DeepDDI73.5534.3613.7389.6983.0668.51
Function TextDDI92.0059.5126.7992.6784.2983.18
Relation Decagon56.9124.666.1290.7879.4857.61
SumGNN87.3035.2817.8593.2080.6660.57
EmerGNN94.1062.2827.8496.1789.2181.43
Multimodal TIGER93.5357.5219.7895.7286.7069.95
MolecBioNet93.9263.3529.6796.0685.1777.29
Ours Static Fusion94.4460.9631.2492.8688.5479.61
Ensemble94.0761.3132.7093.5290.2385.58
M²DDI96.2668.2840.5298.1892.3187.64
The advantage of dynamic routing grows sharply in inductive settings (S1, S2)
Part 2 · M²DDI

Robustness — The Routing Shift

F1 under progressive masking
(a) F1 under progressive modality masking
Expert weight evolution
(b) Expert weight evolution under masking
The gate learns to treat missingness as a routing problem — it redirects to the evidence that remains trustworthy.
Part 2 · M²DDI

Mechanistic Interpretability

PK/PD case study
(a) PK/PD-dominant routing alignment
Routing diversity
(b) Sample-wise routing diversity
M²DDI summary: When the challenge is structured evidence use across heterogeneous and incomplete sources, model-centric design — specialized experts with adaptive routing — is highly effective. But what if the bottleneck is somewhere else?
Part 3 · Transition

Molecular Understanding with LLMs — A New Task

Not just prediction, but description, explanation, and grounded reasoning about molecules

SMILES Text LLM
1D string — spatial structure lost
Molecule Encoder Projector LLM
Powerful but latent — hard to audit
Is there another way?
Part 4 · MolGlass

From Insight to Approach

1 Insight

VLMs already have strong visual pattern recognition. 2D molecular depictions are how chemists communicate — spatial, multi-channel, and auditable.

2 Data-Centric Idea

Instead of building a new molecular encoder, use molecular images as the input to an existing VLM. Focus effort on making the visual evidence chemically informative. The model stays largely frozen.

3 But...

Standard 2D depictions lack explicit chemical abstractions. A chemist can read scaffolds and functional groups from a drawing, but a VLM cannot — at least not reliably. → MolGlass

Part 4 · MolGlass

MolGlass — Chemical-Aware Visual Augmentation

MolGlass paradigm comparison
Functional Groups
Colored highlights via deterministic matching
Scaffold Core
Gray shading + attachment-point dots
BRICS Cleavage Sites
Blue markers at natural edit points
2.5D Stereo Cues
Wedge/dash bonds for chirality
All augmentations are deterministic — no LLM in the loop. Per-molecule color assignment prevents dataset-level shortcuts.
Part 4 · MolGlass

Three-Stage Curriculum

👁

Stage I: Perceive

Learn to read MolGlass conventions: match functional groups, count cleavage sites, identify scaffold cues

🧠

Stage II: Reason

Connect visible evidence to chemical and biological implications via structured explanations

💬

Stage III: Converse

Multi-turn grounded dialogue for assistant-style consultation

Base model: Qwen3-VL-8B + LoRA
Vision encoder frozen throughout all stages
The curriculum is itself part of the data-centric intervention — supervision must co-evolve with the evidence interface.
Part 4 · MolGlass

Open-Ended Molecular Understanding

GPT-5–judged response quality on unseen PubChem molecules (normalized against GPT-5)

ModelStructuralChemicalBiological
Proprietary
GPT-50.5720.5860.641
GPT-5 + MolGlass image0.8831.0251.305
Open-source molecular assistants
Qwen3-VL-8B (vanilla)0.5770.5680.635
Mol-Instructions0.2610.2880.393
3D-MoLM0.5720.7280.919
LLaMo0.4390.4050.720
Mol-LLaMA0.8831.0251.305
MolGlass1.2631.4791.878
MolGlass surpasses specialized molecular assistants — without introducing a new molecular encoder
Part 4 · MolGlass

Zero-Shot Transfer

Model PAMPA BBBP MoleculeQA
DefaultCoT DefaultCoT DefaultCoT
Proprietary
GPT-558.6261.9258.8262.7554.4155.82
GPT-5 + MolGlass64.3469.6565.2069.6156.6860.36
Open-source
Llama3.1-8B56.5146.1957.0751.0338.0741.53
Mol-Instructions49.3147.4047.1345.8949.2949.39
3D-MoLM33.1655.4548.0448.5352.3254.62
LLaMo42.3645.1444.5546.8050.7451.82
Mol-LLaMA48.4064.5349.5057.7957.4560.09
MolGlass66.0372.9265.4367.7261.9463.49
Gains persist under both default and CoT prompting — genuinely better structural evidence identification
Part 4 · MolGlass

Visually Grounded Explanations

Attention heatmaps
Attention patterns under global, functional group, and modification queries
MolGlass summary: When the challenge is structured evidence use through a visual interface, data-centric design — augmenting the input and co-designing supervision — can match or surpass the impact of model-centric approaches.
Part 5 · Rethinking

Two Bottlenecks, Two Interventions

Dimension M²DDI (Model-Centric) MolGlass (Data-Centric)
Locus of improvement Model architecture: experts + routing Input & supervision: visual interface + curriculum
What changes Expert pool, gating mechanism, fusion strategy Input modality, chemical annotations, training stages
What stays fixed Input representations (pretrained encoders) Base VLM (Qwen3-VL-8B, vision encoder frozen)
Interpretability Mechanism-level (expert routing aligns with PK/PD) Evidence-level (attention aligns with visible chemical cues)
System scope Task-specific pharmacological prediction General-purpose molecular understanding
Not “model vs. data” — diagnose where the bottleneck is, then match the intervention to it
Part 5 · Rethinking

What We Learn

1

Diagnose before you design

The bottleneck may not be in the model — it may be in how evidence is presented to the model. M²DDI improved by redesigning internal fusion; MolGlass improved by redesigning the input interface.

2

The evidence interface is a first-class research problem

Data preprocessing is often treated as engineering rather than research. MolGlass shows that the design of the evidence interface can have impact equal to or greater than architectural innovation.

3

Supervision must co-evolve with the intervention

When you change the model (M²DDI), end-task supervision with domain priors may suffice. When you change the evidence interface (MolGlass), supervision must be redesigned to teach the model how to perceive and reason through the new interface.

Part 5 · Rethinking

Limitations

  • M²DDI depends on curated modality pipelines — the quality of external encoders and the knowledge graph directly affects performance.
  • MolGlass uses a pragmatic 2.5D visual representation. While this captures stereochemical cues effectively, it does not replace full 3D conformational reasoning.
  • The scope of this thesis focuses on two representative settings. Broader generalization requires validation across more tasks and scientific domains.
Conclusion

Conclusion & Future Directions

Multimodal molecular learning is a problem of structured evidence use.
The right intervention depends on where the bottleneck is.
M²DDI — Model-Centric
Most effective when the bottleneck is selective fusion, with clearest gains in inductive settings.
MolGlass — Data-Centric
Most effective when the bottleneck is the evidence interface, matching specialized model components.
Future Directions
  • Hybrid model-centric + data-centric systems
  • 3D geometry and conformational reasoning
  • Experimental context integration
  • Interactive scientific dialogue

Acknowledgements

Supervisor

Prof. Yongqi Zhang — for his patient guidance, constant encouragement, generous support, and the intellectual freedom he provided throughout my MPhil study.

Collaborators & Colleagues

Lab mates and collaborators — for the many helpful technical discussions that contributed to this thesis.

Family

For their understanding, patience, and unwavering support.

Thank You
1 / 22
Speaker Notes — Slide 1