Rethinking Multimodal Molecular Learning

From Model-Centric to Data-Centric

Runqing Xu

MPhil Thesis Defense

HKUST(GZ) · Data Science and Analytics · May 9, 2026

Supervisor: Prof. Yongqi Zhang | Co-supervisor: Prof. Li Liu

Part 1 · Background

What is Molecular Learning?

Molecular learning sits at the intersection of chemistry, biology, and machine learning.

The goal is to computationally understand molecular structure, predict molecular properties, and reason about molecular behavior.

Core Question

How can we build computational systems that understand molecules as well as expert chemists do?

Part 1 · Background

Real-World Applications

Drug Discovery

Predicting drug candidates, optimizing leads, and identifying potential interactions.

Toxicology & Safety

Predicting adverse effects before clinical trials — including drug-drug interactions.

Materials Science

Designing novel polymers, catalysts, and functional materials with targeted properties.

Precision Medicine

Tailoring drug combinations to individual patients based on molecular and genomic profiles.

A common prerequisite: the ability to understand and reason about molecules from heterogeneous evidence.

Part 1 · Background

From Unimodal to Multimodal Molecular Learning

The same molecule can be represented in multiple complementary ways. No single view captures everything.

These views are not redundant — they expose complementary evidence about identity, reactivity, function, and context.

SMILES String

1D notation

2D Graph

Topology & bonds

3D Geometry

Spatial arrangement

Text Description

Curated knowledge

Knowledge Graph

Relational context

Part 1 · Background

DDI Prediction — A Key Multimodal Challenge

Drug A + Drug B → Which of 86 event types?

Modality Heterogeneity

Structure, text, and KG differ in semantics and representational form. Each reflects a different pharmacological mechanism.

Dynamic Relevance

Some drug pairs are PK-driven (structure matters), others are PD-driven (function matters). Relevance varies per instance.

Data Incompleteness

For newly emerging drugs, structure is available, but text descriptions and KG coverage may be sparse or missing entirely.

These challenges motivate a multimodal approach with adaptive, mechanism-aware evidence fusion

Part 2 · M²DDI

M²DDI — Architecture Overview

1. Heterogeneous Expert Pool
4 structural + 4 functional + 1 relational expert, each aligned with a distinct modality

2. Prior-Enhanced Dual-Path Gating
Feature-query path + ATC pharmacological prior → instance-specific routing

3. Adaptive Expert Selection
Activate most relevant experts per drug pair → weighted aggregation

Part 2 · M²DDI

Evaluation Settings

DrugBank

86 interaction types · Multi-class classification · Macro F1

TWOSIDES

200 side-effect labels · Multi-label prediction · PR-AUC

Part 2 · M²DDI

Main Results

Category	Method	DrugBank (Macro F1)			TWOSIDES (PR-AUC)
Category	Method	S0	S1	S2	S0	S1	S2
Structure	MLP	60.75	21.52	20.15	81.71	81.70	64.84
	DeepDDI	73.55	34.36	13.73	89.69	83.06	68.51
Function	TextDDI	92.00	59.51	26.79	92.67	84.29	83.18
Relation	Decagon	56.91	24.66	6.12	90.78	79.48	57.61
	EmerGNN	94.10	62.28	27.84	96.17	89.21	81.43
Multimodal	TIGER	93.53	57.52	19.78	95.72	86.70	69.95
	MolecBioNet	93.92	63.35	29.67	96.06	85.17	77.29
Ours	Static Fusion	94.44	60.96	31.24	92.86	88.54	79.61
	Ensemble	94.07	61.31	32.70	93.52	90.23	85.58
	M²DDI	96.26	68.28	40.52	98.18	92.31	87.64

The advantage of dynamic routing grows sharply in inductive settings (S1, S2) where evidence is most asymmetric

Part 2 · M²DDI

Robustness — The Routing Shift

(a) F1 under progressive modality masking

(b) Expert weight evolution under masking

The gate learns to treat missingness as a routing problem — it redirects to the evidence that remains trustworthy, rather than forcing the classifier to absorb noisy or empty signals.

Part 2 · M²DDI

Mechanistic Interpretability

(a) PK/PD-dominant routing alignment

(b) Sample-wise routing diversity

Routing is both mechanism-aligned and instance-specific. Interpretability is strengthened when model structure reflects domain structure.

Part 3 · Transition

M²DDI Works — But What Does It Cost?

The Pattern

Each new modality or task requires new specialized encoders and new fusion components.

Progress comes at the cost of growing architectural complexity — a recurring theme across molecular AI.

The Question

Can we find an approach where the model stays largely fixed, and improvement comes from redesigning the evidence itself?

Part 3 · Transition

The Encoder Escalation in Molecular LLMs

Molecular understanding: asking LLMs to reason about molecules — the same model-centric pattern emerges

3D-MoLM

Dedicated 3D encoder + Q-Former projector to bridge molecular structure to an LLM

Mol-LLaMA

Separate 2D + 3D encoders, interaction module, then Q-Former — even more complex

Each new modality requires new model components — is there a fundamentally different approach?

Part 3 · Transition

Vision as a Unified Molecular Interface

Chemists Reason Visually

Chemists use visual inspection of 2D molecular depictions as their primary reasoning tool. They read scaffolds, functional groups, and stereochemistry from drawings.

VLMs Excel at Visual Patterns

Modern vision-language models have powerful visual pattern recognition. They excel at connecting visual evidence to language.

→ A Data-Centric Approach

Instead of building yet another encoder, use molecular images as input to an existing VLM. Focus effort on making the visual evidence chemically informative.

The model stays largely frozen —
the improvement comes from what the model sees.

Part 4 · MolGlass

MolGlass — Chemical-Aware Visual Augmentation

Functional Groups

Colored highlights via deterministic matching

Scaffold Core

Gray shading + attachment-point dots

BRICS Cleavage Sites

Blue markers at natural edit points

2.5D Stereo Cues

Wedge/dash bonds for chirality

All augmentations are deterministic — no LLM in the loop. Per-molecule color assignment prevents dataset-level shortcuts.

Part 4 · MolGlass

Three-Stage Curriculum

Stage I: Perceive

100K automatic tasks — match groups, count sites, identify scaffold

Train projector only

Stage II: Reason

14K GPT-5 constrained reasoning — organize evidence into auditable traces

+ Unfreeze language tower

Stage III: Converse

9K GPT-5 multi-turn dialogues — grounded in visible evidence

Assistant-style consultation

Base model: Qwen3-VL-8B + LoRA

Vision encoder frozen throughout all stages

Part 4 · MolGlass

Open-Ended Molecular Understanding

Does Data-Centric Work?

Is the bottleneck evidence exposure, not model capacity?

vs. Same Base Model

2× quality scores — bottleneck was evidence, not capacity

Even GPT-5 Benefits

Augmented images boost GPT-5 — the evidence helps any model

vs. Model-Centric Methods

Beats 3D-MoLM & Mol-LLaMA — frozen encoder + better input

Open-ended explanation quality on unseen PubChem molecules (judged by GPT-5)

Model	Structural	Chemical	Biological
Proprietary
GPT-5	0.572	0.586	0.641
GPT-5 + MolGlass	0.883	1.025	1.305
Open-source molecular assistants
Qwen3-VL-8B (vanilla)	0.577	0.568	0.635
3D-MoLM	0.572	0.728	0.919
Mol-LLaMA	0.883	1.025	1.305
MolGlass	1.263	1.479	1.878

Part 4 · MolGlass

Zero-Shot Transfer & Molecular Comprehension

Does It Generalize?

Can the visual grounding transfer to unseen tasks with no task-specific fine-tuning?

PAMPA

Membrane permeability prediction

BBBP

Blood-brain barrier penetration

MoleculeQA

Factual molecular comprehension

Zero-shot performance (accuracy %) — no task-specific fine-tuning

Model	PAMPA		BBBP		MoleculeQA
Model	Def.	CoT	Def.	CoT	Def.	CoT
Proprietary
GPT-5	58.62	61.92	58.82	62.75	54.41	55.82
GPT-5 + MolGlass	64.34	69.65	65.20	69.61	56.68	60.36
Open-source
3D-MoLM	33.16	55.45	48.04	48.53	52.32	54.62
Mol-LLaMA	48.40	64.53	49.50	57.79	57.45	60.09
MolGlass	66.03	72.92	65.43	67.72	61.94	63.49

Part 4 · MolGlass

Visually Grounded Explanations

Why This Matters

Performance gains alone do not tell us whether the model truly uses the augmented visual evidence

Attention Follows the Query

Attention shifts to match each query type — global, functional group, or modification — consistent with evidence-grounded reasoning

Evidence-Level Auditability

A single case study, but illustrates a key advantage: both model and user can inspect the same visual evidence

Part 5 · Synthesis

Two Bottlenecks, Two Approaches

	M²DDI (Model-Centric)	MolGlass (Data-Centric)
Paper	A Unified Framework for Dynamic Multimodal Fusion in DDI Event Prediction	Don’t Just Encode But See: A Data-Centric Paradigm for Visual Molecular Understanding in LLMs
Authors	Runqing Xu, Siyi Liu, Haoyang Li, Hao Li, Yongqi Zhang	Runqing Xu, Xiaotang Wang, Chunfeng Gao, Siyi Liu, Hange Zhou, Yongqi Zhang
Bottleneck	How to fuse heterogeneous molecular evidence	How molecular information is presented to the model
What changes	Expert pool, gating mechanism, fusion strategy	Input modality, chemical annotations, training stages
What stays fixed	Input data (molecular features unchanged)	Base VLM (Qwen3-VL-8B, vision encoder frozen)
Interpretability	Mechanism-level (routing aligns with PK/PD)	Evidence-level (attention aligns with visible cues)

Not “model vs. data” as a binary choice — diagnose where the bottleneck is, then design accordingly

Both under review at ACM SIGKDD 2026 AI4Science Track

Part 5 · Synthesis

What We Learn

1

The evidence interface is a first-class design problem

How you present evidence to the model is a research decision with impact comparable to architectural choices. MolGlass outperforms dedicated molecular encoders by only changing what the model sees.

2

Supervision must co-evolve with evidence design

M²DDI changed the model — standard supervision sufficed. MolGlass changed the input — supervision had to be completely redesigned (Perceive → Reason → Converse). Evidence design and supervision design are coupled.

3

Where you place intelligence determines what transparency you get

M²DDI: intelligence in the model (expert routing) → mechanism-level interpretability. MolGlass: intelligence in the data (visual cues) → evidence-level auditability. Interpretability is a design choice, not a post-hoc analysis.

Part 5 · Synthesis

Limitations

M²DDI depends on curated modality pipelines — the quality of external encoders and the knowledge graph directly affects performance.
MolGlass uses a pragmatic 2.5D visual representation. While this captures stereochemical cues effectively, it does not replace full 3D conformational reasoning.
The scope of this thesis focuses on two representative settings. Broader generalization requires validation across more tasks and scientific domains.

Conclusion

Rethinking Multimodal Molecular Learning: From Model-Centric to Data-Centric

The bottleneck is not always in the model —
and when it is in the evidence,
redesigning what the model sees
can be more effective than redesigning the model itself.

Conclusion

Future Directions

3D Perception

Current 2.5D encoding captures stereochemical cues but cannot replace full 3D conformational reasoning

→ Integrate explicit 3D molecular views or rotation-based video sequences into the visual interface

Dynamic Resolution for Larger Molecules

Testing focuses on drug-sized small molecules; proteins, polymers, and nucleic acids need different scales

→ Adaptive resolution and multi-scale visual strategies

Thank You

Questions & Discussion

Supervisor: Prof. Yongqi Zhang · Co-supervisor: Prof. Li Liu