Rethinking Multimodal Molecular Learning

From Model-Centric to Data-Centric
Runqing Xu
MPhil Thesis Defense
HKUST(GZ) · Data Science and Analytics · May 9, 2026
Supervisor: Prof. Yongqi Zhang  |  Co-supervisor: Prof. Li Liu
Part 1 · Background

What is Molecular Learning?

Molecular learning sits at the intersection of chemistry, biology, and machine learning.

The goal is to computationally understand molecular structure, predict molecular properties, and reason about molecular behavior.

Core Question

How can we build computational systems that understand molecules as well as expert chemists do?

Molecular Learning Concept
Part 1 · Background

Real-World Applications

Drug Discovery

Predicting drug candidates, optimizing leads, and identifying potential interactions.

Toxicology & Safety

Predicting adverse effects before clinical trials — including drug-drug interactions.

Materials Science

Designing novel polymers, catalysts, and functional materials with targeted properties.

Precision Medicine

Tailoring drug combinations to individual patients based on molecular and genomic profiles.

A common prerequisite: the ability to understand and reason about molecules from heterogeneous evidence.

Part 1 · Background

From Unimodal to Multimodal Molecular Learning

The same molecule can be represented in multiple complementary ways. No single view captures everything.

These views are not redundant — they expose complementary evidence about identity, reactivity, function, and context.

SMILES String
1D notation
2D Graph
Topology & bonds
3D Geometry
Spatial arrangement
Text Description
Curated knowledge
Knowledge Graph
Relational context
Part 1 · Background

DDI Prediction — A Key Multimodal Challenge

Drug A  +  Drug B  →  Which of 86 event types?
DDI multimodal example: Metformin and Cimetidine

Modality Heterogeneity

Structure, text, and KG differ in semantics and representational form. Each reflects a different pharmacological mechanism.

Dynamic Relevance

Some drug pairs are PK-driven (structure matters), others are PD-driven (function matters). Relevance varies per instance.

Data Incompleteness

For newly emerging drugs, structure is available, but text descriptions and KG coverage may be sparse or missing entirely.

These challenges motivate a multimodal approach with adaptive, mechanism-aware evidence fusion
Part 2 · M²DDI

M²DDI — Architecture Overview

M2DDI Architecture
1. Heterogeneous Expert Pool
4 structural + 4 functional + 1 relational expert, each aligned with a distinct modality
2. Prior-Enhanced Dual-Path Gating
Feature-query path + ATC pharmacological prior → instance-specific routing
3. Adaptive Expert Selection
Activate most relevant experts per drug pair → weighted aggregation
Part 2 · M²DDI

Evaluation Settings

S0/S1/S2 evaluation settings

DrugBank

86 interaction types · Multi-class classification · Macro F1

TWOSIDES

200 side-effect labels · Multi-label prediction · PR-AUC

Part 2 · M²DDI

Main Results

Category Method DrugBank (Macro F1) TWOSIDES (PR-AUC)
S0 S1 S2 S0 S1 S2
Structure MLP60.7521.5220.1581.7181.7064.84
DeepDDI73.5534.3613.7389.6983.0668.51
Function TextDDI92.0059.5126.7992.6784.2983.18
Relation Decagon56.9124.666.1290.7879.4857.61
EmerGNN94.1062.2827.8496.1789.2181.43
Multimodal TIGER93.5357.5219.7895.7286.7069.95
MolecBioNet93.9263.3529.6796.0685.1777.29
Ours Static Fusion94.4460.9631.2492.8688.5479.61
Ensemble94.0761.3132.7093.5290.2385.58
M²DDI96.2668.2840.5298.1892.3187.64
The advantage of dynamic routing grows sharply in inductive settings (S1, S2) where evidence is most asymmetric
Part 2 · M²DDI

Robustness — The Routing Shift

F1 under progressive masking
(a) F1 under progressive modality masking
Expert weight evolution
(b) Expert weight evolution under masking
The gate learns to treat missingness as a routing problem — it redirects to the evidence that remains trustworthy, rather than forcing the classifier to absorb noisy or empty signals.
Part 2 · M²DDI

Mechanistic Interpretability

PK/PD case study
(a) PK/PD-dominant routing alignment
Routing diversity
(b) Sample-wise routing diversity
Routing is both mechanism-aligned and instance-specific. Interpretability is strengthened when model structure reflects domain structure.
Part 3 · Transition

M²DDI Works — But What Does It Cost?

The Pattern

Each new modality or task requires new specialized encoders and new fusion components.

Progress comes at the cost of growing architectural complexity — a recurring theme across molecular AI.

The Question

Can we find an approach where the model stays largely fixed, and improvement comes from redesigning the evidence itself?

Encoder escalation
Part 3 · Transition

The Encoder Escalation in Molecular LLMs

Molecular understanding: asking LLMs to reason about molecules — the same model-centric pattern emerges

3D-MoLM
3D-MoLM Architecture

Dedicated 3D encoder + Q-Former projector to bridge molecular structure to an LLM

Mol-LLaMA
Mol-LLaMA Architecture

Separate 2D + 3D encoders, interaction module, then Q-Former — even more complex

Each new modality requires new model components — is there a fundamentally different approach?
Part 3 · Transition

Vision as a Unified Molecular Interface

Chemists Reason Visually

Chemists use visual inspection of 2D molecular depictions as their primary reasoning tool. They read scaffolds, functional groups, and stereochemistry from drawings.

VLMs Excel at Visual Patterns

Modern vision-language models have powerful visual pattern recognition. They excel at connecting visual evidence to language.

→ A Data-Centric Approach

Instead of building yet another encoder, use molecular images as input to an existing VLM. Focus effort on making the visual evidence chemically informative.

Vision as universal interface

The model stays largely frozen —
the improvement comes from what the model sees.

Part 4 · MolGlass

MolGlass — Chemical-Aware Visual Augmentation

MolGlass augmentation pipeline
Functional Groups
Colored highlights via deterministic matching
Scaffold Core
Gray shading + attachment-point dots
BRICS Cleavage Sites
Blue markers at natural edit points
2.5D Stereo Cues
Wedge/dash bonds for chirality
All augmentations are deterministic — no LLM in the loop. Per-molecule color assignment prevents dataset-level shortcuts.
Part 4 · MolGlass

Three-Stage Curriculum

MolGlass framework
Stage I: Perceive
100K automatic tasks — match groups, count sites, identify scaffold
Train projector only
Stage II: Reason
14K GPT-5 constrained reasoning — organize evidence into auditable traces
+ Unfreeze language tower
Stage III: Converse
9K GPT-5 multi-turn dialogues — grounded in visible evidence
Assistant-style consultation
Base model: Qwen3-VL-8B + LoRA
Vision encoder frozen throughout all stages
Part 4 · MolGlass

Open-Ended Molecular Understanding

Does Data-Centric Work?

Is the bottleneck evidence exposure, not model capacity?

vs. Same Base Model

quality scores — bottleneck was evidence, not capacity

Even GPT-5 Benefits

Augmented images boost GPT-5 — the evidence helps any model

vs. Model-Centric Methods

Beats 3D-MoLM & Mol-LLaMA — frozen encoder + better input

Open-ended explanation quality on unseen PubChem molecules (judged by GPT-5)

ModelStructuralChemicalBiological
Proprietary
GPT-50.5720.5860.641
GPT-5 + MolGlass0.8831.0251.305
Open-source molecular assistants
Qwen3-VL-8B (vanilla)0.5770.5680.635
3D-MoLM0.5720.7280.919
Mol-LLaMA0.8831.0251.305
MolGlass1.2631.4791.878
Part 4 · MolGlass

Zero-Shot Transfer & Molecular Comprehension

Does It Generalize?

Can the visual grounding transfer to unseen tasks with no task-specific fine-tuning?

PAMPA

Membrane permeability prediction

BBBP

Blood-brain barrier penetration

MoleculeQA

Factual molecular comprehension

Zero-shot performance (accuracy %) — no task-specific fine-tuning

Model PAMPA BBBP MoleculeQA
Def.CoT Def.CoT Def.CoT
Proprietary
GPT-558.6261.9258.8262.7554.4155.82
GPT-5 + MolGlass64.3469.6565.2069.6156.6860.36
Open-source
3D-MoLM33.1655.4548.0448.5352.3254.62
Mol-LLaMA48.4064.5349.5057.7957.4560.09
MolGlass66.0372.9265.4367.7261.9463.49
Part 4 · MolGlass

Visually Grounded Explanations

Why This Matters

Performance gains alone do not tell us whether the model truly uses the augmented visual evidence

Attention Follows the Query

Attention shifts to match each query type — global, functional group, or modification — consistent with evidence-grounded reasoning

Evidence-Level Auditability

A single case study, but illustrates a key advantage: both model and user can inspect the same visual evidence

Attention heatmaps
Part 5 · Synthesis

Two Bottlenecks, Two Approaches

M²DDI (Model-Centric) MolGlass (Data-Centric)
Paper A Unified Framework for Dynamic Multimodal Fusion in DDI Event Prediction Don’t Just Encode But See: A Data-Centric Paradigm for Visual Molecular Understanding in LLMs
Authors Runqing Xu, Siyi Liu, Haoyang Li, Hao Li, Yongqi Zhang Runqing Xu*, Xiaotang Wang*, Chunfeng Gao, Siyi Liu, Hange Zhou, Yongqi Zhang
Bottleneck How to fuse heterogeneous molecular evidence How molecular information is presented to the model
What changes Expert pool, gating mechanism, fusion strategy Input modality, chemical annotations, training stages
What stays fixed Input data (molecular features unchanged) Base VLM (Qwen3-VL-8B, vision encoder frozen)
Interpretability Mechanism-level (routing aligns with PK/PD) Evidence-level (attention aligns with visible cues)
Not “model vs. data” as a binary choice — diagnose where the bottleneck is, then design accordingly

Both under review at ACM SIGKDD 2026 AI4Science Track

Part 5 · Synthesis

What We Learn

1

The evidence interface is a first-class design problem

How you present evidence to the model is a research decision with impact comparable to architectural choices. MolGlass outperforms dedicated molecular encoders by only changing what the model sees.

2

Supervision must co-evolve with evidence design

M²DDI changed the model — standard supervision sufficed. MolGlass changed the input — supervision had to be completely redesigned (Perceive → Reason → Converse). Evidence design and supervision design are coupled.

3

Where you place intelligence determines what transparency you get

M²DDI: intelligence in the model (expert routing) → mechanism-level interpretability. MolGlass: intelligence in the data (visual cues) → evidence-level auditability. Interpretability is a design choice, not a post-hoc analysis.

Part 5 · Synthesis

Limitations

  • M²DDI depends on curated modality pipelines — the quality of external encoders and the knowledge graph directly affects performance.
  • MolGlass uses a pragmatic 2.5D visual representation. While this captures stereochemical cues effectively, it does not replace full 3D conformational reasoning.
  • The scope of this thesis focuses on two representative settings. Broader generalization requires validation across more tasks and scientific domains.
Conclusion

Conclusion

Rethinking Multimodal Molecular Learning: From Model-Centric to Data-Centric

The bottleneck is not always in the model —
and when it is in the evidence,
redesigning what the model sees
can be more effective than redesigning the model itself.

Conclusion

Future Directions

3D Perception

Current 2.5D encoding captures stereochemical cues but cannot replace full 3D conformational reasoning

→ Integrate explicit 3D molecular views or rotation-based video sequences into the visual interface

Dynamic Resolution for Larger Molecules

Testing focuses on drug-sized small molecules; proteins, polymers, and nucleic acids need different scales

→ Adaptive resolution and multi-scale visual strategies

 

Thank You
Questions & Discussion
Supervisor: Prof. Yongqi Zhang  ·  Co-supervisor: Prof. Li Liu
Speaker Notes — Slide 1