Skip to yearly menu bar Skip to main content


San Diego Oral Session

Oral 5D Application 3

Upper Level Ballroom 6CDEF

Moderators: Jicong Fan · Junyuan Hong

Fri 5 Dec 10 a.m. PST — 11 a.m. PST
Abstract:
Chat is not available.

Fri 5 Dec. 10:00 - 10:20 PST

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

Zimu Lu · Yunqiao Yang · Houxing Ren · Haotian Hou · Han Xiao · Ke Wang · Weikang Shi · Aojun Zhou · Mingjie Zhan · Hongsheng Li

LLM‑based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications.To assess the quality of the generated websites, we generate test cases targeting each functionality described in the instructions. These test cases are then manually filtered, refined, and organized to ensure accuracy, resulting in a total of 647 test cases. Each test case specifies an operation to be performed on the website and the expected outcome of the operation.To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute test cases on the generated websites and determine whether the observed responses align with the expected results.We evaluate three high-performance code-agent frameworks—Bolt.diy, OpenHands, and Aider—using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8\% accuracy on the test cases, highlighting the challenging nature of our benchmark.Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of the training set achieves an accuracy of 38.2\%, surpassing the performance of the best proprietary model.We release our data-generation, training, and testing code, along with both the datasets and model weights at https://github.com/mnluzimu/WebGen-Bench.

Fri 5 Dec. 10:20 - 10:40 PST

QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

David Dai · Peilin Chen · Chanakya Ekbote · Paul Liang

Clinical decision‑making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision‑centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time‑series signals, and text reports. QoQ-Med is trained with Domain‑aware Relative Policy Optimization (DRPO), a novel reinforcement‑learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro‑F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces.

Fri 5 Dec. 10:40 - 11:00 PST

NOVA: A Benchmark for Rare Anomaly Localization and Clinical Reasoning in Brain MRI

Cosmin Bercea · Jun Li · Philipp Raffler · Evamaria O. Riedel · Lena Schmitzer · Angela Kurz · Felix Bitzer · Paula Roßmüller · Julian Canisius · Mirjam Beyrle · Che Liu · Wenjia Bai · Bernhard Kainz · Julia Schnabel · Benedikt Wiestler

In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Open-world recognition ensures that such systems remain robust as ever-emerging, previously _unknown_ categories appear and must be addressed without retraining.Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging.However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use.We therefore present NOVA, a challenging, real-life _evaluation-only_ benchmark of $\sim$900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an _extreme_ stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops, with approximately a 65\% gap in localisation compared to natural-image benchmarks and 40\% and 20\% gaps in captioning and reasoning, respectively, compared to resident radiologists. Therefore, NOVA establishes a testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.