Publications

This page collects my current papers and preprints. I am especially interested in rigorous evaluation, medical benchmarks, and scientific-intelligence systems for research workflows.

Papers & Preprints

OpenNovelty: An Open-domain Benchmark for Evaluating the Open-ended Novelty of Language Models

Published in arXiv preprint, 2026

OpenNovelty studies whether language models can judge the novelty of open-ended ideas rather than only solve fixed-answer tasks. It introduces an open-domain benchmark for comparing LLM judgments of novelty across research-oriented scenarios, aiming to better understand how models support scientific creativity and evaluation.

Details Paper

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Published in arXiv preprint, 2025

LLMEval-Fair proposes a dynamic evaluation framework that samples unseen test sets from a large question bank, combines contamination-resistant curation with anti-cheating design, and studies almost 50 frontier models longitudinally to produce a more reliable picture of progress than static leaderboards.

Details Paper

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Published in Findings of EMNLP 2025, 2025

LLMEval-Med is a physician-validated clinical benchmark built from real-world electronic health records and expert-designed scenarios. It targets the weaknesses of existing medical LLM evaluations by moving beyond exam-style questions toward realistic clinical reasoning and checklist-based expert assessment.

Details Paper BibTeX