Publications
This page collects my current papers and preprints. I am especially interested in rigorous evaluation,
medical benchmarks, and scientific-intelligence systems for research workflows.
Papers & Preprints
Published in arXiv preprint, 2026
OpenNovelty studies whether language models can judge the novelty of open-ended ideas rather than only solve fixed-answer tasks. It introduces an open-domain benchmark for comparing LLM judgments of novelty across research-oriented scenarios, aiming to better understand how models support scientific creativity and evaluation.
Published in arXiv preprint, 2025
LLMEval-Fair proposes a dynamic evaluation framework that samples unseen test sets from a large question bank, combines contamination-resistant curation with anti-cheating design, and studies almost 50 frontier models longitudinally to produce a more reliable picture of progress than static leaderboards.
Published in Findings of EMNLP 2025, 2025
LLMEval-Med is a physician-validated clinical benchmark built from real-world electronic health records and expert-designed scenarios. It targets the weaknesses of existing medical LLM evaluations by moving beyond exam-style questions toward realistic clinical reasoning and checklist-based expert assessment.