Huayu Sha

About

I am a Software Engineering student at Fudan University. My current research focuses on trustworthy evaluation of large language models, medical and real-world benchmarks, and scientific-intelligence systems for evaluating novelty and reasoning quality.

I am especially interested in evaluation pipelines that are more robust to contamination, closer to real use cases, and better aligned with the kinds of claims we actually make about modern models.

Selected Publications

View All

2026 · arXiv preprint

OpenNovelty: An Open-domain Benchmark for Evaluating the Open-ended Novelty of Language Models

OpenNovelty studies whether language models can judge the novelty of open-ended ideas rather than only solve fixed-answer tasks. It introduces an open-domain benchmark for comparing LLM judgments of novelty ...

Details Paper

2025 · arXiv preprint

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

LLMEval-Fair proposes a dynamic evaluation framework that samples unseen test sets from a large question bank, combines contamination-resistant curation with anti-cheating design, and studies almost 50 front...

Details Paper

2025 · Findings of EMNLP 2025

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

LLMEval-Med is a physician-validated clinical benchmark built from real-world electronic health records and expert-designed scenarios. It targets the weaknesses of existing medical LLM evaluations by moving ...

Details Paper BibTeX

          Current Focus
        

          Robust evaluation
          Contamination resistance
          Medical NLP
          Expert validation
          Novelty assessment
          Scientific review