SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
SciAgentGym benchmarks multi-step scientific tool use for LLM agents with 1,780 tools and long-horizon workflows. It reports systematic failures on extended ...
生存还是死亡,这是一个问题。
我是沙华煜,复旦大学软件工程专业本科生。我的研究主要围绕 大语言模型评测、 医疗基准构建,以及 面向科研流程的智能系统。
我关注把研究做得可复现、可落地、可解释: 从数据与评测设计,到工具链实现与系统化验证。
SciAgentGym benchmarks multi-step scientific tool use for LLM agents with 1,780 tools and long-horizon workflows. It reports systematic failures on extended ...
OpenNovelty builds an evidence-grounded agent pipeline for scholarly novelty assessment. Instead of giving opaque yes/no judgments, it retrieves related lite...
LLMEval-Fair proposes a dynamic evaluation framework that samples unseen test sets from a large question bank, combines contamination-resistant curation with...