T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

Chen, Zehui; Du, Weihua; Zhang, Wenwei; Liu, Kuikun; Liu, Jiangning; Zheng, Miao; Zhuo, Jingming; Zhang, Songyang; Lin, Dahua; Chen, Kai; Zhao, Feng

Computer Science > Computation and Language

arXiv:2312.14033 (cs)

[Submitted on 21 Dec 2023 (v1), last revised 15 Jan 2024 (this version, v3)]

Title:T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

Authors:Zehui Chen, Weihua Du, Wenwei Zhang, Kuikun Liu, Jiangning Liu, Miao Zheng, Jingming Zhuo, Songyang Zhang, Dahua Lin, Kai Chen, Feng Zhao

View PDF HTML (experimental)

Abstract:Large language models (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and analyze the tool-utilization capability of LLMs is still under-explored. In contrast to previous works that evaluate models holistically, we comprehensively decompose the tool utilization into multiple sub-processes, including instruction following, planning, reasoning, retrieval, understanding, and review. Based on that, we further introduce T-Eval to evaluate the tool utilization capability step by step. T-Eval disentangles the tool utilization evaluation into several sub-domains along model capabilities, facilitating the inner understanding of both holistic and isolated competency of LLMs. We conduct extensive experiments on T-Eval and in-depth analysis of various LLMs. T-Eval not only exhibits consistency with the outcome-oriented evaluation but also provides a more fine-grained analysis of the capabilities of LLMs, providing a new perspective in LLM evaluation on tool-utilization ability. The benchmark will be available at this http URL.

Comments:	Project: this http URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2312.14033 [cs.CL]
	(or arXiv:2312.14033v3 [cs.CL] for this version)
	http://doi.org/10.48550/arXiv.2312.14033

Submission history

From: Zehui Chen [view email]
[v1] Thu, 21 Dec 2023 17:02:06 UTC (4,275 KB)
[v2] Thu, 4 Jan 2024 05:11:22 UTC (4,274 KB)
[v3] Mon, 15 Jan 2024 03:18:25 UTC (4,275 KB)

Computer Science > Computation and Language

Title:T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators