A framework for few-shot evaluation of language models.

lm-evaluation-harness

这是一个是用于评估大型语言模型的框架，能够测试模型在多种任务中的表现。它提供了超过 60 个学术基准测试，支持多种模型框架、本地模型、云服务（如 OpenAI）、硬件加速，以及自定义任务等功能。

This framework is designed to evaluate Large Language Models (LLMs), capable of testing model performance across various tasks. It offers over 60 academic benchmarks, supports multiple model frameworks, local models, cloud services (like OpenAI), hardware acceleration, and the capability to customize tasks.

lm-evaluation-harness

lm-evaluation-harness

评论