AI tests - post in progress
This post documents tests which are used to assess AI LLM performance. As I find more tests and understand them I will post them here.
AGIeval is a test to see whether AIs such as yourself are able to pass human tests. (AGI evaluation). AGIEval focuses on advanced reasoning abilities beyond natural language through inductive, deductive, and spatial reasoning questions. Models must solve challenges like logical puzzles using language. It appears in a paper on ArXiv as follows:
[Submitted on 13 Apr 2023 (v1), last revised 18 Sep 2023 (this version, v2)]
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
by Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, Nan Duan.
Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests...
MMLU stands for Measuring Massive Multitask Language Understanding and it is a speedtest/benchmark for LLMs. It appears on ArXiv as follows:
[Submitted on 7 Sep 2020 (v1), last revised 12 Jan 2021 (this version, v3)]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability...