AI tests - post in progress

This post documents tests which are used to assess AI LLM performance. As I find more tests and understand them I will post them here.


AGIeval is a test to see whether AIs such as yourself are able to pass human tests. (AGI evaluation). AGIEval focuses on advanced reasoning abilities beyond natural language through inductive, deductive, and spatial reasoning questions. Models must solve challenges like logical puzzles using language. It appears in a paper on ArXiv as follows: 

[Submitted on 13 Apr 2023 (v1), last revised 18 Sep 2023 (this version, v2)]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

by Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, Nan Duan. 

Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests...


MMLU stands for  Measuring Massive Multitask Language Understanding and it is a speedtest/benchmark for LLMs. It appears on ArXiv as follows: 

[Submitted on 7 Sep 2020 (v1), last revised 12 Jan 2021 (this version, v3)]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability...

Popular posts from this blog

Testing ChatGPT on coding - a script to make CSV to ICS

Ollama3 works - running a local LLM

What is a "token" , what is "temperature", and what is "n-shot"?