Benchmark Definition Math

AI’s math problem: FrontierMath benchmark shows how far technology still has to go

Artificial intelligence systems may be good at generating text, recognizing images, and even solving basic math problems—but when it comes to advanced mathematical reasoning, they are hitting a wall.

GitHub

Math-VR Benchmark & CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

Recent advances in Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems that require visual assistance, such ...

marktechpost

OMEGA: A Structured Math Benchmark to Probe the Reasoning Limits of LLMs

Large-scale language models with long CoT reasoning, such as DeepSeek-R1, have shown good results on Olympiad-level mathematics. However, models trained through Supervised Fine-Tuning or Reinforcement ...

the-decoder

OpenAI quietly funded independent math benchmark before setting record with o3

Epoch AI, the developer of a mathematics benchmark, did not initially disclose funding from OpenAI due to a non-disclosure agreement, and this only became known when OpenAI set a new record on the ...

GitHub

andrew/oss-community-benchmarks

AI coding assistants are benchmarked mostly on Python and JavaScript. If you maintain a framework outside that bubble, you've probably seen AI tools generate code that looks plausible but uses ...

marktechpost

Planetarium: A New Benchmark to Evaluate LLMs on Translating Natural Language Descriptions of Planning Problems into Planning Domain Definition Language PDDL

Large language models (LLMs) have gained significant attention in solving planning problems, but current methodologies must be revised. Direct plan generation using LLMs has shown limited success, ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results