11.11.25
•
Avi Patel

Introducing Kled HADES: A Benchmark for Real Human Data
Introducing Kled HADES: A Benchmark for Real Human Data
Introducing Kled HADES: A Benchmark for Real Human Data
6 MIN READ
RESEARCH


We’re launching Kled HADES: The Human Aligned Data Evaluation Standard.
As Kled has matured, we’ve realized something the AI industry still hasn’t addressed. The world’s most advanced models are being evaluated on synthetic, filtered, or scraped data that doesn’t reflect real human behavior. They perform well on benchmarks that exist in theory but collapse when faced with the entropy of the real world.
Mercor’s APEX measures how well models perform economically valuable work across consulting, law, and medicine. Kled HADES, by contrast, measures how well those same models perform when exposed to real human data.
For example, a model might summarize a clean academic paper perfectly, but fail when given a real student’s PhD thesis draft full of comments, formatting errors, handwriting, and half finished equations. HADES tests models on that kind of reality, the raw, unfiltered data that defines how people actually create and communicate.
Those failures are not a labeling problem. They are a distribution problem. Models fail because they have never been exposed to this kind of authentic, human data during training. The world they were taught to understand is synthetic. The world they are being deployed into is not.
That is where Kled comes in. The data we collect through our platform reflects the real world in all its complexity: videos, documents, conversations, behavioral data, and sensory recordings, all verified and sourced from real people. When AI labs integrate this data into training and evaluation, models become more robust, grounded, and aligned with reality.
To make HADES truly representative, we’ve partnered with experts across disciplines to construct domain specific evaluation rubrics. With partners at Latham & Watkins, engineers from over 20 Fortune 500 companies, five Grammy nominated artists, six Division I athletic coaches, and ten decorated medical professionals, we’re designing tests that expose where models fail to interpret the data real people produce.
Where Mercor uses experts to measure task performance and close the gap with more labor, Kled uses experts to measure data comprehension and close the gap with better data. HADES doesn’t evaluate what a model can output. It evaluates whether the model can even understand the inputs humans generate every day.
We’ve assembled a dedicated research team to design the evaluation sets, metrics, and scoring frameworks that will form the foundation of HADES and define what true alignment with human data looks like.
We will use HADES to evaluate and rank the top models from every major AI lab on how well they understand authentic human data. The insights we uncover will guide where the industry’s data needs to evolve and position Kled as the primary source for closing those gaps.
This benchmark is more than research. It is one of the most important steps in increasing the scale, defensibility, and value of Kled’s ecosystem. It also marks the second to last piece of the puzzle before our full company update.
We’re hiring. If you’re an AI researcher who wants to help define this new standard and work alongside top talent from Stanford SAIL, email: team@kled.ai.
We’re launching Kled HADES: The Human Aligned Data Evaluation Standard.
As Kled has matured, we’ve realized something the AI industry still hasn’t addressed. The world’s most advanced models are being evaluated on synthetic, filtered, or scraped data that doesn’t reflect real human behavior. They perform well on benchmarks that exist in theory but collapse when faced with the entropy of the real world.
Mercor’s APEX measures how well models perform economically valuable work across consulting, law, and medicine. Kled HADES, by contrast, measures how well those same models perform when exposed to real human data.
For example, a model might summarize a clean academic paper perfectly, but fail when given a real student’s PhD thesis draft full of comments, formatting errors, handwriting, and half finished equations. HADES tests models on that kind of reality, the raw, unfiltered data that defines how people actually create and communicate.
Those failures are not a labeling problem. They are a distribution problem. Models fail because they have never been exposed to this kind of authentic, human data during training. The world they were taught to understand is synthetic. The world they are being deployed into is not.
That is where Kled comes in. The data we collect through our platform reflects the real world in all its complexity: videos, documents, conversations, behavioral data, and sensory recordings, all verified and sourced from real people. When AI labs integrate this data into training and evaluation, models become more robust, grounded, and aligned with reality.
To make HADES truly representative, we’ve partnered with experts across disciplines to construct domain specific evaluation rubrics. With partners at Latham & Watkins, engineers from over 20 Fortune 500 companies, five Grammy nominated artists, six Division I athletic coaches, and ten decorated medical professionals, we’re designing tests that expose where models fail to interpret the data real people produce.
Where Mercor uses experts to measure task performance and close the gap with more labor, Kled uses experts to measure data comprehension and close the gap with better data. HADES doesn’t evaluate what a model can output. It evaluates whether the model can even understand the inputs humans generate every day.
We’ve assembled a dedicated research team to design the evaluation sets, metrics, and scoring frameworks that will form the foundation of HADES and define what true alignment with human data looks like.
We will use HADES to evaluate and rank the top models from every major AI lab on how well they understand authentic human data. The insights we uncover will guide where the industry’s data needs to evolve and position Kled as the primary source for closing those gaps.
This benchmark is more than research. It is one of the most important steps in increasing the scale, defensibility, and value of Kled’s ecosystem. It also marks the second to last piece of the puzzle before our full company update.
We’re hiring. If you’re an AI researcher who wants to help define this new standard and work alongside top talent from Stanford SAIL, email: team@kled.ai.
The Leading Data Marketplace.
Support
A Nitrility Inc. Company
Kled AI © 2025

The Leading Data Marketplace.
Support
A Nitrility Inc. Company
Kled AI © 2025

The Leading Data Marketplace.
Support
A Nitrility Inc. Company
Kled AI © 2025
