Democratizing Evaluation - A Personal Statement

<aside>

We bring humans back into the loop, so we don't just put a weather forecast sticker on an AI. We expose the atmosphere in realtime.

</aside>

Where this came from

Algorithm justice. These were the aspirations from my MADS course on the evaluation of machine learning systems, taught by Paul Resnick. He guided us to discuss how every machine learning model can have bias, and different techniques to make them less biased.

Of course it's very important to improve the algorithm itself, but I wondered: in the world that encourages openly sharing model parameters, architectures and datasets, in addition to the model development team trying to improve the model intrinsically, what if the users of the model have a say?

For instance, an intent classification model - some are more conservative, some are more bold, and let the users choose. Two raters can read the same model output and one calls it accurate, the other calls it dismissive - and they're both right. That disagreement should be a signal, not a noise.

First, we need to have a platform that different models can be tested, and users can really voice out with specific feedback, what worked, what didn't work, and why, instead of purely throwing thumb ups/downs. And the public should have the right to learn from multipolarity - how people perceive the model outputs - instead of just a True/False label and a "score" that tells barely nothing.

Like what a product person would usually do, I kept all these problem statements in my backlog, and code as white-box. I knew it would take villages to actually make this happen, similar to re-creating a HuggingFace + Kaggle hybrid, and it's not something one person can do. So I put this idea aside, and continued my studies and AI product consultancy work.

What changed

I kept studying the transformer architecture, and a few months later, GPT-3 arrived. Although it couldn't do much without proper prompt engineering, the AI community was excited about it. And one year later, the LLM era arrived.

And these big marketing-friendly terms — "benchmark", "out-perform" — did help promote large language models and raise a lot of awareness. And I don't know since when, every company's claiming their own "benchmarks." Product managers are stressed to create a shocking benchmark score for the launch, since now these products are worth billions.

There's once an “AI legal benchmark” that one company claimed they reached - and I found out it was sourcing mainly court data in India and China, while their target was the US. Benchmark means a lot, actually, for model developers. Without it, we don't know how to make our next version better - and we have to assume that benchmark is really measuring towards a valuable use case. As Chip Huyen mentioned in Designing Machine Learning Systems (the book I helped localize) pointed out, these scores are not for the public.

From Hugging Face to yawning face

So the drama is getting boring (although every product launch is using these as a gimmick). And everyone found them interesting, failing at this and that. It's hard to believe that we are treating the development of super-intelligence this way.

Three open questions keep unfolding for me, and they are the same three questions that pushed me back into building HumanJudge:

Subjectivity. For data without a single ground truth, we attempt rating, scoring, all tried to work out a quantified metric for training purpose. What if we use voting, and let different representations be embedded in the inference process? Pluralistic disagreement is real information about the world, it is what every working courtroom, peer review process, and Wikipedia talk page has known forever. It does not need to be averaged into oblivion before it reaches the model.

Recency. In some domains, a benchmark has to be constantly updated to fully reflect the real world - for instance, marketing content. Models trained on yesterday's truth get judged on tomorrow's reality, and we treat that as if it were a permanent score. It isn't. A benchmark from 18 months ago about "is this a good chatbot reply" is closer to historical fiction than to a measurement.

Taking the human out of the loop. How well can LLM-as-judge actually help? What if this judge is itself biased? What's the work needed to keep this judge "the perfect judge"? Can we even have that? And to clarify: I'm not critizing LLM-as-judge and all the pre-production testing and work. What I try to point out here is, only that is not enough, because we are using it every day, and we may be asking AI questions that were untested.

If we apply car-crash testing scenario metaphorically to LLMs → how many crashes happen today?

Selection bias was supposed to be the first lesson of the field, not the last. (SIADS 632, Causal Inference for Data Scientists: lack of comparability between groups is the original sin of empirical claims. Most production AI evaluation pipelines, including the celebrated public ones, recruit their evaluators in ways that would not survive a methods seminar.)

Where this came from

What changed

From Hugging Face to yawning face

Good human alignment?