The No-Data Algorithm




Enabling trust in LLMs-as-judges WITHOUT labelled data

This is the TL;DR of my paper Labelling Data with Unknown References. If you are looking for the code, it is here.

We have a huge problem: evaluating highly-capable LLMs is expensive. Using an LLM as labeller (i.e., LLM-as-judge) is becoming quite common, but the field is VERY divided on their effectiveness and feasibility. Add to that lack of data and pressure to scale, and you're in a pickle.

There are three ways you can prove that a labeller works: statistics (which won't work without data), faith (which is not science), or a formal proof. I introduce here an algorithm that (via a formal proof, obviously) allows you to decide if you can trust your LLM-as-judge in a cryptographically-secure manner.

Outline


  • It is possible to label data without having any labels, provided that the 'thing' saying 'hey I know the labels!' (an LLM) proves it can be trusted.
  • This means that an LLM-as-judge can be trusted to be a labeller iff it passes the checks from the No-Data algorithm. This is cryptographically secure.
  • In turn this suggests that we do not need labelled data to establish trust. As per this site (obviously AI-generated but a great summary), this has applications where this data is scarce (healthcare, market research, low-resource languages, etc), but models (LLMs) trained with lots of internet data COULD perform well.

Picture this: You do not know how to solve a problem, but someone says they can, and they tell you 'just trust me, bro'. As a (self-respecting) scientist, that is not an argument that we should accept. Especially from a machine without provable guarantees of convergence (or at least, ill-characterised). So how do you decide if you can trust it?

Usually, this is done via one of three arguments. The first, is a statistical argument (i.e., a test set). But you do not have that, and hence you cannot possibly know the true answer--perhaps nobody can. The second would be via faith (the 'just trust me, bro' argument), which as we established, is not science. The third would be a formal proof.

Enter the No-Data Algorithm. Granted, it is technically the 'no-labels algorithm', but since you do not have labelled data, it can go either way. The point is, I claim that this algorithm can enable you to establish trust in an evaluator (say, an LLM) even when you do not have labelled data.

Does it work?

Yes!

How?

Just trust me, bro.



Ok, not really. This post will introduce the algorithm and sketches of the proof. There's some experimental results, too, because peer-reviewers usually won't be convinced by proofs alone.

The No-Data Algorithm works as follows: you have (unlabelled) data, and a rubric of what the labels should be. For example, it could be something like 'does the summary cover at least five keypoints of the original text?' or 'does this string contain an even number of zeros?'.
An Evaluator claims it knows the labels. A Verifier must check the labels of the evaluator, but without knowing the labels. What this algorithm does is that, via a protocol (kind of like a game) called the EV Protocol, gets the Evaluator and Verifier to talk to each other until the Verifier says whether it is sufficiently convinced (or not) about the Evaluator's performance.
You then run the EV Protocol on your entire dataset and get a labelled dataset and a statement on whether these labels can be trusted, and to what extent. Here's a picture of the algorithm.

The No-Data Algorithm

As you probably guessed, the important bit is the EV Protocol. This is motivated by classic results in computational complexity theory, such as the Arthur-Merlin protocol from Babai, and it is a type of zero-knowledge proof. What that is (formally) is beyond the scope of this post, but the gist of it is that a zero-knowledge proof establishes trust via challenges. They are used in blockchain, authentication systems, and other critical areas where you should not ever assume that the interlocutor (in this case, the Evaluator) is telling the truth.

The EV Protocol also has challenges--two, to be exact--that are designed to provide sufficient and necessary information as to whether you can trust your labeller. I'll explain why they are sufficient and necessary in the next section. Here I'll introduce them, but first let me explain the rest of the protocol.
The protocol works as follows:

  1. Given a datapoint \(x\) and the rubric \(C\), the Evaluator comes up with a new datapoint \(\tilde{x}\) that, according to it, will lead to the same label.
  2. The Verifier selects, uniformly at random, one of two challenges:
    1. Does \(\tilde{x}\) have the evaluation under \(C\) as \(x\)?, or
    2. Does \(\tilde{x}\) have the same structure as \(x\)?
  3. IF the Verifier passes the challenge, you go back to (1) and start all over. Repeat \(r\) times.
Now, assume that the challenges make sense (and are consistent and complete and so on): it is clear then that failing a challenge is an immiediate disqualification. BUT passing a challenge does not mean trust. After all, the Verifier could have gotten lucky. And, since the challenges are public, the Verifier could have fudged its answer to pass the challenges.

We can show that if the Verifier can pass one of the challenges, then the probability that the Evaluator will succeed at catching a lying Verifier will be \(1 - 1/4^r\), or around 98% for \(r=3\).
In fact, we will also show if the challenges are consistent and complete (i.e., sufficient and necessary), being able to pass both challenges for any choice of \(\tilde{x}\) means that the Verifier knows the labels. So, conversely, not passing the challenges consistently = lying.

The EV Protocol

Before we get to the proofs, let me just repeat what we do:

  1. We challenge the verifier on every point that must be labelled. If it passes the challenge, we accept it. If it fails it, we reject it.
  2. The challenges are designed to catch liars, so it is possible for us (with a bouunded probability) to establish confidence on our evaluator without the need of labels.

We'll do a very short sketch of the proofs--don't worry, this is more to understand why it works. If you are interested in the full proofs, check out the paper.

The plan: to prove correctness of the No-Data algorithm we need two things:

  1. Correctness of the EV protocol, and
  2. (as a corollary) correctness of the full algorithm.
Additionally, for the proof for the EV protocol, there are two things we need to show: that it works (consistent), and that it is sufficient and necessary (complete) to establish trust.

We'll start with the EV protocol, and namely that it is sufficient and necessary. For this, note that the rubric \(C\) is a checklist for your labels, but not the final 'decider' of these labels. After all, the way you aggregate the answers to the checklist is what decides the label, not the other way around (for example, majority vote, a non-linear function, best-of-3, etc). Let's call that \(\sigma\). It is easy to argue that your final label \(y\) will be done by first passing \(x\) into \(C\), and then this output into the aggregator: \(y = \sigma C(x)\). Let's assume that \(\sigma\) is a deterministic function.

So, nobodody knows \(\sigma\), but everyone knows \(C\). Then the checks will (should) focus on ensuring that \(C\) leads to the same valuation \(C(x)\) for any newly-generated \(\tilde{x}\). However, there are many ways one can get the same output valuation \(C(x)\) (for example, by having multiple points checking the same boxes), but that does not mean that \(x\) and \(\tilde{x}\) are equivalent!!!. Why not? Because if \(C\) has many linear terms, like 'if this, label should be 1, otherwise 0', all valuations will be the same. But with non-linear terms, like 'if this or that, then 0, otherwise 1', the \(\tilde{x}\) point that the Evaluator generates could get lucky and fulfil the 'this' or 'that' in the non-linear term--no guarantee that we are looking at the same point as \(x\).

So we need to check for same valuation (\(C(x) =? C(\tilde{x})\)), as covered in Challenge 1 from the previous section, but also whether the points are structurally the same (to account for luck) (which is Challenge 2, above). Crucially, knowing Challenge 1 does not give information about Challenge 2, and viceversa. Which in turn means that being able to solve Challenge 1 (r. 2) does not necessarily mean solving the entire thing, but solving the entire thing does mean knowing Challenges 1 and 2. Hence both challenges are sufficient and necessary.

Now, about completeness: since every challenge is chosen uniformly at random, Challenge 1 and Challenge 2 have both the same probability of appearing in front of the Evaluator. A lying Evaluator may only be able to pass one of the challenges (as per above: passing both means knowing the actual label), and sure, it can prepare in advance, but it does not know in advance which one it will see. So the probability of getting lucky when lying is \(1/4\). It is also easy to show that this is the worst case scenario. Repeat this \(r\) times, and you get a probability of failure (by the Verifier) of \(1/4^r\).

From the above, it follows that the EV Protocol bounds, up to a tight probability, confidence on a labeller per datapoint. The expensive bit is that it does cost more times to run (by a factor of \(r\)), but on the other hand it allows you to label data iff the Evaluator knows it (more on that later!).

These are super short because I think the paper conveys the work better. The gist of it is that I created a synthetic data to illustrate the point, using both a decision tree and an LLM-as-a-judge (o3-mini). It evaluated a problem it did know (in-phenomenon, or IP) as well one it claimed it did (but actually didn't, out-of-phenomenon or OOP). I split the cases in 'Known' and 'Uknown', where 'Known' is, as per the title of this paper, not something we handle and is just there to illustrate what the 'true' accuracy would be if we knew the real label set.
The predictions of the paper adjusted quite well, with the LLM and the decision tree lying through their teeth (and being caught) when the setup required it (the OOP case); and successfully (impressively, I might add, for o3-mini) solving the IP case.

Experiment setup
Experiment results

So what have we learnt?

  1. It is indeed possible to establish trust in an LLM-as-a-judge, even if you do not know the labels! This does not mean that you can just use them: what the No-Data algorithm says is 'you likely can/cannot trust this LLM in this task'. These challenges are cryptographically secure.
  2. You can get a labelled dataset with slightly better performance (slightly!!) if you flip the labels when you cannot trust the evaluator. It's not super duper useful, and frankly I would just stick to the No-Data algorithms measure of success.
It is worth noting that this is a very different way to think of machine learning. We used to require some sort of supervision to get the labels right--nowadays, with LLMs, there's no actual reason to do that. It is not me saying this: it is people using LLMs-as-judges. The truth is that the field is super divided on their use. There are accounts that they work, accounts that they do not work, and I argue that it is precisely because of this division that we should not care about getting a proper conclusive statement. Instead, let's do it on a case-by-case basis: show (prove me) that you can trust your Evaluator on this task, and I will trust your results. Here's an algorithm to do that.

Aside, what I believe is most interesting is the case when the Evaluator does not know the label: you can flip it as much as you'd like, and all you'll get is background noise on aggregate. So you aren't really violating any laws of nature: the background entropy remains unchanged! (little physics joke for you)
The middle ground--when it gets some right and some wrong--is also why the No-Data algorithm only gives you a number: the successes. Everything else (including the threshold at which you are willing to trust a noisy evaluator) is up to the scientist. This is more or less how we are handling natural-language problems (for which this algorithm also works).

Speaking of, one of the cool bits about the No-Data algorithm applied to judging LLMs-as-judges is that, yes, it does work, but it establishes trust in the prompt. So there is room for improvement: how can we establish trust in the model? It'll likely be more expensive and have to be a meta-meta evaluator (since the No-Data algorithm is technically a meta-evaluator), but it could create extremely reliable, cryptographically-secure online benchmarks of performance for LLM competitions that cannot be gamed.


Adrian de Wynter
July 2025