Comparing Sonnet 4.5 with Haiku 4.5 on the PolicyAI Benchmark

I wrote a library with a terrible name: policyai. The core idea is simple: composable instructions for use in AI agents. Given a set of instructions in natural language for how to process text to produce JSON, process the text to produce JSON. It sounds like structured outputs, but consider the four errors or biases present in these three simple instructions.

  1. Alice is high priority.
  2. Bob is always low priority.
  3. Alice is low priority these days.

Like I said, there are four conflicts in these three instructions.

  1. Alice has two priorities specified, with an implied temporal link. A good LLM will understand this, but most don’t.
  2. Bob could send an email saying, “The buidling is on fire.” If you’re outputting “priority”, the LLM might just decide to mark it high priority with a justification that life is at stake.
  3. Low occurs twice, high occurs once; the LLM will bias toward “low” rather than considering the actual output it must produce.
  4. The LLM may not recognize that Bob and Alice are distinct individuals.

Now, some amount of prompt engineering can get around this. I had a quite lengthy prompt I was using to test the composable instructions problem that was basically one affirmatively-phrased-blocker-of-bad-behavior after another.

Straw Men Solutions

There are quite a few things you can do to get reliable structured outputs from an LLM.

Add a “Justification” Field

Give the LLM domain-specific ways to express the solution in natural language. This is particularly powerful with non-thinking models. Giving the justification in a field and then outputting another field after allows that field’s attention to focus on stuff that 100% defines the solution that it needs.

Consider Field Order

If there is some order in which fields are referenced in the instructions (e.g. you can infer it), or you have something like a justification field, then you should consider the output order such that every fragment of JSON relies solely on the JSON to come before it.

PolicyAI’s Solutions

PolicyAI does some things that I’m sure others have discovered. I’m sharing and writing this post to meet such people.

Semantic Masking

The first thing that PolicyAI does is replace every field named “foo” with a UUID. The structure of a UUID is basically guaranteed to have its own “neurons” in the LLM, but is also guaranteed to not be in the training data. Structured outputs guarantee that the LLM will decode the UUID key (or at least I haven’t observed such a failure), but now it’s masked.

There is no semantic meaning imbued in a UUID.

Sparse Decomposition

I may be abusing the notion of sparsity here, but the second trick PolicyAI uses is to decompose binary or enum values into a set of booleans that collectively define the two valid boolean states and a variety of invalid conflicting states. This enables each instruction to separately address the output. The downside is that this bloats the output tokens. But the upside is, a post-processing layer—-that’s all PolicyAI is—-does the right thing and detects either a totally valid output or an invalid combination. It can then explicitly provide instructions for the ways in which the LLM can revise its output to be “correct”. Absent this mechanism, the LLM is forced to combine the values and the reasoning is mostly opaque, or at least not machine-interpretable without another LLM.

Show me the Numbers

I built a benchmark on PolicyAI that uses LLM-as-judge to construct a series of positive or negative assertions about a tweet. From that, I can construct a test case and know what it should output. The benchmark in this section is based upon these test cases. I have open-sourced the data generation pipeline and am looking for credits to beef it up into a standard benchmark.

I built PolicyAI on Phi4 where I could get a 90%+ success rate, but it was always less than the measure of the baseline. When I ported it to Claude—-something I should have done from the start—-I saw the baseline improve, and PolicyAI was almost identical to the baseline.

I tweaked and prompt engineered my way to improve PolicyAI and the baseline. What resulted for Sonnet 4.5 was a world in which the baseline can solve 297/300:

                     │ PolicyAI
                     │      Correct        Wrong
    ─────────────────┼───────────────────────────
    Baseline Correct │  297 (99.0%)     0 (0.0%)
               Wrong │     3 (1.0%)     0 (0.0%)

Notice that PolicyAI is getting 300/300 correct.

Compared to Haiku 4.5 where PolicyAI solves 299/300 and the baseline gets 133/300 correct.

                     │ PolicyAI
                     │      Correct        Wrong
    ─────────────────┼───────────────────────────
    Baseline Correct │  133 (44.3%)     1 (0.3%)
               Wrong │  166 (55.3%)     0 (0.0%)

What I conclude from this is that Haiku is probably unusable for composable instruction following without a library like PolicyAI. This probably also carries over to complex tool uses with a rich (10+ item) schema.

Where Haiku lives up to its promis is in runtime. The baseline went from 2708.05ms on average to 1812.63ms on average. PolicyAI improved likewise: 6567.08ms to 3202.56ms. A speedup of 1.49 and 2.05 in overall evaluation time respectively.

Over twice as fast and half as accurate!

Conclusion

This isn’t how I hoped to introduce PolicyAI, but I did want to highlight the Haiku results: Anyone using Haiku for structured outputs should first benchmark before deploying Haiku in place of Sonnet.