The Specification Spectrum: Narrowing the Guesswork

If natural language prompts are inherently under-specified, how do we constrain the "guess" of an AI agent? It requires moving beyond simple unit tests, which only check the exact conditions a human engineer thought to write down. Instead, we must look at a spectrum of verification rigor.

BL Dr. Ben Livshits #languages #correctness #verification June 18, 2026

01 Generation is Easy, Knowing What to Build is Hard

The paradigm of software engineering is undergoing a tectonic shift. In a recent podcast discussion, Michael Truell — CEO of Anysphere, the company behind Cursor — articulated a fundamental friction point in the move from autocomplete toward capturing developer intent: the problem of under-specification. When developers write natural language prompts, the underlying LLM—paired with a CLI or IDE harness or an agentic setup—is effectively forced to guess the developer's intent. Sometimes the guess is brilliant; other times, it completely misses the mark.

While many researchers and some engineers find this probabilistic guesswork deeply unsatisfying, it uncovers a profound truth: traditional programming has always been built on a foundation of under-specification. When you pull a third-party library via a package manager or integrate an external API, you rarely audit the thousands of lines of underlying code. Instead, you rely on a sparse README, a few function signatures, and a massive stack of unwritten assumptions. Moreover, the library you started with can be updated within hours or days, violating even the implicit assumptions of the developer.

In conventional programming, the human engineer is guessing how the library behaves; in the AI era, the machine is guessing what the human wants. It is the same fundamental problem, just inverted.

To move forward, we need to stop viewing programming as writing explicit syntax and start viewing it as a journey across the specification spectrum.

02 Beyond Unit Tests: Property-Based Testing

I speak to developers who fully embrace the mantra of TDD, saying as long as they can come up with the tests, Claude will provide the code. Maybe so.

Unit tests that current models generally produce are a fragile shield against AI-generated bugs. If both the developer and the LLM forget a critical edge case—such as an empty slice or an integer overflow—the unit test will pass flawlessly while the production code remains broken. For production deployment, the long tail of bugs matters.

Property-Based Testing, using frameworks like QuickCheck or Hypothesis, shifts the focus from inputs to invariant behaviors. Instead of asserting that a function handles a specific example, you write a property: "For any integer x and y, adding them must be commutative."

property_based.py

# Conventional example-driven verification
def test_sort():
    assert my_sort([1,3,2]) == [1,2,3]
    assert my_sort([1,0,-5]) == [-5,0,1]

# Shifting to property-based behavior via Hypothesis
from hypothesis import given, strategies as st

@given(st.lists(st.integers()))
def test_sort(lst):
    result = my_sort(lst)
    for i in range(len(result)-1):
        assert result[i] <= result[i+1]

The testing harness then subjects the AI-generated code to thousands of randomly generated, bizarre inputs to systematically expose the limits of the model's guesswork. These days you even have Anthropic encouraging property-based testing via agents (Maaz, 2026).

03 Types to the Rescue

To tighten constraints even further, we can embed our specifications directly into the type system. Standard type checkers ensure basic data hygiene, but advanced languages allow us to use refinement and dependent types to squeeze out ambiguity entirely.

Refinement Types allow us to attach arbitrary logical predicates directly to standard data types (e.g., defining a type PositiveInt where the value must strictly be greater than zero).

Dependent Types take this to the logical extreme, allowing types to depend directly on values. For instance, in a dependently-typed language, a matrix multiply function doesn't just check that it received two arrays; its compile-time type signature mathematically enforces that the number of columns in Matrix A perfectly matches the number of rows in Matrix B.

If an LLM attempts to generate an implementation that violates these dimensional invariants, the program will simply refuse to compile. The under-specification is completely neutralized by the type system before a single line of code ever runs.

04 The State of the Art in Neurosymbolic Generation

Enforcing these constraints at compile time is incredibly powerful, but it traditionally introduces massive friction for the model. Recent advances in neurosymbolic AI are attempting to solve this by creating explicit feedback loops between probabilistic LLMs and deterministic compilers. The current state of the art in this domain primarily leverages these techniques:

Guided Token Decoding: Instead of letting an LLM generate a full block of code and hoping it compiles, modern frameworks use constrained decoding to intercept token generation in real time (Ma, 2025). As the model samples the next token, an external constraint solver or syntactic type-checker dynamically filters the token vocabulary, forcing the model to generate structurally and logically valid terms by construction (Kogler et al., 2026). This is definitely viable as a way to deal with many syntactic issues, but we don’t quite know how well this is going to work at scale (Livshits, 2026).
Tactic-State Agent Loops: Environments like LeanDojo have turned formal proof assistants into interactive sandboxes (Yang et al., 2023, George et al., 2026). When generating highly complex, dependently typed structures, the LLM functions as an agent that proposes a small step, evaluates the immediate mathematical goal state returned by the compiler, and backtracks if the type-checker rejects the term.

05 The Mirage of Total Certainty: The Reality of Lean 4 at Scale

If dependent types and formal systems can completely eliminate under-specification, why don't we formally verify every piece of enterprise software? Platforms like Lean 4 have made massive strides in usability, yet full-blown formal verification remains less than uniformly applicable for large software systems (Yang et al., 2023).

AWS Cedar, Aeneas, Veil & Loom, the Cairo Zero-knowledge language have all benefitted from Lean-based verification to various degrees. These projects are in the tens of thousands of lines of code scale-wise. The biggest effort is the Lean 4 self-hosting compiler (150,000 to 200,000 LOC). Needless to say, these are significant manual efforts that benefit from LLM-based automation but require careful guidance.

First, there is a specification inversion. To mathematically prove that an AI-generated system is 100% correct, you must first write a flawless, comprehensive formal specification of what it is supposed to do. For a chaotic, real-world application, writing that formal specification is often significantly harder and longer than writing the application itself. We haven't eliminated under-specification; we have simply shifted the burden of absolute precision onto the mathematical specification. Furthermore, when writing code without the help of LLMs, the developer is developing a mental specification in parallel with developing code – the intellectual work required to write specs is likely significantly higher on a line-by-line basis than that of writing code.

Second, the boundary problem presents a significant challenge. Modern software development involves balancing a rigorous, mathematical core with a complex and often unverified exterior. While tools like Lean can produce perfectly verified cryptographic primitives or consensus engines, these components must ultimately interact with less predictable processes like database operations, JSON parsing, and caching. Given that both legacy and AI-generated systems rely heavily on open-source libraries, analyzing a codebase composed of multiple languages and diverse dependencies is inherently difficult.

06 The Modern Engineer's True Imperative

As Michael Truell has noted, the nature of software development is undergoing a fundamental shift. The future of engineering doesn't belong to those who view LLMs as magic black boxes, nor to those chasing the unattainable ideal of absolute mathematical verification for every trivial microservice. Instead, the modern engineer's role is to architect the boundaries. By synthesizing the creative velocity of LLMs with rigorous type systems, property-based testing, and targeted formal verification, we can effectively navigate the challenges of under-specification. This approach enables us to build robust, reliable systems even when relying on complex, external components. Yet, this jagged frontier remains most decidedly uneven; its efficacy depends heavily on the specific domain, the tech stack, and the implementation language.

References

George, R., Huang, S., Song, P., & Anandkumar, A. (2026). LeanProgress: Guiding search for neural theorem proving via proof progress prediction. arXiv:2502.17925.
Kogler, P., Wei, C., & Haselboeck, A. (2026). Towards neuro-symbolic constrained decoding for reliable code generation with LLMs. Companion Proceedings of SE 2026, Lecture Notes in Informatics (LNI).
Livshits, B. (2026). LLMs + security = trouble. arXiv:2602.08422.
Ma, F. (2025). Logically constrained decoding. Proceedings of the 2025 Conference on Human Language Technology (ACL Anthology).
Maaz, M., DeVoe, L., Hatfield-Dodds, Z., & Carlini, N. (2026). Finding bugs across the Python ecosystem with Claude and property-based testing. arXiv:2510.09907.
Truell, M. (2026). The next era of AI coding [Video]. YouTube. https://www.youtube.com/watch?v=8h9j2rskP14
Yang, K., Swope, A. M., Gu, X., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R., & Anandkumar, A. (2023). LeanDojo: Theorem proving with retrieval-augmented language models. arXiv:2306.15626.