AI Doesn't Work Without Good Inputs (And Neither Do Humans)

AI skeptics complain about code which is verbose, overwrought, unmaintainable, and misguided. In many specific instances, I’ve found that to be correct. The problem to be solved isn’t higher ownership, more diligent human review or the perfect CLAUDE.md file. We need better control over the inputs if we want to tighten the band of acceptable outputs.

The core problem with LLMs going off the rails is ambiguity and complexity. LLMs have better luck solving small, well defined problems when they have adequate context. A lot of working with LLMs today is about scoping down those problems or providing them that corrected context. These are not uniquely LLM problems. Humans also do much better with well defined problems, clear outcomes and good context, and I think that’s a much more useful frame.

Context#

In 2026, it seems more critical than ever to give a sense of where you are in your AI journey before you start talking about your plans. We’re an ~80 engineer company doing eCommerce. We have people along an approximately normal distribution of the Yegge adoption scale centered around people in the 4-5 range (some LLM-in-IDE users, folks operating on one Claude Code instance).

Additionally, we’ve got plans to do some rearchitecture in an area of code that is important and poorly understood by the people who notionally own it. This is driven, in part, by some long standing documentation debt that we have accrued.

Where I think we’re headed#

As mentioned in this AI-generated vision of the future, I think the future of software development looks a lot more like developers treating software much more like they treat SQL. We largely leave the implementation details up to the machine (think: query planner) and provide it hints if it’s going off track. If something is particularly critical, we may deviate from this, but those will be quite rare and specialized.

This leaves engineers with two big jobs. First, they will control the inputs to the model (specifications, prompts, etc). Second, they will control/tune verification of the work through increasingly automated means (canary releases, test requirements, etc).

How I think we should get there#

I think while we’re focused on getting LLMs to help with code generation, the result would be improved by using it upstream of the code and in the verification of the code.

First, I think the inputs to a model are critical. We’ve done a lot of work on code generation and getting the machine to produce code like a programmer might. I think we’ve done comparatively little iteration with our product counterparts in generating useful descriptions of the feature/component being improved. I would like to see pair/mob programming on specifications with product and engineers. This is product people in github commenting on specification changes. Practically, I think this looks like using similar to Fission’s OpenSpec.

After the spec is written, I believe that we write the code using orchestrated LLMs. Something like a gastown or multiclaude. There is plenty of other writing on this, but I’ll say that these tools are already quite good at chewing through work, if it’s well specified.

Then I think we need a stronger approach to review and validation. I’m thinking about from the perspectives of interplay dynamics, adherence to the intent, and the readiness of the output for users.

Interplay dynamics#

When we have a specification and we’re mutating/enhancing that specification over time, one critical piece of validation is to know if the spec is internally consistent.

When I say “spec”, I mean more than just a PRD. The structure that OpenSpec seems to use is that there is a proposal.md which is approximately a PRD. High level, what are we even trying to accomplish. Next, there’s an optional design.md which contains an engineering design. SQL tables, box diagrams, etc. It’s skipped for small stuff. These turn into spec.md which contains EARS around specifically how the system should behave. Lastly, there’s a tasks.md which lists the discrete work items in order to accomplish the goal within the context of a design.

├── proposal.md    # ≈ PRD
├── design.md      # engineering design (optional)
├── spec.md          # specification (given/when/then type syntax)
└── tasks.md       # discrete work items

I think we need adversarial reviewers that are divorced from the context that created the specification to look over the spec (proposal, design, tasks) as written, combined with other related specs-as-written, and find inconsistencies between them.

Adherence to the intent#

Once the tasks.md and associated documents are vetted, we have adequately small, high-context tasks that an LLM should be able to make pretty fast work of the results. These use the normal tools that people doing code generation have been working the kinks out of for a bit: CLAUDE.md files, tools, skills, etc.

Once the code is up for review, this gate kicks in. The big question we’re trying to answer is “Does the code that’s written make meaningful progress towards the goal stated in the spec?”. If the answer is yes, we pass. If not, we iterate.

Production readiness#

The last gate to clear is one of production readiness. The question is “Is this change safe to release to our users?”. This can mean many things. Is it behind a disabled feature flag? Does this purely enhance the functionality in a backwards-compatible way? Does the code have adequate telemetry?

The culture shift#

That’s it.

You may notice several things missing here. No “is this code maintainable?”. No “are there tests?”. Some of these are CI gates. Others are tweaks to the repo’s prompts that the LLMs are using to author the code. “This isn’t how I would do it” is no longer valid feedback to block a PR. Instead, you update the inputs. Tighten the spec. Alter the guidance. Add a linter.

This fundamentally changes the role of the reviewer from what we know it today. They become closer to a sensor on how the pipeline to production is running, rather than a gatekeeper. This is higher leverage work. It’s not “Please don’t do it this way”, but rather “How can we adjust the system such that people/robots don’t do it this way?”.

The pathway to production is a system and we control the fitness function that it optimizes for. The role of this reviewer is to provide feedback that will become a new input. In this way, we move closer to “recompute the feature from the specification”, which makes more and more sense as the cost of reimplementation drops towards zero.

I still have open questions. If you have thoughts or answers, let me know. I’d love to talk about it.

Should there be more separation between “LLM the writer of PRDs” and “LLM the writer of eng design” and “LLM the writer of tickets”? OpenSpec seems to combine them, but I haven’t quite decided how linked those three are.
How should we be thinking about deploy/post-deploy validation in this new world?