Why Your LLM Outputs Are Inconsistent (And How to Fix It) | Blog

At last, GPT-5 is now available in the Azure UK South region. I was looking forward to using this deployment to test several existing applications, assuming an 'upgrade' from GPT-4 was inevitable. However, whilst the output was significantly better for my applications, I was concerned that the temperature setting was no longer a setting in the newer model. My applications rely on the outputs being highly deterministic - so what does this mean?

This deterministic behaviour is important if you've built an app on an LLM, and it works perfectly—until one day, when the output format changed. What was once a clean, flat data structure is suddenly wrapped in an array. My code breaks.

This isn't a bug in the logic. It's a feature of the model itself: large language models are nondeterministic. Unlike traditional software, which produces the same output every time for the same input, LLMs make predictions based on probabilities. That inherent randomness is both a blessing and a curse.

For creative tasks like writing a poem or brainstorming ideas, randomness is a feature — it adds spark.
For developers relying on structured data, randomness is a headache — it can mean unpredictable behaviour and even system failures.

The "Temperature" Dial: A Fading Control

Traditionally, developers controlled LLM randomness with the temperature parameter:

Low temperature (e.g. 0.1): More deterministic, the model sticks to the most probable outputs.
High temperature (e.g. 1.0): More random, encouraging variety and creativity.

If you needed consistency, you simply set the temperature to zero. Problem solved.

But this lever is disappearing. In newer, more advanced models (such as some GPT-5 variants), the temperature is fixed — often at 1.0. Why? Because these models are designed to be reasoning-focused and rely on their own internal logic, not just statistical next-word choices. In short, the old trick of "just set temperature=0" no longer works.

The New Rules of the Game: Your Toolkit

When the traditional guardrails vanish, you need new ones. Here's how to build reliability back into your applications:

Be Hyper-Explicit in Your Prompts

Your prompt must now do the heavy lifting. The clearer you are, the less room the model has to drift.

❌ Instead of:

List the key features.

✅ Try:

You are an AI assistant tasked with identifying and listing key features.
Your response must be a single, flat JSON object with one key called "features".
The value should be an array of strings.
Do not include any extra text, explanations, or prose.
Output only the JSON object.

Implement Robust Validation

You cannot trust the LLM to always return the right structure. Validation is your safety net.

Use libraries like Pydantic (Python), Ajv (JavaScript/TypeScript), or Cerberus to enforce schemas.

If validation fails, your app should:

Retry the prompt with stricter instructions.
Fall back to a safe default.
Log/flag the anomaly for human review.

Think of validation as your "circuit breaker" against inconsistent outputs.

Test Like LLMs Will Break You (Because They Might)

LLM providers update their models frequently, and what worked yesterday might not tomorrow.

Run multiple test passes to catch intermittent failures.
Add edge-case prompts to your test suite.
Integrate LLM testing into your CI/CD pipelines so regressions are caught early.

Testing isn't about catching if the model will change — it's about being ready for when it does.

Wrapping Up

You can't eliminate nondeterminism in LLMs. That ship has sailed. But you can tame it.

The old approach of tweaking temperature is fading away. The new approach is about explicit prompts, robust validation, and relentless testing. In other words: don't rely on the model to be consistent — build consistency into your system.

The future of reliable AI apps isn't about controlling the model. It's about designing resilient systems that thrive on top of unpredictable foundations.