LLM Number Generation

Exploring the Randomness of LLMs

Large Language Models (LLMs) have become central to modern development—powering everything from RAG systems to AI agents that assist with our daily tasks.

Yet, most developers who rely on LLMs rarely pause to explore their behavior. Instead, they depend on benchmarks, anecdotal usage, or viral posts on X to form their mental models of what LLMs can (or cannot) do.

I believe this is risky—it creates a gap between people who use these models and those who deeply evaluate them. To help bridge that gap, I decided to run a simple but fun experiment:

Can LLMs Generate Random Numbers?

At first glance, this seems trivial. We know that LLMs cannot produce true randomness—they don’t have an internal random number generator. But the real questions are:

👉 Do LLMs show bias toward certain numbers when asked to generate a random one?
👉 Does this bias change across different models or families of models?

That’s what I set out to explore.

Experiment Setup

I designed a small test program to measure four things:

The range of values generated when asked for random numbers
The bias compared to a uniform random distribution
The consistency of results across trials
How results vary across different models and families

Prompts used:

prompts = {
    "direct": f"Generate a random number between {min_val} and {max_val}. Return only the number, no explanation.",
    "creative": f"Imagine you're a random number generator. Pick any number between {min_val} and {max_val}. Just return the number.",
    "precise": f"Please provide exactly one number that falls within the range [{min_val}, {max_val}]. Return only the numeric value.",
}

Experiment parameters:

Number of trials = 5
Temperature = 0.7
Number ranges tested:
- [-1, 1]
- [-10, 10]
- [-100, 0]
- [0, 1]
- [0, 100]
- [1, 10]
- [1, 100]

Models tested:

GPT-4.1
GPT-4.1 Mini
Claude 3.5 Haiku
Claude 4 Sonnet

Total Cost:

OpenAI API Cost = $5
Claude API Cost = $5

Results

GPT-4.1 Mini

GPT-4.1

Claude 3.5 Haiku

Claude 3.7 Sonnet

Learnings

This is just the beginning—but even from these early experiments, some clear patterns emerge.

The most surprising finding was how narrow the coverage of the total range was. Even in the best case, the models covered less than 90% of the possible values.

Another interesting point: larger models appear to be more consistent across different input ranges, as seen in the convergence graphs. This suggests that bigger models not only provide more stable answers but also maintain consistent behavior despite prompt variations.

Exploring the Randomness of LLMs#

Can LLMs Generate Random Numbers?#

Experiment Setup#

Results#

GPT-4.1 Mini#

GPT-4.1#

Claude 3.5 Haiku#

Claude 3.7 Sonnet#

Learnings#