In today’s fast-moving world of AI and large language models (LLMs), I’ve learned that one of the most valuable skills is not just understanding what these models can do but knowing how to guide them effectively. As I’ve spent time building applications, conducting research, and experimenting with different prompts, I’ve realized that real progress comes from learning how to control the generation process.
In this blog, I want to share seven generation control techniques that have made a real difference in how I work with AI and that every practitioner, researcher, or enthusiast can benefit from.
Temperature is perhaps the most fundamental parameter for controlling AI generation. It controls the randomness of the model’s output by scaling the probability distribution over possible tokens.
Behind the scenes, language models output logits unnormalized log probabilities for each possible next token.
p_i = exp(z_i / T) / Σ_j exp(z_j / T)
Where:
Think of temperature as a “confidence dial”:
Here’s what happens under the hood:
import numpy as np
def temperature_sample(logits, temperature=1.0):
# Step 1: Scale logits by temperature
scaled_logits = logits / temperature
# Step 2: Apply softmax (with numerical stability)
exp_logits = np.exp(scaled_logits - np.max(scaled_logits))
probs = exp_logits / np.sum(exp_logits)
# Step 3: Sample from the distribution
next_token = np.random.choice(len(probs), p=probs)
return next_token
The numerical stability trick (subtracting max before exp) prevents overflow when dealing with large logit values.
Technical implementation of how temperature controls randomness in language model token selectionPerfect for tasks requiring consistency and precision:
# Example with low temperature
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is the capital of France?"}],
temperature=0.1
)
# Output: "The capital of France is Paris."
Use cases:
The model becomes highly deterministic, consistently choosing the most probable tokens.
Unleashes creativity and diverse outputs:
# Example with high temperature
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Describe a sunset"}],
temperature=0.9
)
# Output might vary each time:
# "The crimson orb melted into the horizon..."
# "Golden light spilled across the darkening sky..."
# "Fire painted the clouds as day surrendered to night..."
Use cases:
Each run produces notably different outputs as the model explores less probable but potentially more interesting token choices.
While temperature scales the entire probability distribution, top-p and top-k are truncation methods that eliminate low-probability tokens before sampling. They provide different ways to control output quality and diversity.
Top-k sampling keeps only the k most probable tokens and redistributes their probability mass.
How it works?
import torch
import torch.nn.functional as F
def top_k_sampling(logits, k=50, temperature=1.0):
"""
Top-k sampling implementation
Args:
logits: [vocab_size] tensor of unnormalized scores
k: number of top tokens to keep
temperature: temperature scaling factor
Returns:
sampled token index
"""
# Step 1: Apply temperature
logits = logits / temperature
# Step 2: Get top-k logits and their indices
top_k_logits, top_k_indices = torch.topk(logits, k)
# Step 3: Apply softmax to top-k logits only
top_k_probs = F.softmax(top_k_logits, dim=-1)
# Step 4: Sample from top-k distribution
sampled_index = torch.multinomial(top_k_probs, num_samples=1)
# Step 5: Map back to original vocabulary index
token = top_k_indices[sampled_index]
return token
Let’s say we have vocabulary of 8 tokens:
tokens = ['the', 'a', 'is', 'very', 'quite', 'extremely', 'somewhat', 'rather']
logits = [5.0, 4.5, 3.2, 2.8, 1.5, 0.8, 0.3, -0.5]
# After softmax (temperature = 1.0)
probs = [0.426, 0.259, 0.070, 0.047, 0.013, 0.006, 0.004, 0.002]
With top-k = 3:
# Step 1: Select top-3 tokens
top_k_tokens = ['the', 'a', 'is']
top_k_probs = [0.426, 0.259, 0.070]
Top-p (also called nucleus sampling) keeps the smallest set of tokens whose cumulative probability ≥ p.
How it works?
def top_p_sampling(logits, p=0.9, temperature=1.0):
"""
Top-p (nucleus) sampling implementation
Args:
logits: [vocab_size] tensor of unnormalized scores
p: cumulative probability threshold (0 < p ≤ 1)
temperature: temperature scaling factor
Returns:
sampled token index
"""
# Step 1: Apply temperature and softmax
logits = logits / temperature
probs = F.softmax(logits, dim=-1)
# Step 2: Sort probabilities in descending order
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
# Step 3: Calculate cumulative probabilities
cumsum_probs = torch.cumsum(sorted_probs, dim=-1)
# Step 4: Find the nucleus (tokens to keep)
# Remove tokens where cumsum > p (keep first token that exceeds p)
sorted_indices_to_remove = cumsum_probs > p
# Shift right to keep the first token that exceeds p
sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()
sorted_indices_to_remove[0] = False
# Step 5: Set removed token probabilities to 0
sorted_probs[sorted_indices_to_remove] = 0.0
# Step 6: Renormalize
sorted_probs = sorted_probs / sorted_probs.sum()
# Step 7: Sample from the nucleus
sampled_sorted_index = torch.multinomial(sorted_probs, num_samples=1)
# Step 8: Map back to original vocabulary
token = sorted_indices[sampled_sorted_index]
return token
tokens = ['the', 'a', 'is', 'very', 'quite', 'extremely', 'somewhat', 'rather']
probs = [0.426, 0.259, 0.070, 0.047, 0.013, 0.006, 0.004, 0.002]
With top-p = 0.9:
# Step 1: Sort by probability (already sorted)
# Step 2: Calculate cumulative sum
cumulative = [0.426, 0.685, 0.755, 0.802, 0.815, 0.821, 0.825, 0.827]
With top-p = 0.75:
# cumulative[2] = 0.755 > 0.75 ← Stop here!
# Nucleus = ['the', 'a', 'is']
Top-k = 4 (Fixed):
███████████████ the (40%) ← Keep
██████████ a (25%) ← Keep
████ is (10%) ← Keep
███ very (8%) ← Keep
-- (7%) ← Discard (not in top-4)
-- (5%) ← Discard
-- (3%) ← Discard
-- (2%) ← Discard
Top-p = 0.9 (Adaptive):
███████████████ the (40%) ← Keep
██████████ a (25%) ← Keep
████ is (10%) ← Keep
███ very (8%) ← Keep
-- (7%) ← Keep (cumsum still < 90%)
-- (5%) ← Discard (cumsum > 90%)
-- (3%) ← Discard
-- (2%) ← Discard
Effective prompts are the foundation of controlled generation. The way you structure your prompts directly impacts the quality and relevance of outputs.
Bad: "Tell me about dogs"
Good: "Write a 200-word informative paragraph about dog training techniques for puppies, focusing on positive reinforcement methods."
Prompt: "You are an expert data scientist with 10 years of experience.
Explain gradient descent in simple terms for a beginner."
Prompt: "List the top 5 programming languages for beginners.
Format your response as:
1. [Language]: [Brief description]
2. [Language]: [Brief description]
..."
Prompt: "Write a product review for a smartphone. Requirements:
- Exactly 150 words
- Include both pros and cons
- Mention battery life, camera, and performance
- Use a neutral tone"
Few-shot learning involves providing examples within your prompt to guide the model’s behavior. This technique is incredibly powerful for establishing patterns and desired output formats.
Prompt: "Classify the sentiment of these reviews:
Review: 'This product exceeded my expectations!'
Sentiment: Positive
Review: 'Terrible quality, waste of money.'
Sentiment: Negative
Review: 'It's okay, nothing special.'
Sentiment: Neutral
Review: 'I love this new feature update!'
Sentiment: ?"
Prompt: "Convert natural language to Python functions:
Input: 'Create a function that adds two numbers'
Output:
def add_numbers(a, b):
return a + b
Input: 'Create a function that finds the maximum in a list'
Output:
def find_maximum(numbers):
return max(numbers)
Input: 'Create a function that reverses a string'
Output: ?"
Benefits of Few-shot Learning:
In-context learning leverages the model’s ability to understand and apply new information provided within the conversation context, without updating the model’s parameters.
Prompt: "I'm working with a specific dataset format:
{
'customer_id': 12345,
'purchase_date': '2024-01-15',
'items': ['laptop', 'mouse'],
'total': 899.99
}
Based on this format, generate 3 sample customer records for an electronics store."
Conversation Context:
User: "I'm building a React application for a food delivery service."
AI: "Great! What specific functionality are you looking to implement?"
User: "I need help with the cart component."
AI: [Provides React-specific cart component code tailored to food delivery]
Chain-of-Thought (CoT) prompting encourages the model to show its reasoning process, leading to more accurate and explainable outputs.
Prompt: "Solve this step by step:
A store has 24 apples. They sell 8 apples in the morning and 6 apples in the afternoon. How many apples are left?
Let me work through this step by step:
1) Starting apples: 24
2) Sold in morning: 8
3) Sold in afternoon: 6
4) Total sold: 8 + 6 = 14
5) Remaining: 24 - 14 = 10
Therefore, 10 apples are left."
Prompt: "A company's revenue increased by 20% in Q1 and decreased by 10% in Q2. If they started with $100,000, what's their revenue at the end of Q2? Let's think step by step."
Prompt: "Analyze whether this business model is sustainable:
Business: Subscription-based meal delivery service
- Monthly fee: $50
- Food cost per meal: $8
- Delivery cost per meal: $3
- 20 meals per month per subscriber
Let's break this down step by step:"
When to Use Chain-of-Thought:
Hallucinations when AI models generate false or nonsensical information are a significant challenge. Here are strategies to minimize them:
Prompt: "Based ONLY on the following text, answer the question:
Text: [Insert specific source material]
Question: [Your question]
If the answer cannot be found in the provided text, respond with 'Information not available in the source.'"
Prompt: "Answer the following question and indicate your confidence level (High/Medium/Low):
Question: What is the population of Tokyo in 2024?
Answer: [Response]
Confidence: [Level]
Reasoning: [Why this confidence level]"
Prompt: "Claim: 'Python was created in 1995 by Guido van Rossum'
Please verify this claim step by step:
1. Check the creation year
2. Verify the creator
3. Provide the correct information if any part is wrong
4. Rate the accuracy: Correct/Partially Correct/Incorrect"
Prompt: "Write a summary about renewable energy trends. For each major claim, indicate what type of source would be needed to verify it (e.g., 'government report', 'academic study', 'industry survey')."
(You can use RAG also😃)
The real power comes from combining these techniques strategically:
Prompt: "You are a research assistant helping with academic writing.
Temperature: 0.3 (for accuracy)
Task: Summarize the key findings about machine learning bias from the following paper excerpt.
Follow this format:
1. Main Finding: [One sentence]
2. Supporting Evidence: [Key statistics or examples]
3. Implications: [What this means for practitioners]
4. Confidence: [High/Medium/Low based on source quality]
Paper Excerpt: [Insert text]
Think through this step by step, and only include information directly supported by the text."
Mastering generation control is essential for anyone working with AI models. By understanding and applying these six techniques temperature and top-p sampling, prompt engineering, few-shot learning, in-context learning, chain-of-thought prompting, and hallucination prevention you can dramatically improve the quality, reliability, and usefulness of AI-generated content.
Thank you for reading!🤗I hope that you found this article both informative and enjoyable to read. (Comment if you build any async Agent application lately love to hear that🙂)
Fore more information like this follow me on LinkedIn
Generation Control: Mastering AI Output for Better Results was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.

