A troubling pattern is emerging in AI deployments across the industry.
Engineers who would never expose a database to the public internet are serving LLM inference endpoints with nothing but a static Bearer token protecting them. Security reviews focus on "does it hallucinate?" instead of "can it execute arbitrary commands?"
AI models are not opaque utilities. They are untrusted code execution engines. This distinction matters.
If you are deploying LLMs in production today, you are likely vulnerable to attacks that traditional web application firewalls cannot detect. Here is how to address these risks.
Traditional application security is deterministic. A SQL injection payload either works or it does not. AI attacks are probabilistic—they succeed intermittently, which makes them difficult to reproduce and test.
Your model represents significant investment in compute and data. Attackers do not need to breach your storage to steal it; they can query it repeatedly to train a surrogate model on your outputs.
The Fix: Entropy-Based Query Analysis
Rate limiting alone is insufficient. A sophisticated attacker will stay under your request limits. You need to detect systematic exploration of your model's capabilities.
Legitimate users ask specific, clustered questions. Attackers systematically probe the embedding space. We can detect this by measuring the spatial distribution of incoming queries.
from collections import deque import numpy as np from sklearn.decomposition import PCA class ExtractionDetector: def __init__(self, window_size=1000): # Keep a rolling buffer of user query embeddings self.query_buffer = deque(maxlen=window_size) self.entropy_threshold = 0.85 def check_query(self, user_id: str, query_embedding: np.ndarray) -> bool: self.query_buffer.append({'user': user_id, 'embedding': query_embedding}) # If a user's queries are uniformly distributed across the vector space, # this indicates automated probing rather than organic usage. user_queries = [q for q in self.query_buffer if q['user'] == user_id] if len(user_queries) < 50: return True embeddings = np.array([q['embedding'] for q in user_queries]) coverage = self._calculate_spatial_coverage(embeddings) if coverage > self.entropy_threshold: self._ban_user(user_id) return False return True def _calculate_spatial_coverage(self, embeddings: np.ndarray) -> float: # Use PCA to measure how much of the latent space the queries cover pca = PCA(n_components=min(10, embeddings.shape[1])) reduced = pca.fit_transform(embeddings) variances = np.var(reduced, axis=0) return float(np.std(variances) / (np.mean(variances) + 1e-10))
If you concatenate user input directly into a prompt template like f"Summarize this: {user_input}", you are vulnerable.
There is no such thing as secure system instructions. The model does not understand authority; it only predicts the next token.
The Fix: Input Isolation and Classification
A vision model can be manipulated by changing a few pixels. A text model can be manipulated with invisible unicode characters.
The Fix: Adversarial Training
If you are not running adversarial training, your model is vulnerable to input perturbation attacks.
# The Fast Gradient Sign Method (FGSM) implementation import torch import torch.nn.functional as F def adversarial_training_step(model, optimizer, x, y, epsilon=0.01): model.train() # 1. Create a copy of the input that tracks gradients x_adv = x.clone().detach().requires_grad_(True) output = model(x_adv) loss = F.cross_entropy(output, y) loss.backward() # 2. Add noise in the direction that maximizes loss perturbation = epsilon * x_adv.grad.sign() x_adv = torch.clamp(x + perturbation, 0, 1).detach() # 3. Train the model to resist this perturbation optimizer.zero_grad() loss_clean = F.cross_entropy(model(x), y) loss_adv = F.cross_entropy(model(x_adv), y) (loss_clean + loss_adv).backward() optimizer.step()
Validate your defenses before deploying to production.
CI/CD Integration: Configure your pipeline to fail if Garak detects a vulnerability.
The industry is moving from chatbots to agents models that can write and execute code. This significantly expands the attack surface.
Consider an agent with code execution permissions. An attacker sends an email containing:
The agent may execute this code and exfiltrate environment variables.
Defense in Depth for Agents:
DELETE, SEND_EMAIL, TRANSFER_FUNDS) must require human approval.AI security is an emerging discipline. The patterns described here represent foundational controls, not comprehensive solutions.
Treat your models as untrusted components. Validate their inputs, sanitize their outputs, and enforce the principle of least privilege. Do not grant models elevated permissions without strong isolation boundaries.
\

