Prompt Versioning in Production: How We Stopped Shipping Prompt Changes Blind
Early on at HyrecruitAI, we shipped a prompt change on a Friday afternoon. By Monday, our interview evaluation scores had drifted—some candidates who should have scored 7/10 were getting 4s, others were getting inflated 9s. We had no idea which prompt version caused it, no rollback mechanism, and no way to compare outputs across versions. We had treated prompts like config strings instead of first-class production artifacts.
That Friday incident cost us three days of debugging and a difficult conversation with an early enterprise customer. Here's how we fixed it.
The Problem
At HyrecruitAI, prompts are not static. We have prompts for:
- Interview question generation — based on job description, experience level, and role
- Answer evaluation — scoring candidate responses on rubrics like relevance, depth, communication
- Follow-up probing — generating contextual follow-up questions mid-interview
- Feedback synthesis — summarizing a full interview into a structured hiring recommendation
Each of these prompts evolved rapidly. In our first three months, the evaluation prompt alone went through 40+ iterations. The core problems were:
- No diff history. Prompts lived in environment variables or hardcoded strings inside service files.
git blamecould tell us who changed it, not why. - No A/B testing. We couldn't run two prompt versions against the same input and compare outputs systematically.
- No rollback. When something broke, we'd scramble to remember what the prompt looked like two weeks ago.
- No observability. We couldn't correlate a specific evaluation result with the exact prompt that generated it.
The solution was to treat prompts like database migrations: versioned, auditable, and deployable independently of code.
The Solution: Prompt Registry with Versioned Slots
We built a lightweight Prompt Registry — a database-backed service that stores prompt versions, exposes an API for resolving the active version per slot, and logs every LLM call with a prompt version ID.
Here's the core schema using Drizzle ORM:
// packages/db/src/schema/prompt.ts
import { pgTable, text, timestamp, integer, boolean, uuid } from "drizzle-orm/pg-core";
export const promptSlots = pgTable("prompt_slots", {
id: uuid("id").primaryKey().defaultRandom(),
slug: text("slug").notNull().unique(), // e.g. "eval-v2", "question-gen"
description: text("description"),
createdAt: timestamp("created_at").defaultNow(),
});
export const promptVersions = pgTable("prompt_versions", {
id: uuid("id").primaryKey().defaultRandom(),
slotId: uuid("slot_id").references(() => promptSlots.id),
version: integer("version").notNull(),
body: text("body").notNull(), // The actual prompt text
isActive: boolean("is_active").default(false),
activatedAt: timestamp("activated_at"),
activatedBy: text("activated_by"), // User ID or "system"
changelog: text("changelog"), // Why this version was created
createdAt: timestamp("created_at").defaultNow(),
});
export const promptCallLog = pgTable("prompt_call_log", {
id: uuid("id").primaryKey().defaultRandom(),
slotId: uuid("slot_id").references(() => promptSlots.id),
versionId: uuid("version_id").references(() => promptVersions.id),
inputHash: text("input_hash"), // SHA-256 of variables injected
outputHash: text("output_hash"), // SHA-256 of LLM response
latencyMs: integer("latency_ms"),
model: text("model"),
tokensUsed: integer("tokens_used"),
createdAt: timestamp("created_at").defaultNow(),
});
The registry exposes a simple resolution function:
// packages/lib/src/prompt-registry.ts
import { db } from "@hyrecruitai/db";
import { promptSlots, promptVersions } from "@hyrecruitai/db/schema";
import { eq, and } from "drizzle-orm";
export type PromptVariables = Record<string, string | number>;
export async function resolvePrompt(
slug: string,
variables: PromptVariables
): Promise<{ body: string; versionId: string }> {
const result = await db
.select({
body: promptVersions.body,
versionId: promptVersions.id,
})
.from(promptVersions)
.innerJoin(promptSlots, eq(promptSlots.id, promptVersions.slotId))
.where(
and(
eq(promptSlots.slug, slug),
eq(promptVersions.isActive, true)
)
)
.limit(1)
.then((rows) => rows[0]);
if (!result) {
throw new Error(`No active prompt found for slot: ${slug}`);
}
const rendered = interpolate(result.body, variables);
return { body: rendered, versionId: result.versionId };
}
function interpolate(template: string, vars: PromptVariables): string {
return template.replace(/\{\{(\w+)\}\}/g, (_, key) => {
if (!(key in vars)) throw new Error(`Missing prompt variable: ${key}`);
return String(vars[key]);
});
}
All LLM calls now go through a wrapper that logs the prompt version alongside latency and token usage:
// packages/lib/src/llm-client.ts
import { resolvePrompt } from "./prompt-registry";
import { logPromptCall } from "./prompt-logger";
import { createHash } from "crypto";
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
export async function callLLM(
slug: string,
variables: Record<string, string | number>,
options: { model?: string; maxTokens?: number } = {}
) {
const { body, versionId } = await resolvePrompt(slug, variables);
const model = options.model ?? "claude-sonnet-4-6";
const start = Date.now();
const response = await client.messages.create({
model,
max_tokens: options.maxTokens ?? 1024,
messages: [{ role: "user", content: body }],
});
const latencyMs = Date.now() - start;
const content = response.content[0].type === "text"
? response.content[0].text
: "";
await logPromptCall({
slug,
versionId,
inputHash: sha256(JSON.stringify(variables)),
outputHash: sha256(content),
latencyMs,
model,
tokensUsed: response.usage.input_tokens + response.usage.output_tokens,
});
return { content, versionId };
}
const sha256 = (s: string) =>
createHash("sha256").update(s).digest("hex");
The Iteration: What Failed First
Attempt 1: Plain environment variables. We started with EVAL_PROMPT_V1, EVAL_PROMPT_V2 env vars toggled via feature flags. This broke immediately — the prompt text was too long for env vars in some CI environments, and we had no audit trail of who activated what when.
Attempt 2: Versioned files in Git. We stored prompts in prompts/eval/v1.txt, prompts/eval/v2.txt and loaded the active version via a config key. This was better for diffing, but required a deploy to activate a new version. Prompts were tied to code deploys again.
Attempt 3: Database registry (current). Moving to a database table gave us:
- Activation without a code deploy (toggle via admin panel or CLI)
- Full audit trail with
activatedByandactivatedAt - Correlation between every LLM call and the exact prompt version that produced it
- Rollback in under 30 seconds (flip
isActivevia admin or migration)
We added one more layer: a shadow testing mode. Before activating a new prompt version, we can run it in shadow mode — it executes alongside the active version on real traffic, logs both outputs, but only returns the active version's response to the user. We compare outputs async to catch regressions before going live.
// Shadow mode comparison (runs async, doesn't block response)
export async function callLLMWithShadow(
slug: string,
variables: Record<string, string | number>,
shadowVersionId?: string
) {
const primary = callLLM(slug, variables);
if (shadowVersionId) {
// Fire shadow call without awaiting — log both for comparison
callLLMWithVersion(slug, shadowVersionId, variables)
.then((shadow) => compareShadowOutputs(primary, shadow, slug))
.catch(() => {}); // Shadow failures never surface to user
}
return primary;
}
Architecture / Flow Diagram
┌────────────────────────────────────────────┐
│ Interview Service │
│ │
│ callLLM("eval-answer", { answer, rubric }) │
│ │ │
└───────────────┼────────────────────────────┘
│
▼
┌────────────────────────────────────────────┐
│ Prompt Registry │
│ │
│ promptSlots ──► promptVersions │
│ slug="eval-answer" │
│ version=12 (isActive=true) ──► body │
│ │ │
└───────────────┼────────────────────────────┘
│ interpolate(body, vars)
▼
┌────────────────────────────────────────────┐
│ Anthropic API │
│ model: claude-sonnet-4-6 │
│ latency: 1.2s / tokens: 840 │
└───────────────┬────────────────────────────┘
│
┌───────┴────────┐
▼ ▼
response prompt_call_log
returned (versionId=v12,
to caller inputHash, outputHash,
latencyMs, tokensUsed)
Activation path:
Admin UI ──► PATCH /api/prompts/:versionId/activate
──► db: isActive=true on v12, false on v11
──► next callLLM picks up v12
Learnings & Outcomes
After 6 weeks running the prompt registry in production:
- Rollback time: 3 days → 30 seconds. The Friday incident would have been resolved by flipping
isActiveon the previous version instead of hunting through git history. - Prompt iteration speed: 2x faster. Engineers can activate new prompt versions from the admin panel without waiting for a deploy. PM and non-engineers can now safely participate in prompt tuning.
- Regression detection: caught 3 regressions before they hit users using shadow mode. All three were evaluation prompt changes that subtly shifted scoring distributions.
- Token cost visibility: improved 40%. The call log revealed that one prompt slot was using 2x the tokens of an equivalent rewrite. We optimized it and cut $380/month in API costs.
- Audit trail: Every evaluation result now links to the exact prompt version that produced it. Customer disputes are resolved in minutes instead of "we're not sure what prompt ran then."
Suggestions for Engineers Building Similar Systems
Keep the schema simple. A slots table and a versions table is enough to start. Don't over-engineer a full workflow engine on day one.
Log everything at the call site. Input hash, output hash, latency, tokens — logging these from day one made all our retrospective analysis possible. We wish we'd had this data from the very first prompt.
Decouple activation from deployment. The biggest win isn't versioning — it's the ability to activate a new prompt without touching code. This alone changes how fast you can iterate.
Use changelogs on versions. Require a changelog field before a version can be activated. Even one sentence ("switched to chain-of-thought framing for rubric scoring") is invaluable when debugging a regression three weeks later.
Shadow testing beats staging environments. Real traffic has edge cases staging doesn't. Running a new prompt version on shadow traffic for 24 hours before activation is the best regression signal we've found.
Don't let prompts grow unbounded. Set a soft limit on version count per slot (we use 20). Archive older versions to a cold table instead of deleting — you may want to re-examine them.
Treating prompts as a first-class production artifact — versioned, logged, rollback-able, and activatable independently of code deploys — is one of the highest-leverage changes we've made to our AI infrastructure. The cost was two days of implementation. The payoff has been continuous.