Tracking AI queries: how to quickly find a failure
Next step
Open the bot or continue inside this section.
Article -> plan in AI
Paste this article URL into any AI and get an implementation plan for your project.
Read this article: https://vibecode.morecil.ru/en/dannye-i-khranenie/trassirovka-ai-zaprosov-kak-bystro-naiti-sboi-v-prode/
Work in my current project context.
Create an implementation plan for this stack:
1) what to change
2) which files to edit
3) risks and typical mistakes
4) how to verify everything works
If there are options, provide "quick" and "production-ready". How to use
- Copy this prompt and send it to your AI chat.
- Attach your project or open the repository folder in the AI tool.
- Ask for file-level changes, risks, and a quick verification checklist.
Introduction
Simple words
When an AI function breaks down in a sale, only the result is usually visible: the user received an error, the bot hovered, the task did not reach the end. But it's not clear where exactly everything broke: in the API, in the queue, in the model, in the database, or in your code.
This article is for a beginner who has already launched the first AI script and now wants to stop fixing malfunctions blindly. At the output, you will have a working minimum template: what to log, how to add trace_id, how to link the steps of a single query and how to quickly find the root of the problem.
How to do it in practice
Just remember the main principle: one user request = one track identifier (trace_id) at all steps. If you don’t, you can’t say for sure where the failure occurred.
Mini difficulty ladder:
- We write structured logs in JSON.
- Working minimum: add
trace_idand throw it through all calls. - Strengthening: connect OpenTelemetry and watch tracks in UI.
What to do right now:
- Select one critical scenario (e.g., “create an assistant response”).
- Check if there is a single
trace_idin each step. - If not, make it to your next task as a requirement.
In short: the essence in 5 points
Simple words
To put it to a minimum, the AI scenario’s observability is built not around “beautiful dashboards,” but around answering the question “why did this request break down or become slow.”.
How to do it in practice
- Logs without structure are almost useless at the time of the incident.
- Without
trace_id, you cannot link the steps of a single request. - First, cover only 3 points: input, model call, record the result.
- Set up 3 alerts: error growth, delay growth, timeout growth.
- Regularly check 5-10 real tracks after release.
Mini difficulty ladder:
- Base: readable logs with single fields.
- Working minimum: tracing and basic allerts.
- Strengthening: sampling, deshbords according to the versions of prompts, cost control.
What to do right now:
- Add the log field template to the ticket (
trace_id,span,status,duration_ms). - Assign someone responsible for the quality of the trace.
Dictionary of terms
Simple words
Below is a short dictionary to avoid confusion in terms.
How to do it in practice
Observability(observability): the ability to understand the state of the system from logs, metrics and tracks.Trace: A complete chain of steps from entry to result.Span(Span): One separate step inside the track, such as calling an LLM or recording in a database.Trace ID: A unique track number that links all the steps of a single request.Structured logs(structured logs): JSON logs with the same fields.Latency(latency): How long did the step or the entire request take.Sampling: selecting only part of the tracks to reduce load and storage costs.Alert(alert): automatic notification when the metric has passed the threshold.SLO(target service level): a pre-agreed quality goal, e.g. "95% of requests under 3 seconds".
What to do right now:
- Make sure the entire team understands the terms
traceandspanin the same way. - Add a dictionary to the README service.
Base and context: why AI scripts are hard to debug
Simple words
In a regular web service, a request often takes 1-2 steps. In the AI scenario, there are more steps: getting context, calling the model, calling the tool, post-processing, recording the result. There may be a delay or error at any site.
The problem with a beginner is that there are logs, but they are scattered. One service writes to a file, another in a stdout, the third is completely silent. This makes even a simple mistake look like a detective.
How to do it in practice
Divide the query path into blocks and assign each a span name:
request.in– HTTP/webhook request input;context.load- reading data from the database / cache;llm.call– query to the model;tool.callis an external API or internal tool;response.save– recording the result;request.out: Return of response.
Mini difficulty ladder:
- Base: Write logs on each block.
- Operating minimum: each block has
trace_idandduration_ms. - Strengthening: add
model_name,prompt_version,retry_count,token_usage.
What to do right now:
- Draw the current query path in 5-7 steps.
- For each step, write down where context is lost.
- Identify the 3 steps to be covered first.
Practical part by steps
Step 1. Enter a single log format
Simple words
If every developer writes logs as they want, you won’t be able to quickly search and filter events. A single format removes chaos.
How to do it in practice
Minimum fields of JSON log:
timestampXXlevelXXserviceXXtrace_idXXspanXXmessageXXstatusXXduration_msXX
Example: Node.js + pino
npm i pino pino-http
import pino from "pino";
const logger = pino({ level: process.env.LOG_LEVEL || "info" });
export function logStep(data) {
logger.info({
timestamp: new Date().toISOString(),
service: "ai-gateway",
...data,
});
}
Mini difficulty ladder:
- Base: Unified JSON format.
- Working minimum: mandatory fields are validated in the code.
- Strengthening: The log scheme is checked in CI.
What to do right now:
- Set the required log fields in
docs/logging.md. - Check that
trace_idis not empty.
Step 2. Put trace id through the entire script
Simple words
trace_id should be created once at the input and go on to each call. Then you can always get the full picture.
How to do it in practice
- At the HTTP input, take
traceparentfrom the header or generate a newtrace_id. - Transfer it to functions, background tasks, and external APIs.
- Return
trace_idin error so that the sappor quickly finds the track.
Example: Express middleware
import { randomUUID } from "node:crypto";
export function traceMiddleware(req, res, next) {
const incoming = req.headers["x-trace-id"];
const traceId = typeof incoming === "string" && incoming ? incoming : randomUUID();
req.traceId = traceId;
res.setHeader("x-trace-id", traceId);
next();
}
Mini difficulty ladder:
- Base:
trace_idin the HTTP layer. - Working minimum:
trace_idin queues and workmen. - Support for W3C
traceparent.
What to do right now:
- Check that
trace_idreaches the background tasks. - Add
x-trace-idto API responses.
Step 3. Add LLM call tracing and tools
Simple words
Usually, the biggest delays and errors are in calling the model and external tools. These steps should have separate spas.
How to do it in practice
What to write in llm.call:
provider(who answers: OpenAI, Anthropic, etc.);model(model);prompt_version(the template version);duration_ms;status(ok,timeout,error);token_usage(if available).
What to write in tool.call:
tool_name;http_status;retry_count;duration_ms;- a brief cause of error without sensitive data.
Important: Do not log secrets, tokens and personal data. If necessary, do a mask.
Mini difficulty ladder:
- Base: separate spans for
llm.callandtool.call. - Operating minimum: fix the duration and status of each spa.
- Reinforcement: Add prompt and agent versions.
What to do right now:
- Add 2 Spans:
llm.callandtool.call. - Make sure there are no tokens and passwords in the logs.
Step 4. Connect OpenTelemetry and view trails
Simple words
OpenTelemetry is an open standard and set of libraries for collecting telemetry: logs, metrics and tracks. It is needed so as not to collect everything manually and not to be attached to one vendor.
How to do it in practice
Basic launch in Node.js:
npm i @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
const sdk = new NodeSDK({
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Next, send the tracks to a convenient backend (such as Tempo, Jaeger, or Sentry) and make sure the entire query chain is displayed.
Mini difficulty ladder:
- Base: Auto-collection of base spans.
- Working minimum: manual spans on critical steps.
- Strengthening: custom attributes (release version, prompt, client).
What to do right now:
- Raise locally one backend for trails.
- Make sure that the entire request is visible.
Step 5. Set up allergies that are really useful
Simple words
Without adhesives, you will learn about the problem from users. But too many alerts are also bad: the team stops responding. We need a minimum of work.
How to do it in practice
Starting set of allerts:
- Errors:
status=erroris above the threshold in 5 minutes. - Delay:
p95response time is above the threshold. - External provider timeouts:
timeoutgrowth inllm.call.
Example of starting thresholds:
- error rate > 3% in 5 minutes;
- p95 > 6 seconds in 10 minutes;
- lLM timeout > 2% in 10 minutes.
Mini difficulty ladder:
- Base: 1 alert for errors.
- Working minimum: 3 alerts (errors, delay, timeouts).
- Strengthening: individual thresholds by model and type of request.
What to do right now:
- Configure at least one focus on error growth.
- Check who gets the notification and how.
Real use cases
Simple words
Below are three typical cases where tracing saves hours of manual parsing.
How to do it in practice
- The “500” error only affects some users. On the track, one
crm.lookupinstrument falls with a rare input field format. - A sharp increase in delay. Spans show: not LLM, and slow recording in the database after the response model.
- Floating take of messages.
trace_idshows the re-delivery of the task from the queue and the lack of verification of the operation key.
Mini difficulty ladder:
- Base: Looking for a problem on one track.
- Minimum operation: group similar tracks by mistake.
- Strengthening: build a deshboard by incident classes.
What to do right now:
- Disassemble the latest sales failure through one track.
- Find out which field was missing for quick analysis.
Tools and technologies
Simple words
You don't have to say "all at once." To start, one logger, one track collector and one interface for viewing are enough.
How to do it in practice
Beginner's work kit:
pinoorwinstonfor structured logs in Node.js.- OpenTelemetry SDK for Span Collection.
- Jaeger/Tempo/Sentry to view the tracks.
- PostgreSQL or ClickHouse for storing aggregated logs.
How to choose simply:
- If you need a quick start at the locale: Jaeger.
- If you already have Grafana: Tempo.
- If you need a product with ready-made allerts and issue workflow: Sentry.
Mini difficulty ladder:
- Base: One track visualization tool.
- Working minimum: logs + tracks + 3 alerts.
- Strengthening: a single deshboard for the quality of AI scripts.
What to do right now:
- Select one backend track and fix it as a team standard.
- Don’t change the stack until you’ve gone through 2-3 actual incidents.
Comparative table of approaches
Simple words
The table helps to choose the approach according to the maturity of the team, not the fashion.
How to do it in practice
| Подход | Что видно | Что не видно | Когда подходит |
|---|---|---|---|
| Только текстовые логи | Отдельные ошибки и сообщения | Полный путь запроса | Первый день, очень маленький проект |
| Логи + trace_id | Связь шагов одного запроса | Подробная визуализация задержек | Рабочий минимум для большинства команд |
| OpenTelemetry + backend трасс | Полный путь, узкие места, проблемные спаны | Бизнес-контекст без ваших полей | Прод и регулярные релизы |
| Полный observability-стек (логи + трассы + метрики + алерты) | Состояние системы в реальном времени | Ничего критичного, если верно настроен | Команда с постоянной нагрузкой и SLA |
What to do right now:
- Honestly note where you are in the table.
- The next step is to go to the “Logs + trace id” level.
Implementation checklist
Simple words
The checklist is needed so as not to miss basic things and not get stuck in theory.
How to do it in practice
- There is a single JSON log format.
-
trace_idis created or received at the input. -
trace_idgoes through the API, queue and workman. -
llm.callis writtenstatusandduration_ms. -
tool.callis writtenhttp_statusandretry_count. - No secrets leaked in the logs.
- Adjustments for errors and delays.
- The team is able to find the cause of the incident on one track.
Mini difficulty ladder:
- Base: first 4 points.
- Working minimum: the first 7 points.
- Strengthening: All points + regular track review after release.
What to do right now:
- Go through the checklist on one service.
- Mark 2 gaps and set a correction time.
Typical errors and how to fix them
Simple words
Mistakes are repeated by almost all teams. The good news is that they can be closed with simple rules.
How to do it in practice
Error: Logic only "error occurred" without context. Fix: Add
trace_id,span,duration_ms,status.Error:
trace_idis in the API, but gets lost in line. Correction: Sendtrace_idas a mandatory message field.Error: logging all prompt and personal data. Fix: mask sensitive fields, store only the desired metadata.
Mistake: Allerts are too noisy. Fix: Start with 3 alerts and adjust the thresholds to the actual load.
Mistake: After the release, no one watches the tracks. Fix: Enter a short post-release review for 10 minutes.
Mini difficulty ladder:
- Remove the loss of
trace_id. - Minimum operating time: Reduce the noise of allerts.
- Strengthening: Regular review of trace quality.
What to do right now:
- Take the last crash and check which fields were missing.
- Add these fields to the log standard.
FAQ
Simple words
Brief answers to questions that beginners usually have.
How to do it in practice
**1. Do I need a full observability stack? **
Nope. Start with JSON logs and trace_id, then connect the tracks for critical steps.
**2. Which is more important: tracks or metrics? ** To find the cause of a particular failure is more useful than the track. Metrics are important to control overall stability. Do both at the start, but with minimal coverage.
**3. Is it possible to live only in logs? ** It is possible on a very small project, but with an increase in the load and number of services, this quickly ceases to work.
**4. Do I need to log a full prompt? **
Usually not. It is better to store prompt_version, size and key metadata. Keep the full text only under strict security rules.
**5. How do you know if the implementation has succeeded? ** If the team finds the root of a typical incident in minutes rather than hours, you're on the right track.
What to do right now:
- Collect your 3 internal FAQs for the project.
- Add answers to internal documentation.
Outcome and next practical step
Simple words
Tracing AI queries is not about “beautiful analytics”, but about the speed of service recovery and quiet operation. The most important result is that you stop guessing and start to see the exact cause of the failure.
How to do it in practice
Your next step for today:
- In one critical scenario, implement
trace_idfrom input to output. - Add
llm.callandtool.call. - Set one focus on increasing errors.
If this is done, you already have a working minimum that provides real benefits in the sale.