LLM Log Debugging Glossary

A reference guide to common terms used when exploring and debugging agent logs, coding-tool sessions, and chatbot histories with Hyperparam.

Agent span: A distinct unit of execution within a trace that represents a specific, autonomous action taken by an AI agent such as a tool call, reasoning step or planning phase. It acts as a nested container within larger distributed traces, recording input/output data, latency and metadata for debugging complex autonomous workflows.

Apache Iceberg: An open table format for huge analytic datasets on object storage. In HypStack it is the storage layer: HypAware writes traces as Iceberg tables in your own bucket, so they support schema evolution and snapshot-based time travel and stay readable by Spark, Trino, DuckDB, Snowflake, and Hyperparam alike.

Apache Parquet: An open columnar file format optimized for analytic reads. Trace and log data lands as Parquet (often under an Iceberg table), enabling column pruning and HTTP range requests so clients fetch only the bytes a query actually needs.

Chunking: A technique for splitting large, unmanageable logs into smaller, manageable and semantically meaningful units of text or data to improve comprehension, retention or AI processing efficiency.

Coding agent: An AI tool that reads and edits code and runs commands on a developer's behalf, such as Claude Code, Codex, or Cursor. Coding agents emit rich, nested logs (prompts, tool calls, file edits, token usage) that are a primary source of AI data to observe and debug.

Context: The total input data (system instructions, conversation history, user prompts and retrieved documents) that a model considers during a run, often called the "context window." It serves as short-term memory, enabling coherence and relevance in responses, and is measured in tokens.

Context rot: Also called "context loss" or "context degradation," refers to the measurable decline of an LLM's ability to recall, maintain or act upon information as the input context window fills up, even before the maximum token limit is reached. It's noticeable as a continuous, silent degradation of the model's output, frequently characterized by the model struggling to retrieve information from long prompts or forgetting earlier instructions.

Dataset curation: The process of using LLM interaction data (prompts, responses, user feedback, metadata) generated in production to identify, filter, clean and refine data for future training or fine-tuning. Techniques include annotation, labeling, filtering, segmentation and derived columns.

Hallucination detection: Identifying scenarios where the LLM produces plausible but factually incorrect information. Often involves evaluating model outputs against reliable sources, such as provided context in RAG systems, to ensure faithfulness, factual consistency and accuracy.

Hallucinations: Occur when the model generates confident but illogical, fabricated or false information that differs from the provided data or known facts. They take the form of incorrect answers, fabricated citations or auto-generated pseudo-facts.

HTTP range request: An HTTP feature that fetches a specific byte range of a file instead of the whole thing. It is the core primitive behind browser-native querying: combined with columnar formats and compact indexes, a client can search a multi-gigabyte Parquet on object storage while downloading only kilobytes or megabytes.

Invalid tool argument: An error that occurs when the model tries to call a function or tool but generates arguments that don't match the expected structure, type or format defined in the tool's schema. The LLM essentially acts like a buggy API client, producing incorrect inputs that the backend system can't execute.

Labelling: The automated, scalable application of metadata, categories or tags (generated by an LLM) across large, unstructured logs.

Latency (P90/P95/P99): Time taken by the model to return a response. P90 pinpoints the performance the majority of users experience, barring the top 10% of slowest responses. High P95 values often indicate systemic issues affecting nearly all users except the top 5% of slowest responses. P99 highlights the slowest 1% of responses, indicating the worst outliers.

Large Language Model (LLM): A large deep learning model that excels at natural language processing tasks. LLMs are trained on massive datasets to understand, summarize, generate and predict human-like language. Popular examples include ChatGPT, Claude and Gemini.

LLM-as-a-judge: An automated evaluation method where an LLM acts as an evaluator to analyze, score and provide feedback on the outputs of another system. Within an LLM log, this appears as structured, AI-generated critiques including numerical scores and categorical feedback stored alongside each user prompt and AI response. Evaluation metrics typically include relevance, consistency, accuracy/factuality and ROUGE/BLEU.

LLM call span: A direct invocation of an LLM capturing inputs, outputs, sampling parameters (temperature) and token usage.

LLM log: A record of all interactions, API calls and internal processes related to a large language model. These logs contain input prompts, generated output, and metadata (latency, user ID). Used for debugging, monitoring AI performance, tracking usage for billing and analyzing security events.

Log templatization/parsing: Using LLMs to convert raw, unstructured log data into structured events by identifying static templates and dynamic variables. Simplifies log analysis, enabling AI to detect patterns and anomalies efficiently.

MCP (Model Context Protocol): An open protocol for connecting AI models to external tools and data sources. MCP tool calls are a common origin of tool-call traces, capturing which tools an agent invoked, with what arguments, and what they returned.

Model version: A specific identifier (often logged as model, model_name or model_id) that tracks which exact version of an LLM was used for a particular inference call or request. Tracking model versions is critical for comparing performance over time and rolling back to older versions if a new model performs poorly.

Object storage: Cloud blob storage such as Amazon S3, Google Cloud Storage, or Azure Blob. Storing traces here costs object-storage rates (around $0.023/GB-month) with effectively infinite retention, versus the per-GB ingest fees and short retention of traditional logging vendors.

OpenTelemetry (OTel / OTLP): An open standard for emitting traces, metrics, and logs, with OTLP as its wire protocol. HypAware collects over OTLP, so anything that already speaks OpenTelemetry can send AI traces without a custom SDK.

Perplexity: A key evaluation metric measuring how confidently and accurately a model predicts the next token in a sequence. A lower perplexity score indicates better performance.

Prompt versioning: The systematic tracking of prompt iterations (changes in text, few-shot examples and parameters) linked directly to the specific outputs, latency and costs they produced. Creates an audit trail that allows developers to debug, compare performance and roll back to previous versions.

Rabbit-holing: A failure mode where a model or agent latches onto a bad line of reasoning, tool path or task interpretation and keeps pursuing it past the point where it should have corrected course, asked for clarification or stopped.

Retrieval augmented generation (RAG): A framework that improves LLM accuracy by fetching relevant, external data to ground its answers. Instead of relying on only pre-trained knowledge, RAG retrieves current, domain-specific data, reducing hallucinations and allowing for accurate, cited and up-to-date responses.

Skill: A saved, reusable Hyperparam workflow, stored as markdown, that re-runs a sequence of analysis steps (filters, derived columns, SQL views) on new logs. Skills let a team capture an investigation once and apply it to next week's traces.

Span: A single unit of work within a trace, such as a direct call to the LLM, a database retrieval or a tool execution.

Summarization: The act of condensing chunks or entire documents into a shorter form that retains only the essential information the LLM needs to know.

Sycophancy: A model's tendency to tailor its responses to match a user's stated or implied beliefs, even when those beliefs are factually incorrect. Instead of prioritizing truth, the model acts as a "yes-man" to maximize perceived helpfulness or user satisfaction.

Throughput: The number of queries a system can handle in a specific time frame.

Token usage: The number of tokens processed in a request, often broken down into input, output, and cached tokens. Crucial for tracking cost and latency.

Tool failure: Occurs when a model attempts to use an external capability (like an API, search engine or database) but the process breaks. Because LLMs often act as agents that plan and execute multi-step tasks, a single tool failure can cause a "reliability cliff" where the entire request collapses.

Tool-call traces: Detailed records that capture the entire lifecycle of an AI agent's interaction with external tools, APIs or data sources. Provide a step-by-step audit trail (including inputs, outputs, latency and errors) of what the model decided to do, which tools it called and the results it received.

Trace: A log of the entire request path as it moves through the application, encompassing all intermediate steps such as tool calls, data retrieval and LLM reasoning.

Trace inspection: The process of manually or automatically reviewing a trace to understand the "why" behind a model's output. While monitoring alerts you to a system failure, trace inspection helps determine exactly where the logic broke down.

Vector store: Databases that hold vector representations of log data to facilitate retrieval-based analysis.

Verbosity: A performance metric that tracks how wordy a model is relative to the information it provides. Important for cost control, latency and debugging.

Waiting-state failure: A failure mode where a model or agent enters a stalled state and waits for user input when it should have continued the task, used an available tool or returned a more complete response.

Wasted call: Also referred to as a "failed tool call," a logged attempt by the LLM to use an external tool (e.g., API, database) that results in an error, incorrect function execution or invalid output. Common causes include missing parameters, malformed JSON, incorrect arguments, rate limits or hallucinations where the model invokes non-existent tools.

LLM Log Debugging Glossary - Hyperparam