The RTVI (Real-Time Voice and Video Inference) standard defines a set of message types and structures sent between clients and servers. It is designed to facilitate real-time interactions between clients and AI applications that require voice, video, and text communication. It provides a consistent framework for building applications that can communicate with AI models and the backends running those models in real-time.Documentation Index
Fetch the complete documentation index at: https://daily-mb-ui-agent.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
This page documents version 1.3 of the RTVI standard. The UI Agent Protocol
message types (
ui-event, ui-snapshot, ui-cancel-task, ui-command,
ui-task) were added in v1.3 as an additive extension to v1.2. Older clients
that pre-date v1.3 ignore the new types; the major-version compatibility check
the server runs on client-ready still passes for any 1.x peer.Key Features
Connection Management
RTVI provides a flexible connection model that allows clients to connect to
AI services and coordinate state.
Transcriptions
The standard includes built-in support for real-time transcription of audio
streams.
Client-Server Messaging
The standard defines a messaging protocol for sending and receiving messages
between clients and servers, allowing for efficient communication of
requests and responses.
Advanced LLM Interactions
The standard supports advanced interactions with large language models
(LLMs), including context management, function call handline, and search
results.
Service-Specific Insights
RTVI supports events to provide insight into the input/output and state for
typical services that exist in speech-to-speech workflows.
Metrics and Monitoring
RTVI provides mechanisms for collecting metrics and monitoring the
performance of server-side services.
Terms
- Client: The front-end application or user interface that interacts with the RTVI server.
- Server: The backend-end service that runs the AI framework and processes requests from the client.
- User: The end user interacting with the client application.
- Bot: The AI interacting with the user, technically an amalgamation of a large language model (LLM) and a text-to-speech (TTS) service.
RTVI Message Format
The messages defined as part of the RTVI protocol adhere to the following format:A unique identifier for the message, used to correlate requests and responses.
A label that identifies this message as an RTVI message. This field is
required and should always be set to
'rtvi-ai'.The type of message being sent. This field is required and should be set to
one of the predefined RTVI message types listed below.
The payload of the message, which can be any data structure relevant to the
message type.
RTVI Message Types
Following the above format, this section describes the various message types defined by the RTVI standard. Each message type has a specific purpose and structure, allowing for clear communication between clients and servers.Connection Management
client-ready π
Indicates that the client is ready to receive messages and interact with the server. Typically sent after the transport media channels have connected.- type:
'client-ready' - data:
-
version:
stringThe version of the RTVI standard being used. This is useful for ensuring compatibility between client and server implementations. -
about:
AboutClient ObjectAn object containing information about the client, such as its rtvi-version, client library, and any other relevant metadata. TheAboutClientobject follows this structure:
-
version:
bot-ready π€
Indicates that the bot is ready to receive messages and interact with the client. Typically send after the transport media channels have connected.- type:
'bot-ready' - data:
-
version:
stringThe version of the RTVI standard being used. This is useful for ensuring compatibility between client and server implementations. -
about:
any(Optional) An object containing information about the server or bot. Itβs structure and value are both undefined by default. This provides flexibility to include any relevant metadata your client may need to know about the server at connection time, without any built-in security concerns. Please be mindful of the data you include here and any security concerns that may arise from exposing sensitive information.
-
version:
disconnect-bot π
Indicates that the client wishes to disconnect from the bot. Typically used when the client is shutting down or no longer needs to interact with the bot. Note: Disconnets should happen automatically when either the client or bot disconnects from the transport, so this message is intended for the case where a client may want to remain connected to the transport but no longer wishes to interact with the bot.- type:
'disconnect-bot' - data:
undefined
error π€
Indicates an error occurred during bot initialization or runtime.- type:
'error' - data:
-
message:
stringDescription of the error. -
fatal:
booleanIndicates if the error is fatal to the session.
-
message:
Speaking and Transcription
user-started-speaking π€
Emitted when the user begins speaking- type:
'user-started-speaking' - data: None
user-stopped-speaking π€
Emitted when the user stops speaking- type:
'user-stopped-speaking' - data: None
bot-started-speaking π€
Emitted when the bot begins speaking- type:
'bot-started-speaking' - data: None
bot-stopped-speaking π€
Emitted when the bot stops speaking- type:
'bot-stopped-speaking' - data: None
user-mute-started π€
Introduced in RTVI version 1.2.0 (Pipecat 0.0.102, client-js 1.6.0).
- type:
'user-mute-started' - data: None
user-mute-stopped π€
Introduced in RTVI version 1.2.0 (Pipecat 0.0.102, client-js 1.6.0).
- type:
'user-mute-stopped' - data: None
user-transcription π€
Real-time transcription of user speech, including both partial and final results.- type:
'user-transcription' - data:
-
text:
stringThe transcribed text of the user. -
final:
booleanIndicates if this is a final transcription or a partial result. -
timestamp:
stringThe timestamp when the transcription was generated. -
user_id:
stringIdentifier for the user who spoke.
-
text:
bot-output π€
The best-effort representation of the botβs output text, including both spoken and unspoken text. In addition to transcriptions of spoken text, this message type may also include text that the bot outputs but does not speak (e.g., text sent to the client for display purposes only). Along with the text, this event includes aspoken flag to indicate whether the text was spoken by the bot or not and an aggregated_by field to indicate what the text represents (e.g. βsentenceβ, βwordβ, βcodeβ, βurlβ).
- type:
'bot-output' - data:
-
text:
stringThe output text from the bot. -
spoken:
booleanIndicates if this text was spoken by the bot. -
aggregated_by:
stringIndicates how the text was aggregated (e.g., βsentenceβ, βwordβ, βcodeβ, βurlβ). βsentenceβ and βwordβ are reserved aggregation types defined by the RTVI standard. Other aggregation types may be defined by custom text aggregators used by the server.
-
text:
bot-transcription π€
Transcription of the botβs speech. Note: This protocol currently does not match the user transcription format to support real-time timestamping for bot transcriptions. Rather, the event is typically sent for each sentence of the botβs response. This difference is currently due to limitations in TTS services which mostly do not support (or support well), accurate timing information. If/when this changes, this protocol may be updated to include the necessary timing information. For now, if you want to attempt real-time transcription to match your botβs speaking, you can try using thebot-tts-text message type.
- type:
'bot-transcription' - data:
-
text:
stringThe transcribed text from the bot, typically aggregated at a per-sentence level.
-
text:
Client-Server Messaging
server-message π€
An arbitrary message sent from the server to the client. This can be used for custom interactions or commands. This message may be coupled with theclient-message message type to handle responses from the client.
-
type:
'server-message' -
data:
anyThedatacan be any JSON-serializable object, formatted according to your own specifications.
client-message π
An arbitrary message sent from the client to the server. This can be used for custom interactions or commands. This message may be coupled with theserver-response message type to handle responses from the server.
-
type:
'client-message' -
data:
- t:
string - d:
unknown(optional)
tfield indicating the type of message and an optionaldfield containing any custom, corresponding data needed for the message. - t:
server-response π€
An message sent from the server to the client in response to aclient-message. IMPORTANT: The id should match the id of the original client-message to correlate the response with the request.
-
type:
'client-message' -
data:
- t:
string - d:
unknown(optional)
tfield indicating the type of message and an optionaldfield containing any custom, corresponding data needed for the message. - t:
error-response π€
Error response to a specific client message. IMPORTANT: Theid should match the id of the original client-message to correlate the response with the request.
- type:
'error-response' - data:
- error:
string
- error:
UI Agent Protocol
The UI Agent Protocol lets server-side AI agents observe and drive a client GUI app through structured RTVI messages. Five top-level message types form the protocol:| Type | Direction | Purpose |
|---|---|---|
ui-event | π client β server | Named event with payload (e.g. button click, navigation intent) |
ui-snapshot | π client β server | Platform accessibility tree; grounds agent reasoning in current UI state |
ui-cancel-task | π client β server | Cancel an in-flight user-facing task group |
ui-command | π€ server β client | Named command with payload (toast, scroll, highlight, click, set input value, β¦) |
ui-task | π€ server β client | Task lifecycle envelope; one of group_started / task_update / task_completed / group_completed |
*Data / *Message pydantic model in pipecat.processors.frameworks.rtvi.models. Client-side, each type is a member of RTVIMessageType (e.g. RTVIMessageType.UI_EVENT). The full agent-side abstraction lives in pipecat-subagents; single-LLM Pipecat apps can target this wire format directly via the new pipeline frames (RTVIUICommandFrame, RTVIUITaskFrame).
For a higher-level walkthrough of using this protocol from the web client SDK with a server-side UI agent, see the UI Agent guide.
ui-event π
A named event from the client carrying an app-defined payload. Unlikeclient-message, no server response is expected; the server side fans out the event to handlers and (optionally) injects it into the LLM context as a <ui_event> developer message.
- type:
'ui-event' - data:
-
event:
stringApp-defined event name (e.g."nav_click","play_track","set_tab"). Apps may pick any name. -
payload:
unknown(optional) App-defined payload. Schemaless by design.
-
event:
ui-snapshot π
The current accessibility tree for the client UI, captured from the platformβs accessibility or semantic UI model. Web clients can produce this from the DOM; native mobile clients can produce the same shape from iOS or Android accessibility metadata. The server stores the latest snapshot and renders it as<ui_state> into the LLM context so the agent reasons over the current UI state.
- type:
'ui-snapshot' - data:
-
tree:
A11ySnapshotThe serialized platform accessibility tree. The server-side model mirrors the shared snapshot shape and allows extra fields for platform-specific metadata and forward compatibility. The web SDKβs managed stream (PipecatClient.startUISnapshotStreamor ReactβsuseUISnapshot) produces this shape;A11ySnapshotStreameris available as a low-level helper.
-
tree:
ui-cancel-task π
A request to cancel an in-flight user-facing task group. The server-side framework looks up the matching task group and cancels it (subject to whatever cancellable policy the group was registered with).- type:
'ui-cancel-task' - data:
-
task_id:
stringThe task group id the client wants cancelled. -
reason:
string(optional) Optional human-readable reason logged on the server.
-
task_id:
ui-command π€
A named UI command directing the client to take an action. Client apps can subscribe toRTVIEvent.UICommand, use the onUICommand constructor callback, or use the useUICommandHandler React hook to filter by data.command.
- type:
'ui-command' - data:
-
command:
stringApp-defined command name. Standard names with default React handlers:toast,navigate,scroll_to,highlight,focus,click,set_input_value,select_text. Apps may publish their own command names freely. -
payload:
unknownApp-defined payload. Standard payload shapes for the standard commands are documented under Standard command payloads below.
-
command:
ui-task π€
A task lifecycle envelope for a user-facing task group dispatched by the server. Thedata field carries one of four lifecycle phases discriminated by kind. Client-side, useUITasks() from @pipecat-ai/client-react reduces these into a live in-flight panel with cancel support.
-
type:
'ui-task' -
data (one of):
group_startedβ group dispatched, worker list known.- kind:
'group_started' - task_id:
string - agents:
string[] - label:
string(optional) - cancellable:
boolean - at:
number(epoch ms)
task_updateβ per-worker progress.- kind:
'task_update' - task_id:
string - agent_name:
string - data:
unknown(worker-defined) - at:
number(epoch ms)
task_completedβ per-worker terminal.- kind:
'task_completed' - task_id:
string - agent_name:
string - status:
string("completed","cancelled","failed","error") - response:
unknown(optional) - at:
number(epoch ms)
group_completedβ group terminal; every worker has responded or the group was cancelled.- kind:
'group_completed' - task_id:
string - at:
number(epoch ms)
- kind:
Standard command payloads
Theui-command message carries a free-form payload, but eight standard payload shapes ship with the protocol. Apps can use them as-is, override the client handler to customize behavior, or ignore them and define their own command names.
| Command | Payload shape |
|---|---|
toast | {title, subtitle?, description?, image_url?, duration_ms?} |
navigate | {view, params?} |
scroll_to | {ref?, target_id?, behavior?} |
highlight | {ref?, target_id?, duration_ms?} |
focus | {ref?, target_id?} |
click | {ref?, target_id?} |
set_input_value | {value, ref?, target_id?, replace?} |
select_text | {ref?, target_id?, start_offset?, end_offset?} |
ref is an opaque snapshot reference like "e42" assigned by the client and emitted in the latest ui-snapshot. Client handlers should resolve ref first, then fall back to target_id, an optional app-defined stable target identifier outside the snapshot system. Web clients commonly map target_id to a DOM id; native clients might map it to an accessibility identifier, view id, or app-specific component id. Pydantic models for these shapes ship in pipecat.processors.frameworks.rtvi.models (Toast, Navigate, ScrollTo, Highlight, Focus, Click, SetInputValue, SelectText); server-side helpers like UIAgent.send_command accept them directly and model_dump to the wire shape.
Advanced LLM Interactions
send-text π
A message sent from the client to the server to send text input to the LLM, appended to the userβs context.- type:
'send-text' - data:
-
content:
stringThe text content to be appended to the user context. -
options:
object(optional)-
run_immediately:
boolean(optional) Iftrue, the pipeline should be interrupted and the LLM should process the input immediately after appending it to the context. Defaults totrue. -
audio_response:
boolean(optional) Iffalse, the bot should bypass the TTS so that the bot does not respond to the text in audio. Defaults totrue.
-
run_immediately:
-
content:
llm-function-call-started π€
Introduced in RTVI version 1.2.0 (Pipecat 0.0.102, client-js 1.6.0).
function_call_report_level configuration.
- type:
'llm-function-call-started' - data:
-
function_name:
string(optional) Name of the function being called. Only included if the serverβs report level isNAMEorFULL.
-
function_name:
llm-function-call-in-progress π€
Introduced in RTVI version 1.2.0 (Pipecat 0.0.102, client-js 1.6.0).
llm-function-call message and is the event that triggers registered FunctionCallHandlers when a function_name is present.
- type:
'llm-function-call-in-progress' - data:
-
function_name:
string(optional) Name of the function being called. Only included if the serverβs report level isNAMEorFULL. -
tool_call_id:
stringUnique identifier for this function call. -
arguments:
Record<string, unknown>(optional) Arguments passed to the function. Only included if the serverβs report level isFULL.
-
function_name:
llm-function-call-stopped π€
Introduced in RTVI version 1.2.0 (Pipecat 0.0.102, client-js 1.6.0).
- type:
'llm-function-call-stopped' - data:
-
function_name:
string(optional) Name of the function that was called. Only included if the serverβs report level isNAMEorFULL. -
tool_call_id:
stringIdentifier matching the original function call. -
cancelled:
booleanIndicates whether the function call was cancelled before completing. -
result:
unknown(optional) The result of the function call, if available. Only included if the serverβs report level isFULL.
-
function_name:
llm-function-call π€
A function call request from the LLM, sent from the bot to the client. Note that for most cases, an LLM function call will be handled completely server-side. However, in the event that the call requires input from the client or the client needs to be aware of the function call, this message/response schema is required.- type:
'llm-function-call' - data:
-
function_name:
stringName of the function to be called. -
tool_call_id:
stringUnique identifier for this function call. -
args:
Record<string, unknown>Arguments to be passed to the function.
-
function_name:
llm-function-call-result π
The result of the function call requested by the LLM, returned from the client.- type:
'llm-function-call-result' - data:
-
function_name:
stringName of the called function. -
tool_call_id:
stringIdentifier matching the original function call. -
arguments:
Record<string, unknown>Arguments that were passed to the function. -
result:
Record<string, unknown> | stringThe result returned by the function.
-
function_name:
bot-llm-search-response π€
Search results from the LLMβs knowledge base.Currently, Google Gemini is the only LLM that supports built-in search.
However, we expect other LLMs to follow suite, which is why this message type
is defined as part of the RTVI standard. As more LLMs add support for this
feature, the format of this message type may evolve to accommodate
discrepancies.
- type:
'bot-llm-search-response' - data:
-
search_result:
string(optional) Raw search result text. -
rendered_content:
string(optional) Formatted version of the search results. -
origins:
Array<Origin Object>Source information and confidence scores for search results. TheOrigin Objectfollows this structure:
-
search_result:
Service-Specific Insights
bot-llm-started π€
Indicates LLM processing has begun- type:
bot-llm-started - data: None
bot-llm-stopped π€
Indicates LLM processing has completed- type:
bot-llm-stopped - data: None
user-llm-text π€
Aggregated user input text that is sent to the LLM.- type:
'user-llm-text' - data:
-
text:
stringThe userβs input text to be processed by the LLM.
-
text:
bot-llm-text π€
Individual tokens streamed from the LLM as they are generated.- type:
'bot-llm-text' - data:
-
text:
stringThe token text from the LLM.
-
text:
bot-tts-started π€
Indicates text-to-speech (TTS) processing has begun.- type:
'bot-tts-started' - data: None
bot-tts-stopped π€
Indicates text-to-speech (TTS) processing has completed.- type:
'bot-tts-stopped' - data: None
bot-tts-text π€
The per-token text output of the text-to-speech (TTS) service (what the TTS actually says).- type:
'bot-tts-text' - data:
-
text:
stringThe text representation of the generated bot speech.
-
text:
Metrics and Monitoring
metrics π€
Performance metrics for various processing stages and services. Each message will contain entries for one or more of the metrics types:processing, ttfb, characters.
- type:
'metrics' - data:
- processing: [See Below] (optional) Processing time metrics.
- ttfb: [See Below] (optional) Time to first byte metrics.
- characters: [See Below] (optional) Character processing metrics.
-
processor:
stringThe name of the processor or service that generated the metric. -
value:
numberThe value of the metric, typically in milliseconds or character count. -
model:
string(optional) The model of the service that generated the metric, if applicable.