Assist pipelines
The Assist pipeline integration runs the common steps of a voice assistant:
- Wake word detection
- Speech to text
- Intent recognition
- Text to speech
Pipelines are run via a WebSocket API:
{
"type": "assist_pipeline/run",
"start_stage": "stt",
"end_stage": "tts",
"input": {
"sample_rate": 16000,
}
}
The following input fields are available:
| Name | Type | Description |
|---|---|---|
start_stage | enum | Required. The first stage to run. One of wake_word, stt, intent, tts. |
end_stage | enum | Required. The last stage to run. One of stt, intent, tts. |
input | dict | Depends on start_stage:
|
pipeline | string | Optional. ID of the pipeline (use assist_pipeline/pipeline/list to get names). |
conversation_id | string | Optional. Unique id for conversation. |
device_id | string | Optional. Device ID from Home Assistant's device registry of the device that is starting the pipeline. |
timeout | number | Optional. Number of seconds before pipeline times out (default: 300). |
Events
As the pipeline runs, it emits events back over the WebSocket connection. The following events can be emitted:
| Name | Description | Emitted | Attributes |
|---|---|---|---|
run-start | Start of pipeline run | always | pipeline - ID of the pipelinelanguage - Language used for pipelinerunner_data - Extra WebSocket data:
tts_output - TTS Output data
|
run-end | End of pipeline run | always | |
wake_word-start | Start of wake word detection | audio only | engine: wake engine usedmetadata: incoming audiotimeout: seconds before wake word timeout metadata |
wake_word-end | End of wake word detection | audio only | wake_word_output - Detection result data:
|
stt-start | Start of speech to text | audio only | engine: STT engine usedmetadata: incoming audio metadata |
stt-vad-start | Start of voice command | audio only | timestamp: time relative to start of audio stream (milliseconds) |
stt-vad-end | End of voice command | audio only | timestamp: time relative to start of audio stream (milliseconds) |
stt-end | End of speech to text | audio only | stt_output - Object with text, the detected text. |
intent-start | Start of intent recognition | always | engine - Agent engine usedlanguage: Processing language. intent_input - Input text to agent |
intent-progress | Intermediate update of intent recognition | depending on conversation agent | chat_log_delta - optional, delta object from the chat log tts_start_streaming - optional, True if TTS streaming has started |
intent-end | End of intent recognition | always | intent_output - conversation response |
tts-start | Start of text to speech | audio only | engine - TTS engine usedlanguage: Output language.voice: Output voice. tts_input: Text to speak. |
tts-end | End of text to speech | audio only | token - Token of the generated audiourl - URL to the generated audiomime_type - MIME type of the generated audio |
error | Error in pipeline | on error | code - Error code (see below)message - Error message |
Error codes
The following codes are returned from the pipeline error event:
wake-engine-missing- No wake word engine is installedwake-provider-missing- Configured wake word provider is not availablewake-stream-failed- Unexpected error during wake word detectionwake-word-timeout- Wake word was not detected within timeoutstt-provider-missing- Configured speech-to-text provider is not availablestt-provider-unsupported-metadata- Speech-to-text provider does not support audio format (sample rate, etc.)stt-stream-failed- Unexpected error during speech-to-textstt-no-text-recognized- Speech-to-text did not return a transcriptintent-not-supported- Configured conversation agent is not availableintent-failed- Unexpected error during intent recognitiontts-not-supported- Configured text-to-speech provider is not available or options are not supportedtts-failed- Unexpected error during text-to-speech
Sending speech data
After starting a pipeline with stt as the first stage of the run and receiving a stt-start event, speech data can be sent over the WebSocket connection as binary data. Audio should be sent as soon as it is available, with each chunk prefixed with a byte for the stt_binary_handler_id.
For example, if stt_binary_handler_id is 1 and the audio chunk is a1b2c3, the message would be (in hex):
stt_binary_handler_id
||
01a1b2c3
||||||
audio
To indicate the end of sending speech data, send a binary message containing a single byte with the stt_binary_handler_id.
Wake word detection
When start_stage is set to wake_word, the pipeline will not run until a wake word has been detected. Clients should avoid unnecessary audio streaming by using a local voice activity detector (VAD) to only start streaming when human speech is detected.
For wake_word, the input object should contain a timeout float value. This is the number of seconds of silence before the pipeline will time out during wake word detection (error code wake-word-timeout).
If enough speech is detected by Home Assistant's internal VAD, the timeout will be continually reset.
Audio enhancements
The following settings are available as part of the input object when start_stage is set to wake_word:
noise_suppression_level- level of noise suppression (0 = disabled, 4 = max)auto_gain_dbfs- automatic gain control (0 = disabled, 31 = max)volume_multiplier- audio samples multiplied by constant (1.0 = no change, 2.0 = twice as loud)
If your device's microphone is fairly quiet, the recommended settings are:
noise_suppression_level- 2auto_gain_dbfs- 31volume_multiplier- 2.0
Increasing noise_suppression_level or volume_multiplier may cause audio distortion.