RecognitionConfig

Provides information to the recognizer that specifies how to process the request.

{
  "encoding": enum (AudioEncoding),
  "sampleRateHertz": integer,
  "audioChannelCount": integer,
  "enableSeparateRecognitionPerChannel": boolean,
  "languageCode": string,
  "maxAlternatives": integer,
  "speechContexts": [
    {
      object (SpeechContext)
    }
  ],
  "enableWordTimeOffsets": boolean,
  "diarizationConfig": {
    object (SpeakerDiarizationConfig)
  },
}

Field	Description
encoding	enum (AudioEncoding) Encoding of audio data sent in all RecognitionAudio messages. This field is optional for FLAC and WAV audio files and required for all other audio formats. For details, see AudioEncoding.
sampleRateHertz	integer Sample rate in Hertz of the audio data sent in all RecognitionAudio messages. For now we only support 8000Hz. In case your audio is of any other sampling rate, consider resampling to 8000Hz.
audioChannelCount	integer The number of channels in the input audio data. ONLY set this for MULTI-CHANNEL recognition. Valid values for LINEAR16 and FLAC are 1-8. If 0 or omitted, defaults to one channel (mono). Note: We only recognize the first channel by default. To perform independent recognition on each channel set enableSeparateRecognitionPerChannel to 'true'.
enableSeparateRecognitionPerChannel	boolean This needs to be set to true explicitly and audioChannelCount > 1 to get each channel recognized separately. The recognition result will contain a channelTag field to state which channel that result belongs to. If this is not true, we will only recognize the first channel. The request is billed cumulatively for all channels recognized: audioChannelCount multiplied by the length of the audio.
languageCode	string Required. The language of the supplied audio as a BCP-47 language tag. Example: "en-IN". See Language Support for a list of the currently supported language codes.
maxAlternatives	integer Maximum number of recognition hypotheses to be returned. Specifically, the maximum number of SpeechRecognitionAlternative messages within each SpeechRecognitionResult. The server may return fewer than maxAlternatives. Valid values are 0-10. A value of 0 or 1 will return a maximum of one. If omitted, will return a maximum of one.
speechContexts[]	object (SpeechContext) Array of SpeechContext. This feature is experimental and may not work as of now. We do support biasing of models with customer specific terminology so this may not be needed.
diarizationConfig	object (SpeakerDiarizationConfig) Config to enable speaker diarization and set additional parameters to make diarization better suited for your application. This feature is experimental and may not work for now.
enableWordTimeOffsets	boolean If true, the top result includes a list of words and the start and end time offsets (timestamps) for those words. If false, no word-level time offset information is returned. The default is false.

AudioEncoding

The encoding of the audio data sent in the request.

All encodings support only 1 channel (mono) audio, unless the audioChannelCount and enableSeparateRecognitionPerChannel fields are set.

We support wav [LINEAR16] and mp3 [MP3] right now.

For best results, the audio source should be captured and transmitted using a lossless encoding (LINEAR16). The accuracy of the speech recognition can be reduced if lossy codecs are used to capture or transmit audio, particularly if background noise is present. Lossy codecs include MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE, and MP3.

Enums

Format	Description
LINEAR16	Uncompressed 16-bit signed little-endian samples (Linear PCM).
MP3	Compressed mp3 encoded stream

LanguageSupport

Vernacular ASR only supports indian languages for now. Use these language codes for following languages.

Language	Code
Hindi	hi-IN
English	en-IN
Kannada	kn-IN
Malayalam	ml-IN
Bengali	bn-IN
Marathi	mr-IN
Gujarati	gu-IN
Punjabi	pa-IN
Telugu	te-IN
Tamil	ta-IN

SpeechContext

Provides hints to the speech recognizer to favor specific words and phrases in the results.

{
  "phrases": [
    string
  ]
}

Field	Description
phrases[]	string A list of strings containing words and phrases "hints" so that the speech recognition is more likely to recognize them. This can be used to improve the accuracy for specific words and phrases, for example, if specific commands are typically spoken by the user. This can also be used to add additional words to the vocabulary of the recognizer. See usage limits.

SpeakerDiarizationConfig

Config to enable speaker diarization.

{
  "enableSpeakerDiarization": boolean,
  "minSpeakerCount": integer,
  "maxSpeakerCount": integer,
  "speakerTag": integer
}

Field	Description
enableSpeakerDiarization	boolean If true, enables speaker detection for each recognized word in the top alternative of the recognition result.
minSpeakerCount	integer Minimum number of speakers in the conversation. This range gives you more flexibility by allowing the system to automatically determine the correct number of speakers. If not set, the default value is 2.
maxSpeakerCount	integer Maximum number of speakers in the conversation. This range gives you more flexibility by allowing the system to automatically determine the correct number of speakers. If not set, the default value is 6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RecognitionConfig.md

RecognitionConfig.md

RecognitionConfig

AudioEncoding

Enums

LanguageSupport

SpeechContext

SpeakerDiarizationConfig

Files

RecognitionConfig.md

Latest commit

History

RecognitionConfig.md

File metadata and controls

RecognitionConfig

AudioEncoding

Enums

LanguageSupport

SpeechContext

SpeakerDiarizationConfig