Speech APIs

Leverage the power of speech with Microsoft Azure Speech services. This suite of APIs enables you to convert speech to text, text to speech, translate spoken audio, and recognize speakers.

Speech to Text

Transcribe spoken audio into text with high accuracy. Supports multiple languages, real-time streaming, and batch processing.

Real-time Transcription: Process audio as it's captured, ideal for live captioning and voice commands.
Batch Transcription: Transcribe pre-recorded audio files efficiently.
Customization: Adapt models to specific vocabularies, accents, and acoustic environments.

Example Usage (REST API):


POST https://speech.googleapis.com/v1/speech:recognize
Content-Type: application/json

{
  "config": {
    "encoding": "LINEAR16",
    "sampleRateHertz": 16000,
    "languageCode": "en-US"
  },
  "audio": {
    "content": "BASE64_ENCODED_AUDIO_DATA"
  }
}

Text to Speech

Convert written text into natural-sounding synthesized speech. Choose from a wide variety of voices and languages.

Neural Voices: Experience human-like speech with advanced neural network technology.
SSML Support: Control pronunciation, pitch, rate, and more using Speech Synthesis Markup Language.
Custom Voice: Create your own unique brand voice.

Example Usage (SDK - C#):


// Requires Azure Speech SDK installation

var speechConfig = SpeechConfig.FromSubscription("YOUR_SPEECH_KEY", "YOUR_SERVICE_REGION");
speechConfig.SpeechSynthesisVoiceName = "en-US-JessaNeural";

using (var synthesizer = new SpeechSynthesizer(speechConfig))
{
    var speechSynthesisResult = await synthesizer.SpeakTextAsync("Hello, world!");
    if (speechSynthesisResult.Reason == ResultReason.SynthesizingAudioCompleted)
    {
        // Audio data is available in speechSynthesisResult.AudioData
    }
}

Speech Translation

Translate spoken audio from one language to another in real-time or as batch files.

Multilingual Support: Translate between numerous language pairs.
Low Latency: Ideal for live multilingual conversations.
Combined Capabilities: Integrates Speech to Text and Text Translation services.

Speaker Recognition

Identify or verify speakers from audio samples. Useful for voice biometrics and personalized experiences.

Speaker Identification: Determine "who is speaking" from a group of known speakers.
Speaker Verification: Confirm if a speaker is who they claim to be.
Voice Profiles: Create and manage voice profiles for enrolled speakers.

Explore the full documentation for detailed API specifications, SDKs, and quickstart guides.