Azure AI Speech SDK for JavaScript

Introduction to the Azure AI Speech SDK for JavaScript

The Azure AI Speech SDK for JavaScript empowers developers to easily integrate advanced speech capabilities into their web applications. Leverage the power of Azure AI Speech services for tasks such as speech-to-text, text-to-speech, speech translation, and speaker recognition directly from your browser.

This SDK provides a robust, performant, and developer-friendly API for building intelligent speech-enabled experiences, enabling natural language interactions and enhancing accessibility in your solutions.

Key Features

Speech-to-Text (STT): Convert spoken audio into text in real-time or from audio files. Supports multiple languages, custom models, and advanced features like punctuation and intent recognition.
Text-to-Speech (TTS): Synthesize natural-sounding speech from text. Choose from a wide range of voices, languages, and speaking styles to create engaging audio content.
Speech Translation: Translate spoken language into another language in real-time, enabling cross-lingual communication.
Speaker Recognition: Identify or verify speakers based on their voice characteristics.
Real-time and Batch Processing: Support for both streaming audio and processing pre-recorded audio files.
Customizable Models: Train custom speech models to improve accuracy for specific accents, jargon, or acoustic environments.
Web-based: Designed for client-side execution in web browsers, simplifying deployment and integration.

Installation

You can install the Azure AI Speech SDK for JavaScript using npm or yarn. It's recommended to use a bundler like Webpack or Parcel for production environments.

npm install microsoft-cognitiveservices-speech-sdk
# or
yarn add microsoft-cognitiveservices-speech-sdk

Alternatively, you can include it directly from a CDN (for development or simpler scenarios):

<script src="https://cdn.jsdelivr.net/npm/microsoft-cognitiveservices-speech-sdk@latest/distrib/browser/microsoftCognitiveServicesSpeech.js"></script>

Getting Started with Speech-to-Text

Here's a basic example of how to perform real-time speech-to-text using the SDK:

Note: Replace YOUR_SPEECH_KEY and YOUR_SPEECH_REGION with your actual Azure Speech service key and region.


import * as speechsdk from "microsoft-cognitiveservices-speech-sdk";
import { AudioConfig } from "microsoft-cognitiveservices-speech-sdk";

const speechConfig = speechsdk.SpeechConfig.fromSubscription("YOUR_SPEECH_KEY", "YOUR_SPEECH_REGION");
speechConfig.speechRecognitionLanguage = "en-US";

const audioConfig = AudioConfig.fromDefaultMicrophoneInput();
const recognizer = new speechsdk.SpeechRecognizer(speechConfig, audioConfig);

console.log("Say something...");

recognizer.recognized.addEventListener((s, e) => {
    if (e.result.reason === speechsdk.ResultReason.RecognizedSpeech) {
        console.log(`RECOGNIZED: Text=${e.result.text}`);
    } else if (e.result.reason === speechsdk.ResultReason.NoMatch) {
        console.log(`NOMATCH: Speech could not be recognized.`);
    }
});

recognizer.canceled.addEventListener((s, e) => {
    console.log(`CANCELED: Reason=${e.reason}`);
    if (e.reason === speechsdk.CancellationReason.Error) {
        console.log(`CANCELED: ErrorCode=${e.errorCode}`);
        console.log(`CANCELED: ErrorDetails=${e.errorDetails}`);
        console.log(`CANCELED: Did you set the speech resource key and region values?`);
    }
    recognizer.stopContinuousRecognitionAsync();
});

recognizer.sessionStopped.addEventListener((s, e) => {
    console.log("\n    Session stopped event.");
    recognizer.stopContinuousRecognitionAsync();
});

recognizer.startContinuousRecognitionAsync(
    function (result) {
        // Optional: handle start result
    },
    function (err) {
        console.log(err);
        recognizer.stopContinuousRecognitionAsync();
    });

Core Concepts

Speech Config: Contains your subscription key, region, and language settings.
Audio Config: Defines the audio input source (e.g., microphone, file).
Recognizer: The main object for interacting with Speech services. Different recognizer types exist for various scenarios (SpeechRecognizer, TranslationRecognizer).
Events: The SDK uses an event-driven model to notify you of results, errors, and session status. Key events include recognized, canceled, and sessionStopped.

API Overview

The SDK exposes various classes and methods for comprehensive control over speech interactions:

`SpeechConfig`

Static methods like fromSubscription() and fromAuthorizationToken() to configure authentication.

Properties like speechRecognitionLanguage, speechSynthesisLanguage, etc.

`AudioConfig`

Static methods like fromDefaultMicrophoneInput(), fromWavFileInput() to specify audio sources.

`SpeechRecognizer`

Methods: recognizeOnceAsync(), startContinuousRecognitionAsync(), stopContinuousRecognitionAsync().

Events: recognized, canceled, sessionStarted, sessionStopped.

`SpeechSynthesizer`

Methods: speakTextAsync(), speakSsmlAsync().

`TranslationRecognizer`

Methods and events similar to SpeechRecognizer but for translation scenarios.

Refer to the API Reference for a complete list of classes and methods.

Supported Platforms

The Azure AI Speech SDK for JavaScript is designed to run in modern web browsers. It is compatible with major browsers including Chrome, Firefox, Safari, and Edge.

Troubleshooting

Common issues and solutions:

Authentication Errors: Ensure your Speech service subscription key and region are correctly configured.
Microphone Access Denied: Verify that your browser has permission to access the microphone for the website.
No Speech Recognized: Check microphone input levels and ensure the correct language is set in speechConfig.
CORS Issues: If hosting your web app on a different domain than your Azure Speech resource, ensure CORS is configured correctly in Azure.

For detailed troubleshooting, consult the tutorials and FAQ.

Introduction to the Azure AI Speech SDK for JavaScript

Key Features

Installation

Getting Started with Speech-to-Text

Core Concepts

API Overview

SpeechConfig

AudioConfig

SpeechRecognizer

SpeechSynthesizer

TranslationRecognizer