Real-Time Audio Transcription API: How to Turn Speech to Text During Live Conferencing

In an age of back-to-back video calls and online events, real-time transcription is becoming a must-have. Live captions and searchable conversation logs help make meetings more accessible, educational lectures easier to review, and sales calls easier to analyze later.

Having a running transcript means participants can read along or catch up if they miss something, and later search the text for key points. It also provides a record for compliance or training purposes. But delivering accurate, real-time transcripts in a multi-person setting is a tough technical challenge.

captionless image

What Makes Live Transcription So Challenging?

A few key factors:

Low latency: Captions need to appear almost instantly as someone speaks, or they lose their value. Any significant delay can confuse or distract users.
Audio quality: Background noise, echoes, or low-quality mics can derail speech recognition. The system must handle imperfect audio in real time.
Multiple speakers: In group calls, identifying who said each line (speaker diarization) is non-trivial. Overlapping speech can trip up a naive transcription system.
Cost and scale: Transcribing one stream is easy; doing it for dozens of simultaneous speakers (and possibly many concurrent meetings) can rack up costs if using cloud AI services. Scaling that reliably is another hurdle.
Integration complexity: Bringing transcription into an app means dealing with audio streams, APIs or AI models, and ensuring everything works across different user devices and network conditions.

The Browser-Side Approach: Web Speech API

The simplest approach a developer might try is a quick browser-side hack: using the Web Speech API built into browsers for speech recognition. Modern browsers (particularly Chrome) let you access a SpeechRecognition API that transcribes microphone audio to text on the client. This requires no server and no external API keys – just a few lines of JavaScript.

However, this simplicity comes with significant limitations. For one, the Web Speech API may impose time limits (for example, Chrome’s implementation often cuts off after ~60 seconds of speech on desktop). Quality and availability can vary by device and browser, and there’s no standard support for differentiating speakers or channels – it’s typically just transcribing one source at a time (usually the user’s own mic).

Moreover, because it runs in each user’s browser, the transcripts stay on the client unless you explicitly ship them to a server, making centralized logging or search more complex. In short, the browser API approach is fine for a quick demo, but it falls short for robust multi-person or multi-session transcription (no diarization, inconsistent limits, and hard to share transcripts across users).

captionless image

Progressing to Server-Side (SFU-Based) Transcription

To build a scalable real-time transcription service, moving the speech recognition to the server side is the logical next step. This typically means introducing a WebRTC media server — often a Selective Forwarding Unit (SFU) — into your architecture. An SFU receives audio/video streams from all participants and then selectively forwards those streams to others without mixing them. For example, in a 100-person conference, the SFU might route only the active speakers’ audio/video to each participant, optimizing bandwidth. The SFU’s role as a central router also makes it an ideal point to grab the audio for transcription purposes.

Server-side audio export works like this: each participant’s audio stream is sent to the SFU, and the SFU (or an associated server process) can tap into those streams in real time. Rather than having each client do recognition, the server can pipe the raw audio from each user into a streaming Automatic Speech Recognition (ASR) service. Cloud providers like Google (for example, the Vertex AI Speech-to-Text or new Gen AI Live APIs) and startups like Deepgram offer streaming speech-to-text APIs that accept raw audio and return text incrementally. By forwarding PCM audio to such an API, you can receive interim transcripts (updates as the person is speaking) and final transcripts (after a pause or sentence ends) with only a couple hundred milliseconds of latency.

Advantages of the SFU-Based Approach

Since audio is captured per participant, we inherently know the speaker’s identity for each transcript. There’s no need for complex speaker diarization algorithms to label the text — you can tag it by the participant ID or name since each audio stream corresponds to a known peer.
Offloading recognition to the server side reduces client workload — user devices just send their audio as usual, and all the heavy lifting (and any API calls to external AI services) happen on the backend. This means consistent behavior across devices (no reliance on browser quirks) and lower CPU/memory use on users’ machines.
A central service makes it easier to aggregate and distribute the transcripts. For instance, the server can merge all speakers’ subtitles and broadcast the combined transcript to every participant in the meeting, or store it in a database — all in one place.

Implementation Overview

To sketch a solution, imagine an SFU that exposes a hook for incoming audio. Pseudocode in Node.js might look like:

sfu.on("audioTrack", (participantId, pcmChunk) => {
  // Send the PCM audio chunk to a streaming ASR service (e.g., via SDK or websocket)
  asrClient.sendAudio(participantId, pcmChunk);
});

// Listen for transcription results from the ASR service
asrClient.on("transcription", (participantId, result) => {
  const text = result.text;
  const isFinal = result.isFinal;
  console.log(`Speaker ${participantId}: ${text}`);
  // Here you could broadcast the text to clients in real-time, perhaps with the speaker's name.
});

In practice, the exact API varies. If using Google’s streaming API, you might open a gRPC or WebSocket stream and continuously feed it audio bytes, receiving live StreamingRecognitionResult messages. With Deepgram’s API, you’d open a persistent connection that sends back JSON transcripts as it processes the stream.

Audio Format Considerations

One important detail is the audio format. Most real-time ASR services expect raw PCM audio in a specific format (often 16-bit linear PCM at 16 kHz or 8 kHz for speech). Your SFU or media pipeline should output audio in a format the ASR likes. If it doesn’t by default, you may need to resample or convert the audio. (For example, Chrome typically captures audio at 48 kHz Opus — you’d convert that to 16 kHz PCM for many speech APIs.) Some SFU solutions now directly support exporting audio in 16 kHz PCM to make this easier.

Introducing Fishjam

Meet Fishjam — a WebRTC infrastructure and SDK provider designed to simplify real-time media features like this. Fishjam provides a cloud-based SFU (Selective Forwarding Unit) and easy-to-use client and server SDKs, so developers can focus on their app logic rather than managing WebRTC internals. It’s essentially an all-in-one platform for live video/audio features (think video conferencing, interactive streaming, etc.), created by the team at Software Mansion. What makes Fishjam especially relevant here is its new audio pipeline capabilities — essentially built-in support for real-time audio export and import.

captionless image

Audio Export with Fishjam

Fishjam allows you to spin up a server-side agent that joins your call (virtually) and subscribes to any participant’s audio stream in raw PCM form. In fact, Fishjam’s APIs were designed with AI integration in mind: you can specify that a given peer’s audio should be forwarded to agents as 16-bit PCM at 16 kHz (or 24 kHz), which matches what most speech-to-text engines expect. Once that subscription is in place, your backend code receives a steady stream of PCM audio chunks for each subscribed user. From there, it’s straightforward to feed those chunks into an ASR service of your choice.

Example Implementation

For example, using Fishjam’s Node.js server SDK:

import { GoogleGenAI, LiveServerMessage, Modality, Session } from '@google/genai';
import {
  FishjamConfig,
  FishjamAgent,
  FishjamWSNotifier,
  PeerConnected,
  PeerId,
} from '@fishjam-cloud/js-server-sdk';

const fishjamConfig: FishjamConfig = {
  fishjamId: process.env.FISHJAM_ID,
  managementToken: process.env.FISHJAM_TOKEN
};

const notifier = new FishjamWSNotifier(
  fishjamConfig,
  (error) => console.error('Fishjam websocket error: %O', error),
  (code, reason) => console.log(`Fishjam websocket closed. code: ${code}, reason: ${reason}`)
);

// one ASR session per speaker
const peerSessions = new Map<PeerId, Session>();

notifier.on('peerConnected', ({ peerId }) => {
  const session = await ai.live.connect({
    model: 'gemini-live-2.5-flash-preview',
    config: { responseModalities: [Modality.TEXT], inputAudioTranscription: {} },
    callbacks: {
      onmessage: (msg) => {
        const text = msg.serverContent?.inputTranscription?.text;

        if (text) console.log(`Peer ${peerId} said: ${text}`);
      },
      onclose: () => peerSessions.delete(peerId),
      onerror: (e) => console.error(`Gemini error for ${peerId}:`, e),
    },
  });

  peerSessions.set(peerId, session);
});

// Create an agent (server-side participant) in the room to get PCM16 @ 16k
const subscribe = { audioFormat: 'pcm16', audioSampleRate: 16000 };
const { agent } = await fishjamClient.createAgent(roomId, { subscribe });

console.log("Agent connected and listening for audio.");

// Stream each PCM chunk to the speaker's persistent ASR session
agent.on('trackData', async ({ peerId, data }) => {
  const session = peerSessions.get(peerId);

  session?.sendRealtimeInput({
    audio: {
      data: data.toBase64(),
      mimeType: 'audio/pcm;rate=16000',
    },
  });
});

In the code above, we tell Fishjam that our agent wants PCM audio at 16 kHz. This agent now behaves like a ghost participant in the meeting who can hear everyone. Every time a participant speaks, the agent fires a trackData event with the latest audio bytes. We then send those bytes to the speech service, which represents a call to something like Google’s streaming API or another AI. The result (when available) can be logged or forwarded to clients. This is the essence of real-time transcription via Fishjam: the heavy lifting of capturing, decoding, and forwarding audio is handled by the infrastructure, and you glue in your preferred AI for the speech-to-text part.

Beyond Transcription: Audio Injection

Fishjam doesn’t just export audio for transcription — it also lets you inject audio back into the room. Your backend agent can publish audio like a participant, which unlocks voice bots and spoken responses: generate TTS and play it live, whisper coaching tips, or speak real-time translations. The same agent plumbing can power other real-time audio jobs too (keyword spotting, recording, moderation), but the combo of Fishjam + ASR is especially compelling for adding live captions and conversational AI without building a media server.

Conclusion

In short, Fishjam turns a hacky browser demo into a scalable, low-latency pipeline: user speaks → audio hits the SFU → PCM to your ASR → text (and optionally TTS) back into the room. You skip WebRTC packet wrangling, Opus decoding, mixing, and SFU scaling—Fishjam handles the media; you focus on product logic. To dive deeper, visit the Fishjam site and the docs.

Ready to bring real-time transcription to your users in a scalable way? Fishjam might be the missing piece to make it happen — with a lot less code and headache on your end. Visit the transcription example for a deeper dive or visit the page to get started with a free trial and if you need help along the way, reach out at projects@swmansion.com.

Happy hacking, and happy transcribing!

Software Mansion: multimedia experts, AI explorers, React Native core contributors, community builders, and software development consultants.

Real-Time Audio Transcription API: How to Turn Speech to Text During Live Conferencing

What Makes Live Transcription So Challenging?

The Browser-Side Approach: Web Speech API

Progressing to Server-Side (SFU-Based) Transcription

Advantages of the SFU-Based Approach

Implementation Overview

Audio Format Considerations

Introducing Fishjam

Audio Export with Fishjam

Example Implementation

Beyond Transcription: Audio Injection

Conclusion

Related Articles

WebRTC vs HLS — Which One Is Better for Your Streaming Project?

Building Interactive Streaming Apps: WebRTC, WHIP & WHEP Explained