Blog

A collection of technical guides, developer insights and updates on Fishjam's latest features and product developments

Building Interactive Streaming Apps: WebRTC, WHIP & WHEP Explained cover image

Building Interactive Streaming Apps: WebRTC, WHIP & WHEP Explained

by Karol Konkol • Jul 8, 2025 • 5 min read

Live streaming has become extremely popular, especially for broadcasting live sports and streaming on platforms like YouTube and Twitch. Today, it’s easy to watch major events online, and that’s starting to take over from traditional forms like watching on TV. The vast majority of streaming platforms use a technology called HTTP Adaptive Streaming (HAS), which enables delivering video to millions of viewers with only a few seconds of latency. But if you want truly real-time, interactive experience, there are a few things to consider. Going beyond traditional TV — WebRTC for new applications If you’re trying to move past the limitations of traditional TV and build something better — with significantly less delay — or if you want to use live streaming in totally new ways, there are plenty of options to consider. Just think about creators being able to interact with the audience in real time, or people joining an online auction and actually participating as it happens. Protocols like HLS (HTTP Live Streaming) or DASH (Dynamic Adaptive Streaming over HTTP) are ideal for video-on-demand or mass broadcasts. However, when low latency streaming is critical and when you can’t afford a delay of even one second, HLS and DASH typically can’t keep up. Fortunately, there is the WebRTC standard, which is perfectly suited for applications that require ultra-low latency and almost perfect synchronization. What is WebRTC and why does it matter? WebRTC was designed for direct communication with very low latency — around 200 milliseconds — allowing natural interactions between participants. The standard is supported by browsers, which simplifies the integration of WebRTC-based solutions. It’s worth keeping in mind that achieving such low latency makes WebRTC a complex solution that incorporates many different protocols such as ICE, TURN, STUN, DTLS, SDP, and many others. This, in turn, often translates into troublesome implementation, debugging, and usage. Plus, WebRTC doesn’t actually specify how connections are established — so, theoretically, they could even involve carrier pigeons. As a result, each implementation of a WebRTC-based solution varies, which adds complexity and means each streaming server needs its own approach to setting up connections. WHIP & WHEP protocols — simplifying WebRTC for broadcasting In broadcasting scenarios, the connection establishment has been standardized by the WHIP and WHEP protocols, which clearly define how the process should work. These protocols rely on the HTTP protocol and strictly specify the sequence of messages sent to establish connections, making the process easier. WHIP (WebRTC-HTTP ingestion protocol) focuses on stream ingestion , allowing broadcasters to send live media to WebRTC-compatible servers by leveraging HTTP(S) for signaling. This simplifies the setup for sending live streams over the internet. On the other end, WHEP (WebRTC-HTTP egress protocol) deals with the distribution aspect , enabling straightforward delivery of live streams from WebRTC servers to the end-users. This helps reduce latency and improve quality of service by keeping the streaming experience stable and consistent. By leveraging WHIP and WHEP, you can use the same implementation, applications, or players to receive streams from different providers. This standardization eliminates the need for custom solutions for each provider, lowering development costs and technical barriers for both broadcasters and content creators. As a result, these protocols offer a streamlined framework for efficiently managing the entire live streaming lifecycle — from broadcast to viewer. Building a live streaming app: should you create your own infrastructure or use third-party providers? When creating a streaming product, you have to decide whether to build your own infrastructure or rely on an external provider. Setting up and maintaining your own system can be complex and expensive, so many choose to go with third-party providers instead. However, if you decide to build your own infrastructure, the first challenge is dealing with the WebRTC standard, which can be quite complex when it comes to transmitting multimedia. You basically have two options here: Implementing the standard yourself, which gives you full control over the library and its features. However, this scenario requires having a dedicated and skilled team for implementation and maintenance. Using an existing library and building a multimedia server based on it. Although this is a simpler scenario, it still requires a skilled team capable of effectively debugging WebRTC. And that’s often quite challenging. Once your multimedia server is up and running, new challenges pop up — like keeping the quality high, scaling smoothly, and keeping latency low. This means managing geolocation, routing streams across clusters and instances, scaling your multimedia servers properly, and handling all the usual headaches of running a distributed infrastructure. Sounds complex? Well, because it is. But external providers can take these burdens off your shoulders by delivering reliable, high-quality multimedia services and offering tools that make it easier to integrate WebRTC into your custom apps. WebRTC for broadcasting made simple with Fishjam One of the tools that makes live streaming and broadcasting easier is Fishjam. It’s a live-streaming and video conferencing API designed to seamlessly integrate WebRTC into your application, removing the complexity of custom WebRTC implementations and the need for deep multimedia expertise. With ready-to-use React and React Native SDKs, adding live streaming to your app is simple and user-friendly. Plus, thanks to integration with tools like Smelter , it offers much more than just basic streaming. To sum up, Fishjam lets you skip the hard parts and start building fast. Don’t wait, check the website, and be sure to give it a try! And if you need help along the way, feel free to reach out to us at projects@swmansion.com . Want to learn more about Fishjam and its capabilities? Check our video: https://medium.com/media/676f4cfd4d963050a698d65fa0a3c21a/href To explore other applications of WebRTC, read our article on Real-Time Gesture Recognition ! We’re Software Mansion : multimedia experts, AI explorers, React Native core contributors, community builders, and software development consultants. Building Interactive Streaming Apps: WebRTC, WHIP & WHEP Explained was originally published in Software Mansion on Medium, where people are continuing the conversation by highlighting and responding to this story.

Real-Time Gesture Recognition in Videoconferencing cover image

Real-Time Gesture Recognition in Videoconferencing

by Tomasz Mazur • Jun 10, 2025 • 8 min read

Chances are, you’ve come across gesture recognition before. If you’ve ever given someone a thumbs up with Apple’s Reactions or played around with Snapchat or TikTok filters, your device was recognizing your gestures in real time. Gesture detection is quickly going mainstream, making remote conversations feel more natural and engaging. Let’s explore how to detect hand gestures in JavaScript running in the browser and build a simple videoconferencing app with special effects. Apple’s Reactions in action What is gesture recognition? Gesture recognition is a technology that finds and identifies hand gestures in images and video. It’s closely related to pose detection, which identifies and tracks the positions of people in an image or video. What’s more, it’s commonly used to add effects to a video or image source based on the detected gestures. AI-powered features in videoconferencing Gesture recognition or gesture control falls into a broader category of videoconferencing features that use AI to enhance the user’s experience. Other common features include: background blur virtual backgrounds automatic transcription Challenges in real-time gesture recognition Unfortunately, real-time gesture recognition is hard to implement (especially in the browser!), since it combines multiple traditionally difficult tasks: computer vision, low-latency videoconferencing and real-time video compositing. Running AI models and low-latency streaming are resource-heavy tasks that rely on a range of advanced browser APIs like Web Workers, WebRTC, WebGL, and, more recently, WebGPU. Video compositing in the browser has been around for a while, but it’s only recently gotten more attention thanks to the new WebCodecs and experimental Insertable Streams APIs. Implementing gesture recognition in videoconferencing Now, we’re going to build a simple React app that lets users join video calls and trigger visual effects whenever they make a ‘timeout’ gesture — because sometimes, a conversation just needs a reset. Final product with gesture recognition To build this app, we need to pick the right tools for the job. There are three key challenges to solve, which we’ll break down below. Detecting gestures in real-time To detect gestures in real-time, we’re going to use MediaPipe for its hand landmark detection model . They also have a ready-to-go gesture recognition model , but it doesn’t fit our specific use case. For other AI-powered features, you may want to look at Tensorflow.js and Transformers.js . Video compositing in real-time To render video effects, we’re going to use Smelter for its speed and simple component-based API. It’s written in Rust and primarily designed for server-side compositing, but thanks to a new WASM build, it can now run entirely in the browser. At the moment, this build works only in Chrome and Safari, though we’re actively working on support for other popular browsers. If you need to target every platform, your safest bet is the built-in HTML Canvas API, but beware, there be dragons. Real-time communication This is the backbone of our videoconferencing application. The collection of protocols that allows us to achieve low-latency videoconferencing is called WebRTC . Implementating WebRTC manually requires a lot of development time and infrastructure, so we’re going to use Fishjam (a live streaming and video conferencing API), because its room manager allows us to prototype for free, without our own backend server. Obtaining the camera’s video stream To start detecting gestures, we need to obtain the video stream from our device’s camera. Normally, we would call the built-in getUserMedia() to get a MediaStream , but we’re going to use Fishjam’s useCamera hook, which integrates the above API with the React lifecycle: export default function App() { ... const { cameraStream } = useCamera(); ... } Detecting gestures with MediaPipe Gesture recognition is the most resource-intensive part of our app, so we’re going to have to be careful with how and where we run it. To prevent blocking the main thread when running the hand landmark detection model, we’re going to use Web Worker to run the detections asynchronously. Below, you’ll find the complete code for the Web Worker. It receives messages containing VideoFrame objects and replies with any hand landmarks it detects. // worker.js let landmarker; const init = async () => { const { FilesetResolver, HandLandmarker } = await import( "@mediapipe/tasks-vision" ); // load the correct WASM bundle const vision = await FilesetResolver.forVisionTasks( "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@latest/wasm", ); landmarker = await HandLandmarker.createFromOptions(vision, { baseOptions: { modelAssetPath: "https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/latest/hand_landmarker.task" }, runningMode: "VIDEO", numHands: 2, }); }; init(); self.onmessage = ({ data: { frame } }) => { const detections = landmarker?.detectForVideo(frame, frame.timestamp); frame.close(); postMessage(detections?.landmarks ?? []); }; Note that we set runningMode: "VIDEO" to benefit from hand tracking, which improves the accuracy of hand detections. We also need to set numHands: 2, as we need to detect 2 hands, while the default value is 1. The worker on its own isn’t very useful, we also need to capture the camera’s MediaStream and send frames to it from the main thread. // GestureDetector.ts export type HandGesture = "NONE" | "TIMEOUT"; export class GestureDetector { private video: HTMLVideoElement; private prevTime: number = 0; private closing: boolean = false; private worker: Worker; constructor( stream: MediaStream, detectionCallback: (gesture: HandGesture) => void, ) { this.video = document.createElement("video"); this.video.srcObject = stream; this.video.play(); // start the Web Worker this.worker = new Worker(new URL("./worker.js", import.meta.url)); // callback to run when worker responds with landmarks this.worker.onmessage = ({ data }) => { detectionCallback(findGesture(data)); this.video.requestVideoFrameCallback(() => this.detect()); }; // begin the gesture detection loop this.video.requestVideoFrameCallback(() => this.detect()); } detect() { if (this.closing) return; const currentTime = this.video.currentTime; // check if the video has advanced forward if (this.prevTime >= currentTime) { this.video.requestVideoFrameCallback(() => this.detect()); return; } this.prevTime = currentTime; const frame = new VideoFrame(this.video); this.worker.postMessage({ frame }, [frame]); } close() { this.closing = true; this.video.remove(); this.worker.terminate(); } } The above code does a few things: It creates a <video> element that will handle playback of the MediaStream. It starts the Web Worker that will run the MediaPipe model in the background. It begins the gesture detection loop, by repeatedly calling requestVideoFrameCallback(), which allows us to run code when the video frame changes. Recognizing the “timeout” gesture from hand landmarks In the GestureDetector implementation, we use a seemingly magical function called findGesture(), which takes hand landmarks and returns a gesture. But in reality, the function is quite simple; it just checks four things: Are all fingers straight? Are the fingers of each hand pointing in the same direction? Are the hands positioned perpendicular to each other? Is the tip of the middle finger on one hand placed in the palm of the other? If the answer to all four questions is yes, then it’s a clear ‘timeout’ gesture! If you want to know the specifics of how to check the above conditions, then you can check out the demo’s source code . Lastly, we’re going to integrate GestureDetector with the React lifecycle by creating a useGesture hook: export const useGesture = (stream: MediaStream | null) => { const [gesture, setGesture] = useState<HandGesture>("NONE"); useEffect(() => { if (!stream) return; const detector = new GestureDetector(stream, setGesture); return () => { detector.close(); setGesture("NONE"); }; }, [stream]); return gesture; }; Adding effects to the video stream The last thing we need to do is trigger an effect when we detect the"TIMEOUT" gesture. As shown in the GIF below, we want some text to slide in, pause, and then slide out. Final product with gesture recognition To add effects to the camera stream, we need to register it as an input with Smelter: const { cameraStream } = useCamera(); ... await smelter.registerInput("cameraStream", { type: "stream", stream: cameraStream.clone(), }); Now that we have the input, we need to set up an output to tell Smelter what to do with it: const { stream: output } = await smelter.registerOutput( "modifiedCamera", <VideoWithEffects stream={cameraStream} inputId="cameraStream" />, { type: "stream", video: { resolution: { width: 1280, height: 720 } }, }, ); // we can now use output to interact with the modified stream // e.g. we can tell Fishjam to send the modified stream to others import { useCustomSource } from "@fishjam-cloud/react-client"; const { setStream } = useCustomSource("custom-camera"); ... setStream(output); Note that smelter.registerOutput() takes 3 arguments: The ID of the output The layout of the output The options of the output, most notably the type (this can be “stream”, “whip” or “canvas”) and resolution. The layout is the most interesting part, since it can be a React component, which means the output video layout can be reactive. We’re going to make use of this in our <VideoWithEffects> component: // VideoWithEffects.tsx export type VideoWithEffectsProps = { stream: MediaStream; inputId: string; }; const DURATION = 5000; export default function VideoWithEffects({ stream, inputId, }: VideoWithEffectsProps) { const gesture = useGesture(stream); const [animating, setAnimating] = useState(false); useEffect(() => { if (gesture === "TIMEOUT" && !animating) { // start the animation setAnimating(true); // reset the flag when the animation is done setTimeout(() => setAnimating(false), DURATION + 500); } }, [gesture, animating]); return ( <View> <Rescaler> <InputStream inputId={inputId} /> </Rescaler> {animating && ( <Animation duration={DURATION} /> )} </View> ); } The components <View> and <Rescaler> are baked into the Smelter Typescript SDK , which has a lot of utilities for creating layouts. The layout is reactive thanks to the useGesture() hook, which allows us to render an <Animation> whenever a gesture is recognized. A simple example implementation of <Animation> is described below: // Animation.tsx export type AnimationProps = { duration: number; }; type AnimationState = "before" | "pause" | "after"; const START_DELAY = 100; const WIDTH = 1280; export default function Animation({ duration }) { const [animationState, setAnimationState] = useState<AnimationState>("before"); const durationMs = (duration - START_DELAY) / 3; // slide in from the right and out to the left const right = useMemo(() => { switch (animationState) { case "before": return WIDTH; case "pause": return 0; default: return -2 * WIDTH; } }, [animationState]); useEffect(() => { setTimeout(() => { setAnimationState("pause"); setTimeout(() => setAnimationState("after"), 2 * durationMs); }, START_DELAY); }, [durationMs]); return ( <View style={{ top: 0, left: 0 }}> <Rescaler style={{ bottom: 0, right }} transition={{ durationMs, easingFunction: "bounce" }} > <Image source="/assets/timeout-text.gif" /> </Rescaler> </View> ); } Try it out yourself! We’ve covered the core components needed to implement gesture recognition in TypeScript, running right in the browser. If you want to see the full example in action, make sure to check out our hosted demo , or its source code on GitHub . If you’re working on AI-based features with real-time video and you need further help, reach out to us on Discord . Closing remarks Real-time gesture recognition isn’t without its challenges, but thanks to tools like MediaPipe , Fishjam and Smelter , it’s getting a whole lot easier, especially on the web. And with powerful solutions like Insertable Streams API becoming more widely available, the future of browser-based video effects looks really promising. We’re Software Mansion : multimedia experts, AI explorers, React Native core contributors, community builders, and software development consultants. Real-Time Gesture Recognition in Videoconferencing was originally published in Software Mansion on Medium, where people are continuing the conversation by highlighting and responding to this story.

WebRTC: P2P, SFU, MCU and All You Need to Know About Them cover image

WebRTC: P2P, SFU, MCU and All You Need to Know About Them

by Adrian Czerwiec • Jun 3, 2025 • 4 min read

Want to add live video to your app? WebRTC can do the job, but the setup can make a big difference. Here’s a quick guide to P2P, SFU, and MCU to help you pick the right setup for your project. What is WebRTC? WebRTC (Web Real-Time Communication) is a browser-based protocol that enables low-latency video, voice, and data sharing without any external dependencies . Many applications use WebRTC for real-time communication, such as video conferencing, online gaming, and file sharing. To enable secure and real-time communication, WebRTC combines several technologies to fulfill following purposes: Signaling — uses SDP (Session Description Protocol) for offer/answer negotiation Connecting — utilizes STUN/TURN protocols to navigate NATs and firewalls. Securing — SRTP (Secure Real-time Transport Protocol) is used to encrypt media streams Communicating — uses RTP (Real-time Transport Protocol) and SCTP for delivering media and data packets. With all of the above concerns addressed, two (or more) endpoints can begin communicating. However, the right architecture depends entirely on your use case — and that’s where things get interesting. Peer-to-Peer (P2P) — the simplest setup The easiest approach we can take is to use WebRTC in a peer-to-peer (P2P) architecture. The setup is simple, it doesn’t require any additional servers, as the peers communicate directly with each other. This comes with tradeoffs, as each peer must be able to connect directly to each other. It’s best suited for one-on-one calls or file sharing. The total number of connections grows exponentially with the number of peers, and the bandwidth gets used up quickly. MCU (Multipoint Control Unit) — centralized media processing What if you want to stream the same content to multiple people? That’s where additional server can come in handy! Multipoint Control Unit (MCU) is a server that can receive media streams from all participants, process them (e.g., by mixing video/audio into a single composite stream), and then send one unified stream back to each user. This optimizes the bandwidth usage, as each participant only needs to send and receive one stream. Additionally, the server can apply layouts, overlays or other effects to the stream. The downside is that heavy server-side processing can be costly and add latency. SFU (Selective Forwarding Unit) — scalable and flexible What is SFU in WebRTC? The Selective Forwarding Unit (SFU) is another type of server that can be used to stream between peers. In this scenario, all peers stream directly to the SFU, which selectively forwards individual streams to other peers, based on dynamic rules. This provides more flexibility because we can save bandwidth by deciding which streams to forward at any given time. For example, you can have a conference with a hundred people, but each participant only receive camera streams of people that are currently speaking. This Selective Forwarding Unit approach is best suited for video conferencing, virtual events or group collaboration tools. WebRTC SFU vs MCU Both SFU and MCU use a central server, but they handle streams differently. An MCU mixes all incoming streams into one and sends the same output to every participant. It reduces client-side work and bandwidth but adds latency and server cost. On the other hand, an SFU forwards streams selectively — without mixing — so each participant only receives what’s needed. It’s more efficient and scales better. So, WebRTC SFU vs MCU is a tradeoff between simplicity (MCU) and flexibility (SFU). Is it hard to implement WebRTC? In short: yes — WebRTC is a complex technology and implementing it by yourself can feel like going down a rabbit hole. While it hides a lot of complexity under the hood, building production-ready video features involves tough challenges: Network instability NAT traversal failures Dynamic bandwidth adaptation Voice activity detection Simulcast support and track prioritization If your goal is to build a great product, not to become a video infrastructure engineer, there are better options than reinventing the wheel. Meet Fishjam — a live streaming and video conferencing API If you’re looking for a quick and reliable solution, Fishjam provides a developer-friendly API for live video and conferencing . Fishjam is an out-of-the-box low-latency live streaming and video conferencing API that comes equipped with React Native and React SDK, allowing you to set up videoconferencing or live streaming in your app in minutes. With Fishjam, this snippet is all you need to join a call and share your camera: import { useCamera, useConnection } from "@fishjam-cloud/react-native-sdk"; ... const { prepareCamera } = useCamera(); const { joinRoom } = useConnection(); useEffect(() => { (async () => { await prepareCamera({ cameraEnabled: true }); await joinRoom(url, peerToken); })(); }, [prepareCamera, joinRoom]); WebRTC gives you a lot of power, but it’s not easy to get right. If you don’t want to deal with media servers, setup, and edge cases, Fishjam lets you skip the hard parts and start building fast. And if you need help along the way, feel free to reach out to us at projects@swmansion.com . We’re Software Mansion : multimedia experts, AI explorers, React Native core contributors, community builders, and software development consultants. WebRTC: P2P, SFU, MCU and All You Need to Know About Them was originally published in Software Mansion on Medium, where people are continuing the conversation by highlighting and responding to this story.