← Back to BlogEngineering · 15 min read

WebRTC Architecture Explained for Beginners: SDP, ICE, STUN/TURN, and Signaling

The seven-step flow behind every video call, written so it actually makes sense — with a Firebase signaling example, the production issues that quietly break apps, and what changes when you scale past five participants.

KVBy Kishan Vaghani·Published May 22, 2026·15-minute read

Real-time communication has reshaped what a browser can do. Zoom, Google Meet, WhatsApp web calls, Discord, Microsoft Teams, in-browser customer support, telemedicine, classroom apps — they all run on WebRTC. The name stands for Web Real-Time Communication, and it lets browsers and mobile apps move audio, video, and data directly between users with no plugin, no install, and (in the ideal case) no server in the media path.

WebRTC has a reputation for being intimidating, and that reputation is half deserved. It mixes networking, NAT traversal, codecs, asynchronous signaling, and a browser API surface that wasn't designed to be self-explanatory. But once you can name the moving parts — SDP, ICE, STUN, TURN, signaling — the whole stack stops feeling like a black box. That's what this post is for.

We'll walk through what WebRTC is, the three browser APIs it ships, the seven-step flow that every video call follows, what SDP and ICE actually contain, the difference between STUN and TURN (and why production apps always need TURN), how signaling works with a real Firebase example, the seven production issues you will hit, and what changes when you scale past five participants and need an SFU. Every code example here is something you can paste into a browser console or a small project and run.

1 What WebRTC Actually Is

Before WebRTC existed, doing browser-based voice or video meant a Flash plugin, a proprietary client install, or a native app wrapped around a stream. WebRTC replaced all of that with a standard browser API for:

  • Voice and video calls
  • Screen sharing
  • Real-time text and data transfer
  • File transfer between peers

The single most important idea behind WebRTC is peer-to-peer. Whenever the network allows it, two browsers talk directly to each other — no media server sitting in the middle. That removes three things at once: latency, server bandwidth cost, and a single point of failure. It's why a one-to-one Google Meet call feels noticeably faster than a server-routed system, and why a small startup can ship a video call feature without paying for a CDN-scale backend.

2 The Three WebRTC APIs

Everything WebRTC does sits on top of three browser APIs. Learn them and you've learned 80% of WebRTC.

getUserMedia() — capture audio, video, screen

const stream = await navigator.mediaDevices.getUserMedia({
  audio: true,
  video: { width: 1280, height: 720 },
});

video.srcObject = stream;

Returns a MediaStream from the user's mic and camera. The browser shows a permission prompt the first time. For screen sharing, use getDisplayMedia() instead.

RTCPeerConnection — the connection itself

const pc = new RTCPeerConnection({
  iceServers: [
    { urls: "stun:stun.l.google.com:19302" },
  ],
});

stream.getTracks().forEach((track) =>
  pc.addTrack(track, stream)
);

The engine. It handles ICE negotiation, encryption, codec selection, congestion control, and the actual media transport. Almost every line of WebRTC code you write eventually calls a method on a peer connection.

RTCDataChannel — arbitrary data over the same pipe

const channel = pc.createDataChannel("chat");

channel.onopen   = () => channel.send("hello");
channel.onmessage = (e) => console.log(e.data);

A bidirectional message channel over the existing peer connection. Useful for chat, multiplayer game state, file transfer, or any low-latency data — without standing up a separate WebSocket once the connection is established.

3 The Seven-Step Call Flow

Every video call — from a two-person tutoring session to a hundred-person classroom — follows the same seven steps. Memorise this and you can debug almost any WebRTC bug.

1. User opens the page                  → app loads
2. Browser requests mic/camera          → getUserMedia()
3. Each side creates a peer connection  → new RTCPeerConnection()
4. Signaling channel opens              → WebSocket/Firebase/etc.
5. SDP offer/answer exchanged           → "here's how I talk"
6. ICE candidates exchanged             → "here's where to reach me"
7. Media starts flowing                 → audio + video, end to end

Steps 1–3 happen entirely on each client. Steps 4–6 happen through your signaling server (we'll cover this). Step 7 — once it succeeds — bypasses your servers and runs directly between the two browsers, possibly via a TURN relay if the network won't allow direct P2P.

A WebRTC call “hanging” is almost always stuck at step 5 or step 6. Knowing which one matters more than knowing any specific API.

4 SDP — The Negotiation Document

SDP stands for Session Description Protocol. It is a plain-text document that two peers exchange to agree on terms — what codecs each side supports, what encryption keys to use, what ports the media will land on, and so on.

The mental model is two people figuring out a shared language before they start talking. Peer A says “I speak Opus audio and VP8/H.264 video, here's my encryption key.” Peer B replies “same here on Opus, I'll go with VP8 too, and here's mine.” That negotiation is the SDP offer/answer model.

A minimal SDP body looks like this:

v=0
o=- 46117355 2 IN IP4 127.0.0.1
s=-
t=0 0
m=audio 9 RTP/SAVPF 111
a=rtpmap:111 opus/48000/2
m=video 9 RTP/SAVPF 96
a=rtpmap:96 VP8/90000

The two key lines: m=audio advertises Opus audio, m=video advertises VP8 video, both over Secure RTP. The rest is timestamps and identifiers.

In code, the offer/answer dance looks like this:

// Caller
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
signalingChannel.send({ type: "offer", sdp: offer.sdp });

// Receiver
signalingChannel.on("offer", async ({ sdp }) => {
  await pc.setRemoteDescription({ type: "offer", sdp });
  const answer = await pc.createAnswer();
  await pc.setLocalDescription(answer);
  signalingChannel.send({ type: "answer", sdp: answer.sdp });
});

// Caller, when answer arrives
signalingChannel.on("answer", async ({ sdp }) => {
  await pc.setRemoteDescription({ type: "answer", sdp });
});

When both peers have called setRemoteDescription, they've agreed on terms. SDP is done. ICE takes over.

5 ICE Candidates and NAT Traversal

SDP says what we'll talk about. ICE says where to find each peer. ICE is Interactive Connectivity Establishment, and it exists because of one inconvenient fact: most devices on the internet sit behind a NAT.

Your laptop's local IP looks like 192.168.1.5. The internet has no idea how to route to that. Your home router translates between your local network and a public IP ( 49.36.x.x ) — a process called NAT, and the reason WebRTC needs ICE.

ICE gathers candidates, three flavours of them:

  • Host candidate — your raw local address ( 192.168.1.5:54321 ). Works if both peers are on the same LAN. Free, fast, rarely useful in practice.
  • Server-reflexive candidate — your public IP as seen from a STUN server (49.36.x.x:32118). This is the one that works for most internet calls.
  • Relay candidate — a TURN server's address that will forward your media on. Used as a fallback when nothing else works.

Both peers gather all the candidates they can find, send them to each other through signaling, and run connectivity checks in parallel:

pc.onicecandidate = (event) => {
  if (event.candidate) {
    signalingChannel.send({
      type:      "candidate",
      candidate: event.candidate.toJSON(),
    });
  }
};

signalingChannel.on("candidate", async ({ candidate }) => {
  await pc.addIceCandidate(candidate);
});

ICE picks the best working path automatically and switches if one fails mid-call (mobile WiFi → 4G, for example). This is the part of WebRTC that earns its complexity — it's also what makes calls survive network changes that would kill a naive socket connection.

6 STUN vs TURN

These two acronyms confuse more beginners than anything else in WebRTC. The distinction is actually clean:

STUN (Session Traversal Utilities for NAT) is a tiny server that answers one question: “what does my public IP and port look like from the outside?” The client sends a small packet, the STUN server replies with what it saw. That's it. STUN doesn't carry any media. Google runs free public STUN servers; most apps just use those.

TURN (Traversal Using Relays around NAT) is a full media relay. When two peers can't reach each other directly — corporate firewall blocks UDP, symmetric NAT on a carrier-grade mobile network, etc. — all the audio and video packets flow through the TURN server instead.

// What it looks like in code — the iceServers config:
const pc = new RTCPeerConnection({
  iceServers: [
    { urls: "stun:stun.l.google.com:19302" },
    {
      urls: "turn:turn.example.com:3478",
      username:   "expiring-username",
      credential: "expiring-password",
    },
  ],
});
PropertySTUNTURN
PurposeDiscover public IP/portRelay all media
Bandwidth costNegligibleFull call bandwidth
Latency added~0+20–80 ms typically
When usedEvery call~10–20% of calls (fallback)
Self-host?Use Google's freeYes — coturn, Twilio NTS, Cloudflare

The rule for production: always configure TURN. Plenty of users — anyone on a hotel WiFi, a corporate VPN, or a restrictive mobile carrier — will not connect without it. A WebRTC app without TURN works perfectly in your testing and silently fails for 15% of real users.

7 Signaling Servers

Here is the question that trips up every WebRTC newcomer: if this is all peer-to-peer, why do I need a server at all? The answer is that the peers need to exchange SDP and ICE candidates before their direct connection exists. They have no way to reach each other yet — they need a third party to relay those first few messages.

That third party is the signaling server. The WebRTC spec explicitly does not define how signaling works. It's your choice. Common options:

  • WebSockets — the default; bidirectional, low latency, trivial to host.
  • Socket.IO — Node.js wrapper around WebSockets, adds rooms, reconnection, fallbacks.
  • Firebase Realtime Database / Firestore — zero-backend prototyping, what most beginners reach for first.
  • MQTT — when you're bridging IoT devices.
  • Plain HTTP polling — high-latency, but occasionally the only option behind locked-down enterprise networks.

Critically: signaling carries only metadata — the SDP and ICE messages. The audio and video never touch the signaling server. Once peers have each other's SDP and a working ICE candidate, the signaling channel becomes idle (still useful for chat messages, room state, mute events).

8 Signaling with Firebase (worked example)

Firebase is the popular “no-backend” choice because the Realtime Database (or Firestore) already gives you instant cross-client sync, authentication, and hosting. The shape of the architecture is exactly the same as WebSockets, just with the database playing the relay role.

// Both peers connect to the same Firestore "room" document.

const pc = new RTCPeerConnection({
  iceServers: [
    { urls: "stun:stun.l.google.com:19302" },
  ],
});

// CALLER: create the offer, write it to the room
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
await roomRef.set({
  offer: { type: offer.type, sdp: offer.sdp },
});

// CALLER: listen for the answer the callee writes back
roomRef.onSnapshot((snap) => {
  const data = snap.data();
  if (!pc.currentRemoteDescription && data?.answer) {
    pc.setRemoteDescription(new RTCSessionDescription(data.answer));
  }
});

// BOTH SIDES: stream ICE candidates into a subcollection
pc.onicecandidate = (event) => {
  if (event.candidate) {
    candidatesRef.add(event.candidate.toJSON());
  }
};

// BOTH SIDES: listen for the other peer's candidates
remoteCandidatesRef.onSnapshot((snap) => {
  snap.docChanges().forEach((change) => {
    if (change.type === "added") {
      pc.addIceCandidate(new RTCIceCandidate(change.doc.data()));
    }
  });
});

That's the whole signaling layer in ~30 lines. The tradeoffs: Firebase is dead simple but its per-document update model becomes expensive at scale, and you have less control over message ordering than a raw WebSocket. For a prototype, two-person call, or low-traffic SaaS — perfect. For a video platform doing thousands of concurrent rooms, build a real signaling server.

9 The Seven Production Issues

1. No TURN server

The single most common bug. Demo works perfectly between two laptops on the same WiFi. First real user on a corporate network gets “connecting…” forever. Always ship with a TURN server. Twilio Network Traversal Service, Cloudflare Calls, or self-hosted coturn are the usual choices.

2. Mobile WiFi → cellular handoff drops the call

When the device's IP changes, ICE needs to re-run. Browsers don't do this automatically — you have to:

pc.oniceconnectionstatechange = () => {
  if (pc.iceConnectionState === "disconnected") {
    pc.restartIce();   // gather fresh candidates, renegotiate
  }
};

3. Echo and feedback

Default constraints don't enable echo cancellation on every platform. Be explicit:

getUserMedia({
  audio: {
    echoCancellation: true,
    noiseSuppression: true,
    autoGainControl:  true,
  },
});

4. Safari and iOS edge cases

Safari supports WebRTC but with quirks: limited simulcast, stricter autoplay rules, slower codec negotiation. Test there separately from Chrome.

5. Unlimited TURN bandwidth bills

Every TURN-relayed call burns real bandwidth, and there are bots that'll happily abuse open TURN credentials for tunneling. Always use short-lived TURN credentials signed by your backend, never hardcode a long-lived password into the client.

6. Stuck in “connecting” with no error

ICE failures don't throw; you have to watch iceConnectionState and connectionState and surface them in the UI. Otherwise users sit on a spinner with no recourse.

7. No observability

Production WebRTC apps need to collect pc.getStats() on a timer — packet loss, RTT, jitter, frame rate, codec in use. Without it, every “the call was choppy yesterday” ticket is unanswerable.

10 Scaling Past Two People — Mesh, SFU, MCU

Pure peer-to-peer works great for one-on-one calls. It falls apart fast as you add participants. Each peer has to send its video to every other peer — N×(N−1) streams total.

For a 10-person meeting:

10 × 9 = 90 separate streams across the mesh
Each user uploads 9 copies of their HD video.
A typical home upload (10 Mbps) can't carry that.

Above roughly five participants you need a server architecture. Two main shapes:

SFU — Selective Forwarding Unit. Each client uploads its stream once. The SFU forwards copies to everyone else without decoding the video. Upload bandwidth stays constant per user; the SFU does the fan-out. Open source choices: mediasoup, LiveKit, Janus, Jitsi Videobridge, Ion-SFU. This is what every modern video platform uses.

MCU — Multipoint Control Unit. The server decodes all incoming streams, mixes them into a single composite, and sends that one stream back to every client. CPU-heavy on the server, easy on the client. Used mostly for legacy SIP interop or for low-power endpoints that can't decode multiple streams.

Modern SFUs add two more tricks for adaptive quality: simulcast (the client uploads three resolutions; the SFU forwards whichever fits each viewer's bandwidth) and SVC (a single encoded stream with multiple temporal/spatial layers that the SFU can truncate per receiver). Either lets a 50-person meeting work on consumer hardware.

The Production Checklist

Five things that, if they're all true on launch day, give your WebRTC app a fighting chance:

  1. TURN configured with short-lived credentials. Not optional. Even “internal tool” deployments hit it.
  2. Real-time stats from pc.getStats() sent to your backend. Packet loss, RTT, jitter, codec in use.
  3. ICE restart on disconnect. Mobile users will switch networks mid-call.
  4. Test on real bad networks. Throttle to 3G, add 200ms latency, drop 2% of packets. Most bugs surface here.
  5. Plan for an SFU if you ever need more than five people in a room. LiveKit or mediasoup let you migrate without rewriting the client.

Frequently Asked Questions

What is the difference between STUN and TURN servers?
STUN tells a peer what its public IP and port look like from the internet — that's it. TURN is a relay: when two peers can't reach each other directly, TURN forwards all audio and video packets between them. STUN is cheap and used by almost every connection; TURN is expensive in bandwidth and used as a fallback for the ~10–20% of connections that can't go peer-to-peer.
Why do I need a signaling server if WebRTC is peer-to-peer?
WebRTC moves media peer-to-peer, but peers still need to discover each other and agree on terms before media flow can start. They have to exchange SDP and ICE candidates somehow, and they need to do it before the direct connection exists. A signaling server — usually a small WebSocket or Firebase channel — carries those handshake messages. Once peers are connected, signaling drops out of the media path.
How long does ICE negotiation take?
On a good network, a few hundred milliseconds. Peers gather host candidates instantly, hit a STUN server in ~50 ms, and complete connectivity checks within a second. On bad networks ICE can take 3–10 seconds as the SDK falls back through STUN to TURN. If calls “hang” before media starts, ICE is almost always the culprit.
Can WebRTC scale to a 50-person meeting?
Not with pure peer-to-peer — a 50-person mesh would require each browser to upload its video 49 times, which is impossible on a consumer connection. Above ~5 participants you need an SFU that receives each user's stream once and forwards copies. Open-source SFUs like mediasoup, LiveKit, and Janus handle thousands of participants per server.
Does WebRTC work the same on mobile?
Mostly — the same APIs exist in Chrome, Safari, and Firefox on iOS/Android. The differences that bite in production: mobile carriers often use symmetric NAT (TURN becomes mandatory), users switch between WiFi and 4G mid-call (you must call pc.restartIce() to recover), and battery/CPU is tighter so simulcast layers matter more. Test on real mobile networks, not emulators.
KV

About the author — Kishan Vaghani

Kishan is the founder of ShareCode and writes about the engineering and infrastructure decisions behind real-time collaborative apps. ShareCode itself uses Firebase and CRDT-based sync rather than WebRTC for its editor, but the same NAT-traversal and signaling questions show up in any peer-to-peer system.

Final Thoughts

WebRTC mixes more disciplines than almost any other browser API — networking, codecs, distributed systems, real-time media. That's the source of its reputation for being hard. But the core flow is small and repeatable: capture media, exchange signaling messages, negotiate SDP, gather ICE candidates, stream media directly.

Once you can hold those five stages in your head, the rest is just optimisation — better codecs, simulcast for adaptive quality, an SFU when you outgrow peer-to-peer, observability when you need to debug a production call. Every major video app — Zoom, Meet, Teams, Discord — is fundamentally a well-tuned implementation of the same five stages.

If you're building something adjacent to this — real-time collaboration, multiplayer editing, anything that needs two clients to share state — our writeup of how real-time sync works under the hood covers the other half of the problem: how CRDTs and operational transforms let two clients converge on identical state without a server arbitrator in the middle.

Try it now

Prototype your signaling layer in a code space

Paste the Firebase signaling snippet from §8 into a ShareCode editor, share the URL with a teammate, and step through the offer/answer flow together — the fastest way to actually understand SDP exchange.

Open a code space