The problem I kept running into
I record most of my team calls on Zoom. The result is an MP4 file — usually around 180– 220 MB for a one-hour call. The video itself is useless: a static grid of small faces that nobody wants to watch. What I actually want is a 40–50 MB MP3 I can drop into Whisper or Descript for transcription.
The naïve approach is to upload the MP4 to a cloud converter, wait for the upload, wait for processing, download the result. That worked, but I started wondering exactly what happened to those 200 MB recordings during the wait. Some of those calls had salary discussions and product roadmaps in them. I stopped uploading after I noticed one converter's URL was still live and had no expiry notice.
The better approach: do it all in the browser, where the file never leaves your device.
How a video-to-audio extractor actually works in the browser
A video file is a container. MP4 (using the MPEG-4 container format), WebM (Google's open-source container), and MOV (QuickTime) are all wrappers that hold two separate streams:
- A video stream — encoded as H.264, H.265 (HEVC), VP8, VP9, or AV1 depending on how the file was created.
- An audio stream — encoded as AAC (most common for MP4), Opus (WebM), or PCM (uncompressed, rare in video files).
Extracting audio means: read the container, identify the audio track, discard the video track, re-encode the audio into a standalone format (MP3 or WAV), and write the output file.
In the browser, this is done via the Web Audio API and the browser's built-in media decoder. The MediaRecorder API handles the final re-encoding step. Chrome 88+, Firefox 85+, and Safari 14+ all support this pipeline natively.
What to expect: real numbers from my test files
I ran six Zoom recordings through the browser-based video-to-audio converter to see what the results looked like. Here's the data:
| Source file | Duration | MP4 size | MP3 output | WAV output |
|---|---|---|---|---|
| Team standup | 22 min | 84 MB | 12.3 MB | 118 MB |
| Client demo | 47 min | 196 MB | 26.4 MB | 252 MB |
| Lecture recording | 63 min | 241 MB | 35.1 MB | 338 MB |
| WebM screen recording | 18 min | 31 MB | 10.1 MB | 96 MB |
Key takeaway: MP3 runs about 85–90% smaller than the source MP4. WAV is uncompressed and ends up larger than the source video because the video stream was compressed but the audio is now stored raw. Only choose WAV if you need to do further editing in a DAW and want to avoid generational quality loss.
MP3 vs WAV: the actual decision criteria
Every explainer I've read says "MP3 for sharing, WAV for editing" — which is technically correct but too simple to be actionable. Here's how I actually decide:
Choose MP3 when: The file is going to a transcription service (Whisper, Descript, Otter.ai). These tools accept MP3 and file size directly affects upload speed and API cost. A 47-minute call at 26 MB is a lot easier to work with than 252 MB. 128 kbps is fine for spoken word. 192 kbps if the recording has significant background music.
Choose WAV when:You're doing post-production in a proper DAW (Adobe Audition, Logic, Reaper). The noise-reduction and EQ passes that make a podcast sound professional compound quality loss on a lossy source. Start lossless, apply your edits, then export the final version as MP3. Starting from MP3 and going through two more lossy re-encodes will audibly degrade the output.
The VFR problem with smartphone videos
This one took me a while to notice. Smartphone cameras record in Variable Frame Rate (VFR) — the frame rate adapts to motion and lighting. This is fine for watching the video, but it creates a subtle problem if you plan to re-sync the extracted audio back to a different video track.
The audio stream is linear time. The video stream in a VFR file has varying timestamps. When you extract the audio and later try to sync it to a constant-frame-rate (CFR) track, they drift. The drift is usually imperceptible in the first minute but can be a half-second off by the end of a 20-minute clip.
Fix: if you plan to re-sync, convert the source video to CFR first using Handbrake (free, open-source) before extracting the audio. Handbrake's “Peak Framerate” setting with your target frame rate (usually 30fps) handles this in one pass.
If you're just sending the audio to a transcription service and never re-syncing, you can ignore this entirely.
Browser limitations: what the client-side approach can't do
I believe in being honest about limitations. Here's where the browser-based approach falls short compared to FFmpeg or a cloud service:
- No bitrate control.The browser's MediaRecorder picks a bitrate automatically. For MP3, Chrome typically produces 128 kbps stereo. You can't set 320 kbps in the browser without a WASM-compiled encoder. If bitrate matters (it usually doesn't for speech), FFmpeg is the right tool.
- No channel mixing. If your source has a 5.1 or 7.1 audio track (common for professionally produced video), the browser will downmix to stereo automatically. Most Zoom recordings are stereo or mono already, so this is rarely an issue.
- Processing speed caps out at your device's CPU. A 2-hour 4K video with a huge audio track can take a noticeable amount of time in the browser. Cloud processing would be faster here, but at the cost of uploading 1+ GB files.
- Safari has limited WebM support.Safari can decode H.264 MP4 and MOV reliably, but WebM (VP8/VP9) support was patchy until Safari 16. If you're using Safari on macOS Monterey or older, stick to MP4 and MOV inputs.
Step by step: the actual process
- Open the Video to Audio Converter. No account needed.
- Drag your MP4, WebM, or MOV file into the upload zone. The file loads into the browser's memory — nothing is sent to a server. You can verify this by opening your browser's Network tab (F12 → Network) and confirming there are no outgoing requests to external hosts after the page has loaded.
- Choose MP3 or WAV based on the criteria above.
- Click Convert. Processing time scales roughly linearly with file size. A 200 MB MP4 typically takes 15–30 seconds on a mid-range laptop.
- Click Download. The browser writes the file to your Downloads folder directly.
What I actually use this for
My regular workflow: Zoom recording exported as MP4 → extract as MP3 in the browser → upload to Whisper (or Otter.ai for live transcription) → paste transcript into Claude for meeting notes. The whole pipeline from raw recording to structured notes is about 8–10 minutes, most of which is the transcription waiting time.
I also use it to pull audio from training videos before going on a long flight. The audio-only file is 10× smaller, which matters when I'm pre-caching content on a device with limited storage.
Related tools you might need next
- Audio format converter — convert the resulting MP3 to WAV, OGG, FLAC, or M4A if your downstream tool needs a specific format.
- AI Audio Enhancer — uses AI (not just DSP) to denoise and improve clarity. Useful if the Zoom recording has significant background noise or echo.
- Free Video Editor — trim the video to the section you need before extracting, if you only want a specific clip.
Written by Achraf A., founder of TheFreeAITools — built in Morocco. Last tested on Chrome 124, Firefox 125, and Safari 17.4 on macOS Sonoma.