Whisper
Whisper: Audio Transcription and Subtitle Extraction
Overview
Let's try to use FFmpeg in combination with OpenAI's Whisper to extract transcriptions and generate subtitles from audio and video files. FFmpeg handles media processing while Whisper provides state-of-the-art speech recognition.
Prerequisites
Required Software
- FFmpeg: Media processing framework
- Whisper: OpenAI's automatic speech recognition system
- Python 3.7+: Required for Whisper
Installation
Install FFmpeg
1# Windows (using chocolatey)
2choco install ffmpeg
3
4# macOS (using homebrew)
5brew install ffmpeg
6
7# Ubuntu/Debian
8sudo apt update
9sudo apt install ffmpeg
10
11# Verify installation
12ffmpeg -version
Install Whisper
1# Install via pip
2pip install openai-whisper
3
4# Or install the latest development version
5pip install git+https://github.com/openai/whisper.git
6
7# Verify installation
8whisper --help
Basic Workflow
Method 1: Direct Audio/Video to Text
For files that Whisper can process directly:
1# Basic transcription
2whisper audio_file.mp3
3
4# Specify output format
5whisper video_file.mp4 --output_format txt
6
7# Choose model size (tiny, base, small, medium, large)
8whisper audio_file.wav --model medium
9
10# Specify language (optional, auto-detected by default)
11whisper audio_file.mp3 --language Thai
Method 2: FFmpeg + Whisper Pipeline
For better control or problematic file formats:
Step 1: Extract Audio with FFmpeg
1# Extract audio as WAV (recommended for Whisper)
2ffmpeg -i input_video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio_output.wav
3
4# Extract audio as MP3
5ffmpeg -i input_video.mkv -vn -acodec mp3 -ab 128k audio_output.mp3
6
7# Extract specific time range
8ffmpeg -i input.mp4 -ss 00:01:30 -t 00:02:00 -vn -acodec pcm_s16le audio_segment.wav
Step 2: Transcribe with Whisper
1# Basic transcription
2whisper audio_output.wav
3
4# Generate multiple formats simultaneously
5whisper audio_output.wav --output_format txt,srt,vtt,json
6
7# Use specific model and language
8whisper audio_output.wav --model large --language en --output_format srt
Output Formats
Available Output Formats
- txt: Plain text transcription
- srt: SubRip subtitle format
- vtt: WebVTT subtitle format
- json: Detailed JSON with timestamps and confidence scores
- tsv: Tab-separated values with timestamps
Examples
1# Generate SRT subtitles
2whisper video.mp4 --output_format srt
3
4# Generate multiple formats
5whisper audio.wav --output_format txt,srt,vtt,json
6
7# Custom output directory
8whisper audio.mp3 --output_dir ./transcriptions --output_format srt
Advanced FFmpeg Preprocessing
Audio Quality Optimization
1# Normalize audio levels
2ffmpeg -i input.mp4 -af "dynaudnorm" -acodec pcm_s16le normalized_audio.wav
3
4# Remove background noise (basic)
5ffmpeg -i input.mp4 -af "highpass=f=200,lowpass=f=3000" filtered_audio.wav
6
7# Boost volume
8ffmpeg -i input.mp4 -af "volume=2.0" louder_audio.wav
Handling Multiple Audio Tracks
1# List audio streams
2ffmpeg -i input.mkv
3
4# Extract specific audio track (e.g., track 1)
5ffmpeg -i input.mkv -map 0:a:1 -vn audio_track2.wav
6
7# Mix multiple audio tracks
8ffmpeg -i input.mkv -af "amix=inputs=2" mixed_audio.wav
Batch Processing
1# Process multiple files (bash)
2for file in *.mp4; do
3 ffmpeg -i "$file" -vn -acodec pcm_s16le "${file%.*}.wav"
4 whisper "${file%.*}.wav" --output_format srt
5done
6
7# Windows batch processing
8for %f in (*.mp4) do (
9 ffmpeg -i "%f" -vn -acodec pcm_s16le "%~nf.wav"
10 whisper "%~nf.wav" --output_format srt
11)
Whisper Model Selection
Available Models
Model | Parameters | English-only | Multilingual | Required VRAM | Relative Speed |
---|---|---|---|---|---|
tiny | 39 M | ✓ | ✓ | ~1 GB | ~32x |
base | 74 M | ✓ | ✓ | ~1 GB | ~16x |
small | 244 M | ✓ | ✓ | ~2 GB | ~6x |
medium | 769 M | ✓ | ✓ | ~5 GB | ~2x |
large | 1550 M | ✗ | ✓ | ~10 GB | 1x |
Model Selection Guidelines
1# Fast transcription (lower accuracy)
2whisper audio.wav --model tiny
3
4# Balanced speed/accuracy
5whisper audio.wav --model base
6
7# High accuracy (slower)
8whisper audio.wav --model large
9
10# English-only models (slightly better for English)
11whisper audio.wav --model base.en
Common Use Cases
1. YouTube Video Transcription
1# Download with yt-dlp, extract audio, transcribe
2yt-dlp -x --audio-format mp3 "https://youtube.com/watch?v=VIDEO_ID"
3whisper "VIDEO_TITLE.mp3" --output_format srt
4
5# Or direct processing if file is compatible
6whisper "$(yt-dlp --get-filename -x --audio-format mp3 'VIDEO_URL')" --output_format srt
2. Podcast Episode Processing
1# Extract and normalize audio
2ffmpeg -i podcast_episode.mp3 -af "dynaudnorm" -ar 16000 normalized_podcast.wav
3
4# Transcribe with timestamps
5whisper normalized_podcast.wav --output_format json,srt --model medium
3. Meeting Recording Transcription
1# Extract audio from meeting recording
2ffmpeg -i meeting_recording.mp4 -vn -ar 16000 -ac 1 meeting_audio.wav
3
4# Transcribe with word-level timestamps
5whisper meeting_audio.wav --output_format json --model base --word_timestamps True
4. Foreign Language Content
1# Auto-detect language
2whisper foreign_audio.mp3 --model medium --output_format srt
3
4# Specify language for better accuracy
5whisper thai_audio.mp3 --language Thai --model medium --output_format srt
6
7# Translate to English while transcribing
8whisper foreign_audio.mp3 --task translate --output_format srt
Optimization Tips
Performance Optimization
1# Use GPU acceleration (if available)
2whisper audio.wav --device cuda
3
4# Specify number of CPU threads
5whisper audio.wav --threads 4
6
7# Use faster models for real-time processing
8whisper audio.wav --model tiny.en --fp16 False
Quality Optimization
1# Improve accuracy with audio preprocessing
2ffmpeg -i noisy_audio.mp3 -af "highpass=f=80,lowpass=f=8000,dynaudnorm" clean_audio.wav
3whisper clean_audio.wav --model large --temperature 0
4
5# Use temperature for consistency
6whisper audio.wav --temperature 0 --best_of 5
Troubleshooting
Common Issues
File Format Problems
1# Convert unsupported formats
2ffmpeg -i input.webm -acodec pcm_s16le -ar 16000 output.wav
3ffmpeg -i input.flac -acodec mp3 output.mp3
Memory Issues
1# Use smaller model
2whisper large_file.wav --model tiny
3
4# Process in segments
5ffmpeg -i large_file.mp4 -f segment -segment_time 300 -vn segment_%03d.wav
Poor Transcription Quality
1# Try different models
2whisper audio.wav --model medium # vs base or large
3
4# Specify language explicitly
5whisper audio.wav --language en
6
7# Adjust audio quality
8ffmpeg -i input.mp3 -af "volume=2.0,highpass=f=100" enhanced.wav
Advanced Features
Custom Vocabulary and Prompts
1# Use initial prompt for context
2whisper audio.wav --initial_prompt "This is a technical discussion about machine learning"
3
4# For names and technical terms
5whisper audio.wav --initial_prompt "Speakers: Peerasan Buranasanti. Topic: Media Streaming"
Fine-tuning Output
1# Suppress non-speech tokens
2whisper audio.wav --suppress_tokens "50257"
3
4# No speech threshold (adjust for silence detection)
5whisper audio.wav --no_speech_threshold 0.6
6
7# Compression ratio threshold
8whisper audio.wav --compression_ratio_threshold 2.4
Integration with Video Editing
1# Generate subtitle file for video editing
2whisper video.mp4 --output_format srt --model medium
3
4# Burn subtitles directly into video
5whisper video.mp4 --output_format srt
6ffmpeg -i video.mp4 -vf "subtitles=th-video.srt" video_with_subs.mp4
Complete Example Workflows
Workflow 1: Conference Talk Processing
1#!/bin/bash
2# conference_transcription.sh
3
4INPUT_FILE="$1"
5OUTPUT_NAME="${INPUT_FILE%.*}"
6
7echo "Processing: $INPUT_FILE"
8
9# Step 1: Extract and optimize audio
10ffmpeg -i "$INPUT_FILE" \
11 -vn \
12 -af "dynaudnorm,highpass=f=80,lowpass=f=8000" \
13 -ar 16000 \
14 -ac 1 \
15 "${OUTPUT_NAME}_processed.wav"
16
17# Step 2: Transcribe with Whisper
18whisper "${OUTPUT_NAME}_processed.wav" \
19 --model medium \
20 --output_format txt,srt,json \
21 --language Thai \
22 --initial_prompt "This is a technical conference presentation"
23
24# Step 3: Clean up
25rm "${OUTPUT_NAME}_processed.wav"
26
27echo "Transcription complete: ${OUTPUT_NAME}.txt, ${OUTPUT_NAME}.srt"
Workflow 2: Multilingual Content
1#!/bin/bash
2# multilingual_transcription.sh
3
4INPUT_FILE="$1"
5
6# Extract audio
7ffmpeg -i "$INPUT_FILE" -vn -ar 16000 temp_audio.wav
8
9# Detect language and transcribe
10whisper temp_audio.wav --model medium --output_format json
11
12# Also create English translation
13whisper temp_audio.wav --task translate --output_format srt --model medium
14
15# Clean up
16rm temp_audio.wav
Best Practices
Audio Quality Guidelines
- Sample Rate: 16kHz is optimal for Whisper
- Format: WAV or high-quality MP3
- Mono vs Stereo: Mono is sufficient for speech
- Bit Depth: 16-bit is adequate
Model Selection Strategy
- tiny/base: Quick drafts, real-time processing
- small/medium: Balanced accuracy/speed for most use cases
- large: Maximum accuracy for important content
File Organization
1project/
2├── original_files/
3├── processed_audio/
4├── transcriptions/
5│ ├── txt/
6│ ├── srt/
7│ └── json/
8└── scripts/
Useful FFmpeg Commands for Whisper Prep
Audio Extraction and Conversion
1# Extract best quality audio
2ffmpeg -i input.mkv -q:a 0 -map a output.mp3
3
4# Convert to Whisper-optimal format
5ffmpeg -i input.mp4 -vn -ar 16000 -ac 1 -c:a pcm_s16le output.wav
6
7# Extract audio from specific timeframe
8ffmpeg -i input.mp4 -ss 00:10:00 -to 00:20:00 -vn output.wav
Audio Enhancement
1# Normalize and filter
2ffmpeg -i input.mp3 -af "dynaudnorm=f=75:g=25,highpass=f=80,lowpass=f=8000" clean.wav
3
4# Reduce noise (basic)
5ffmpeg -i noisy.wav -af "afftdn" denoised.wav
Performance Monitoring
Check Processing Time
1# Time the transcription
2time whisper audio.wav --model base
3
4# Monitor GPU usage (if using CUDA)
5nvidia-smi
Batch Processing with Progress
1#!/bin/bash
2total_files=$(ls *.mp4 | wc -l)
3current=0
4
5for file in *.mp4; do
6 ((current++))
7 echo "Processing file $current of $total_files: $file"
8
9 ffmpeg -i "$file" -vn -ar 16000 temp.wav
10 whisper temp.wav --output_format srt --model base
11 rm temp.wav
12
13 echo "Completed: ${file%.*}.srt"
14done
Integration Tips
With Video Editors
- Export SRT files for Premiere Pro, DaVinci Resolve, etc.
- Use VTT format for web players
- JSON format provides detailed timing for custom applications
With Automation Scripts
1# Python automation example
2import subprocess
3import os
4
5def process_video(input_path, model="base"):
6 # Extract audio
7 audio_path = f"{os.path.splitext(input_path)[0]}.wav"
8 subprocess.run([
9 "ffmpeg", "-i", input_path, "-vn",
10 "-ar", "16000", "-ac", "1", audio_path
11 ])
12
13 # Transcribe
14 subprocess.run([
15 "whisper", audio_path, "--model", model,
16 "--output_format", "srt,txt"
17 ])
18
19 # Clean up
20 os.remove(audio_path)
Troubleshooting
Common Issues and Solutions
Issue | Solution |
---|---|
"No module named whisper" | pip install openai-whisper |
FFmpeg not found | Add FFmpeg to system PATH |
Poor transcription quality | Try larger model, improve audio quality |
Out of memory | Use smaller model or process shorter segments |
Wrong language detected | Specify --language parameter |
No timestamps in output | Use SRT, VTT, or JSON format |
Audio Quality Issues
1# Check audio properties
2ffprobe -v quiet -show_format -show_streams input.mp4
3
4# Test with sample
5ffmpeg -i input.mp4 -t 30 -vn sample.wav
6whisper sample.wav --model base
Quick Reference Commands
Essential Commands
1# Basic transcription
2whisper file.mp4
3
4# High-quality subtitles
5whisper file.mp4 --model large --output_format srt
6
7# Extract audio + transcribe
8ffmpeg -i video.mp4 -vn audio.wav && whisper audio.wav --output_format srt
9
10# Batch process current directory
11for f in *.mp4; do whisper "$f" --output_format srt; done
12
13# Foreign language with translation
14whisper foreign.mp3 --task translate --output_format srt
Useful FFmpeg Audio Processing
1# Optimize for speech recognition
2ffmpeg -i input.mp4 -af "dynaudnorm,highpass=f=80,lowpass=f=8000" -ar 16000 -ac 1 optimized.wav
3
4# Split long audio into segments
5ffmpeg -i long_audio.mp3 -f segment -segment_time 600 -c copy segment_%03d.mp3
6
7# Merge multiple audio files
8ffmpeg -f concat -safe 0 -i filelist.txt -c copy merged.mp3
Note: Processing time varies significantly based on audio length, model size, and hardware. The large
model provides the best accuracy but requires substantial computational resources.