1College of Computer Science, Inner Mongolia University, China
2Lenovo, China
Accepted by ICASSP 2026
Abstract
Audio-Visual Target Speaker Extraction (AVTSE) aims to isolate a target speaker's voice from multi-speaker mixtures by leveraging visual cues. However, the practical deployment of existing AVTSE methods is often hindered by poor generalization, high computational complexity, and non-causal designs. To address these issues, we propose 2S-AVTSE, a novel two-stage system built on an audio-visual decoupling strategy. This approach uniquely eliminates the need for synchronized audio-visual training data, enhancing its applicability in real world scenarios. The first stage uses a compact visual network to perform voice activity detection (VAD) by analyzing visual cues only. Its output VAD then guides a second-stage audio network to extract the target speech. With a computational load of only 1.89 GMACs, our system exhibits superior generalization and robustness in realistic and cross-domain scenarios compared to end-to-end baselines. This design presents a practical and effective solution for real-world applications.
Real Recordings 🎯 Where 2S-AVTSE Truly Excels
💡 Key Insight: While 2S-AVTSE performs competitively on simulation data, it demonstrates exceptional generalization on real-world recordings where other methods struggle. This makes it a practical, deployable solution for actual applications.
Unprocessed
Original Mix
2S-AVTSE
Causal, 1.89G MACs/s
CTCNet
NonCausal, 92.56G MACs/s
CTCNet-mini
NonCausal, 2.26G MACs/s
Simulation Demos 📊 Competitive Performance on Standard Benchmarks
Mix
-
CTCNet
NonCausal, 92.56G MACs/s
CTCNet-mini
NonCausal, 2.26G MACs/s
2S-AVTSE
Causal, 1.89G MACs/s
Clean
Ground Truth
Note: These simulation demos are from the artificial LRS2-2Mix benchmark, which features 100% overlapping speech. Since 2S-AVTSE identifies the speaker via an initial activation cue, we prepended a 2-second non-overlapping voice segment to each sample for this evaluation. Our design deliberately trades maximum performance on this specific benchmark for superior robustness in more realistic, sparsely-overlapped conversations. Therefore, its performance here may be surpassed by end-to-end models that are specifically optimized for this dataset's artificial conditions.
Core Advantages
Overcomes Data Scarcity. Our novel decoupled training strategy works without synchronized AV data, allowing it to be trained on vast, readily available audio-only datasets.
Ready for Real-Time Use. With an ultra-lightweight design (1.36M params, 1.89 GMacs/s) and a fully causal architecture, 2S-AVTSE runs efficiently on standard CPUs with no delay.
Built for Reality, Not Just Benchmarks. While competitive on standard benchmarks, our model truly shines in realistic conversational scenarios and real-world recordings where other models fail to generalize.
Deploy Anywhere. We've confirmed its robust performance on both ARM and x86 platforms, making 2S-AVTSE a practical solution ready for deployment on laptops, desktops, and more.