Two-stage Audio-Visual Target Speaker Extraction System for Real-Time Processing On Edge Device

Zixuan Li1, Xueliang Zhang1*, Lei Miao2, Zhipeng Yan2, Ying Sun2, Chong Zhu2
1College of Computer Science, Inner Mongolia University, China
2Lenovo, China
Accepted by ICASSP 2026

Abstract

Audio-Visual Target Speaker Extraction (AVTSE) aims to isolate a target speaker's voice from multi-speaker mixtures by leveraging visual cues. However, the practical deployment of existing AVTSE methods is often hindered by poor generalization, high computational complexity, and non-causal designs. To address these issues, we propose 2S-AVTSE, a novel two-stage system built on an audio-visual decoupling strategy. This approach uniquely eliminates the need for synchronized audio-visual training data, enhancing its applicability in real world scenarios. The first stage uses a compact visual network to perform voice activity detection (VAD) by analyzing visual cues only. Its output VAD then guides a second-stage audio network to extract the target speech. With a computational load of only 1.89 GMACs, our system exhibits superior generalization and robustness in realistic and cross-domain scenarios compared to end-to-end baselines. This design presents a practical and effective solution for real-world applications.

Real Recordings 🎯 Where 2S-AVTSE Truly Excels

💡 Key Insight: While 2S-AVTSE performs competitively on simulation data, it demonstrates exceptional generalization on real-world recordings where other methods struggle. This makes it a practical, deployable solution for actual applications.

Unprocessed
Original Mix
2S-AVTSE
Causal, 1.89G MACs/s
CTCNet
NonCausal, 92.56G MACs/s
CTCNet-mini
NonCausal, 2.26G MACs/s

Simulation Demos 📊 Competitive Performance on Standard Benchmarks

Mix
-
CTCNet
NonCausal, 92.56G MACs/s
CTCNet-mini
NonCausal, 2.26G MACs/s
2S-AVTSE
Causal, 1.89G MACs/s
Clean
Ground Truth
Note: These simulation demos are from the artificial LRS2-2Mix benchmark, which features 100% overlapping speech. Since 2S-AVTSE identifies the speaker via an initial activation cue, we prepended a 2-second non-overlapping voice segment to each sample for this evaluation. Our design deliberately trades maximum performance on this specific benchmark for superior robustness in more realistic, sparsely-overlapped conversations. Therefore, its performance here may be surpassed by end-to-end models that are specifically optimized for this dataset's artificial conditions.

Core Advantages

  • Overcomes Data Scarcity. Our novel decoupled training strategy works without synchronized AV data, allowing it to be trained on vast, readily available audio-only datasets.
  • Ready for Real-Time Use. With an ultra-lightweight design (1.36M params, 1.89 GMacs/s) and a fully causal architecture, 2S-AVTSE runs efficiently on standard CPUs with no delay.
  • Built for Reality, Not Just Benchmarks. While competitive on standard benchmarks, our model truly shines in realistic conversational scenarios and real-world recordings where other models fail to generalize.
  • Deploy Anywhere. We've confirmed its robust performance on both ARM and x86 platforms, making 2S-AVTSE a practical solution ready for deployment on laptops, desktops, and more.