Indonesian Speech Anti-Spoofing System: Data Creation and Convolutional Neural Network Models

Sarah Azka Arief
STE-ITB

Candy Olivia Mawalim
Japan Advanced Institute Of Science and Technology

Dessi Puji Lestari
STE-ITB

Introduction
Technological advancements in neural networks have improved Automatic Speaker Verification (ASV) systems, which are vulnerable to spoofing attacks, such as replay attacks and speech synthesis. Existing research, including ASVspoof challenges, has focused on certain languages, leaving a gap for underrepresented languages like Bahasa Indonesia. This study addresses this gap by developing a convolutional neural network-based system for detecting spoofed speech in Indonesian, creating a specialized dataset, and evaluating models like LCNN and ResNet for their effectiveness and generalizability.

Speech Spoof Detection
Automatic Speaker Verification (ASV) systems are crucial for detecting spoofed speech and ensuring security, often integrating with countermeasure systems to protect against spoofing attacks, including speech synthesis, voice conversion, impersonation, and replay attacks. These attacks are categorized into physical access (PA) and logical access (LA) scenarios, both requiring effective detection solutions. Initially, classical machine learning was used for countermeasures, but the advent of deep learning has shifted the focus to neural network-based approaches, particularly convolutional neural networks (CNNs) like ResNet and light convolutional neural networks (LCNN). These models are favored for their ability to automatically extract features, such as linear frequency cepstral coefficients (LFCC), which are crucial for the accuracy and robustness of spoof speech detection systems, as demonstrated in ASVspoof challenges.

Building The Dataset
The creation of a Bahasa Indonesia dataset for spoofed speech detection, using audio from Common Voice and Prosa.ai. It includes bona fide and spoofed speech for logical access (LA) and physical access (PA) scenarios. LA spoofs were generated using the MMS model for text-to-speech and FreeVC for voice conversion, covering various Indonesian accents. PA spoofs were created through replay attack simulations, recording audio with different microphones. The dataset combines varied acoustic conditions and high-quality studio recordings for comprehensive spoof detection.

Experimental Results and Analysis
This study uses convolutional neural network (CNN) models, specifically ResNet and LCNN, for Indonesian spoof speech detection, with linear frequency cepstral coefficients (LFCC) as input features. The dataset from Prosa.ai is carefully partitioned to ensure balanced gender representation and equal distribution of bona fide and spoofed audio, enhancing model accuracy in both physical access (PA) and logical access (LA) scenarios.

Physical Access (PA)

Same-source performance
Both LCNN and ResNet models performed well when trained and tested on the same dataset

Cross-Source Performance
Poor generalization was observer with minDCF scores of 1 and EERs between 81-100%

Factors for Poor Performance

Differences in recording quality between the Prosa.ai (studio) and Common Voice (volunteer) datasets.
Variations in recording settings, such as gain and stereo-to-mono conversion.
Smaller dataset size in the PA scenario, limiting model exposure to diverse conditions.

Logical Access (QA)

Same-source performance
Excellent performance in 4-fold cross-validitaion and same-source tests with minDCF and EER scores near zero.

Cross-Source Performance

Prosa.ai Trained Models: Good generalization when tested on Common Voice, maintaining low minDCF and EER scores.
Common Voice Trained Models: Inferior performance with minDCF scores of 0.5 to 1 and EERs of 30-100%.

Factors for Poor Performance

The consistent quality of the Prosa.ai dataset enabled clearer distinctions between bona fide and spoofed speech.
Variable quality of Common Voice data made it challengeing for models to generalize effectively.

Conclusion
This study developed an Indonesian spoofed speech dataset from Common Voice and Prosa.ai to support the creation of anti spoofing systems for both logical access (LA) and physical access (PA) scenarios. The dataset included diverse accents and recording conditions, with LA spoofs generated using the MMS model and FreeVC system, and PA spoofs created through replay attack simulations. Evaluations of LCNN and ResNet models on this dataset showed strong performance, particularly with È-fold cross-validation, but revealed significant generalization issues when tested across different data sources. The consistent quality of the Prosa.ai dataset aided in better generalization, emphasizing the importance of high-quality data. Future work should focus on enhancing dataset diversity and exploring hybrid approaches to improve the robustness of Indonesian speech anti-spoofing systems.