{"id":23113,"date":"2024-12-15T01:36:42","date_gmt":"2024-12-14T18:36:42","guid":{"rendered":"https:\/\/stei.itb.ac.id\/?page_id=23113"},"modified":"2024-12-16T08:10:34","modified_gmt":"2024-12-16T01:10:34","slug":"indonesian-speech-anti-spoofing-system-data-creation-and-convolutional-neural-network-models","status":"publish","type":"page","link":"https:\/\/stei.itb.ac.id\/en\/prima\/indonesian-speech-anti-spoofing-system-data-creation-and-convolutional-neural-network-models\/","title":{"rendered":"Indonesian Speech Anti-Spoofing System:  Data Creation and Convolutional Neural  Network Models"},"content":{"rendered":"<div class=\"wpb-content-wrapper\"><div class=\"fullwidth\" ><div class=\"vc_row wpb_row vc_row-fluid kepala vc_custom_1734236300248 vc_row-has-fill\"><div class=\"wpb_column vc_column_container vc_col-sm-12\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\"><div class=\"container\" ><div class=\"vc_row wpb_row vc_inner vc_row-fluid\"><div class=\"wpb_column vc_column_container vc_col-sm-12\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\"><div class=\"vc_btn3-container vc_btn3-inline vc_do_btn\" ><button class=\"vc_general vc_btn3 vc_btn3-size-lg vc_btn3-shape-rounded vc_btn3-style-modern vc_btn3-icon-left vc_btn3-color-white\" onclick=\"history.back()\"><i class=\"vc_btn3-icon fas fa-home\"><\/i> Kembali ke Beranda<\/button><\/div><\/div><\/div><\/div><\/div><\/div><\/div><\/div><\/div><\/div><\/div><div class=\"fullwidth\" ><div class=\"vc_row wpb_row vc_row-fluid vc_custom_1734170418007\"><div class=\"wpb_column vc_column_container vc_col-sm-3\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\">\n\t<div class=\"wpb_text_column wpb_content_element\" >\n\t\t<div class=\"wpb_wrapper\">\n\t\t\t<p><strong>Sarah Azka Arief<\/strong><br \/>\nSTE-ITB<\/p>\n\n\t\t<\/div>\n\t<\/div>\n<\/div><\/div><\/div><div class=\"wpb_column vc_column_container vc_col-sm-3\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\">\n\t<div class=\"wpb_text_column wpb_content_element\" >\n\t\t<div class=\"wpb_wrapper\">\n\t\t\t<p><b>Candy Olivia Mawalim<\/b><br \/>\nJapan Advanced Institute Of Science and Technology<\/p>\n\n\t\t<\/div>\n\t<\/div>\n<\/div><\/div><\/div><div class=\"wpb_column vc_column_container vc_col-sm-3\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\">\n\t<div class=\"wpb_text_column wpb_content_element\" >\n\t\t<div class=\"wpb_wrapper\">\n\t\t\t<p><strong>Dessi Puji Lestari<\/strong><br \/>\nSTE-ITB<\/p>\n\n\t\t<\/div>\n\t<\/div>\n<\/div><\/div><\/div><div class=\"wpb_column vc_column_container vc_col-sm-3\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\"><\/div><\/div><\/div><\/div><\/div><div class=\"fullwidth\" ><div class=\"vc_row wpb_row vc_row-fluid vc_custom_1734168902943\"><div class=\"wpb_column vc_column_container vc_col-sm-12\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\">\n\t<div class=\"wpb_text_column wpb_content_element\" >\n\t\t<div class=\"wpb_wrapper\">\n\t\t\t<p><strong>Introduction<\/strong><br \/>\nTechnological advancements in neural networks have improved Automatic Speaker Verification (ASV) systems, which are vulnerable to spoofing attacks, such as replay attacks and speech synthesis. Existing research, including ASVspoof challenges, has focused on certain languages, leaving a gap for underrepresented languages like Bahasa Indonesia. This study addresses this gap by developing a convolutional neural network-based system for detecting spoofed speech in Indonesian, creating a specialized dataset, and evaluating models like LCNN and ResNet for their effectiveness and generalizability.<\/p>\n<p><strong>Speech Spoof Detection<\/strong><br \/>\nAutomatic Speaker Verification (ASV) systems are crucial for detecting spoofed speech and ensuring security, often integrating with countermeasure systems to protect against spoofing attacks, including speech synthesis, voice conversion, impersonation, and replay attacks. These attacks are categorized into physical access (PA) and logical access (LA) scenarios, both requiring effective detection solutions. Initially, classical machine learning was used for countermeasures, but the advent of deep learning has shifted the focus to neural network-based approaches, particularly convolutional neural networks (CNNs) like ResNet and light convolutional neural networks (LCNN). These models are favored for their ability to automatically extract features, such as linear frequency cepstral coefficients (LFCC), which are crucial for the accuracy and robustness of spoof speech detection systems, as demonstrated in ASVspoof challenges.<\/p>\n<p><strong>Building The Dataset<\/strong><br \/>\nThe creation of a Bahasa Indonesia dataset for spoofed speech detection, using audio from Common Voice and Prosa.ai. It includes bona fide and spoofed speech for logical access (LA) and physical access (PA) scenarios. LA spoofs were generated using the MMS model for text-to-speech and FreeVC for voice conversion, covering various Indonesian accents. PA spoofs were created through replay attack simulations, recording audio with different microphones. The dataset combines varied acoustic conditions and high-quality studio recordings for comprehensive spoof detection.<\/p>\n\n\t\t<\/div>\n\t<\/div>\n<\/div><\/div><\/div><\/div><\/div><div class=\"fullwidth\" ><div class=\"vc_row wpb_row vc_row-fluid vc_custom_1734169162466 vc_row-o-content-bottom vc_row-flex\"><div class=\"wpb_column vc_column_container vc_col-sm-4\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\"><\/div><\/div><\/div><div class=\"wpb_column vc_column_container vc_col-sm-4\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\">\n\t<div  class=\"wpb_single_image wpb_content_element vc_align_center wpb_content_element\">\n\t\t\n\t\t<figure class=\"wpb_wrapper vc_figure\">\n\t\t\t<div class=\"vc_single_image-wrapper   vc_box_border_grey\"><img loading=\"lazy\" decoding=\"async\" width=\"1392\" height=\"1696\" src=\"https:\/\/stei.itb.ac.id\/wp-content\/uploads\/24-1.jpg\" class=\"vc_single_image-img attachment-full\" alt=\"\" title=\"Build The Dataset\" \/><\/div>\n\t\t<\/figure>\n\t<\/div>\n<\/div><\/div><\/div><div class=\"wpb_column vc_column_container vc_col-sm-4\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\"><\/div><\/div><\/div><\/div><\/div><div class=\"fullwidth\" ><div class=\"vc_row wpb_row vc_row-fluid vc_custom_1734168902943\"><div class=\"wpb_column vc_column_container vc_col-sm-12\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\">\n\t<div class=\"wpb_text_column wpb_content_element\" >\n\t\t<div class=\"wpb_wrapper\">\n\t\t\t<p><strong>Experimental Results and Analysis<\/strong><br \/>\nThis study uses convolutional neural network (CNN) models, specifically ResNet and LCNN, for Indonesian spoof speech detection, with linear frequency cepstral coefficients (LFCC) as input features. The dataset from Prosa.ai is carefully partitioned to ensure balanced gender representation and equal distribution of bona fide and spoofed audio, enhancing model accuracy in both physical access (PA) and logical access (LA) scenarios.<\/p>\n\n\t\t<\/div>\n\t<\/div>\n<\/div><\/div><\/div><\/div><\/div><div class=\"fullwidth\" ><div class=\"vc_row wpb_row vc_row-fluid vc_custom_1734169162466 vc_row-o-content-bottom vc_row-flex\"><div class=\"wpb_column vc_column_container vc_col-sm-6\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\"><h3 style=\"color: #111111;text-align: center;font-family:Abril Fatface;font-weight:400;font-style:normal\" class=\"vc_custom_heading vc_do_custom_heading\" >Physical Access (PA)<\/h3>\n\t<div  class=\"wpb_single_image wpb_content_element vc_align_center wpb_content_element vc_custom_1734201677076\">\n\t\t\n\t\t<figure class=\"wpb_wrapper vc_figure\">\n\t\t\t<div class=\"vc_single_image-wrapper   vc_box_border_grey\"><img loading=\"lazy\" decoding=\"async\" width=\"849\" height=\"643\" src=\"https:\/\/stei.itb.ac.id\/wp-content\/uploads\/Physical-Access-PA.jpg\" class=\"vc_single_image-img attachment-full\" alt=\"\" title=\"Physical Access (PA)\" \/><\/div>\n\t\t<\/figure>\n\t<\/div>\n\n\t<div class=\"wpb_text_column wpb_content_element\" >\n\t\t<div class=\"wpb_wrapper\">\n\t\t\t<p><strong>Same-source performance<\/strong><br \/>\nBoth LCNN and ResNet models performed well when trained and tested on the same dataset<\/p>\n<p><strong>Cross-Source Performance<\/strong><br \/>\nPoor generalization was observer with minDCF scores of 1 and EERs between 81-100%<\/p>\n<p><strong>Factors for Poor Performance<\/strong><\/p>\n<ul>\n<li>Differences in recording quality between the Prosa.ai (studio) and Common Voice (volunteer) datasets.<\/li>\n<li>Variations in recording settings, such as gain and stereo-to-mono conversion.<\/li>\n<li>Smaller dataset size in the PA scenario, limiting model exposure to diverse conditions.<\/li>\n<\/ul>\n\n\t\t<\/div>\n\t<\/div>\n\n\t<div  class=\"wpb_single_image wpb_content_element vc_align_center wpb_content_element\">\n\t\t\n\t\t<figure class=\"wpb_wrapper vc_figure\">\n\t\t\t<div class=\"vc_single_image-wrapper   vc_box_border_grey\"><img loading=\"lazy\" decoding=\"async\" width=\"962\" height=\"690\" src=\"https:\/\/stei.itb.ac.id\/wp-content\/uploads\/Data-Physical-Access-PA1.jpg\" class=\"vc_single_image-img attachment-full\" alt=\"\" title=\"Data Physical Access (PA)1\" \/><\/div>\n\t\t<\/figure>\n\t<\/div>\n\n\t<div  class=\"wpb_single_image wpb_content_element vc_align_center wpb_content_element\">\n\t\t\n\t\t<figure class=\"wpb_wrapper vc_figure\">\n\t\t\t<div class=\"vc_single_image-wrapper   vc_box_border_grey\"><img loading=\"lazy\" decoding=\"async\" width=\"962\" height=\"457\" src=\"https:\/\/stei.itb.ac.id\/wp-content\/uploads\/Data-Physical-Access-PA2.jpg\" class=\"vc_single_image-img attachment-full\" alt=\"\" title=\"Data Physical Access (PA)2\" \/><\/div>\n\t\t<\/figure>\n\t<\/div>\n<\/div><\/div><\/div><div class=\"wpb_column vc_column_container vc_col-sm-6\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\"><h3 style=\"color: #111111;text-align: center;font-family:Abril Fatface;font-weight:400;font-style:normal\" class=\"vc_custom_heading vc_do_custom_heading\" >Logical Access (QA)<\/h3>\n\t<div  class=\"wpb_single_image wpb_content_element vc_align_center wpb_content_element\">\n\t\t\n\t\t<figure class=\"wpb_wrapper vc_figure\">\n\t\t\t<div class=\"vc_single_image-wrapper   vc_box_border_grey\"><img loading=\"lazy\" decoding=\"async\" width=\"415\" height=\"323\" src=\"https:\/\/stei.itb.ac.id\/wp-content\/uploads\/Qogical-Access-QA-1.jpg\" class=\"vc_single_image-img attachment-full\" alt=\"\" title=\"Qogical Access (QA)\" \/><\/div>\n\t\t<\/figure>\n\t<\/div>\n\n\t<div class=\"wpb_text_column wpb_content_element\" >\n\t\t<div class=\"wpb_wrapper\">\n\t\t\t<p><strong>Same-source performance<\/strong><br \/>\nExcellent performance in 4-fold cross-validitaion and same-source tests with minDCF and EER scores near zero.<\/p>\n<p><strong>Cross-Source Performance<\/strong><\/p>\n<ul>\n<li>Prosa.ai Trained Models: Good generalization when tested on Common Voice, maintaining low minDCF and EER scores.<\/li>\n<li>Common Voice Trained Models: Inferior performance with minDCF scores of 0.5 to 1 and EERs of 30-100%.<\/li>\n<\/ul>\n<p><strong>Factors for Poor Performance<\/strong><\/p>\n<ul>\n<li>The consistent quality of the Prosa.ai dataset enabled clearer distinctions between bona fide and spoofed speech.<\/li>\n<li>Variable quality of Common Voice data made it challengeing for models to generalize effectively.<\/li>\n<\/ul>\n\n\t\t<\/div>\n\t<\/div>\n\n\t<div  class=\"wpb_single_image wpb_content_element vc_align_center wpb_content_element\">\n\t\t\n\t\t<figure class=\"wpb_wrapper vc_figure\">\n\t\t\t<div class=\"vc_single_image-wrapper   vc_box_border_grey\"><img loading=\"lazy\" decoding=\"async\" width=\"969\" height=\"686\" src=\"https:\/\/stei.itb.ac.id\/wp-content\/uploads\/Qogical-Access-QA1.jpg\" class=\"vc_single_image-img attachment-full\" alt=\"\" title=\"Qogical Access (QA)1\" \/><\/div>\n\t\t<\/figure>\n\t<\/div>\n\n\t<div  class=\"wpb_single_image wpb_content_element vc_align_center wpb_content_element\">\n\t\t\n\t\t<figure class=\"wpb_wrapper vc_figure\">\n\t\t\t<div class=\"vc_single_image-wrapper   vc_box_border_grey\"><img loading=\"lazy\" decoding=\"async\" width=\"974\" height=\"461\" src=\"https:\/\/stei.itb.ac.id\/wp-content\/uploads\/Qogical-Access-QA2.jpg\" class=\"vc_single_image-img attachment-full\" alt=\"\" title=\"Qogical Access (QA)2\" \/><\/div>\n\t\t<\/figure>\n\t<\/div>\n<\/div><\/div><\/div><\/div><\/div><div class=\"fullwidth\" ><div class=\"vc_row wpb_row vc_row-fluid vc_custom_1734168902943\"><div class=\"wpb_column vc_column_container vc_col-sm-12\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\">\n\t<div class=\"wpb_text_column wpb_content_element\" >\n\t\t<div class=\"wpb_wrapper\">\n\t\t\t<p><strong>Conclusion<\/strong><br \/>\nThis study developed an Indonesian spoofed speech dataset from Common Voice and Prosa.ai to support the creation of anti spoofing systems for both logical access (LA) and physical access (PA) scenarios. The dataset included diverse accents and recording conditions, with LA spoofs generated using the MMS model and FreeVC system, and PA spoofs created through replay attack simulations. Evaluations of LCNN and ResNet models on this dataset showed strong performance, particularly with \u00c8-fold cross-validation, but revealed significant generalization issues when tested across different data sources. The consistent quality of the Prosa.ai dataset aided in better generalization, emphasizing the importance of high-quality data. Future work should focus on enhancing dataset diversity and exploring hybrid approaches to improve the robustness of Indonesian speech anti-spoofing systems.<\/p>\n\n\t\t<\/div>\n\t<\/div>\n<\/div><\/div><\/div><\/div><\/div>\n<\/div>","protected":false},"excerpt":{"rendered":"Kembali ke Beranda Sarah Azka Arief STE-ITB Candy Olivia Mawalim Japan Advanced Institute Of Science and Technology Dessi Puji Lestari STE-ITB Introduction Technological advancements in neural networks have improved Automatic [...]","protected":false},"author":1,"featured_media":0,"parent":22933,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-23113","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/stei.itb.ac.id\/en\/wp-json\/wp\/v2\/pages\/23113","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/stei.itb.ac.id\/en\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/stei.itb.ac.id\/en\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/stei.itb.ac.id\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/stei.itb.ac.id\/en\/wp-json\/wp\/v2\/comments?post=23113"}],"version-history":[{"count":5,"href":"https:\/\/stei.itb.ac.id\/en\/wp-json\/wp\/v2\/pages\/23113\/revisions"}],"predecessor-version":[{"id":23565,"href":"https:\/\/stei.itb.ac.id\/en\/wp-json\/wp\/v2\/pages\/23113\/revisions\/23565"}],"up":[{"embeddable":true,"href":"https:\/\/stei.itb.ac.id\/en\/wp-json\/wp\/v2\/pages\/22933"}],"wp:attachment":[{"href":"https:\/\/stei.itb.ac.id\/en\/wp-json\/wp\/v2\/media?parent=23113"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}