Detecting Vulnerability in Open Source Projects using Deep Learning

Yudistira Asnar
STEI-ITB

Bayu Samudra
Magister Informatika STEI-ITB

Fawwaz A. Wiradhika
Magister Informatika STEI-ITB

Alifia Rahmah
Magister Informatika STEI-ITB

Abstract
The increasing reliance on open-source software has amplified the need for effective vulnerability detection mechanisms. This research focuses on investigating existing literature and methodologies in the domain of vulnerability detection in open-source projects.
By analyzing state-of-the-art approaches, including Pre-processing Techniques, Encoder Architectures, Neural Networks, and other machine learning techniques, we aim to identify their strengths and limitations. Our objective is to enhance the current state of the art by selectively integrating and improving upon these methodologies. The proposed research will explore novel combinations of deep learning models and feature engineering techniques to achieve higher performant in vulnerability detection. This poster outlines our research objectives, methodology, and the anticipated contributions to the field of software security. Through this exploration, we hope to provide practical frameworks, consisting the best techniques, that are best suited and can be adopted by developers and researchers alike to bolster the security of open-source software projects.

Keyword: Vulnerability Detection, Deep Learning, Pre-processing

Introduction
Open Source Software (OSS) recently increasing in popularity to be used by developer. However, OSS also carries high security risks since its open-source nature allows for potential vulnerabilities in the code, which can be exploited by attackers. Therefore, it is crucial to develop effective vulnerability detection mechanisms. One approach to detecting vulnerabilities in OSS is static code analysis, a technique that analyzes the source code without executing it. A lot of research have delved into static analysis. The most promising result so far is vulnerability detection using deep learning. Three of them are: (1), (2), and (3).

VulnDetect Framework
In this research, we present the VulnDetect Framework, which is composed of five stages: Data Collection & Labeling, Data Processing, Feature Extraction, Model Training, and Model Evaluation.

44a — Figure 1. VulnDetect Framework Pipeline

At Data Collection & Labelling stage, we gather datasets from Open Source Software (OSS) Projects and label them according to their vulnerabilities adapting from (1).
During the Data Preprocessing stage, the dataset is tokenized and normalized.
Furthermore, the dataset’s features are augmented by including various aspects found in programming languages (PLs). This phase also involves the creation of graph-based features like Abstract Syntax Trees (AST), Program Dependency Graphs (PDG), and other relevant features for the modeling phase. This phase can help enrich information from the source code and preserver syntax-semantic meaning of the code. This phase can help the model to learn better.
In the Feature Extraction stage, the tokenized data is converted into a format that can be interpreted. For this research, we study various techniques: CodeBERT, Code2Vec, and Code2Seq as our feature extractors, preferring them over NLP-based extractors like Word2Vec or Glove due to the unique characteristics of PLS. PLs are markedly different from natural languages (NLs) due to their rigid syntax, grammar, and deterministic semantics.
The Modeling stage involves selecting and tuning the model based on its hyperparameters, followed by training the model. In this research, we utilize sequence-based model and graph- based model. In this research, we use two type classification, the bi-class classification (vulnerable/not) and multi-class classification (not vulnerable/vulnerability-1/vulnerability-2,…, n).
The final stage in our pipeline is the Model Evaluation, where the model is assessed using chosen metrics.

Here are model evaluation of (1), (2), (3) that are using various techniques:

These results are not justifying which methods are the best, as those studies are different approaches on building the models and not using the same dataset. Additionally, project types, programming languages, and other factors contributes to the performance of those models.
Li, G. and Yang, Y. (2024) compare various deep learning model which broadly divided into two type, sequence-based and graph-based. While sequence-based model can find pattern in token sequence but overlooking structural information, context, and interaction in code. Graph-based model alleviate this problem by integrating complex information, such as logical structure and relationships in the code in the form of graph. However, when applied to large- scale program, the graph become very complex making feature extraction difficult, which affect scalability of graph-based model. Additionally, noisy edge can obscure critical patterns and capturing long range dependencies still become challenge in graph-based model.

Conclusion
In summary, the increasing complexity of open- source software highlights the urgent need for effective vulnerability detection mechanisms. Our research aims to enhance existing methodologies by conducting a thorough review of current literature and identifying gaps in traditional approaches.
By integrating advanced machine learning techniques, such as Pre-processing Techniques, Encoder Architectures, Neural Networks, we expect to improve the accuracy and efficiency of vulnerability detection. Our goal is to develop a hybrid model that combines the strengths of various techniques, providing a more robust solution to software vulnerabilities.
This research has the potential to significantly advance the field of software security, enabling developers and organizations to better protect their systems. Future work will focus on empirical validation of our proposed methods and collaboration with industry partners to ensure practical relevance.

Main References

1) VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python (Wartschinski et al., 2020)
2) Vulnerability detection with interpretations (Li et al., 2021)
3) fine-grained Machine Learning Techniques for Python Source Code Vulnerability Detection (Farasat and Posegga, 2024)