The study demonstrates the effectiveness of Rep2Vec in comparison with state-of-the-art methods for multiple repository tasks.
The authors of this study conducted extensive experiments on the collected dataset from GitHub demonstrating the effectiveness of Rep2Vec in comparison with state-of-the-art methods for multiple repository tasks. Driven by the exponential increase of software and the advent of the pull-based development system Git, a large amount of open-source software has emerged on various social coding platforms. GitHub, as the largest platform, not only attracts developers and researchers to contribute legitimate software and research-related source code but has also become a popular platform for an increasing number of cybercriminals to perform continuous cyberattacks. Hence, some tools have been developed to learn representations of repositories on GitHub for various related applications (e.g., malicious repository detection) recently. However, most of them merely focus on code content while ignoring the rich relational data among repositories. In addition, they usually require a mass of resources to obtain sufficient labeled data for model training while ignoring the usefully handy unlabeled data. The authors propose a novel model Rep2Vec which integrates the code content, the structural relations, and the unlabeled data to learn the repository representations. First, to comprehensively model the repository data, we build a repository heterogeneous graph (Rep-HG) which is encoded by a graph neural network. Afterwards, to fully exploit unlabeled data in Rep-HG, the authors introduce adversarial attacks to generate more challenging contrastive pairs for the contrastive learning module to train the encoder in node view and meta-path view simultaneously. To alleviate the workload of the encoder against attacks, the authors further design a dual-stream contrastive learning module that integrates contrastive learning on adversarial graph and original graph together. Finally, the pre-trained encoder is fine-tuned to the downstream task, and further enhanced by a knowledge distillation module. (Published Abstract Provided)