This paper reports on experiments that demonstrate the importance of feature selection as well as generalization towards deepfake methods that deviate from training distribution; it presents the CtrSVDD dataset which was curated for controlled singing voice deepfake protection with enhanced controllability, diversity, and data openness.
This paper discusses the impact of recent singing voice synthesis and conversion advancements, and the resulting need for singing voice deepfake detection (SVDD) models. It introduces the CtrSVDD model, a large-scale, diverse collection of bonafide and deepfake singing vocals, which are synthesized using cutting edge methods from publicly accessible singing voice datasets, including 47.64 hours of bonafide and 260.34 hours of deepfake singing vocals, and spanning 14 deepfake methods and 164 singer identities. The CtrSVDD benchmark dataset was curated for controlled SVDD with enhanced controllability, diversity, and data openness with the hope that it will accelerate research toward SVDD. The paper describes the CtrSVDD dataset design, baseline systems, and the experiments and results that led to the CtrSVDD model presented here. The CtrSVDD dataset, baseline system implementations, and trained model weights are publicly accessible.