Towards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based Framework

Towards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based Framework

Andrea Di Pierno, Luca Guarnera, Dario Allegra, Sebastiano Battiato
Department of Mathematics and Computer Science, University of Catania, Italy
andrea.dipierno@phd.unict.it, luca.guarnera@unict.it, dario.allegra@unict.it, sebastiano.battiato@unict.it

ACM International Conference on Multimedia (MM '25) - DFF '25 Workshop,

[RELATED WORKS]

Overview of the LAVA framework. At the bottom, the base model is a deep convolutional autoencoder
trained to reconstruct fake audio inputs A by minimizing the discrepancy between the original waveform A
and its reconstruction A_r, using a smoothed L1 loss. Once trained, the decoder is discarded and the
encoder E is reused as a frozen backbone for all subsequent classification tasks. At the top, an input aud
io sample A_i is processed by the encoder to obtain a latent representation z, which is then passed
through an attention module. In Level 1 (ADA), the resulting attended representation z'_ADA is used to
classify the sample into one of three dataset categories: ASVspoof2021 (ASV), FakeOrReal (FoR), or
CodecFake (Codec). If the sample is attributed to CodecFake and the classifier confidence exceeds a
predefined rejection threshold, the sample is forwarded to Level 2 (ADMR). Here, the same encoder and attention module
are reused to produce a second attended representation z'_ADMR, which is then classified into one of six codec-specific
classes (F01–F06). To build a robust attribution model, a threshold strategy was applied at each level of decision
king: whenever the confidence associated with a prediction drops below a predefined threshold (different for each level),
the corresponding sample is discarded and marked as "unknown".

ABSTRACT

The proliferation of audio deepfakes poses a growing threat to trust in digital communications. While detection methods have advanced, attributing audio deepfakes to their source models remains an underexplored yet crucial challenge. In this paper we introduce LAVA (Layered Architecture for Voice Attribution), a hierarchical framework for audio deepfake detection and model recognition that leverages attention-enhanced latent representations extracted by a convolutional autoencoder trained solely on fake audio. Two specialized classifiers operate on these features: Audio Deepfake Attribution (ADA), which identifies the generation technology, and Audio Deepfake Model Recognition (ADMR), which recognize the specific generative model instance. To improve robustness under open-set conditions, we incorporate confidence-based rejection thresholds. Experiments on ASVspoof2021, FakeOrReal, and CodecFake show strong performance: the ADA classifier achieves F1-scores over 95% across all datasets, and the ADMR module reaches 96.31% macro F1 across six classes. Additional tests on unseen attacks from ASVpoof2019 LA and error propagation analysis confirm LAVA’s robustness and reliability. The framework advances the field by introducing a supervised approach to deepfake attribution and model recognition under open-set conditions, validated on public benchmarks and accompanied by publicly released models and code. Models and code are available at https://github.com/adipiz99/LAVA-framework.

Download Paper GITHUB

Cite:
@inproceedings{di2025towards,
title={Towards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based Framework},
author={Di Pierno, Andrea and Guarnera, Luca and Allegra, Dario and Battiato, Sebastiano},
booktitle={Proceedings of the 1st on Deepfake Forensics Workshop: Detection, Attribution, Recognition, and Adversarial Challenges in the Era of AI-Generated Media},
pages={101--109},
year={2025}
}

[RELATED WORKS]