
Over the last five years, facial deepfake detection has become one of the most dynamic and challenging areas in multimedia forensics. The scientific community has progressively advanced from early detection attempts to sophisticated models capable of capturing subtle artifacts and inconsistencies in manipulated content. Despite this progress, deepfake forensics remains an arms race: detectors must generalize across unseen generators, remain effective under compression, and resist adversarial manipulations such as morphing and social-media beauty filters. Recent surveys confirm both the maturity of the field and the urgency of addressing robustness, attribution, and lifelong authentication of media. Within this evolving landscape, the sAIfer Lab’s Biometric Unit (University of Cagliari) has contributed a coherent body of research, consolidated within the FF4ALL project. Our studies have introduced approaches for artifact decomposition, high-frequency enhancement for compressed content, and tensor-based modeling for scaled and compressed images. We further analyzed fusion rules at score level and generalized detection based on inner/outer face inconsistencies. More recently, we proposed quality-based artifact modeling in videos, evaluated the robustness of forensic tools under morphing and compression, and studied the impact of beauty filters on detection systems. This keynote will reflect on five years of research efforts, combining a broad overview of the state of the art with the experience of our laboratory, and discussing how multidisciplinary collaboration—spanning biometrics, AI, and forensics—can guide the next generation of trustworthy and explainable solutions for deepfake media authentication.

The phenomenal growth of generative AI methods has made high-quality multimodal synthetic content generation possible. While synthetic data generation has many valuable applications, it has also resulted in deepfakes that are increasingly used to spread misinformation and disinformation. To address this challenge, we need deepfake detectors that can be deployed at scale in real-world settings and across different use cases. Such detectors should be trained on large and diverse datasets, generalize well to data generated from unseen methods, and provide explainable results. Furthermore, they must be effective across different languages and cultural contexts to ensure broad inclusivity. The 1 Million Deepfakes Detection Challenge is designed to provide a large-scale benchmark for detecting and localizing deepfakes, with a dataset that currently includes more than two million samples. Results from the participating methods highlight the limitations of existing approaches, particularly in accurately localizing manipulated segments. In many societies, people mix languages and dialects during conversation, making deepfakes harder for observers to detect. Most deepfake datasets are monolingual, contain clean audio, and are focused on Western languages. We therefore need multilingual, code-switching, and dialect-diverse audio and video datasets with realistic artifacts. The ArEnAV dataset is one such resource, consisting of Arabic–English code-switching across multiple dialects, totaling over 765 hours and 387,072 clips generated using four TTS and two lip-sync models. Furthermore, most existing datasets focus on simple face swaps or object changes and miss more complex edits that alter the meaning of an image. To address this, we propose MultiFakeVerse, a large dataset comprising deepfakes generated using language-based reasoning. A vision-language model first identifies the main person in an image, decides on a realistic manipulation, and then generates the edited image with a diffusion model. When tested on MultiFakeVerse, current detection methods that work well on traditional deepfakes perform much worse on higher-level, meaning-altering manipulations.