Adaptive Audio Deepfake Detection

Empowering forensic investigations with a 98-point feature extraction pipeline and ensemble machine learning to distinguish genuine human speech from AI-generated voice clones.

The Threat of Voice Cloning

As Generative AI rapidly advances, malicious actors can now clone voices with terrifying accuracy using only a few seconds of source audio.

  • Social Engineering: Voice-phishing (vishing) attacks targeting financial institutions and vulnerable individuals.
  • Misinformation: Generating fake statements from public figures to manipulate public opinion.
  • Authentication Bypass: Defeating voice-biometric security systems.
Threat Vector

The TruTone Solution

Project Overview

TruTone is an adaptive audio deepfake detection system that uses machine learning to distinguish between genuine human speech and artificially synthesized voice recordings. The system addresses the growing challenge posed by AI-generated audio, which has become increasingly realistic and difficult to detect using conventional methods.

The system is implemented using Python and a forensic-grade feature extraction pipeline that computes a 98-point numerical vector from each audio file. This vector captures Mel-Frequency Cepstral Coefficients (MFCCs), their first and second-order derivatives, spectral centroid, spectral roll off, spectral bandwidth, zero-crossing rate, RMS energy, chroma features, and spectral flatness.

A Weighted Soft-Voting Ensemble combining a Random Forest classifier and a Logistic Regression model is trained on a labeled dataset of 841 audio samples. The ensemble achieves an accuracy of 96.9%, a ROC-AUC score of 0.982, and a false positive rate of 2.1%. Deployed with a high-performance REST API, TruTone allows users to upload audio files and receive a forensic verdict in under 300 milliseconds.

Project Objectives

Designing an accurate and accessible defense mechanism.

98-Point Extraction

To develop a forensic feature extraction pipeline that captures a comprehensive set of spectral, temporal, and harmonic properties from audio recordings.

Ensemble Classifier

To train and evaluate a Weighted Soft-Voting Ensemble combining Random Forest and Logistic Regression models on a balanced dataset of genuine and synthetic speech.

Multi-Tiered Verdicts

To implement a multi-tiered verdict scheme that distinguishes between definitive and borderline detection results, reducing the risk of false accusations.

Web & API Deployment

To deploy the detection engine as a web application with a REST API, enabling both interactive and programmatic access for seamless integration.

Usability Evaluation

To evaluate the system's usability with real users and incorporate their feedback into the interface design, achieving an 81.2/100 SUS score.

System Architecture

A robust, scalable pipeline from audio ingestion to forensic verdict.

High Level Architecture

High-Level Flow

The architecture is designed for speed and reliability. Uploaded audio files bypass permanent storage, moving directly into the memory-resident extraction pipeline.

Sub-300ms Latency

Optimized for near-real-time forensic analysis.

Detailed Architecture

Detailed Component Architecture

The system separates concerns across Presentation, Application, Service, and Data layers.

  • Flask App Layer: Routing and secure file handling.
  • Feature Extraction: Librosa-powered vectorization.
  • Inference Engine: Scikit-Learn Voting Ensemble.

Live Demonstration

Watch TruTone analyze and classify audio deepfakes in real-time.

Get In Touch

Have questions about the TruTone architecture or interested in research collaboration? Reach out to the development team.

Email Us

For all inquiries, please reach out via email:

info@true-tone.org