DO NOT RECORD ME

Sub-Audible Signal Standard for Audio Capture and Transcription Opt-out

Authors

George Danforth
Jonathan Mortensen
https://confident.security

Abstract

This document proposes a standardized method for individuals to signal their desire to opt out of audio recording and transcription through a sub-audible digital signal embedded in audio streams. The DO NOT RECORD ME (DNRM) signal standard aims to provide users with a technical mechanism to express their privacy preferences regarding automated speech transcription and audio capture systems.

1. Introduction

1.1 Problem Statement

As AI-powered audio transcription technology becomes more powerful and more cost-effective to run, we are seeing a proliferation of tools making use of this technology across a wide range of use-cases. While many such use-cases are seemingly benign (e.g. transcribing work meetings), others are explicitly adversarial (e.g. cheating on job interviews). In any case where an individual does not explicitly consent to being the subject of an AI-powered transcription, there is the potential for an abuse of that individual's privacy.

There currently exists no technical mechanism allowing individuals to opt out of AI-powered transcriptions. Whether or not an individual's privacy is respected in the context of AI transcription software is purely up to the discretion of the individual operating the transcription software.

1.2 Proposed Solution Overview

We propose DO NOT RECORD ME (DNRM), a standardized infrasonic, sub-audible digital signal that individuals can inject into their audio streams to indicate their opt-out preference from audio capture and AI audio transcription. The DNRM signal functions as an audio-domain analog to the HTTP DO NOT TRACK header, establishing a technical mechanism for individuals to assert their privacy preferences in digital spaces where audio capture and transcription may occur. When detected by compliant transcription systems, the DNRM signal triggers the exclusion of the originating speaker's audio stream from transcription or capture.

2. Requirements

2.1 Functional Requirements

Must interfere with a clean audio signal to the minimum degree possible
Must be detectable using standard audio processing techniques
Must be difficult to accidentally trigger
Should be resistant to common audio processing (compression, filtering)

2.2 Non-Functional Requirements

Signal generation and detection must be achievable with standard consumer hardware (smartphones, laptops)

3. Technical Specification

3.1 Signal Characteristics

Frequency Selection

The DNRM signal operates in the infrasonic range to ensure compatibility with most widely used digital voice and video communication systems. This design choice addresses the constraint that most digital audio streams used for voice transmission use a 16 kHz sample rate, which would severely undersample ultrasonic frequencies, rendering them undetectable. Infrasonic frequencies, by contrast, are oversampled in these systems, ensuring reliable detection.

Signal Composition

The DNRM signal consists of three summed sine waves at frequencies of 2 Hz, 3 Hz, and 5 Hz. This combination provides the following advantages:

All component frequencies remain well within the infrasonic range
The composite waveform creates a distinctive signature that is unlikely to occur naturally in speech or environmental audio
Each component frequency can be trivially extracted using Fast Fourier Transform (FFT) analysis, enabling robust detection even in the presence of noise

Amplitude Specification

The signal employs an amplitude control constant of 0.0025, maintaining the entire signal below -50 dB. Although most consumer speakers and headphones filter frequencies outside the audible range before signal reproduction, this constant ensures that the signal will have very low amplitude, providing a safeguard against unintended audibility in cases where audio streams might be connected to high-end equipment with extended low-frequency response.

3.2 Detection Algorithm

Detection of the DNRM signal follows a straightforward signal processing approach:

Audio Buffering: Incoming audio streams are buffered into discrete chunks for processing. Assuming a 16 kHz sample rate the recommended minimum chunk size is 8,000 samples.
Frequency Analysis: Apply a windowed Fast Fourier Transform (FFT) to each audio chunk. Assuming a 16 kHz sample rate, the recommended window size is 8,000 samples.
Component Detection: Examine the FFT output for the presence of spectral peaks at 2 Hz, 3 Hz, and 5 Hz. All three frequency components must be present for a positive detection within a single window.
Amplitude Threshold: Each detected frequency component must exceed -80 dB to be considered valid. This threshold distinguishes intentional DNRM signals from noise floor artifacts while accounting for signal attenuation through transmission and processing.
Persistence Validation: To confirm the presence of a DNRM signal and eliminate false positives from transient noise, the signal must be detected in at least 90% of windows within a rolling time period (e.g., 90% of windows in any 2-second interval). This persistence threshold ensures that brief spectral coincidences do not trigger false opt-outs while maintaining responsive detection of genuine DNRM signals. When the persistence threshold is met, the detection algorithm should flag the audio stream as containing a DNRM signal and initiate appropriate handling per the system's compliance implementation.

4. Open Questions

How should detection implementations handle signal conflicts, i.e., when processed audio contains mixed DNRM and non-DNRM speakers? Should the DNRM signal be considered "viral" in this case and apply to all audio streams in a co-mingled digital space?
What incentives would encourage voluntary adoption of the DNRM standard by transcription software vendors or open source implementations?
What is the appropriate layer of the software stack at which DNRM signal detection should be implemented?