Artificial Intelligence

AI and Human Interaction Reaches a New Level: ChatGPT's Advanced Voice Mode

Gábor Bíró • July 31, 2024

4 min read

In the summer of 2024, OpenAI began rolling out the highly anticipated Advanced Voice Mode for ChatGPT. Leveraging the multimodal capabilities of the GPT-4o model, this feature opened a new dimension in communication with artificial intelligence. Initially available to a select group of paid (Plus) subscribers, this function offered hyper-realistic, real-time voice interactions, significantly reducing the latency of previous voice features and enabling more natural conversations.

The Advanced Voice Mode fundamentally changed the interaction between users and ChatGPT. While earlier voice functions used separate models for speech-to-text and text-to-speech conversions, the GPT-4o model can natively handle audio inputs and outputs. This multimodal approach allows for near-instantaneous responses and a smoother, more fluid conversation flow.

Advanced Voice Mode Capabilities

At its launch, the Advanced Voice Mode promised and offered several groundbreaking features to testers:

Real-time interaction: Minimal latency between question and answer, enabling natural dialogue.
Interruptibility: Users could interrupt ChatGPT mid-sentence, just like in a human conversation.
Emotion detection and expression: The system could recognize emotions in the user's tone of voice (e.g., sadness, excitement) and respond with similarly nuanced, emotional tones.
Preset voices: To prevent misuse (e.g., voice cloning), OpenAI initially limited the response voices to four options (Juniper, Breeze, Cove, Ember) created with professional voice actors. These replaced the controversial "Sky" voice featured in an earlier demo.

Gradual Rollout and Safety Measures

From the beginning, OpenAI emphasized a cautious, gradual rollout and the importance of safety. The alpha phase in July 2024 started with a small user group, with plans to make the feature available to all Plus subscribers by the fall of 2024. Before the wider release, they worked with over 100 external testers across 45 languages to identify and mitigate potential risks.

Robust safety measures were implemented, including filters to prevent the generation of violent, hateful, or copyrighted content in audio format. Specific systems were built to ensure the model only speaks in the authorized preset voices, preventing the impersonation of known individuals or the user's own voice.

Background: The "Sky" Voice Case

The development of the Advanced Voice Mode was overshadowed by the controversy surrounding the "Sky" voice, demonstrated in May 2024. Many believed the voice bore a striking resemblance to actress Scarlett Johansson, who had previously declined an offer from OpenAI to voice the system. Johansson publicly expressed her shock and disapproval. Although OpenAI denied intentionally mimicking the actress (and later investigations revealed the voice actor for Sky was hired months before Johansson was approached), the controversy led to the removal of the "Sky" voice before wider testing began.

At the time of the July 2024 launch, OpenAI indicated plans to enhance the voice mode with future capabilities, such as real-time video analysis and screen sharing, and also planned to release a detailed safety report in August.

Update (April 14, 2025)

Since the original article's publication in July 2024, ChatGPT's Advanced Voice Mode has undergone significant development and become more widely available:

Full Rollout for Paid Users: As planned, OpenAI extended Advanced Voice Mode access to all ChatGPT Plus, Team, Pro, Enterprise, and Edu users in the fall of 2024. It became the default voice mode for paid tiers on mobile, desktop, and web interfaces.
Availability for Free Users: Starting February 2025, free ChatGPT users can also experience Advanced Voice Mode, albeit with daily time limits. For them, the feature is powered by the GPT-4o mini model.
New Features:
- Video and Screen Sharing: The previously announced real-time video analysis and screen sharing capabilities became available for paid users in the mobile apps (iOS and Android) starting December 2024.
- Memory and Custom Instructions: These features have been integrated into the voice mode, allowing ChatGPT to remember past conversations and adhere to user-defined preferences.
- More Voices & Improved Pronunciation: The number of available voices increased to nine (e.g., Arbor, Maple, Sol), with seasonal options also appearing. OpenAI continues to refine the naturalness of the voices and handling of different accents.
- Fewer Interruptions: A March 2025 update improved the system's ability to avoid interrupting the user during thinking pauses, making dialogue even smoother.
Safety Report and Concerns: OpenAI published the GPT-4o System Card in August 2024, detailing extensive testing and built-in safety measures. It confirmed the use of preset voices and content filtering but also highlighted risks like anthropomorphism (attributing human qualities to AI), potential emotional attachment, and rare instances of unintentional voice mimicry requiring further refinement.
Usage Limits: Usage of Advanced Voice Mode is subject to daily limits that vary depending on the user tier (Free, Plus, Pro, etc.).

Overall, ChatGPT's Advanced Voice Mode has been successfully rolled out and continues to evolve, bringing interactions with AI closer to natural human conversation, while OpenAI strives to manage the associated safety and ethical challenges.