Below, we compile a selection of key tools and services around synthetic voices, or synthetic sound-effects both for generation, design, cloning, or detection (updated: December 2024).
Voice Synthesis Tools (e.g. Text-to-Speech, Voice Cloning, Voice Design)
Eleven Labs: One of the leading platform in voice synthesis and text-to-speech technology, utilizing advanced AI to create lifelike, customizable voices with remarkable precision and versatility. Intuitive interface, featuring:
-
-
- Extensive Voice Library: Over 40 voice models are available, varying in gender, age, and nationality, each with unique style and tone tags, such as articulate, friendly, raspy, emotional, conversational, confident, seductive, expressive, anxious, childish, excited, and more. And generation parameters like tempo and temperature, and similarity.
- Voice Design Services: Create custom voices from prompts. Currently works best on human-oriented vocal description or specific character archetypes (e.g., a deep, raspy, professional British male or a massive, evil ogre).
- Voice Cloning: Offers rapid voice cloning in as little as 2 minutes for basic use cases or 30 minutes for more professional results.
- Also: voiceover production, dubbing, voice isolation, browser-based text-to-speech tools, and a DeepFake Detection capable of detecting synthetic voices generated by ElevenLabs. An API is also available.
-
Hume AI: One of the leading platform for expressive Voice synthesis, specifically designed to understand and generate nuanced emotional expressions through voice, facial expressions, and text. Intuitive interface, featuring:
-
-
- Several Emotion-Driven Voice Agents: which aim to capture complex states like empathy, excitement, sadness, or calm. You can choose some vocal preset, as well as the LLM for text generation used.
- Voice Design: you can create & save a new voice from an existing one, specifying different vocal attributes on a scale of -100 to 100 (
- Emotional Analysis: Real-time emotion recognition through voice and facial cues (from camera and mic). It notably can detect certain vocal expression (from 28 kinds, Adoration, Amusement, Anger, Awe, Confusion, Contempt, Contentment, Desire, Disappointment, Disgust, Distress, Ecstasy, Embarrassment, Excitement, Fear, Horror, Interest, Joy, Pain, Realization, Relief, Sadness, Surprise, Sympathy, Tiredness, etc.)
and classify vocal calls (inferring probabilities of 67 descriptors, like ‘laugh’, ‘sigh’, ‘shriek’, ‘oh’, ‘ahh’, ‘mhm’, etc.).
Some custom models further specialize in detecting specific traits from vocal recordings, such as confidence, parkinson, toxicity of speech, depressed mood, attentiveness, father authenticity, alertness, etc. - Research: Hume AI’s research develops AI models to map and understand human emotions across diverse expressive behaviors, analyzing facial expressions, speech prosody, and vocal bursts. They emphasize a continuous, multidimensional approach rather than traditional discrete categories. Their work encompasses analyzing facial expressions, speech prosody, and vocal bursts to develop AI models that align with human well-being.
- an API is also available.
-
Resemble.ai Another leading voice synthesis platform offering advanced tools for creating realistic, customizable voices. It enables text-to-speech with Extensive Voice Library (& parameters like tempo and temperature), speech-to-speech, voice cloning from 10 seconds of audio, Recording Edit (e.g. remove fillers) and multilingual support across 149 languages, Deep Fake Detection. Include an API.
Open AI GPT Audio Models: [Advanced voice mode only available for Plus, Team, Enterprise, and Edu users in the iOS / Android mobile apps.]
OpenAI launched recently an Advance Voice-Assistant mode, with natural sounding voice, which enable to chat, and reacts and adapts in tone.
For now, limited options in terms of vocal expression compared to many other platforms.
Sound Generation Tools (e.g. Text-to-SFX Text-to-Sound)
ElevenLabs allows users to generate sound effects (SFX) from text description,(with a prompt up to 448 characters). It can produces audio output ranging from 0.5 to 22 seconds. The quality and accuracy of results can vary depending on the prompt.
Genny Lovo AI allows users to generate sound effects (SFX) from text description,(with a prompt up to 150 characters), produces audio until 10s. Some temperature (‘creative’) parameter. Typically very (intensely) cinematic.
Stability Audio Models: Stability AI have several high-quality audio models.
Stable Audio Open is an open-source model optimized for generating short audio samples, sound effects, and production elements using text prompts. Freely available here, involves some coding (minimal)!
Stable Audio 2.0, propose both text&audio-to-audio AI model, allowing users to both generate until 3min track from text description or upload and transform samples using prompts. More details here and try here. Soon in their API.
Speech-To-Text Tools
Coming soon.