Skip to content
GCC AI Research

Search

Results for "voice AI"

The future of audio AI: adoption use cases powering the Middle East

MBZUAI ·

ElevenLabs, a voice AI research and product company, presented at MBZUAI's Incubation and Entrepreneurship Center (IEC) on the adoption of audio AI in the Middle East. Hussein Makki, general manager for the Middle East at ElevenLabs, highlighted the potential of voice-native AI across sectors like telecommunications, banking, and education. ElevenLabs focuses on making content accessible and engaging across languages and voices through its text-to-speech models. Why it matters: This signals growing interest and investment in voice AI applications within the region, potentially transforming customer service and content accessibility in Arabic.

Your voice can jailbreak a speech model – here’s how to stop it, without retraining

MBZUAI ·

A new paper from MBZUAI demonstrates that state-of-the-art speech models can be easily jailbroken using audio perturbations to generate harmful content, achieving success rates of 76-93% on models like Qwen2-Audio and LLaMA-Omni. The researchers adapted projected gradient descent (PGD) to the audio domain to optimize waveforms that push the model towards harmful responses. They propose a defense mechanism based on post-hoc activation patching that hardens models at inference time without retraining. Why it matters: This research highlights a critical vulnerability in speech-based LLMs and offers a practical solution, contributing to the development of more secure and trustworthy AI systems in the region and globally.

Text-to-speech system brings real-time speech to LLMs

MBZUAI ·

MBZUAI researchers developed LLMVoX, a system enabling LLMs to produce real-time speech, including Arabic. LLMVoX addresses limitations of existing end-to-end and cascaded pipeline approaches, which suffer from either degraded reasoning or latency. LLMVoX was developed as part of Project OMER, which was recently awarded Regional Research Grant from Meta. Why it matters: This enhances the potential of LLMs to function as more natural, multimodal virtual assistants, especially for Arabic-speaking users in the Middle East.

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

arXiv ·

MBZUAI researchers introduce LLMVoX, a 30M-parameter, LLM-agnostic, autoregressive streaming text-to-speech (TTS) system that generates high-quality speech with low latency. The system preserves the capabilities of the base LLM and achieves a lower Word Error Rate compared to speech-enabled LLMs. LLMVoX supports seamless, infinite-length dialogues and generalizes to new languages with dataset adaptation, including Arabic.

Past, Present and Future of Speech Technologies

MBZUAI ·

Pedro J. Moreno, former head of ASR R&D at Google, presented a talk at MBZUAI on the past, present, and future of speech technologies. The talk covered the evolution of speech tech, his career contributions including work on Google Voice search, and the impact of LLMs on speech science. He also discussed the interplay between foundational and applied research and preparing the next generation of scientists. Why it matters: The talk provides insights into the trajectory of speech technologies from a leading researcher, highlighting future directions and the ethical considerations surrounding AI's impact on society.

Research talk on Privacy and Security Issues in Speech

MBZUAI ·

A research talk was given on privacy and security issues in speech processing, highlighting the unique privacy challenges due to the biometric information embedded in speech. The talk covered the legal landscape, proposed solutions like cryptographic and hashing-based methods, and adversarial processing techniques. Dr. Bhiksha Raj from Carnegie Mellon University, an expert in speech and audio processing, delivered the talk. Why it matters: As speech-based interfaces become more prevalent in the Middle East, understanding and addressing the associated privacy risks is crucial for ethical AI development and deployment.

Making human-machine conversation more lifelike than ever at GITEX

MBZUAI ·

MBZUAI researchers demonstrated a low-latency, multilingual multimodal AI system at GITEX that integrates speech, text, and visual capabilities for more lifelike human-machine conversation. The demo, led by Dr. Hisham Cholakkal, includes a mobile app where users can point their camera at an object and ask questions, receiving spoken answers in multiple languages. They are also integrating the model into a robot dog that can respond to voice commands. Why it matters: This work addresses key challenges in deploying LLMs to real-world applications in the Middle East, such as multilingual support and real-time responsiveness.