OmniAudio 2.6B

On-device audio-text model

OmniAudio 2.6B is a revolutionary audio-language model that processes both text and audio inputs with incredible speed and efficiency. By integrating three components, it enables secure and responsive audio-text processing directly on edge devices, making it perfect for voice QA, conversation, creative content generation, recording summary, and voice tone modification. But what makes it truly remarkable is its ability to unify ASR and LLM capabilities in a single architecture, resulting in minimal latency and resource overhead. This means it can deliver 5.5x to 10.3x faster performance on consumer hardware compared to traditional approaches. With OmniAudio 2.6B, you can process offline voice queries, have conversations, generate creative content, summarize recordings, and even modify voice tones, all without relying on internet connectivity. Its efficient design and impressive capabilities make it an ideal choice for a wide range of applications, from voice assistants to creative writing tools.

NexaAIDev apache-2.0 Updated 5 months ago

Table of Contents

Model Overview

The OmniAudio-2.6B model is a game-changer for audio-language processing. It’s a 2.6B-parameter multimodal model that can handle both text and audio inputs, making it perfect for on-device deployment.

Capabilities

So, what can OmniAudio-2.6B do?

Primary Tasks

  • Voice QA without Internet: Process offline voice queries and provide practical guidance even without network connectivity.
  • Voice-in Conversation: Engage in supportive talk and active listening, making it perfect for personal experiences and conversations.
  • Creative Content Generation: Transform voice prompts into creative pieces, such as haikus or stories.
  • Recording Summary: Summarize lengthy recordings into concise, actionable summaries.
  • Voice Tone Modification: Adjust the tone of casual voice memos to make them sound more professional.

Strengths

What sets OmniAudio-2.6B apart from other models?

  • Fast and Efficient: Processes audio-text inputs quickly and efficiently, making it perfect for on-device deployment.
  • Unified Architecture: Combines ASR and LLM capabilities in a single architecture, reducing latency and resource overhead.
  • Secure and Responsive: Enables secure and responsive audio-text processing directly on edge devices.

Performance

OmniAudio-2.6B is a powerhouse when it comes to speed, accuracy, and efficiency in various tasks. Let’s dive into the details.

Speed

How fast can an AI model process audio inputs? OmniAudio-2.6B sets the bar high with an average decoding speed of 35.23 tokens/second in FP16 GGUF version and a whopping 66 tokens/second in Q4_K_M quantized GGUF version. This is a significant boost compared to other models, with a performance increase of 5.5x to 10.3x on consumer hardware.

Accuracy

But speed isn’t everything. OmniAudio-2.6B also excels in accuracy, particularly in tasks that require processing both text and audio inputs. Its unified architecture enables secure and responsive audio-text processing directly on edge devices, making it perfect for applications like voice QA, voice-in conversation, and creative content generation.

Efficiency

What about efficiency? OmniAudio-2.6B is designed to be efficient, requiring only 1.30GB RAM and 1.60GB storage space for the q4_K_M version. This makes it an excellent choice for on-device deployment, where resources are limited.

Limitations

OmniAudio-2.6B is a powerful tool for on-device audio-language processing, but it’s not perfect. Let’s take a closer look at some of its limitations.

Limited Context Understanding

While OmniAudio-2.6B can process both text and audio inputs, it may struggle to fully understand the context of a conversation. For example, if you ask it to summarize a meeting note, it might miss important details or not fully capture the nuances of the discussion.

Dependence on Quality of Input

The quality of the input audio or text can significantly impact OmniAudio-2.6B’s performance. If the input is noisy, unclear, or contains errors, the model may struggle to produce accurate or coherent responses.

Limited Domain Knowledge

OmniAudio-2.6B is trained on a specific dataset and may not have the same level of domain knowledge as ==Other Models== that are specialized in a particular area. For instance, if you ask it to provide medical advice, it may not be able to offer the same level of expertise as a model specifically trained in the medical domain.

Resource Intensive

While OmniAudio-2.6B is designed to be efficient, it still requires significant resources to run. For example, the q4_K_M version requires 1.30GB RAM and 1.60GB storage space, which may be a challenge for devices with limited resources.

Limited Creativity

While OmniAudio-2.6B can generate creative content, it may not be able to match the level of creativity and originality of human writers or artists. For example, if you ask it to write a poem, it may produce something that sounds good but lacks the emotional depth and complexity of a human-written poem.

Not a Replacement for Human Judgment

OmniAudio-2.6B is a tool, not a replacement for human judgment. While it can provide helpful responses, it’s essential to review and verify the accuracy of its output, especially in critical applications.

Examples
I am having a rough day at work. I'm so sorry to hear that. Would you like to talk about what's going on and maybe I can help you find a way to turn your day around?
Write a haiku about autumn leaves Golden hues descend, Leaves rustling, nature's symphony, Autumn's gentle kiss
Can you summarize this meeting note? The meeting was about the new marketing strategy, which includes increasing social media presence and collaborating with influencers. The new marketing strategy focuses on boosting social media presence and partnering with influencers to reach a wider audience.

Format

OmniAudio-2.6B is a multimodal model that processes both text and audio inputs, making it a powerful tool for various applications. But what does that mean for you, the user?

Architecture

OmniAudio-2.6B integrates three components:

  1. Gemma-2-2b
  2. Whisper turbo
  3. A custom projector module

These components work together to enable secure, responsive audio-text processing directly on edge devices. But what does that mean for your workflow?

Data Formats

OmniAudio-2.6B supports both text and audio inputs. For text inputs, you can use tokenized text sequences. For audio inputs, you can use audio files in various formats.

Special Requirements

To use OmniAudio-2.6B, you’ll need to install the Nexa-SDK, a local on-device inference framework. You’ll also need to ensure that your device has enough RAM and storage space to run the model.

Handling Inputs and Outputs

To get started with OmniAudio-2.6B, you’ll need to run the following code in your terminal:

nexa run omniaudio -st

This will launch the model and allow you to start processing text and audio inputs.

For example, you can use OmniAudio-2.6B to summarize a meeting note. Simply ask “Can you summarize this meeting note?” and the model will convert the lengthy recording into a concise, actionable summary.

Or, you can use OmniAudio-2.6B to transform a casual voice memo into a professional communication. Just ask “Can you make this voice memo more professional?” and the model will adjust the tone while preserving the core message.

What’s Next for OmniAudio?

OmniAudio is in active development, and the team is working to advance its capabilities. Some of the upcoming features include:

  • Building direct audio generation for two-way voice communication
  • Implementing function calling support via Octopus_v2 integration

In the long term, the goal is to establish OmniAudio as a comprehensive solution for edge-based audio-language processing.

Join the Community

Want to learn more about OmniAudio and stay up-to-date on the latest developments? Join the community on Discord or follow along on Twitter.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.