Apollo 7B T32

Multimodal video understanding

Meet Apollo 7B T32, a game-changing AI model that's pushing the boundaries of video understanding. What makes it remarkable? It can handle hour-long videos with ease, balancing speed and accuracy like never before. Apollo models are designed to excel at tasks like long-form video comprehension, temporal reasoning, and complex video question-answering, making them perfect for multi-turn conversations grounded in video content. With just 3B parameters, Apollo 7B T32 outperforms most 7B competitors and even rivals 30B-scale models. Its unique design decisions make it efficient, fast, and accurate, making it a standout in the world of Large Multimodal Models.

Apollo LMMs apache-2.0 Updated 4 months ago

Table of Contents

Model Overview

Meet Apollo, a game-changer in the world of video understanding. This Large Multimodal Model (LMM) is designed to handle hour-long videos with ease, balancing speed and accuracy.

Capabilities

So, what can Apollo do?

  • Long-form video comprehension: It can understand videos that are hours long, not just short clips.
  • Temporal reasoning: It can make sense of what’s happening in a video over time.
  • Complex video question-answering: It can answer tough questions about a video, like “What’s happening in this scene?”
  • Multi-turn conversations: It can have a conversation with you about a video, responding to your questions and statements.

Performance

But how does Apollo perform in terms of speed, accuracy, and efficiency?

  • Speed: Imagine processing hour-long videos in a matter of seconds. Apollo makes this possible, thanks to its strategic design decisions that balance speed and accuracy.
  • Accuracy: Apollo excels in tasks such as long-form video comprehension, temporal reasoning, complex video question-answering, and multi-turn conversations grounded in video content. It even rivals 30B-scale models in terms of accuracy, which is impressive considering its smaller size.
  • Efficiency: Apollo can process 32 tokens per frame, allowing it to capture complex video content.

Comparison to Other Models

But how does Apollo compare to other AI models? ==Other models== may excel in certain tasks, but Apollo’s unique combination of speed, accuracy, and efficiency makes it a top contender in the field of video understanding.

Example Use Case

Let’s take a look at an example use case for Apollo. Imagine you have a video of a cat playing with a ball, and you want to generate a description of the video. Apollo can process the video and generate a detailed description, including the actions of the cat and the ball.

Examples
What is the sequence of events in the video where a person is cooking? The person first cracks an egg into a bowl, then adds flour and sugar. Next, they mix the ingredients together and pour the batter into a pan. After that, they put the pan on the stove and cook the mixture until it's golden brown.
What is the question being asked in the video where a person is asking about the weather? The person is asking 'Will it rain tomorrow?'
Can you describe the scene in the video where a person is playing soccer? The person is playing soccer on a green field with other players. They are wearing a blue jersey and running towards the goal.

Limitations

While Apollo has shown impressive results in video understanding, it’s essential to acknowledge its limitations.

  • Handling Complexity: Apollo excels at handling hour-long videos, but how well does it perform when dealing with extremely complex or abstract video content?
  • Limited Contextual Understanding: Although Apollo can engage in multi-turn conversations grounded in video content, its ability to understand the nuances of human context is still limited.
  • Dependence on Quality of Input: The quality of the input video can significantly impact Apollo’s performance.

Format

Apollo is a Large Multimodal Model (LMM) designed for video understanding tasks. It supports various tasks, including long-form video comprehension, temporal reasoning, complex video question-answering, and multi-turn conversations grounded in video content.

Architecture

Apollo models use a transformer architecture, but with a twist. They’re designed to handle hour-long videos, balancing speed and accuracy.

Supported Data Formats

Apollo models support video input, specifically mp4 files. They can handle videos of varying lengths, including hour-long videos.

Input Requirements

To use Apollo, you’ll need to provide a video file and a question or prompt related to the video.

Output Requirements

The output of the Apollo model will be a text response to the input question or prompt.

Getting Started

Want to try Apollo out? Here’s a quick example of how to use it:

  • Install the model using pip install -e. and pip install flash-attn --no-build-isolation
  • Load the model and tokenizer using model = AutoModelForCausalLM.from_pretrained(model_path,...)
  • Prepare your video data using mm_processor = ApolloMMLoader(...)
  • Ask a question about the video using conv = conv_templates["qwen_2"].copy()
  • Get the model’s response using output_ids = model.generate(...)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.