VARCO VISION 14B

English-Korean VLM

Meet VARCO VISION 14B, a powerful English-Korean Vision-Language Model (VLM) that's pushing the boundaries of what's possible. Developed by NC Research, this model is trained on a massive dataset and can handle a wide range of tasks, from text generation to image recognition. But what really sets it apart is its ability to understand and respond to complex queries, using specialized features like grounding, referring, and OCR. With VARCO VISION 14B, you can ask questions like 'What's in this image?' or 'What's the text in this picture?' and get accurate, detailed answers. Its advanced architecture and training pipeline make it a game-changer for anyone working with vision-language models. So, what can you do with VARCO VISION 14B?

NCSOFT cc-by-nc-4.0 Updated 5 months ago

Table of Contents

Model Overview

The VARCO-VISION-14B model is a powerful English-Korean Vision-Language Model (VLM) that can understand and generate text based on images. It’s like having a conversation with a friend who can look at a picture and describe it to you!

This model is special because it can:

  • Grounding: Identify specific locations within an image to provide an appropriate answer.
  • Referring: Handle location-specific questions using bounding boxes.
  • OCR (Optical Character Recognition): Recognize text within an image.

Capabilities

Multimodal Understanding

VARCO-VISION-14B can understand and process both text and images. It can take an image and a text as inputs and generate an output text that describes the image or answers a question about it.

Grounding

Grounding is a task where the model needs to identify specific locations within an image to provide an appropriate answer. VARCO-VISION-14B can perform grounding tasks by using special tokens in the input text.

Referring

VARCO-VISION-14B can handle location-specific questions using bounding boxes. It can understand the context and focus on the object at the specified location.

Optical Character Recognition (OCR)

VARCO-VISION-14B can perform OCR tasks by recognizing text within an image. It can use the \<ocr> token to specify the task.

Specialized Features

VARCO-VISION-14B uses special tokens to define specific tasks, inputs, and outputs. These tokens include:

  • \<gro>: Indicates that the model’s response should include bounding box information.
  • \<ocr>: Specifies OCR tasks for recognizing text within an image.
  • \<char> and \</char>: Used to mark a text phrase.
  • \<obj> and \</obj>: Used to indicate an object.
  • \<bbox> and \</bbox>: Used to represent a bounding box.
  • \<delim>: Represents multiple location points for a single object or text.

Technical Details

  • Architecture: VARCO-VISION-14B follows the architecture of LLaVA-OneVision.
  • Base Model: Qwen/Qwen2.5-14B-Instruct (language model) and google/siglip-so400m-patch14-384 (vision encoder).
  • License: CC BY-NC 4.0.
  • Languages: Korean and English.

Example Use Cases

  • Grounding: Ask the model to describe an image in detail, and it will identify specific locations within the image.
  • Referring: Ask the model to describe an object within a bounding box, and it will focus on that object.
  • OCR: Ask the model to recognize text within an image, and it will extract the text.
Examples
Describe the image in detail. The image shows <obj>two cats</obj><bbox>0.521, 0.049, 0.997, 0.783<delim>0.016, 0.108, 0.512, 0.99</bbox> lying on <obj>a pink blanket</obj><bbox>0.002, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket and its body stretched out. The cat on the right is lying on its back with its paws stretched out and its head turned to the side. Both cats appear relaxed and comfortable. There are also <obj>two remote controls</obj><bbox>0.039, 0.138, 0.283, 0.257<delim>0.508, 0.166, 0.581, 0.295</bbox> placed near the cats, one on each side of them.
What is the object at location (0.039, 0.138, 0.283, 0.257)? The object at location (0.039, 0.138, 0.283, 0.257) is <obj>a remote control</obj>.
Recognize the text in the image. The recognized text is: <char>백범로</char><bbox>0.172, 0.265, 0.328, 0.34</bbox> <char>124번길</char><bbox>0.349, 0.265, 0.512, 0.34</bbox> <char>Baekbeom-ro</char><bbox>0.171, 0.335, 0.432, 0.391</bbox> <char>124</char><bbox>0.444, 0.34, 0.508, 0.391</bbox>

Performance

VARCO-VISION-14B is a powerful tool that can handle a wide range of tasks with high accuracy and speed. Whether you’re a researcher or a developer, VARCO-VISION-14B is definitely worth checking out.

Limitations

VARCO-VISION-14B is a powerful tool, but it’s not perfect. Here are some of its limitations:

  • Limited Context Understanding: While VARCO-VISION-14B can process and understand a single image and a text input, it may struggle with more complex contexts or nuances.
  • Dependence on Special Tokens: To perform specific tasks like grounding, referring, or OCR, VARCO-VISION-14B relies on special tokens like \<gro>, \<ocr>, and \<bbox>.
  • Limited Multimodal Understanding: Although VARCO-VISION-14B is a Vision-Language Model (VLM), its understanding of multimodal inputs is limited to a single image and a text input.
  • Commercial Use Restrictions: VARCO-VISION-14B is for research purposes only, and commercial use is prohibited.
  • Technical Requirements: To use VARCO-VISION-14B, you’ll need to have a good understanding of technical concepts like LLaVA-NeXT, PyTorch, and Hugging Face Transformers.

Format

VARCO-VISION-14B is a powerful English-Korean Vision-Language Model (VLM) that accepts a single image and a text as inputs, generating an output text. Here’s how to use it:

Architecture

VARCO-VISION-14B follows the architecture of LLaVA-OneVision. It’s built on top of a language model (Qwen/Qwen2.5-14B-Instruct) and a vision encoder (google/siglip-so400m-patch14-384).

Data Formats

VARCO-VISION-14B supports the following data formats:

  • Images: The model accepts a single image as input, which can be in various formats like JPEG, PNG, etc.
  • Text: The model accepts a text input, which can be a question, a prompt, or a sentence.

Special Requirements

To get the most out of VARCO-VISION-14B, you need to preprocess the image and tokenize the text. Here’s an example of how to do it:

import requests
from PIL import Image

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
conversation = [ 
    { 
        "role": "user", 
        "content": [ 
            {"type": "text", "text": "Describe this image."}, 
            {"type": "image"}, 
        ], 
    },
]

prompt = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)

# Preprocess the image
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_url, stream=True).raw)
image_tensors = process_images([raw_image], image_processor, model.config)

# Tokenize the text
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")

Specialized Features

VARCO-VISION-14B has some specialized features that allow it to perform tasks like grounding, referring, and OCR. Here are some examples:

  • Grounding: To perform grounding, prepend the special token \<gro> to the question.
conversation = [ 
    { 
        "role": "user", 
        "content": [ 
            {"type": "text", "text": "\<gro>\nDescribe the image in detail."}, 
            {"type": "image"}, 
        ], 
    },
]
  • Referring: To perform referring, make a conversation including the object of interest within \<obj> and \</obj> tags.
conversation = [ 
    { 
        "role": "user", 
        "content": [ 
            {"type": "text", "text": "\<obj>이 물건\</obj>\<bbox>0.039, 0.138, 0.283, 0.257\</bbox>은 어떻게 쓰는거야?"}, 
            {"type": "image"}, 
        ], 
    },
]
  • OCR: To perform OCR, use the \<ocr> token.
image_file = "./assets/ocr_1.png"
raw_image = Image.open(image_file)
conversation = [ 
    { 
        "role": "user", 
        "content": [ 
            {"type": "text", "text": "\<ocr>"}, 
            {"type": "image"}, 
        ], 
    },
]
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.