VARCO VISION 14B HF

Vision-Language Model

Meet VARCO VISION 14B HF, a powerful English-Korean Vision-Language Model (VLM) that's pushing the boundaries of what's possible. This model is trained on a massive dataset and can handle a wide range of tasks, from image recognition to text generation. But what really sets it apart is its ability to understand and respond to questions about specific locations within an image, using bounding boxes to pinpoint exact areas of interest. Want to know more about an object in a picture? VARCO VISION 14B HF can provide detailed information, including its location and context. With its advanced architecture and specialized features, this model is a game-changer for anyone working with images and text. So, what can you do with VARCO VISION 14B HF?

NCSOFT cc-by-nc-4.0 Updated 3 months ago

Table of Contents

Model Overview

The VARCO-VISION-14B model is a powerful English-Korean Vision-Language Model (VLM) that can understand and describe images. It was developed by NC Research, Multimodal Generation Team.

Capabilities

This model can perform various tasks, including:

  • Describe images: Take an image and a text as inputs and generate an output text that describes the image.
  • Grounding: Identify specific locations within an image to provide an appropriate answer.
  • Referring: Handle location-specific questions using bounding boxes.
  • Optical Character Recognition (OCR): Recognize text within an image.

The model uses special tokens to define specific tasks, inputs, and outputs. For example, \<gro> is used for grounding, \<ocr> is used for OCR, and \<obj> and \<bbox> are used to indicate objects and bounding boxes.

The model supports both Korean and English languages.

Technical Details

The model follows the architecture of LLaVA-OneVision and uses Qwen/Qwen2.5-14B-Instruct as its language model and google/siglip-so400m-patch14-384 as its vision encoder. It is licensed under CC BY-NC 4.0.

How to use it?

To use the model, you need to import torch, requests, and PIL libraries. You can load the model using LlavaOnevisionForConditionalGeneration.from_pretrained function.

You need to prepare a conversation history and use apply_chat_template to get a correctly formatted prompt. Then, you can generate an output text using model.generate function.

Performance

The model showcases remarkable performance in various tasks. It can handle inputs quickly and efficiently, making it suitable for real-time applications. Its accuracy is comparable to that of proprietary models, making it a reliable choice for various tasks.

The model can handle a wide range of tasks, including grounding, referring, and OCR. Its ability to process multiple inputs and generate output text makes it a versatile model.

Limitations

The model has some limitations, including:

  • Limited input formats: It only accepts a single image and a text as inputs.
  • Dependence on special tokens: The model relies on special tokens to understand specific tasks, inputs, and outputs.
  • Limited context understanding: The model might not always understand the context correctly, especially for complex images.
  • OCR limitations: The model’s OCR capabilities are not foolproof and might depend on the quality of the input image.
  • Research-only license: The model is licensed for research purposes only and can’t be used for commercial applications.
Examples
Describe this image. http://images.cocodataset.org/val2017/000000039769.jpg The image shows <obj>two cats</obj><bbox>0.014, 0.106, 0.51, 0.996<delim>0.51, 0.054, 0.996, 0.787</bbox> lying on <obj>a pink blanket</obj><bbox>0.003, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket, while the cat on the right is lying on its stomach with its head also resting on the blanket. Both cats appear to be relaxed and comfortable. There are <obj>two remote controls</obj><bbox>0.037, 0.141, 0.283, 0.253<delim>0.506, 0.171, 0.581, 0.295</bbox> placed near the cats, one on the left side and one on the right side of the image.
What is <obj>이 물건</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox> used for? This object is a remote control, mainly used to control electronic devices such as TVs or other appliances remotely. The remote control has various buttons, each performing functions like channel changing, volume adjustment, and power on/off. Users can hold the remote control and press the buttons to send signals to the device for the desired operation. Remote controls are commonly used in homes or offices for convenient control of electronic devices.
Perform OCR on this image:./assets/ocr_1.png 백범로 124번길 Baekbeom-ro 124 Mansu Jugong Apt 42 Shieung 시청 인천대공원 모래내시장역 IncheonGrand Park

Example Use Cases

  • Describe an image in detail, including specific locations and objects.
  • Ask a question about a specific object in an image, using its location to provide context.
  • Recognize text within an image, such as street signs or product labels.

Format

The model accepts input in the form of images and text, and generates output text. The image can be in any format that can be read by the PIL library, and the text should be a string.

The model has several specialized features that allow it to perform specific tasks, including grounding, referring, and OCR. To perform these tasks, you need to use special tokens and follow specific formats.

Here is an example of how to use the model:

import torch
import requests
from PIL import Image
from transformers import LlavaOnevisionForConditionalGeneration, AutoProcessor

model_name = "NCSOFT/VARCO-VISION-14B-HF"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(model_name, torch_dtype="float16", device_map="auto", attn_implementation="flash_attention_2")
processor = AutoProcessor.from_pretrained(model_name)

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image."},
            {"type": "image"},
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)

inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(device, torch.float16)

output = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
output = processor.decode(output[0][inputs.input_ids.shape[1]:])

if output.endswith(EOS_TOKEN):
    output = output[: -len(EOS_TOKEN)]

output = output.strip()
print(output)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.