VARCO VISION 14B HF
Meet VARCO VISION 14B HF, a powerful English-Korean Vision-Language Model (VLM) that's pushing the boundaries of what's possible. This model is trained on a massive dataset and can handle a wide range of tasks, from image recognition to text generation. But what really sets it apart is its ability to understand and respond to questions about specific locations within an image, using bounding boxes to pinpoint exact areas of interest. Want to know more about an object in a picture? VARCO VISION 14B HF can provide detailed information, including its location and context. With its advanced architecture and specialized features, this model is a game-changer for anyone working with images and text. So, what can you do with VARCO VISION 14B HF?
Table of Contents
Model Overview
The VARCO-VISION-14B model is a powerful English-Korean Vision-Language Model (VLM) that can understand and describe images. It was developed by NC Research, Multimodal Generation Team.
Capabilities
This model can perform various tasks, including:
- Describe images: Take an image and a text as inputs and generate an output text that describes the image.
- Grounding: Identify specific locations within an image to provide an appropriate answer.
- Referring: Handle location-specific questions using bounding boxes.
- Optical Character Recognition (OCR): Recognize text within an image.
The model uses special tokens to define specific tasks, inputs, and outputs. For example, \<gro>
is used for grounding, \<ocr>
is used for OCR, and \<obj>
and \<bbox>
are used to indicate objects and bounding boxes.
The model supports both Korean and English languages.
Technical Details
The model follows the architecture of LLaVA-OneVision and uses Qwen/Qwen2.5-14B-Instruct as its language model and google/siglip-so400m-patch14-384 as its vision encoder. It is licensed under CC BY-NC 4.0.
How to use it?
To use the model, you need to import torch
, requests
, and PIL
libraries. You can load the model using LlavaOnevisionForConditionalGeneration.from_pretrained
function.
You need to prepare a conversation history and use apply_chat_template
to get a correctly formatted prompt. Then, you can generate an output text using model.generate
function.
Performance
The model showcases remarkable performance in various tasks. It can handle inputs quickly and efficiently, making it suitable for real-time applications. Its accuracy is comparable to that of proprietary models, making it a reliable choice for various tasks.
The model can handle a wide range of tasks, including grounding, referring, and OCR. Its ability to process multiple inputs and generate output text makes it a versatile model.
Limitations
The model has some limitations, including:
- Limited input formats: It only accepts a single image and a text as inputs.
- Dependence on special tokens: The model relies on special tokens to understand specific tasks, inputs, and outputs.
- Limited context understanding: The model might not always understand the context correctly, especially for complex images.
- OCR limitations: The model’s OCR capabilities are not foolproof and might depend on the quality of the input image.
- Research-only license: The model is licensed for research purposes only and can’t be used for commercial applications.
Example Use Cases
- Describe an image in detail, including specific locations and objects.
- Ask a question about a specific object in an image, using its location to provide context.
- Recognize text within an image, such as street signs or product labels.
Format
The model accepts input in the form of images and text, and generates output text. The image can be in any format that can be read by the PIL library, and the text should be a string.
The model has several specialized features that allow it to perform specific tasks, including grounding, referring, and OCR. To perform these tasks, you need to use special tokens and follow specific formats.
Here is an example of how to use the model:
import torch
import requests
from PIL import Image
from transformers import LlavaOnevisionForConditionalGeneration, AutoProcessor
model_name = "NCSOFT/VARCO-VISION-14B-HF"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(model_name, torch_dtype="float16", device_map="auto", attn_implementation="flash_attention_2")
processor = AutoProcessor.from_pretrained(model_name)
# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(device, torch.float16)
output = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
output = processor.decode(output[0][inputs.input_ids.shape[1]:])
if output.endswith(EOS_TOKEN):
output = output[: -len(EOS_TOKEN)]
output = output.strip()
print(output)