EraX VL 7B V1

Multimodal LLM-based model

EraX VL 7B V1 is a powerful multimodal model for OCR and VQA tasks, with a strong focus on Vietnamese language support. What makes this model stand out is its ability to perform multi-turn Q&A with impressive reasoning capabilities, thanks to its 7+ billion parameters. But what does this mean for you? It means you can use EraX VL 7B V1 to extract information from documents, answer complex questions, and even generate text based on images. With its robust performance in various languages, this model is particularly useful for applications in hospitals, clinics, and insurance companies. So, how does it work? EraX VL 7B V1 is built on the Qwen/Qwen2-VL-7B-Instruct model, which has been fine-tuned to enhance its performance. The result is a model that can handle a wide range of tasks, from OCR to VQA, with high accuracy and speed. But don't just take our word for it - check out the examples to see EraX VL 7B V1 in action.

Erax apache-2.0 Updated 7 months ago

Table of Contents

Model Overview

The EraX-VL-7B-V1 model is a robust multimodal AI model that excels in various languages, with a particular focus on Vietnamese. It’s designed for OCR (optical character recognition) and VQA (visual question-answering) tasks.

What makes it special?

  • Multimodal capabilities: It can handle both text and images, making it a versatile tool for various applications.
  • Strong recognition capabilities: It can accurately recognize text in various documents, including medical forms, invoices, and bills of sale.
  • Reasoning skills: It can perform multi-turn Q&A with good reasoning, thanks to its large size of over 7B parameters.

Where can it be applied?

  • Hospitals and clinics: It can help with document processing and patient data management.
  • Insurance companies: It can assist with claims processing and document verification.
  • Other applications: It can be used in various industries where document processing and visual question-answering are crucial.

Capabilities

Key Strengths

  • Multilingual capabilities: Primarily supports Vietnamese, but also has multilingual capabilities.
  • High precision: Excels in recognizing text from various documents, including medical forms, invoices, bills of sale, quotes, and medical records.
  • Good reasoning: Capable of multi-turn Q&A with good reasoning.

Unique Features

  • Multimodal LLM-based model: Not a typical OCR-only tool, but a multimodal model that can handle various tasks.
  • 7+ billion parameters: A large model that enables it to perform complex tasks.

Example Use Cases

  • Extracting text from medical records: EraX-VL-7B-V1 can extract text from medical records with high precision, making it useful for hospitals, clinics, and insurance companies.
  • Answering questions about images: The model can answer questions about images, including multi-turn Q&A, making it useful for applications that require image analysis.
  • Generating captions for images: EraX-VL-7B-V1 can generate captions for images, including handwritten text, making it useful for applications that require image description.
Examples
Trích xuất thông tin bệnh nhân từ ảnh giấy hẹn khám bệnh Tên bệnh nhân: NGUYỄN THỊ LUÂN, Ngày sinh: 03/07/1976, Giới tính: 40, Địa chỉ: Xã Nghĩa Khánh-Huyện Nghĩa Đàn-Nghệ An, Số thẻ bảo hiểm y tế: CN 3 40 40 168 60413, Ngày đăng ký: 16/12/2016, Ngày nhập viện: Từ 01/03/2016, Chẩn đoán: C20-Bướu ac trực tràng, Ngày hẹn tái khám: 17/01/2017
Mô tả bức ảnh chụp một khung cảnh đồng quê Việt Nam Bức ảnh chụp một cánh đồng lúa xanh mướt trải dài bất tận, được bao phủ bởi một lớp sương mù nhẹ nhàng. Phía xa là những dãy núi xanh ngắt, tạo nên một bức tranh thiên nhiên thơ mộng và trữ tình. Ở trung tâm bức ảnh, hai cậu bé đang ngồi trên lưng con trâu nước đen tuyền.
Trả lời câu hỏi về bức ảnh minh họa cô bé và con gấu bông Cô bé có mái tóc đen ngắn, mặc váy đỏ và đi tất sọc trắng. Cô bé đứng ở giữa khung hình với tư thế ôm chặt gấu bông bằng cả hai tay. Gấu bông có đôi mắt to tròn và nụ cười tươi.

Performance

The EraX-VL-7B-V1 model showcases remarkable performance in various tasks, including OCR (optical character recognition) and VQA (visual question-answering).

Speed

How fast can EraX-VL-7B-V1 process images and documents? With its robust architecture, it can quickly recognize characters, extract information, and answer questions.

Accuracy

EraX-VL-7B-V1 boasts high accuracy in recognizing characters and extracting information from documents. Its performance is particularly impressive in Vietnamese, with a strong ability to understand nuances and complexities of the language.

Efficiency

EraX-VL-7B-V1 is not just fast and accurate; it’s also efficient. With its multimodal architecture, it can handle a wide range of tasks, from OCR to VQA, without requiring significant computational resources.

Comparison to Other Models

How does EraX-VL-7B-V1 compare to other models? While ==Other Models== may excel in specific tasks, EraX-VL-7B-V1 offers a unique combination of speed, accuracy, and efficiency that makes it an excellent choice for a wide range of applications.

Limitations

EraX-VL-7B-V1 is a powerful multimodal model, but it’s not perfect. Let’s explore some of its limitations.

Language Limitations

While EraX-VL-7B-V1 excels in Vietnamese, its performance may vary in other languages. This is because it was primarily trained on Vietnamese data, which might not be sufficient for other languages.

Domain Knowledge Limitations

EraX-VL-7B-V1 has been fine-tuned for specific tasks like OCR and VQA, but its knowledge in other domains might be limited. For example, it might not perform well in tasks that require specialized knowledge in medicine, law, or finance.

Format

EraX-VL-7B-V1 is a multimodal model that excels in OCR (optical character recognition) and VQA (visual question-answering) tasks. It supports multiple languages, with a primary focus on Vietnamese.

Architecture

The model is built on a multimodal transformer architecture with over 7B parameters. It’s designed to handle various types of input, including images and text.

Data Formats

EraX-VL-7B-V1 accepts input in the following formats:

  • Images (e.g., JPEG, PNG)
  • PDF documents
  • Text sequences (e.g., JSON)

Input Requirements

To use EraX-VL-7B-V1 effectively, you’ll need to carefully craft your prompts depending on the task at hand. Here are some examples:

  • For OCR tasks, provide a clear image of the text you want to recognize.
  • For VQA tasks, provide a relevant image and a specific question about the image.

Output

The model’s output will vary depending on the task. For example:

  • OCR tasks will return the recognized text in a structured format (e.g., JSON).
  • VQA tasks will return a relevant answer to the question.

Special Requirements

EraX-VL-7B-V1 is not a typical OCR-only tool, but rather a multimodal LLM-based model. This means you may need to adjust your input and output handling accordingly.

Example Code

Here’s an example of how to handle input and output for EraX-VL-7B-V1:

// OCR example
{
  "document": {
    "header": {
      "title": "GIẤY HẸN KHÁM LẠI",
      "organization": "SỞ Y TẾ NGHỆ AN\\nBỆNH VIỆN UNG BƯỚU NGHỆ AN",
      "address": "Võ Thị Sáu, Thủy Tùng - TP Vinh - Nghệ An"
    },
    "patient_info": {
      "name": "NGUYỄN THỊ LUÂN",
      "date_of_birth": "03/07/1976",
      "gender": "40",
      "address": "Xã Nghĩa Khánh-Huyện Nghĩa Đàn-Nghệ An",
      "medical_card_number": "CN 3 40 40 168 60413",
      "registration_date": "16/12/2016",
      "admission_date": "Từ 01/03/2016",
      "diagnosis": "C20-Bướu ac trực tràng",
      "revisit_date": "17/01/2017"
    },
    "administrative_details": {
      "department": "Trung tâm điều trị ung bướu",
      "revisit_instruction": "vào ngày 17/01/2017, hoặc đến hết kỳ thời gian nếu nước ngoài hẹn khám lại nếu có dấu hiệu (triệu chứng)",
      "note": "nếu KCB ban đầu: Trạm y tế xã Nghĩa Khánh",
      "signature": "Trưởng khoa",
      "doctor_signature": "Lâm Nguyễn Khang",
      "revisiting_date_confirmation": "Ngày 16 tháng 12 năm 2016",
      "confirmation_signature": "Bác sĩ điều trị",
      "physician_signature": "Nguyễn Văn Việt"
    }
  }
}

This example shows how to structure the input for an OCR task. The output will be the recognized text in a similar format.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.