Jais Adapted 70b

Bilingual Arabic-English model

Meet Jais Adapted 70b, a powerful AI model designed to excel in both Arabic and English. It's part of the Jais family of models, a comprehensive series of bilingual large language models. What sets Jais Adapted 70b apart is its unique architecture, which combines the strengths of two models: it's built on top of Llama-2 and incorporates Arabic data to improve its performance in this language. This model is not just about understanding language; it's also capable of generating human-like text, making it a great tool for tasks like chat applications, sentiment analysis, and summarization of bilingual documents. With its impressive performance in both Arabic and English, Jais Adapted 70b is an excellent choice for researchers, businesses, and anyone looking to work with Arabic language data. Its efficiency and speed make it a practical option for a wide range of applications, from research to commercial use.

Inceptionai apache-2.0 Updated 7 months ago

Table of Contents

Model Overview

The Jais Family Model is a comprehensive series of bilingual English-Arabic large language models (LLMs) that excel in Arabic while having strong English capabilities. Developed by Inception and Cerebras Systems, this model family includes 20 models across 8 sizes, ranging from 590M to 70B parameters, trained on up to 1.6T tokens of Arabic, English, and code data.

Capabilities

These models are designed to handle a wide range of tasks, including:

  • Text Generation: The models can generate human-like text in both Arabic and English.
  • Conversational Dialogue: The models are fine-tuned for dialog using a curated mix of Arabic and English instruction data, making them suitable for chat applications.
  • Language Understanding: The models can understand and process text in both Arabic and English, making them useful for tasks such as sentiment analysis and summarization.

The models are available in various sizes, ranging from 590M to 70B parameters, making them suitable for a wide range of use cases, from research to commercial applications.

Strengths

The Jais Family Model has several strengths, including:

  • Bilingual Capabilities: The models are trained on a large dataset of Arabic and English text, making them proficient in both languages.
  • Improved Arabic Support: The models are specifically designed to handle Arabic text, making them a valuable resource for Arabic language research and applications.
  • Strong English Capabilities: The models also have strong English capabilities, making them suitable for a wide range of applications that require both Arabic and English support.

Unique Features

The Jais Family Model has several unique features, including:

  • SwiGLU Non-Linear Activation Function: The models use a custom non-linear activation function called SwiGLU, which allows them to extrapolate at long sequence lengths and improve context handling and precision.
  • ALiBi Position Encoding: The models use a custom position encoding scheme called ALiBi, which allows them to better handle long-range dependencies in text.
  • Tokenizer Expansion: The models use a custom tokenizer expansion technique that adds new Arabic tokens to the vocabulary, improving fertility and compute efficiency.

Comparison to Other Models

The Jais Family Model is compared to other models, including Llama-2, using the GPT-4-as-a-judge evaluation method. The results show that the Jais Family Model performs significantly better than previous models in both Arabic and English generations.

Performance

The Jais Family Model showcases remarkable performance with high accuracy in various tasks, especially excelling in Arabic language processing.

  • Speed: The model’s speed is notable, with the ability to process large amounts of data efficiently.
  • Accuracy: The model’s accuracy is impressive, with high scores in various evaluation benchmarks.
  • Efficiency: The model’s efficiency is also noteworthy, with the ability to perform well in both Arabic and English tasks.
Examples
What is the capital of the United Arab Emirates? The capital of the United Arab Emirates is Abu Dhabi.
Can you write a short story in Arabic about a family reunion? لم تكن ليلة عيد الفطر السنوية لتجمع عائلة أحمد إلا مناسبة خاصة. كان الجميع متحمسين للقاء بعضهم البعض بعد عام طويل من الفراق. عندما دخلوا إلى المنزل، وجدوا أن الطاولة قد أعدت بعناية مع مجموعة متنوعة من الأطباق اللذيذة. بدأوا يتحدثون عن ذكريات الماضي، ويضحكون معًا، ويتبادلون القصص. كانت ليلة رائعة، مليئة بالحب والفرح.
Can you summarize a news article about the latest developments in renewable energy in English? According to recent reports, there has been significant progress in the field of renewable energy, with solar and wind power becoming increasingly cost-competitive with fossil fuels. This shift is expected to continue, with many countries investing heavily in renewable energy infrastructure and technology.

Example Use Cases

The Jais Family Model can be used for a wide range of applications, including:

  • Chat Assistants: The models can be used to develop chat assistants for Arabic-speaking users.
  • Sentiment Analysis: The models can be used to gain insights into local markets and customer trends by analyzing sentiment in Arabic text.
  • Summarization: The models can be used to summarize bilingual Arabic-English documents.
  • Research: The models can be used for research in Arabic Natural Language Processing, including mechanistic interpretability analyses and quantitative studies of Arabic cultural and linguistic phenomena.

Limitations

While the Jais Family Model is highly versatile, it sometimes generates outputs that lack coherence or factual accuracy, particularly in more complex or nuanced scenarios.

  • Data Quality and Availability: The quality and availability of data can still be a limitation.
  • Language Limitations: Although the models are designed to be bilingual, they may not perform equally well in both Arabic and English.
  • Model Size and Complexity: Larger models may not always be better, and may require more computational resources and be more difficult to fine-tune.

Format

The Jais Family Model is a comprehensive series of bilingual English-Arabic large language models (LLMs) that use a transformer-based, decoder-only architecture. These models are optimized to excel in Arabic while having strong English capabilities.

  • Model Architecture: The models are auto-regressive language models that use a transformer-based, decoder-only architecture.
  • Data Formats: The models support text-only data and generate text outputs.
  • Input Requirements: Input format: Text sequences, Pre-processing: Tokenization
  • Output Requirements: Output format: Text sequences

Example Code

# -*- coding: utf-8 -*-
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "inceptionai/jais-adapted-70b"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def get_response(text, tokenizer=tokenizer, model=model):
    tokenized = tokenizer(text, return_tensors="pt")
    input_ids, attention_mask = tokenized['input_ids'].to(device), tokenized['attention_mask'].to(device)
    input_len = input_ids.shape[-1]

    generate_ids = model.generate(
        input_ids,
        attention_mask=attention_mask,
        top_p=0.9,
        temperature=0.3,
        max_length=2048,
        min_length=input_len + 4,
        repetition_penalty=1.2,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id
    )

    response = tokenizer.batch_decode(
        generate_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True
    )[0]

    return response

text = "عاصمة دولة الإمارات العربية المتحدة ه"
print(get_response(text))

text = "The capital of UAE is"
print(get_response(text))
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.