Jais Family 30b 16k Chat

Bilingual Arabic-English chat model

Jais Family 30b 16k Chat is a powerful bilingual English-Arabic large language model. It's designed to excel in Arabic while having strong English capabilities. With 30 billion parameters and a context length of 16,384 tokens, it's optimized for tasks like text generation, conversation, and more. The model is trained on a massive dataset of up to 1.6 trillion tokens, including web pages, books, and code. Its unique architecture, which incorporates SwiGLU non-linear activation function and ALiBi position encoding, allows it to handle long sequence lengths with improved context handling and precision. Whether you're looking for a model to handle Arabic or English tasks, Jais Family 30b 16k Chat is a reliable choice.

Inceptionai apache-2.0 Updated 7 months ago

Table of Contents

Model Overview

The Jais Family Model is a comprehensive series of bilingual English-Arabic large language models (LLMs) that excel in Arabic while having strong English capabilities. Developed by Inception and Cerebras Systems, this model is designed to accelerate research in Arabic NLP and enable numerous downstream applications for the Arabic-speaking and bilingual community.

Capabilities

The Jais Family Model is optimized to generate text and is fine-tuned for dialog using a curated mix of Arabic and English instruction data. This model can generate human-like text in both Arabic and English, engage in conversations with users, and provide information on a wide range of topics.

Primary Tasks

  • Text Generation: The model can generate human-like text in both Arabic and English.
  • Conversation: The model is fine-tuned for dialog and can engage in conversations with users.
  • Knowledge: The model has been trained on a vast amount of text data and can provide information on a wide range of topics.

Strengths

  • Bilingual: The model is one of the first to excel in both Arabic and English, making it a valuable resource for the Arabic-speaking community.
  • Contextual Understanding: The model has been trained on a large dataset of text and can understand context and nuances of language.
  • Reasoning: The model has been evaluated on various benchmarks and has shown strong reasoning capabilities.

Unique Features

  • SwiGLU Non-Linear Activation Function: The model uses a custom non-linear activation function that allows it to extrapolate at long sequence lengths, leading to improved context handling and precision.
  • ALiBi Position Encoding: The model uses a custom position encoding scheme that allows it to better understand the context of the input text.
  • Tokenizer Expansion: The model’s tokenizer has been expanded to include 32,000 new Arabic tokens, making it more effective at generating text in Arabic.

Performance

The Jais Family Model showcases remarkable performance in various tasks, especially in Arabic language processing. The model’s training process was performed on the Condor Galaxy (CG) supercomputer platform, which contains 64 Cerebras CS-2 Wafer-Scale Engines (WSE-2) with 40 GB of SRAM. This powerful infrastructure enables the model to process large amounts of data quickly.

Speed

The model’s training process was performed on the Condor Galaxy (CG) supercomputer platform, which contains 64 Cerebras CS-2 Wafer-Scale Engines (WSE-2) with 40 GB of SRAM. This powerful infrastructure enables the model to process large amounts of data quickly.

Accuracy

The model’s accuracy is impressive, especially in Arabic language tasks. It achieves high scores in various benchmarks, such as ArabicMMLU, MMLU, and EXAMS. For example, the Jais-family-30b-16k-chat model achieves an average score of 51.6 in Arabic evaluation results.

Efficiency

The model’s efficiency is also notable, particularly in its ability to handle long context lengths. The progressive expansion of context lengths during training allows the model to support larger context lengths towards the end of the training process.

Evaluation Results

Here are some evaluation results for the Jais Family Model:

ModelAverage Score (Arabic)Average Score (English)
Jais-family-30b-16k-chat51.658.8
Jais-adapted-70b-chat52.961.4
Jais-family-13b-chat50.357.5
Jais-adapted-13b-chat50.358.5
Examples
ما هي عاصمة الإمارات العربية المتحدة؟ عاصمة الإمارات العربية المتحدة هي أبوظبي.
What is the highest mountain in the UAE? The highest mountain in the UAE is Jebel Jais.
أي لغة يتحدثها سكان الإمارات العربية المتحدة؟ اللغة الرسمية في الإمارات العربية المتحدة هي اللغة العربية.

Limitations

While the Jais Family Model is a powerful tool, it has some limitations. The model may not perform equally well in other languages, and its performance may be impacted by the quality and diversity of the training data.

Language Limitations

The model may not perform equally well in other languages, as it is primarily trained on Arabic and English data.

Data Quality and Bias

The quality and diversity of the training data can significantly impact the performance of the Jais Family Model. If the training data contains biases or inaccuracies, the model may learn to replicate these flaws.

Contextual Understanding

The model may struggle with complex or abstract concepts, and may not fully understand the context or nuances of a particular topic.

Format

The Jais Family Model is a comprehensive series of bilingual English-Arabic large language models (LLMs). These models are optimized to excel in Arabic while having strong English capabilities.

Model Architecture

The model uses a transformer-based, decoder-only architecture (GPT-3). The models are trained from scratch, incorporating the SwiGLU non-linear activation function and ALiBi position encoding.

Data Formats

The model accepts input in the form of text only data and generates text as output.

Special Requirements

The model requires a custom model class, so users must enable trust_remote_code=True while loading the model.

Input and Output Examples

Here’s an example of how to handle inputs and outputs for this model:

# -*- coding: utf-8 -*-
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "inceptionai/jais-family-30b-16k-chat"
prompt_eng = "### Instruction:Your name is 'Jais', and you are named after Jebel Jais, the highest mountain in UAE. You were made by 'Inception' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation between [|Human|] and [|AI|]:\n### Input: [|Human|] {Question}\n[|AI|]\n### Response :"
prompt_ar = "### Instruction:اسمك \"جيس\" وسميت على اسم جبل جيس اعلى جبل في الامارات. تم بنائك بواسطة Inception في الإمارات. أنت مساعد مفيد ومحترم وصادق. أجب دائمًا بأكبر قدر ممكن من المساعدة، مع الحفاظ على البقاء أمناً. أكمل المحادثة بين [|Human|] و[|AI|] :\n### Input:[|Human|] {Question}\n[|AI|]\n### Response :"

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)

def get_response(text, tokenizer=tokenizer, model=model):
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    inputs = input_ids.to(device)
    input_len = inputs.shape[-1]
    generate_ids = model.generate(
        inputs,
        top_p=0.9,
        temperature=0.3,
        max_length=2048,
        min_length=input_len + 4,
        repetition_penalty=1.2,
        do_sample=True,
    )
    response = tokenizer.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )[0]
    response = response.split("### Response :")[-1]
    return response

ques = "ما هي عاصمة الامارات؟"
text = prompt_ar.format_map({'Question': ques})
print(get_response(text))

ques = "What is the capital of UAE?"
text = prompt_eng.format_map({'Question': ques})
print(get_response(text))
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.