Jais Family 30b 8k Chat

Bilingual chat model

The Jais Family 30b 8k Chat model is a powerful AI tool that excels in both Arabic and English. It's part of a comprehensive series of bilingual large language models designed to bridge the gap in Arabic NLP. With 30 billion parameters and a context length of 8,192 tokens, this model is capable of handling complex tasks like text generation, conversation, and more. Its unique architecture, which combines a transformer-based decoder-only design with SwiGLU non-linear activation and ALiBi position encoding, allows it to extrapolate at long sequence lengths, leading to improved context handling and precision. The model is fine-tuned for dialog using a curated mix of Arabic and English instruction data, making it well-suited for a wide range of applications, from research to real-world use cases. Have you considered how this model could be used to accelerate research in Arabic NLP or enable downstream applications for the Arabic-speaking community?

Inceptionai apache-2.0 Updated 7 months ago

Table of Contents

Model Overview

The Jais Family Model is a comprehensive series of bilingual English-Arabic large language models (LLMs) developed by Inception, Cerebras Systems. These models are optimized to excel in Arabic while having strong English capabilities.

Capabilities

The Jais Family Model is a powerful bilingual English-Arabic large language model (LLM) that excels in Arabic while having strong English capabilities. It’s designed to handle a wide range of tasks, including:

  • Text generation: The model can generate high-quality text in both Arabic and English.
  • Conversational dialogue: The model is fine-tuned for dialog using a curated mix of Arabic and English instruction data, making it suitable for chatbot applications.
  • Reasoning and knowledge: The model demonstrates strong reasoning and knowledge capabilities, as shown in its evaluation results on various benchmarks.

What sets Jais Family Model apart?

  • Bilingual capabilities: The model is optimized to excel in Arabic while having strong English capabilities, making it a valuable resource for Arabic-speaking and bilingual communities.
  • Large-scale training data: The model is trained on up to 1.6 trillion tokens of diverse English, Arabic, and code data, which enables it to learn complex patterns and relationships in language.
  • Advanced architecture: The model uses a transformer-based, decoder-only architecture (GPT-3) and incorporates architectural enhancements such as SwiGLU non-linear activation function and ALiBi position encoding.

Evaluation results

The Jais Family Model has been evaluated on various benchmarks, including:

  • Arabic benchmarks: The model achieves strong results on Arabic benchmarks, including ArabicMMLU, MMLU, EXAMS, LitQA, and more.
  • English benchmarks: The model also performs well on English benchmarks, including MMLU, RACE, Hellaswag, PIQA, and more.
Examples
ما هي عاصمة الامارات؟ عاصمة الإمارات العربية المتحدة هي أبوظبي
What is the capital of UAE? The capital of the United Arab Emirates is Abu Dhabi
Write a short story about a character who learns a new language. Amira had always been fascinated by the Arabic language. She decided to take a class and soon found herself immersed in the sounds and rhythms of the language. As she progressed, she began to appreciate the beauty of Arabic literature and the rich culture it represented. With each new word and phrase she learned, Amira felt a sense of accomplishment and connection to the language.

Performance

The Jais Family Models demonstrate impressive performance in various tasks, showcasing their capabilities in Arabic and English. Let’s dive into their speed, accuracy, and efficiency.

Speed

The Jais Family Models are designed to be fast and efficient. With the ability to process up to 16,384 tokens in a single context, they can handle long-range dependencies and complex tasks with ease. The models’ speed is also enhanced by their ability to generate text quickly and accurately.

Accuracy

The Jais Family Models have shown impressive accuracy in various benchmarks, including Arabic and English evaluations. They have achieved high scores in tasks such as knowledge, reasoning, and misinformation/bias detection.

ModelAverage Score
jais-family-30b-16k51.6
jais-family-30b-8k51.4
jais-family-13b50.3
jais-family-6p7b48.7
jais-family-2p7b45.6
jais-family-1p3b42.7
jais-family-590m37.8

Efficiency

The Jais Family Models are designed to be efficient in terms of computational resources. They can be fine-tuned on a variety of tasks with minimal computational requirements. The models’ efficiency is also enhanced by their ability to adapt to new tasks and domains.

Limitations

The Jais Family Model is a powerful tool for generating human-like text in both Arabic and English, but it’s not perfect. Here are some of its limitations:

  • Data Bias: The model is trained on a dataset that may reflect biases present in the data. This can result in the model generating responses that are biased or discriminatory.
  • Limited Context Understanding: The model’s ability to understand context is limited to the input prompt and the training data it was exposed to. It may not always understand the nuances of human communication, such as sarcasm, idioms, or figurative language.

Format

The Jais Family Model is a series of bilingual English-Arabic large language models (LLMs) that use a transformer-based, decoder-only architecture (GPT-3). These models are optimized to excel in Arabic while having strong English capabilities.

Model Architecture

  • Jais models (jais-family-*) are trained from scratch, incorporating the SwiGLU non-linear activation function and ALiBi position encoding.
  • Jais adapted models (jais-adapted-*) are built on top of Llama-2, which employs RoPE position embedding and Grouped Query Attention.

Input and Output

  • Input: Text only data
  • Output: Model generates text

Data Formats

  • Supported languages: Arabic (MSA) and English
  • Data sources: Web, code, books, scientific papers, and synthetic data

Special Requirements

  • Custom model class: Required to use the model, with trust_remote_code=True while loading the model.
  • Tokenizer expansion: Arabic data is added to the Llama-2 tokenizer, improving fertility and compute efficiency.

Example Code

# -*- coding: utf-8 -*-
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "inceptionai/jais-family-30b-8k-chat"
prompt_eng = "### Instruction:Your name is 'Jais', and you are named after Jebel Jais, the highest mountain in UAE. You were made by 'Inception' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation between [|Human|] and [|AI|]:\n### Input: [|Human|] {Question}\n[|AI|]\n### Response :"
prompt_ar = "### Instruction:اسمك \"جيس\" وسميت على اسم جبل جيس اعلى جبل في الامارات. تم بنائك بواسطة Inception في الإمارات. أنت مساعد مفيد ومحترم وصادق. أجب دائمًا بأكبر قدر ممكن من المساعدة، مع الحفاظ على البقاء أمناً. أكمل المحادثة بين [|Human|] و[|AI|] :\n### Input:[|Human|] {Question}\n[|AI|]\n### Response :"

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)

def get_response(text, tokenizer=tokenizer, model=model):
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    inputs = input_ids.to(device)
    input_len = inputs.shape[-1]
    generate_ids = model.generate(
        inputs,
        top_p=0.9,
        temperature=0.3,
        max_length=2048,
        min_length=input_len + 4,
        repetition_penalty=1.2,
        do_sample=True,
    )
    response = tokenizer.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )[0]
    response = response.split("### Response :")[-1]
    return response

ques = "ما هي عاصمة الامارات؟"
text = prompt_ar.format_map({'Question': ques})
print(get_response(text))

ques = "What is the capital of UAE?"
text = prompt_eng.format_map({'Question': ques})
print(get_response(text))
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.