Jais Family 30b 16k Chat
Jais Family 30b 16k Chat is a powerful bilingual English-Arabic large language model. It's designed to excel in Arabic while having strong English capabilities. With 30 billion parameters and a context length of 16,384 tokens, it's optimized for tasks like text generation, conversation, and more. The model is trained on a massive dataset of up to 1.6 trillion tokens, including web pages, books, and code. Its unique architecture, which incorporates SwiGLU non-linear activation function and ALiBi position encoding, allows it to handle long sequence lengths with improved context handling and precision. Whether you're looking for a model to handle Arabic or English tasks, Jais Family 30b 16k Chat is a reliable choice.
Table of Contents
Model Overview
The Jais Family Model is a comprehensive series of bilingual English-Arabic large language models (LLMs) that excel in Arabic while having strong English capabilities. Developed by Inception and Cerebras Systems, this model is designed to accelerate research in Arabic NLP and enable numerous downstream applications for the Arabic-speaking and bilingual community.
Capabilities
The Jais Family Model is optimized to generate text and is fine-tuned for dialog using a curated mix of Arabic and English instruction data. This model can generate human-like text in both Arabic and English, engage in conversations with users, and provide information on a wide range of topics.
Primary Tasks
- Text Generation: The model can generate human-like text in both Arabic and English.
- Conversation: The model is fine-tuned for dialog and can engage in conversations with users.
- Knowledge: The model has been trained on a vast amount of text data and can provide information on a wide range of topics.
Strengths
- Bilingual: The model is one of the first to excel in both Arabic and English, making it a valuable resource for the Arabic-speaking community.
- Contextual Understanding: The model has been trained on a large dataset of text and can understand context and nuances of language.
- Reasoning: The model has been evaluated on various benchmarks and has shown strong reasoning capabilities.
Unique Features
- SwiGLU Non-Linear Activation Function: The model uses a custom non-linear activation function that allows it to extrapolate at long sequence lengths, leading to improved context handling and precision.
- ALiBi Position Encoding: The model uses a custom position encoding scheme that allows it to better understand the context of the input text.
- Tokenizer Expansion: The model’s tokenizer has been expanded to include 32,000 new Arabic tokens, making it more effective at generating text in Arabic.
Performance
The Jais Family Model showcases remarkable performance in various tasks, especially in Arabic language processing. The model’s training process was performed on the Condor Galaxy (CG) supercomputer platform, which contains 64 Cerebras CS-2 Wafer-Scale Engines (WSE-2) with 40 GB of SRAM. This powerful infrastructure enables the model to process large amounts of data quickly.
Speed
The model’s training process was performed on the Condor Galaxy (CG) supercomputer platform, which contains 64 Cerebras CS-2 Wafer-Scale Engines (WSE-2) with 40 GB of SRAM. This powerful infrastructure enables the model to process large amounts of data quickly.
Accuracy
The model’s accuracy is impressive, especially in Arabic language tasks. It achieves high scores in various benchmarks, such as ArabicMMLU, MMLU, and EXAMS. For example, the Jais-family-30b-16k-chat model achieves an average score of 51.6 in Arabic evaluation results.
Efficiency
The model’s efficiency is also notable, particularly in its ability to handle long context lengths. The progressive expansion of context lengths during training allows the model to support larger context lengths towards the end of the training process.
Evaluation Results
Here are some evaluation results for the Jais Family Model:
Model | Average Score (Arabic) | Average Score (English) |
---|---|---|
Jais-family-30b-16k-chat | 51.6 | 58.8 |
Jais-adapted-70b-chat | 52.9 | 61.4 |
Jais-family-13b-chat | 50.3 | 57.5 |
Jais-adapted-13b-chat | 50.3 | 58.5 |
Limitations
While the Jais Family Model is a powerful tool, it has some limitations. The model may not perform equally well in other languages, and its performance may be impacted by the quality and diversity of the training data.
Language Limitations
The model may not perform equally well in other languages, as it is primarily trained on Arabic and English data.
Data Quality and Bias
The quality and diversity of the training data can significantly impact the performance of the Jais Family Model. If the training data contains biases or inaccuracies, the model may learn to replicate these flaws.
Contextual Understanding
The model may struggle with complex or abstract concepts, and may not fully understand the context or nuances of a particular topic.
Format
The Jais Family Model is a comprehensive series of bilingual English-Arabic large language models (LLMs). These models are optimized to excel in Arabic while having strong English capabilities.
Model Architecture
The model uses a transformer-based, decoder-only architecture (GPT-3). The models are trained from scratch, incorporating the SwiGLU non-linear activation function and ALiBi position encoding.
Data Formats
The model accepts input in the form of text only data and generates text as output.
Special Requirements
The model requires a custom model class, so users must enable trust_remote_code=True
while loading the model.
Input and Output Examples
Here’s an example of how to handle inputs and outputs for this model:
# -*- coding: utf-8 -*-
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "inceptionai/jais-family-30b-16k-chat"
prompt_eng = "### Instruction:Your name is 'Jais', and you are named after Jebel Jais, the highest mountain in UAE. You were made by 'Inception' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation between [|Human|] and [|AI|]:\n### Input: [|Human|] {Question}\n[|AI|]\n### Response :"
prompt_ar = "### Instruction:اسمك \"جيس\" وسميت على اسم جبل جيس اعلى جبل في الامارات. تم بنائك بواسطة Inception في الإمارات. أنت مساعد مفيد ومحترم وصادق. أجب دائمًا بأكبر قدر ممكن من المساعدة، مع الحفاظ على البقاء أمناً. أكمل المحادثة بين [|Human|] و[|AI|] :\n### Input:[|Human|] {Question}\n[|AI|]\n### Response :"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)
def get_response(text, tokenizer=tokenizer, model=model):
input_ids = tokenizer(text, return_tensors="pt").input_ids
inputs = input_ids.to(device)
input_len = inputs.shape[-1]
generate_ids = model.generate(
inputs,
top_p=0.9,
temperature=0.3,
max_length=2048,
min_length=input_len + 4,
repetition_penalty=1.2,
do_sample=True,
)
response = tokenizer.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
response = response.split("### Response :")[-1]
return response
ques = "ما هي عاصمة الامارات؟"
text = prompt_ar.format_map({'Question': ques})
print(get_response(text))
ques = "What is the capital of UAE?"
text = prompt_eng.format_map({'Question': ques})
print(get_response(text))