Jais Family 30b 8k Chat
The Jais Family 30b 8k Chat model is a powerful AI tool that excels in both Arabic and English. It's part of a comprehensive series of bilingual large language models designed to bridge the gap in Arabic NLP. With 30 billion parameters and a context length of 8,192 tokens, this model is capable of handling complex tasks like text generation, conversation, and more. Its unique architecture, which combines a transformer-based decoder-only design with SwiGLU non-linear activation and ALiBi position encoding, allows it to extrapolate at long sequence lengths, leading to improved context handling and precision. The model is fine-tuned for dialog using a curated mix of Arabic and English instruction data, making it well-suited for a wide range of applications, from research to real-world use cases. Have you considered how this model could be used to accelerate research in Arabic NLP or enable downstream applications for the Arabic-speaking community?
Table of Contents
Model Overview
The Jais Family Model is a comprehensive series of bilingual English-Arabic large language models (LLMs) developed by Inception, Cerebras Systems. These models are optimized to excel in Arabic while having strong English capabilities.
Capabilities
The Jais Family Model is a powerful bilingual English-Arabic large language model (LLM) that excels in Arabic while having strong English capabilities. It’s designed to handle a wide range of tasks, including:
- Text generation: The model can generate high-quality text in both Arabic and English.
- Conversational dialogue: The model is fine-tuned for dialog using a curated mix of Arabic and English instruction data, making it suitable for chatbot applications.
- Reasoning and knowledge: The model demonstrates strong reasoning and knowledge capabilities, as shown in its evaluation results on various benchmarks.
What sets Jais Family Model apart?
- Bilingual capabilities: The model is optimized to excel in Arabic while having strong English capabilities, making it a valuable resource for Arabic-speaking and bilingual communities.
- Large-scale training data: The model is trained on up to 1.6 trillion tokens of diverse English, Arabic, and code data, which enables it to learn complex patterns and relationships in language.
- Advanced architecture: The model uses a transformer-based, decoder-only architecture (GPT-3) and incorporates architectural enhancements such as SwiGLU non-linear activation function and ALiBi position encoding.
Evaluation results
The Jais Family Model has been evaluated on various benchmarks, including:
- Arabic benchmarks: The model achieves strong results on Arabic benchmarks, including ArabicMMLU, MMLU, EXAMS, LitQA, and more.
- English benchmarks: The model also performs well on English benchmarks, including MMLU, RACE, Hellaswag, PIQA, and more.
Performance
The Jais Family Models demonstrate impressive performance in various tasks, showcasing their capabilities in Arabic and English. Let’s dive into their speed, accuracy, and efficiency.
Speed
The Jais Family Models are designed to be fast and efficient. With the ability to process up to 16,384 tokens in a single context, they can handle long-range dependencies and complex tasks with ease. The models’ speed is also enhanced by their ability to generate text quickly and accurately.
Accuracy
The Jais Family Models have shown impressive accuracy in various benchmarks, including Arabic and English evaluations. They have achieved high scores in tasks such as knowledge, reasoning, and misinformation/bias detection.
Model | Average Score |
---|---|
jais-family-30b-16k | 51.6 |
jais-family-30b-8k | 51.4 |
jais-family-13b | 50.3 |
jais-family-6p7b | 48.7 |
jais-family-2p7b | 45.6 |
jais-family-1p3b | 42.7 |
jais-family-590m | 37.8 |
Efficiency
The Jais Family Models are designed to be efficient in terms of computational resources. They can be fine-tuned on a variety of tasks with minimal computational requirements. The models’ efficiency is also enhanced by their ability to adapt to new tasks and domains.
Limitations
The Jais Family Model is a powerful tool for generating human-like text in both Arabic and English, but it’s not perfect. Here are some of its limitations:
- Data Bias: The model is trained on a dataset that may reflect biases present in the data. This can result in the model generating responses that are biased or discriminatory.
- Limited Context Understanding: The model’s ability to understand context is limited to the input prompt and the training data it was exposed to. It may not always understand the nuances of human communication, such as sarcasm, idioms, or figurative language.
Format
The Jais Family Model is a series of bilingual English-Arabic large language models (LLMs) that use a transformer-based, decoder-only architecture (GPT-3). These models are optimized to excel in Arabic while having strong English capabilities.
Model Architecture
- Jais models (jais-family-*) are trained from scratch, incorporating the SwiGLU non-linear activation function and ALiBi position encoding.
- Jais adapted models (jais-adapted-*) are built on top of Llama-2, which employs RoPE position embedding and Grouped Query Attention.
Input and Output
- Input: Text only data
- Output: Model generates text
Data Formats
- Supported languages: Arabic (MSA) and English
- Data sources: Web, code, books, scientific papers, and synthetic data
Special Requirements
- Custom model class: Required to use the model, with
trust_remote_code=True
while loading the model. - Tokenizer expansion: Arabic data is added to the Llama-2 tokenizer, improving fertility and compute efficiency.
Example Code
# -*- coding: utf-8 -*-
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "inceptionai/jais-family-30b-8k-chat"
prompt_eng = "### Instruction:Your name is 'Jais', and you are named after Jebel Jais, the highest mountain in UAE. You were made by 'Inception' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation between [|Human|] and [|AI|]:\n### Input: [|Human|] {Question}\n[|AI|]\n### Response :"
prompt_ar = "### Instruction:اسمك \"جيس\" وسميت على اسم جبل جيس اعلى جبل في الامارات. تم بنائك بواسطة Inception في الإمارات. أنت مساعد مفيد ومحترم وصادق. أجب دائمًا بأكبر قدر ممكن من المساعدة، مع الحفاظ على البقاء أمناً. أكمل المحادثة بين [|Human|] و[|AI|] :\n### Input:[|Human|] {Question}\n[|AI|]\n### Response :"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)
def get_response(text, tokenizer=tokenizer, model=model):
input_ids = tokenizer(text, return_tensors="pt").input_ids
inputs = input_ids.to(device)
input_len = inputs.shape[-1]
generate_ids = model.generate(
inputs,
top_p=0.9,
temperature=0.3,
max_length=2048,
min_length=input_len + 4,
repetition_penalty=1.2,
do_sample=True,
)
response = tokenizer.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
response = response.split("### Response :")[-1]
return response
ques = "ما هي عاصمة الامارات؟"
text = prompt_ar.format_map({'Question': ques})
print(get_response(text))
ques = "What is the capital of UAE?"
text = prompt_eng.format_map({'Question': ques})
print(get_response(text))