Jais Adapted 70b
Meet Jais Adapted 70b, a powerful AI model designed to excel in both Arabic and English. It's part of the Jais family of models, a comprehensive series of bilingual large language models. What sets Jais Adapted 70b apart is its unique architecture, which combines the strengths of two models: it's built on top of Llama-2 and incorporates Arabic data to improve its performance in this language. This model is not just about understanding language; it's also capable of generating human-like text, making it a great tool for tasks like chat applications, sentiment analysis, and summarization of bilingual documents. With its impressive performance in both Arabic and English, Jais Adapted 70b is an excellent choice for researchers, businesses, and anyone looking to work with Arabic language data. Its efficiency and speed make it a practical option for a wide range of applications, from research to commercial use.
Table of Contents
Model Overview
The Jais Family Model is a comprehensive series of bilingual English-Arabic large language models (LLMs) that excel in Arabic while having strong English capabilities. Developed by Inception and Cerebras Systems, this model family includes 20 models across 8 sizes, ranging from 590M
to 70B
parameters, trained on up to 1.6T
tokens of Arabic, English, and code data.
Capabilities
These models are designed to handle a wide range of tasks, including:
- Text Generation: The models can generate human-like text in both Arabic and English.
- Conversational Dialogue: The models are fine-tuned for dialog using a curated mix of Arabic and English instruction data, making them suitable for chat applications.
- Language Understanding: The models can understand and process text in both Arabic and English, making them useful for tasks such as sentiment analysis and summarization.
The models are available in various sizes, ranging from 590M
to 70B
parameters, making them suitable for a wide range of use cases, from research to commercial applications.
Strengths
The Jais Family Model has several strengths, including:
- Bilingual Capabilities: The models are trained on a large dataset of Arabic and English text, making them proficient in both languages.
- Improved Arabic Support: The models are specifically designed to handle Arabic text, making them a valuable resource for Arabic language research and applications.
- Strong English Capabilities: The models also have strong English capabilities, making them suitable for a wide range of applications that require both Arabic and English support.
Unique Features
The Jais Family Model has several unique features, including:
- SwiGLU Non-Linear Activation Function: The models use a custom non-linear activation function called SwiGLU, which allows them to extrapolate at long sequence lengths and improve context handling and precision.
- ALiBi Position Encoding: The models use a custom position encoding scheme called ALiBi, which allows them to better handle long-range dependencies in text.
- Tokenizer Expansion: The models use a custom tokenizer expansion technique that adds new Arabic tokens to the vocabulary, improving fertility and compute efficiency.
Comparison to Other Models
The Jais Family Model is compared to other models, including Llama-2, using the GPT-4-as-a-judge evaluation method. The results show that the Jais Family Model performs significantly better than previous models in both Arabic and English generations.
Performance
The Jais Family Model showcases remarkable performance with high accuracy in various tasks, especially excelling in Arabic language processing.
- Speed: The model’s speed is notable, with the ability to process large amounts of data efficiently.
- Accuracy: The model’s accuracy is impressive, with high scores in various evaluation benchmarks.
- Efficiency: The model’s efficiency is also noteworthy, with the ability to perform well in both Arabic and English tasks.
Example Use Cases
The Jais Family Model can be used for a wide range of applications, including:
- Chat Assistants: The models can be used to develop chat assistants for Arabic-speaking users.
- Sentiment Analysis: The models can be used to gain insights into local markets and customer trends by analyzing sentiment in Arabic text.
- Summarization: The models can be used to summarize bilingual Arabic-English documents.
- Research: The models can be used for research in Arabic Natural Language Processing, including mechanistic interpretability analyses and quantitative studies of Arabic cultural and linguistic phenomena.
Limitations
While the Jais Family Model is highly versatile, it sometimes generates outputs that lack coherence or factual accuracy, particularly in more complex or nuanced scenarios.
- Data Quality and Availability: The quality and availability of data can still be a limitation.
- Language Limitations: Although the models are designed to be bilingual, they may not perform equally well in both Arabic and English.
- Model Size and Complexity: Larger models may not always be better, and may require more computational resources and be more difficult to fine-tune.
Format
The Jais Family Model is a comprehensive series of bilingual English-Arabic large language models (LLMs) that use a transformer-based, decoder-only architecture. These models are optimized to excel in Arabic while having strong English capabilities.
- Model Architecture: The models are auto-regressive language models that use a transformer-based, decoder-only architecture.
- Data Formats: The models support text-only data and generate text outputs.
- Input Requirements: Input format: Text sequences, Pre-processing: Tokenization
- Output Requirements: Output format: Text sequences
Example Code
# -*- coding: utf-8 -*-
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "inceptionai/jais-adapted-70b"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
def get_response(text, tokenizer=tokenizer, model=model):
tokenized = tokenizer(text, return_tensors="pt")
input_ids, attention_mask = tokenized['input_ids'].to(device), tokenized['attention_mask'].to(device)
input_len = input_ids.shape[-1]
generate_ids = model.generate(
input_ids,
attention_mask=attention_mask,
top_p=0.9,
temperature=0.3,
max_length=2048,
min_length=input_len + 4,
repetition_penalty=1.2,
do_sample=True,
pad_token_id=tokenizer.pad_token_id
)
response = tokenizer.batch_decode(
generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=True
)[0]
return response
text = "عاصمة دولة الإمارات العربية المتحدة ه"
print(get_response(text))
text = "The capital of UAE is"
print(get_response(text))