Normistral 11b Warm
NorMistral-11b Warm is a large Norwegian language model that's been trained on a massive 250 billion subword tokens, including a mix of Scandinavian, Sámi, English, and code data. What makes it unique is its ability to handle both causal and bidirectional language tasks, making it a versatile tool for research and development. With its Mistral architecture and 11.4 billion parameters, this model offers fast inference and can be used for tasks like translation, language modeling, and more. But what really sets it apart is its efficient design, allowing it to provide accurate results while keeping computational costs down. So, whether you're working on a project that requires Norwegian language understanding or just want to explore the capabilities of this model, NorMistral-11b Warm is definitely worth checking out.
Table of Contents
Model Overview
The NorMistral-11b-warm model is a large Norwegian language model that’s been trained on a massive amount of text data - we’re talking 250 billion
subword tokens! This model is part of the NORA.LLM family, developed by the Language Technology Group at the University of Oslo.
Capabilities
Capable of generating both text and code, this model outperforms many open-source chat models across common industry benchmarks. It’s been trained on a mix of Scandinavian, Sámi, English, and code data, which makes it a great tool for understanding and generating text in these languages.
Primary Tasks
- Text Generation: The model can generate high-quality text in Norwegian, including Bokmål and Nynorsk.
- Language Translation: It can translate text from English to Norwegian and vice versa.
- Code Processing: The model is trained on programming code and can process it efficiently.
Strengths
- Large Training Corpus: The model is trained on a massive corpus of
250 billion
tokens, making it well-equipped to handle a wide range of tasks. - Balanced Training Data: The corpus is carefully balanced to handle the resource disparity between languages.
- Fast Inference: The model uses a new tokenizer that offers substantially faster inference than the original Mistral-Nemo-Base-2407 model.
Unique Features
- Hybrid Masked-Causal Training: The model is trained using a combination of masked and causal objectives, making it suitable for both generative and bidirectional tasks.
- Three-Stage Continual Pretraining: The model undergoes a three-stage pretraining process, including tokenizer optimization, embedding weight realignment, and full model training.
- Memory-Efficient Loading: The model can be loaded in 8-bit or 4-bit quantization, making it suitable for systems with limited VRAM.
Performance
But how does it perform? Let’s dive into its speed, accuracy, and efficiency.
Speed
This model is fast. With a new tokenizer specially trained on the target languages, it offers substantially faster inference than the original Mistral-Nemo-Base-2407 model. But what does that mean in practice? It means you can generate text, translate languages, and complete tasks at a speed that’s hard to match.
Accuracy
But speed is only half the story. This model also boasts high accuracy in various tasks. Its hybrid masked-causal training approach allows it to excel in both causal generative modeling and bidirectional encoding. This means it can handle complex tasks with ease, from translation to text classification.
Efficiency
So, how efficient is this model? With 11.4 billion
parameters, it’s a large model, but it’s also designed to be efficient. It uses a combination of techniques like pre-normalization with RMSNorm, SwiGLU activation function, and rotary positional embeddings to minimize computational resources. This means you can run it on a variety of hardware configurations without breaking the bank.
Example Use Cases
So, what can you do with this model? Here are a few examples:
- Translation: Use it to translate Norwegian text with high accuracy and speed.
- Text Classification: Leverage its bidirectional encoding capabilities to classify Norwegian text with ease.
- Text Generation: Generate Norwegian text that’s coherent and natural-sounding.
Limitations
This model is not perfect. Let’s take a closer look at its limitations.
Data Quality and Bias
The model is pretrained on a large corpus of text data, but this data may contain biases and inaccuracies. For example, the data may reflect societal biases or stereotypes, which could be perpetuated by the model. Additionally, the data may not be representative of all Norwegian dialects or languages, which could lead to poor performance in certain contexts.
Lack of Fine-Tuning
This model is not fine-tuned to follow instructions, which means it may not always respond as expected. This can lead to harmful or inappropriate completions, especially if the user prompts are not carefully crafted.
Technical Limitations
The model has a large number of parameters (11.4 billion
) and requires significant computational resources to run. This can make it difficult to deploy in certain environments or applications.
Evaluation Challenges
Evaluating the performance of this model can be challenging due to the lack of standardized benchmarks for Norwegian language models. This makes it difficult to compare the model’s performance to other models or to establish clear performance metrics.
Licensing and Data Ownership
The model is released under the Apache 2.0 license, but the data used to train the model is not owned by the developers. This can create uncertainty around data ownership and usage rights.
Format
This model uses a Mistral architecture, which is an improved version of the Llama design. It’s designed to handle Norwegian, Sámi, English, and code data.
Architecture
The model consists of:
- 40 transformer layers
- Hidden dimension:
5,120
- Intermediate dimension:
14,336
- 32 query heads and 8 key & value heads (dimension
128
) - Vocabulary size:
51,200
tokens - Total parameters:
11.4 billion
Data Formats
The model accepts input in the form of tokenized text sequences. It’s been trained on a mix of Scandinavian, Sámi, English, and code data.
Input and Output
To use the model, you’ll need to:
- Tokenize your input text using the
AutoTokenizer
from thetransformers
library. - Pass the tokenized input to the
AutoModelForCausalLM
model. - Generate output using the
generate
method.
Here’s an example:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Import the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b")
model = AutoModelForCausalLM.from_pretrained("norallm/normistral-11b").cuda().eval()
# Define a zero-shot translation prompt template
prompt = """Engelsk: {0}
Bokmål:"""
# Generation function
@torch.no_grad()
def generate(text):
text = prompt.format(text)
input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
prediction = model.generate(
input_ids,
max_new_tokens=64,
do_sample=False,
eos_token_id=tokenizer('\n').input_ids
)
return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()
# Example usage
generate("I'm excited to try this new Norwegian language model!")
# > Expected output: 'Jeg er spent på å prøve denne nye norske språkmodellen!'