Normistral 11b Warm

Norwegian Language Model

NorMistral-11b Warm is a large Norwegian language model that's been trained on a massive 250 billion subword tokens, including a mix of Scandinavian, Sámi, English, and code data. What makes it unique is its ability to handle both causal and bidirectional language tasks, making it a versatile tool for research and development. With its Mistral architecture and 11.4 billion parameters, this model offers fast inference and can be used for tasks like translation, language modeling, and more. But what really sets it apart is its efficient design, allowing it to provide accurate results while keeping computational costs down. So, whether you're working on a project that requires Norwegian language understanding or just want to explore the capabilities of this model, NorMistral-11b Warm is definitely worth checking out.

Norallm apache-2.0 Updated 4 months ago

Table of Contents

Model Overview

The NorMistral-11b-warm model is a large Norwegian language model that’s been trained on a massive amount of text data - we’re talking 250 billion subword tokens! This model is part of the NORA.LLM family, developed by the Language Technology Group at the University of Oslo.

Capabilities

Capable of generating both text and code, this model outperforms many open-source chat models across common industry benchmarks. It’s been trained on a mix of Scandinavian, Sámi, English, and code data, which makes it a great tool for understanding and generating text in these languages.

Primary Tasks

  • Text Generation: The model can generate high-quality text in Norwegian, including Bokmål and Nynorsk.
  • Language Translation: It can translate text from English to Norwegian and vice versa.
  • Code Processing: The model is trained on programming code and can process it efficiently.

Strengths

  • Large Training Corpus: The model is trained on a massive corpus of 250 billion tokens, making it well-equipped to handle a wide range of tasks.
  • Balanced Training Data: The corpus is carefully balanced to handle the resource disparity between languages.
  • Fast Inference: The model uses a new tokenizer that offers substantially faster inference than the original Mistral-Nemo-Base-2407 model.

Unique Features

  • Hybrid Masked-Causal Training: The model is trained using a combination of masked and causal objectives, making it suitable for both generative and bidirectional tasks.
  • Three-Stage Continual Pretraining: The model undergoes a three-stage pretraining process, including tokenizer optimization, embedding weight realignment, and full model training.
  • Memory-Efficient Loading: The model can be loaded in 8-bit or 4-bit quantization, making it suitable for systems with limited VRAM.

Performance

But how does it perform? Let’s dive into its speed, accuracy, and efficiency.

Speed

This model is fast. With a new tokenizer specially trained on the target languages, it offers substantially faster inference than the original Mistral-Nemo-Base-2407 model. But what does that mean in practice? It means you can generate text, translate languages, and complete tasks at a speed that’s hard to match.

Accuracy

But speed is only half the story. This model also boasts high accuracy in various tasks. Its hybrid masked-causal training approach allows it to excel in both causal generative modeling and bidirectional encoding. This means it can handle complex tasks with ease, from translation to text classification.

Efficiency

So, how efficient is this model? With 11.4 billion parameters, it’s a large model, but it’s also designed to be efficient. It uses a combination of techniques like pre-normalization with RMSNorm, SwiGLU activation function, and rotary positional embeddings to minimize computational resources. This means you can run it on a variety of hardware configurations without breaking the bank.

Examples
Engelsk: I'm excited to try this new Norwegian language model! Jeg er spent på å prøve denne nye norske språkmodellen!
En søt lundefugl flyr over de<mask>norske fjorder. En søt lundefugl flyr over de vakre norske fjorder.
Hva betyr ordet 'maskinlæring' på engelsk? Machine learning

Example Use Cases

So, what can you do with this model? Here are a few examples:

  • Translation: Use it to translate Norwegian text with high accuracy and speed.
  • Text Classification: Leverage its bidirectional encoding capabilities to classify Norwegian text with ease.
  • Text Generation: Generate Norwegian text that’s coherent and natural-sounding.

Limitations

This model is not perfect. Let’s take a closer look at its limitations.

Data Quality and Bias

The model is pretrained on a large corpus of text data, but this data may contain biases and inaccuracies. For example, the data may reflect societal biases or stereotypes, which could be perpetuated by the model. Additionally, the data may not be representative of all Norwegian dialects or languages, which could lead to poor performance in certain contexts.

Lack of Fine-Tuning

This model is not fine-tuned to follow instructions, which means it may not always respond as expected. This can lead to harmful or inappropriate completions, especially if the user prompts are not carefully crafted.

Technical Limitations

The model has a large number of parameters (11.4 billion) and requires significant computational resources to run. This can make it difficult to deploy in certain environments or applications.

Evaluation Challenges

Evaluating the performance of this model can be challenging due to the lack of standardized benchmarks for Norwegian language models. This makes it difficult to compare the model’s performance to other models or to establish clear performance metrics.

Licensing and Data Ownership

The model is released under the Apache 2.0 license, but the data used to train the model is not owned by the developers. This can create uncertainty around data ownership and usage rights.

Format

This model uses a Mistral architecture, which is an improved version of the Llama design. It’s designed to handle Norwegian, Sámi, English, and code data.

Architecture

The model consists of:

  • 40 transformer layers
  • Hidden dimension: 5,120
  • Intermediate dimension: 14,336
  • 32 query heads and 8 key & value heads (dimension 128)
  • Vocabulary size: 51,200 tokens
  • Total parameters: 11.4 billion

Data Formats

The model accepts input in the form of tokenized text sequences. It’s been trained on a mix of Scandinavian, Sámi, English, and code data.

Input and Output

To use the model, you’ll need to:

  1. Tokenize your input text using the AutoTokenizer from the transformers library.
  2. Pass the tokenized input to the AutoModelForCausalLM model.
  3. Generate output using the generate method.

Here’s an example:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Import the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b")
model = AutoModelForCausalLM.from_pretrained("norallm/normistral-11b").cuda().eval()

# Define a zero-shot translation prompt template
prompt = """Engelsk: {0}
Bokmål:"""

# Generation function
@torch.no_grad()
def generate(text):
    text = prompt.format(text)
    input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
    prediction = model.generate(
        input_ids,
        max_new_tokens=64,
        do_sample=False,
        eos_token_id=tokenizer('\n').input_ids
    )
    return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()

# Example usage
generate("I'm excited to try this new Norwegian language model!")
# > Expected output: 'Jeg er spent på å prøve denne nye norske språkmodellen!'
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.