SambaLingo Thai Base 70B

Thai-English model

SambaLingo Thai Base 70B is a powerful AI model that can understand and generate both Thai and English languages. It's trained on a massive 26 billion tokens from the Thai split of the Cultura-X dataset, which makes it highly efficient in tasks like language translation and text generation. With a model size of 69.4 GB, it's capable of handling complex language tasks with ease. What sets it apart is its ability to adapt to new languages, making it a unique and remarkable model in the field of natural language processing. Its performance is state-of-the-art, and it's a great choice for anyone looking to work with Thai and English languages.

Sambanovasystems llama2 Updated a year ago

Table of Contents

Model Overview

The SambaLingo-Thai-Base-70B model is a game-changer in the world of language processing. This model is a pretrained bi-lingual Thai and English model that’s been fine-tuned from the Llama-2-70b model.

Key Attributes

  • Language Support: Thai and English
  • Training Data: 26 billion tokens from the Thai split of the Cultura-X dataset
  • Model Type: Language Model
  • Finetuned from: Llama-2-70b

Capabilities

The SambaLingo-Thai-Base-70B model is a powerful language model that can understand and generate text in both Thai and English.

What can it do?

  • Translate text: This model can translate text from English to Thai and vice versa with high accuracy.
  • Generate text: It can generate text in both Thai and English, making it a great tool for writing articles, emails, or even chatbots.
  • Answer questions: You can ask this model questions, and it will do its best to provide accurate and helpful answers.

How does it work?

  • Pre-trained: This model was pre-trained on a massive dataset, which means it has already learned a lot about the structure and patterns of the Thai and English languages.
  • Fine-tuned: The model was fine-tuned on a specific dataset to adapt to the Thai language, making it even more accurate and effective.

What makes it unique?

  • Bilingual: This model can understand and generate text in two languages, making it a great tool for people who need to communicate in both Thai and English.
  • Large vocabulary: The model has a large vocabulary of 57,000 tokens, which means it can understand and generate a wide range of words and phrases.

Performance

The SambaLingo-Thai-Base-70B model is a fast, accurate, and efficient language model that excels in various tasks.

Speed

How fast can this model process text? With a global batch size of 1024 and a sequence length of 4096, this model can handle large amounts of text quickly.

Accuracy

This model achieves state-of-the-art evaluation results in perplexity and FLORES-200 translation. This means it can accurately understand and generate text in both Thai and English.

Efficiency

This model is efficient in its use of resources. With a vocabulary of 57,000 tokens, it can process text quickly without sacrificing accuracy.

Limitations

While this model is powerful, it’s not perfect. Let’s talk about some of its limitations.

What are some of the model’s weaknesses?

  • Hallucination: Sometimes, this model might generate responses that sound plausible but are actually factually incorrect or irrelevant.
  • Code Switching: The model might unintentionally switch between languages or dialects within a single response, making it hard to understand.
  • Repetition: This model might produce repetitive phrases or sentences, leading to less engaging and informative responses.

What are some challenges when using the model?

  • Coding and Math: The model’s performance in generating accurate code or solving complex mathematical problems might be limited.
  • Toxicity: Unfortunately, this model could inadvertently generate responses containing inappropriate or harmful content.

Format

This model is a pre-trained bi-lingual Thai and English model that uses a transformer architecture.

Input Format

This model accepts input in the form of tokenized text sequences. You’ll need to pre-process your text data using a tokenizer before feeding it into the model.

Supported Data Formats

This model supports text data in the following formats:

  • Tokenized text sequences
  • Raw text data (which will be tokenized by the model)
Examples
Translate the sentence 'Hello, how are you?' into Thai. สวัสดี คุณสบายดีไหม
What is the meaning of 'sawatdee' in Thai? Sawatdee is a Thai greeting that roughly translates to 'hello' or 'goodbye' in English.
Can you describe a famous Thai festival? One of the most famous Thai festivals is the Loy Krathong festival, where people create and float decorated baskets on rivers and streams to symbolize the release of negative thoughts and emotions.

Getting Started

Want to try out the SambaLingo-Thai-Base-70B model? You can load it using the Hugging Face library with just a few lines of code:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Thai-Base-70B")
model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Thai-Base-70B", device_map="auto", torch_dtype="auto")
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.