Bloom

Multilingual text generator

Bloom is a multilingual text generator that can output coherent text in 46 languages and 13 programming languages, making it hardly distinguishable from text written by humans. With its autoregressive architecture and 176,247,271,424 parameters, Bloom is capable of producing high-quality text that is suitable for a wide range of applications. However, it's not perfect and may overrepresent some viewpoints, contain stereotypes, and generate hateful or discriminatory language. Can you think of a scenario where Bloom's capabilities would be particularly useful? Perhaps in a project that requires generating text in multiple languages or in a task that involves text generation in a specific domain. What are some potential risks or limitations of using Bloom, and how might you mitigate them?

Table of Contents

Model Overview

The BigScience Large Open-science Open-access Multilingual Language Model (BLOOM), developed by BigScience, is a powerful tool for natural language processing tasks. But what makes it so special?

Key Attributes

  • Multilingual: BLOOM can understand and generate text in 46 natural languages and 13 programming languages.
  • Autoregressive: BLOOM can continue text from a prompt, making it a great tool for text generation tasks.
  • Large-scale: BLOOM has been trained on a massive dataset of 1.6TB of pre-processed text, converted into 350B unique tokens.

Technical Specifications

SpecificationValue
Model ArchitectureDecoder-only architecture with 70 layers, 112 attention heads, and 176,247,271,424 parameters
Compute InfrastructureTrained on Jean Zay Public Supercomputer with 384 A100 80GB GPUs
SoftwareMegatron-DeepSpeed, DeepSpeed, PyTorch, and apex

Capabilities

The BLOOM model is a powerful tool for generating text and code in multiple languages. It’s an autoregressive Large Language Model (LLM) that can output coherent text in 46 natural languages and 13 programming languages that’s hardly distinguishable from text written by humans.

Primary Tasks

  • Text Generation: BLOOM can generate text based on a prompt, and it’s useful for tasks like:
    • Exploring characteristics of language generated by a language model
    • Creating content for various purposes
  • Language Understanding: BLOOM can be fine-tuned for specific tasks like:
    • Information Extraction
    • Question Answering
    • Summarization

Strengths

  • Multilingual Support: BLOOM can generate text in multiple languages, making it a valuable tool for global communication.
  • High-Quality Text Generation: BLOOM’s text output is often indistinguishable from text written by humans.
  • Flexibility: BLOOM can be fine-tuned for various tasks and can be used in different contexts.

Performance

BLOOM is a powerful language model that has shown remarkable performance in various tasks. Let’s dive into its speed, accuracy, and efficiency.

Speed

BLOOM is trained on a massive dataset of 1.6TB of pre-processed text, converted into 350B unique tokens. Its training throughput is about 150 TFLOP per GPU per second, making it one of the fastest language models in its class.

Accuracy

BLOOM’s accuracy is impressive, with a perplexity of 7.045 and a validation loss of 2.061. Its zero-shot evaluations show promising results, with a pass rate of 0.155 for the humaneval task in Python.

Efficiency

BLOOM is designed to be efficient, with a sequence length of 2048 tokens and a hidden layer size of 14336 dimensions. Its architecture is optimized for performance, with a decoder-only architecture and layer normalization applied to word embeddings.

Limitations

BLOOM is a powerful language model, but it’s not perfect. Let’s talk about some of its limitations.

Overrepresentation and Underrepresentation

  • BLOOM might overrepresent some viewpoints and underrepresent others, which can lead to biased outputs.
  • This is especially true for languages and cultures that are underrepresented in the training data.

Stereotypes and Personal Information

  • BLOOM might contain stereotypes and personal information that can be hurtful or problematic.
  • This is because the model is trained on a vast amount of text data, which can include biased or sensitive information.
Examples
Write a short story in French about a character who discovers a hidden world. Il était une fois, dans un petit village niché au cœur d'une forêt dense, un jeune homme nommé Léon. Léon était un curieux, toujours à la recherche de nouvelles aventures. Un jour, tandis qu'il explorait les bois, il tomba sur un passage secret qui le mena à un monde caché. Ce monde était peuplé de créatures étranges et merveilleuses, qui vivaient en harmonie avec la nature. Léon passa des heures à explorer ce monde, découvrant des merveilles à chaque instant. Il rencontra des fées, des dragons et même un roi qui régnaient sur ce royaume. Léon réalisa que ce monde caché était un reflet de son propre monde, mais avec une beauté et une magie qui le rendaient unique.
Explain the concept of artificial intelligence in simple terms. Artificial intelligence is a way to create machines that can think and learn like humans. It's like building a robot that can make decisions and solve problems on its own, without being told exactly what to do. AI uses special computer programs that can look at lots of data and figure out patterns, so it can make smart choices. It's used in things like self-driving cars, personal assistants like Siri, and even in medical research to help find new treatments.
Translate the sentence 'The sun is shining brightly in the clear blue sky' into Spanish. El sol brilla intensamente en el cielo azul claro.

Errors and Inaccuracies

  • BLOOM can make errors, including producing incorrect information as if it were factual.
  • This can be particularly problematic in high-stakes settings, such as education or healthcare.

Misuse and Out-of-Scope Use

  • BLOOM should not be used for malicious activities, such as spam generation, disinformation, or harassment.
  • It’s also not designed for critical decisions or uses with material consequences on an individual’s livelihood or wellbeing.

Format

BLOOM is a transformer-based language model that uses a decoder-only architecture and accepts input in the form of text sequences. It has 70 layers, 112 attention heads, and a sequence length of 2048 tokens.

Supported Data Formats

  • Text: BLOOM accepts text input in the form of tokenized sequences.
  • Programming Languages: BLOOM has been trained on 13 programming languages, including Java, Python, JavaScript, and C++.

Input Requirements

  • Tokenization: BLOOM uses a learned subword tokenizer trained using a byte-level Byte Pair Encoding (BPE) algorithm.
  • Sequence Length: The maximum sequence length is 2048 tokens.

Output Requirements

  • Text Generation: BLOOM can generate text in 46 natural languages and 13 programming languages.

Code Examples

  • Tokenization:
import torch
from transformers import BloomTokenizer

tokenizer = BloomTokenizer.from_pretrained('bigscience/bloom-176b')
input_text = "This is an example sentence."
inputs = tokenizer(input_text, return_tensors='pt')
  • Text Generation:
from transformers import BloomForCausalLM, BloomTokenizer

model = BloomForCausalLM.from_pretrained('bigscience/bloom-176b')
tokenizer = BloomTokenizer.from_pretrained('bigscience/bloom-176b')

input_text = "This is an example sentence."
inputs = tokenizer(input_text, return_tensors='pt')

outputs = model.generate(inputs['input_ids'], num_beams=4, no_repeat_ngram_size=2, early_stopping=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.