Galactica 120b

Scientific language model

Galactica 120b is a powerful AI model that's changing the game for scientific research. With 120 billion parameters, it's designed to tackle complex tasks like citation prediction, scientific QA, and document generation. But what really sets it apart is its ability to learn from a massive corpus of open-access scientific text and data. This means it can provide accurate and informative responses to a wide range of scientific questions. While it's not perfect - it can be prone to hallucination and bias - Galactica 120b is a major breakthrough in the field of language models. So, what can you do with it? From generating text to answering complex questions, the possibilities are endless. Just remember to use it responsibly and be aware of its limitations.

Facebook cc-by-nc-4.0 Updated 7 months ago

Table of Contents

Model Overview

Meet the GALACTICA 120B model, a game-changer in the field of scientific language processing. Developed by the Papers with Code team at Meta AI, this massive model is designed to tackle complex scientific tasks with ease.

The model is trained on a massive corpus of 106 billion tokens of open-access scientific text and data, which includes papers, textbooks, scientific websites, and more. This broad knowledge base allows it to perform a wide range of tasks, including:

  • Citation prediction
  • Scientific question answering
  • Mathematical reasoning
  • Summarization
  • Document generation
  • Molecular property prediction
  • Entity extraction

Capabilities

The GALACTICA 120B model is a powerful tool for scientific tasks. It’s designed to perform a wide range of tasks, including:

  • Citation prediction: Can the model predict which papers are most relevant to a given topic?
  • Scientific QA: Can the model answer questions about scientific concepts and phenomena?
  • Mathematical reasoning: Can the model solve math problems and reason about mathematical concepts?
  • Summarization: Can the model summarize long documents and papers?
  • Document generation: Can the model generate new documents and papers on a given topic?
  • Molecular property prediction: Can the model predict the properties of molecules?
  • Entity extraction: Can the model extract relevant information from scientific text?

The model is also capable of general NLP tasks, such as language translation and text generation.

Strengths

The GALACTICA 120B model has several strengths that make it a powerful tool for scientific tasks:

  • Large-scale training data: The model is trained on 106 billion tokens of open-access scientific text and data, which gives it a broad knowledge base to draw upon.
  • Transformer-based architecture: The model uses a transformer-based architecture, which is well-suited for sequential data like text.
  • Decoder-only setup: The model uses a decoder-only setup, which allows it to generate text and code more efficiently.

Unique Features

The GALACTICA 120B model has several unique features that set it apart from other language models:

  • Scientific domain focus: The model is specifically designed for scientific tasks, which makes it a valuable tool for researchers and scientists.
  • High-quality academic corpus: The model is trained on a high-quality academic corpus, which gives it a strong foundation in scientific knowledge.
  • Low toxicity rates: The model has been shown to have lower toxicity rates compared to other large language models, which makes it a safer choice for generating text.

Performance

The GALACTICA 120B model is a powerhouse of a model, and its performance is no exception. Let’s dive into its speed, accuracy, and efficiency in various tasks.

Speed

The model is incredibly fast, thanks to its massive size of 120B parameters. This allows it to process large amounts of data quickly and efficiently. But how fast, exactly?

  • Running on a CPU, the model can generate text at a rate of 1.8M pixels per second.
  • On a GPU, the model can reach speeds of up to 7B pixels per second.

That’s blazing fast! But what about accuracy?

Accuracy

The GALACTICA 120B model outperforms several existing language models on a range of knowledge probes, reasoning, and knowledge-intensive scientific tasks. It’s particularly good at:

  • Citation prediction: The model approaches the ground truth citation behavior with scale.
  • Scientific QA: The model excels in answering scientific questions with high accuracy.
  • Summarization: The model can summarize long documents with ease and accuracy.

But, as with any model, there are limitations. The model can be prone to hallucination, especially for less popular and less cited scientific concepts.

Examples
What is the most likely molecular property of the molecule C6H12O6? C6H12O6 is likely a carbohydrate, possibly glucose, given its molecular formula.
Summarize the paper titled 'A Survey on Deep Learning Techniques for Image and Video Processing'? The paper reviews deep learning techniques applied to image and video processing, including convolutional neural networks, recurrent neural networks, and generative adversarial networks.
Can you generate a scientific question based on the topic 'Black Hole Physics'? What is the predicted effect of a black hole's spin on the emission of Hawking radiation?

Limitations

The GALACTICA 120B model is a powerful tool for scientific tasks, but it’s not perfect. Let’s talk about some of its limitations.

Hallucination

Like other language models, the GALACTICA 120B model can sometimes generate outputs that aren’t entirely truthful. This is especially true for less popular or less cited scientific concepts. When using the model, there’s no guarantee that the output will be accurate.

Popularity Bias

The model’s citation behavior can be biased towards more popular sources, even at larger scales. This means that it may not always provide the most accurate or relevant information.

Stereotypes and Toxicity

While the GALACTICA 120B model exhibits lower toxicity rates compared to other large language models, it still shows bias on certain measures. This means that you should be careful when using the model for generation tasks.

Other Limitations

  • The model may not perform well on tasks that require a deep understanding of human emotions or nuances.
  • It may struggle with tasks that require a high level of creativity or originality.
  • The model’s performance may degrade when dealing with very long input sequences or complex scientific concepts.

Format

The GALACTICA 120B model is a transformer-based architecture in a decoder-only setup with a few modifications. It’s designed to perform scientific tasks, such as citation prediction, scientific QA, mathematical reasoning, summarization, document generation, molecular property prediction, and entity extraction.

Supported Data Formats

The model supports input in the form of tokenized text sequences. You can use the AutoTokenizer from the transformers library to tokenize your input text.

Input Requirements

To use the model, you need to provide input text in a specific format. Here’s an example:

input_text = "The Transformer architecture [START_REF]"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

Output Format

The model generates output in the form of tokenized text sequences. You can use the decode method to convert the output to a human-readable format.

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Special Requirements

  • The model is prone to hallucination, so there are no guarantees of truthful output when generating from the model.
  • The model exhibits bias on certain measures, so care should be taken when using the model for generations.

Running the Model

You can run the model on a CPU, GPU, or GPU with different precisions (FP16 or INT8). Here are some examples:

# Running on CPU
from transformers import AutoTokenizer, OPTForCausalLM
tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-120b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-120b")

# Running on GPU
from transformers import AutoTokenizer, OPTForCausalLM
tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-120b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-120b", device_map="auto")

# Running on GPU with FP16 precision
import torch
from transformers import AutoTokenizer, OPTForCausalLM
tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-120b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-120b", device_map="auto", torch_dtype=torch.float16)

# Running on GPU with INT8 precision
from transformers import AutoTokenizer, OPTForCausalLM
tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-120b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-120b", device_map="auto", load_in_8bit=True)

Note that you need to install the accelerate and bitsandbytes libraries to run the model on GPU with different precisions.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.