Bloomz 3b Retriever

Cross-language retriever

Bloomz 3b Retriever is a powerful AI model designed for Open Domain Question Answering (ODQA) tasks. What makes it unique is its ability to handle cross-language queries, meaning it can link questions in one language to relevant documents in another. This model is trained on a large corpus of context/query pairs in both English and French, and uses a contrastive method to bring the embedding representation of queries and associated contexts closer. As a result, it outperforms other models like TF-IDF and CamemBERT in both monolingual and cross-language contexts. With its efficient design, Bloomz 3b Retriever can project queries and text with an algebraic structure, making it an ideal choice for tasks that require fast and accurate results.

Cmarkea bigscience-bloom-rail-1.0 Updated 4 months ago

Table of Contents

Model Overview

Meet the Bloomz-3b-retriever model, a game-changer for Open Domain Question Answering (ODQA) tasks. This model is designed to create an embedding representation of text and queries, linking them together in a way that’s language-agnostic.

Capabilities

What can it do?

  • Create an embedding representation of text and queries for a retrieval task
  • Link queries to documents using an algebraic structure
  • Project queries and text to bring them closer together

Strengths

  • Cross-language capabilities (English/French)
  • Ideal for ODQA tasks
  • Trained on a corpus of context/query pairs with a balanced language distribution (50% English, 50% French)

Performance Comparison

We’ve compared the Bloomz-3b-retriever model to other models like TF-IDF, CamemBERT, and Sentence-BERT on the SQuAD evaluation dataset. The results show that our model outperforms the others in both monolingual and cross-language contexts.

ModelTop-meanTop-stdTop-1 (%)Top-5 (%)Top-10 (%)
TF-IDF128269234656
CamemBERT417347123
Sentence-BERT1141437182
Bloomz-560m-retriever1047517886
Bloomz-3b-retriever937507987

How to Use It

You can use the Bloomz-3b-retriever model with the Transformers library. Here’s an example:

import numpy as np
from transformers import pipeline
from scipy.spatial.distance import cdist

retriever = pipeline('feature-extraction', 'cmarkea/bloomz-3b-retriever')

# Important: take only last token!
infer = lambda x: [np.array(ii[0][-1]).reshape(1,-1) for ii in retriever(x)]

list_of_contexts = [...]
emb_contexts = np.concatenate(infer(list_of_contexts), axis=0)

list_of_queries = [...]
emb_queries = np.concatenate(infer(list_of_queries), axis=0)

# Important: take l2 distance!
dist = cdist(emb_queries, emb_contexts, 'euclidean')

top_k = lambda x: [ [list_of_contexts[qq] for qq in ii] for ii in dist.argsort(axis=-1)[:,:x]]

# top 5 nearest contexts for each query
top_contexts = top_k(5)
Examples
Find the top 3 nearest contexts for the query 'What is the capital of France?' from the list of contexts ['Paris is the capital of France.', 'London is the capital of England.', 'Berlin is the capital of Germany.']. ['Paris is the capital of France.', 'London is the capital of England.', 'Berlin is the capital of Germany.']
Determine the embedding representation of the text 'The cat sat on the mat.' [-0.0118, 0.0285, 0.0143,..., 0.0462, -0.0119, -0.0273]
Calculate the L2 distance between the query 'What is AI?' and the context 'AI is a field of computer science focused on creating intelligent machines.' 0.4211

Example Use Cases

  • Open Domain Question Answering (ODQA)
  • Cross-language retrieval tasks
  • Text classification and clustering

Limitations

  • Language limitations: while the model is designed to be cross-language, its performance may vary depending on the language pair.
  • Data quality: the model’s performance relies heavily on the quality of the training data.
  • Algebraic structure: the model uses an algebraic structure to bring queries and contexts closer together, which may not always be effective in capturing the nuances of human language.

Note: This response is a rewritten version of the original text, condensed and reorganized to meet the requirements.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.