Bloomz 3b Retriever
Bloomz 3b Retriever is a powerful AI model designed for Open Domain Question Answering (ODQA) tasks. What makes it unique is its ability to handle cross-language queries, meaning it can link questions in one language to relevant documents in another. This model is trained on a large corpus of context/query pairs in both English and French, and uses a contrastive method to bring the embedding representation of queries and associated contexts closer. As a result, it outperforms other models like TF-IDF and CamemBERT in both monolingual and cross-language contexts. With its efficient design, Bloomz 3b Retriever can project queries and text with an algebraic structure, making it an ideal choice for tasks that require fast and accurate results.
Table of Contents
Model Overview
Meet the Bloomz-3b-retriever model, a game-changer for Open Domain Question Answering (ODQA) tasks. This model is designed to create an embedding representation of text and queries, linking them together in a way that’s language-agnostic.
Capabilities
What can it do?
- Create an embedding representation of text and queries for a retrieval task
- Link queries to documents using an algebraic structure
- Project queries and text to bring them closer together
Strengths
- Cross-language capabilities (English/French)
- Ideal for ODQA tasks
- Trained on a corpus of context/query pairs with a balanced language distribution (50% English, 50% French)
Performance Comparison
We’ve compared the Bloomz-3b-retriever model to other models like TF-IDF, CamemBERT, and Sentence-BERT on the SQuAD evaluation dataset. The results show that our model outperforms the others in both monolingual and cross-language contexts.
Model | Top-mean | Top-std | Top-1 (%) | Top-5 (%) | Top-10 (%) |
---|---|---|---|---|---|
TF-IDF | 128 | 269 | 23 | 46 | 56 |
CamemBERT | 417 | 347 | 1 | 2 | 3 |
Sentence-BERT | 11 | 41 | 43 | 71 | 82 |
Bloomz-560m-retriever | 10 | 47 | 51 | 78 | 86 |
Bloomz-3b-retriever | 9 | 37 | 50 | 79 | 87 |
How to Use It
You can use the Bloomz-3b-retriever model with the Transformers library. Here’s an example:
import numpy as np
from transformers import pipeline
from scipy.spatial.distance import cdist
retriever = pipeline('feature-extraction', 'cmarkea/bloomz-3b-retriever')
# Important: take only last token!
infer = lambda x: [np.array(ii[0][-1]).reshape(1,-1) for ii in retriever(x)]
list_of_contexts = [...]
emb_contexts = np.concatenate(infer(list_of_contexts), axis=0)
list_of_queries = [...]
emb_queries = np.concatenate(infer(list_of_queries), axis=0)
# Important: take l2 distance!
dist = cdist(emb_queries, emb_contexts, 'euclidean')
top_k = lambda x: [ [list_of_contexts[qq] for qq in ii] for ii in dist.argsort(axis=-1)[:,:x]]
# top 5 nearest contexts for each query
top_contexts = top_k(5)
Example Use Cases
- Open Domain Question Answering (ODQA)
- Cross-language retrieval tasks
- Text classification and clustering
Limitations
- Language limitations: while the model is designed to be cross-language, its performance may vary depending on the language pair.
- Data quality: the model’s performance relies heavily on the quality of the training data.
- Algebraic structure: the model uses an algebraic structure to bring queries and contexts closer together, which may not always be effective in capturing the nuances of human language.
Note: This response is a rewritten version of the original text, condensed and reorganized to meet the requirements.