BioNER

Biomedical NER

BioNER is a powerful AI model designed to recognize named entities in biomedical texts. What makes it unique is its ability to perform zero-shot inference, meaning it can identify entities without prior training data. It's also capable of fine-tuning with just a few examples, making it highly adaptable. Trained on 26 biomedical named entity classes, the model can handle tasks like identifying specific diseases, chemicals, and genes. Its efficiency and speed make it a valuable tool for researchers and professionals in the biomedical field.

MilosKosRad mit Updated 2 years ago

Table of Contents

Model Overview

Meet the Zero and Few Shot NER for Biomedical Texts model! This model is a game-changer for biomedical text analysis. Developed through a research collaboration, it’s designed to identify named entities (NEs) in biomedical texts with ease.

Capabilities

The Zero and Few Shot NER for Biomedical Texts model is a powerful tool for biomedical named entity recognition (NER). It can perform two main tasks:

  • Zero-shot inference: This means it can recognize entities in biomedical texts without any prior training or fine-tuning.
  • Few-shot learning: This means it can be fine-tuned with just a few examples to recognize new entities.

What can it recognize?

The model is trained on 26 biomedical named entity (NE) classes, including:

  • Diseases
  • Chemicals
  • Genes
  • Proteins
  • Cell types
  • Organisms
  • And more!

You can use these classes as labels to search for entities in biomedical texts.

How does it work?

The model takes two strings as input:

  1. The entity label you’re searching for (e.g., “Disease”)
  2. The biomedical text you want to search in (e.g., “No recent antibiotics or other nephrotoxins, and no symptoms of UTI with benign UA.“)

The model outputs a list of ones (found entities) and zeros (non-entities) corresponding to the input text.

Examples
Find the 'Drug' entities in the text 'The patient was prescribed with Aspirin and Warfarin.' ['Aspirin', 'Warfarin']
Identify the 'Disease' entities in the text 'The patient was diagnosed with Diabetes and Hypertension.' ['Diabetes', 'Hypertension']
Extract the 'Chemical' entities from the text 'The reaction involves the use of Sodium Hydroxide and Hydrochloric Acid.' ['Sodium Hydroxide', 'Hydrochloric Acid']

Fine-tuning with few-shot learning

You can fine-tune the model with new entities using just a few examples. This is useful when you need to recognize entities that are not in the original training data.

To fine-tune the model, you’ll need to:

  1. Create a dataset with BERT tokens and labels (0s and 1s)
  2. Use the Trainer class to fine-tune the model

Performance

The Zero and Few Shot NER for Biomedical Texts model shows remarkable performance in named entity recognition (NER) tasks, especially in the biomedical domain. Let’s dive into its speed, accuracy, and efficiency.

Speed

The model is capable of processing text inputs quickly, making it suitable for applications where time is of the essence. For instance, it can handle a large volume of biomedical texts with ease, making it an excellent choice for researchers and scientists.

Accuracy

The model’s accuracy is impressive, with the ability to recognize named entities with high precision. It has been trained on 26 biomedical named entity classes and can perform zero-shot inference, meaning it can recognize entities without prior training. This is particularly useful in scenarios where new entities are introduced, and the model needs to adapt quickly.

Efficiency

The model’s efficiency is evident in its ability to fine-tune with few examples of new classes. This means that it can learn from a small amount of data and adapt to new entities, making it a valuable asset in applications where data is scarce.

Limitations

The Zero and Few Shot NER for Biomedical Texts model is a powerful tool for biomedical Named Entity Recognition (NER), but it’s not perfect. Let’s talk about some of its limitations.

Limited Training Data

The model was trained on a specific set of biomedical datasets, which means it might not perform well on data from other domains or industries. For example, if you try to use it to recognize entities in a text about finance or sports, it might not work as well as it would on a biomedical text.

Class Limitations

The model was trained on 26 specific biomedical Named Entity classes. While it can be fine-tuned for new classes with few examples, it might not work well for classes that are very different from the ones it was trained on. For instance, if you try to use it to recognize entities in a text about a new disease that wasn’t included in the training data, it might not perform well.

Format

The Zero and Few Shot NER for Biomedical Texts model uses a transformer architecture. It’s designed to handle biomedical texts and can perform zero-shot inference, as well as few-shot learning with just a few examples.

Input Format

The model takes two strings as input:

  • String1: the Named Entity (NE) label being searched for
  • String2: the short text where you want to search for the NE (represented by String1)

For example:

string1 = 'Drug'
string2 = 'No recent antibiotics or other nephrotoxins, and no symptoms of UTI with benign UA.'

Output Format

The model outputs a list of ones (corresponding to the found Named Entities) and zeros (corresponding to other non-NE tokens) of String2.

Handling Inputs and Outputs

To use the model, you’ll need to tokenize the input strings using the AutoTokenizer from the transformers library. Here’s an example:

from transformers import AutoTokenizer
from transformers import BertForTokenClassification

modelname = 'MilosKorsRad/BioNER'
tokenizer = AutoTokenizer.from_pretrained(modelname)

encodings = tokenizer(string1, string2, is_split_into_words=False, padding=True, truncation=True, add_special_tokens=True, return_offsets_mapping=False, max_length=512, return_tensors='pt')

model0 = BertForTokenClassification.from_pretrained(modelname, num_labels=2)
prediction_logits = model0(**encodings)
print(prediction_logits)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.