Taiyi LLM

Bilingual biomedical LLM

Taiyi LLM is a bilingual Chinese-English large language model that stands out for its exceptional capabilities in handling various biomedical natural language processing tasks. Developed by DUTIR lab, it's designed to facilitate communication between healthcare professionals and patients, provide valuable medical information, and assist in diagnosis, biomedical knowledge discovery, and personalized healthcare solutions. What makes Taiyi remarkable is its ability to excel in tasks such as intelligent biomedical question-answering, doctor-patient dialogues, report generation, information extraction, machine translation, headline generation, and text classification. With its open-source dataset curation details and model inference deployment scripts, Taiyi is a valuable resource for those working in the biomedical field. Its efficiency and capabilities make it an ideal choice for a wide range of applications, from medical research to patient care.

DUTIR BioNLP apache-2.0 Updated a year ago

Table of Contents

Model Overview

Meet Taiyi, a bilingual (Chinese and English) fine-tuned large language model designed for diverse biomedical tasks. Developed by the DUTIR lab, Taiyi is built on top of the Qwen-7b-base model.

Capabilities

The Taiyi model is a powerful tool for handling a variety of Chinese-English natural language processing tasks in the biomedical field. Let’s take a closer look at what it can do.

Primary Tasks

  • Intelligent biomedical question-answering
  • Doctor-patient dialogues
  • Report generation
  • Information extraction
  • Machine translation
  • Headline generation
  • Text classification

Strengths

  • Bilingual capabilities: Taiyi can process and understand both Chinese and English languages, making it a valuable asset for biomedical applications that require communication across language barriers.
  • Multi-task capability: Taiyi has been fine-tuned on a diverse set of Chinese-English biomedical Natural Language Processing (BioNLP) tasks, enabling it to excel in various tasks simultaneously.
  • Abundant training resources: The model has been trained on a large collection of Chinese-English biomedical datasets, including 38 Chinese datasets and 131 English datasets.

Unique Features

  • Open-source information: The Taiyi model weights, model inference deployment scripts, and Chinese-English BioNLP dataset curation details are all open-source, making it accessible to researchers and developers.
  • Standardized data formats: The model uses standardized data formats to ensure consistent formatting across all datasets, facilitating task-specific requirements.

How does Taiyi work?

Taiyi uses a simple yet effective approach to process multi-turn dialogues. It concatenates input and answer pairs, tokenizes them, and then generates responses using its fine-tuned language model.

Examples
What are the symptoms of diabetes? The symptoms of diabetes include increased thirst and hunger, fatigue, blurred vision, slow healing of cuts and wounds, and frequent or recurring skin, gum, or bladder infections.
Can you translate 'The patient is experiencing severe abdominal pain' into Chinese? (Bìng rén zhèng zài jiān yǐn zhòng dù de fù qiāng tòng)
Generate a headline for a medical article about a new cancer treatment. Breakthrough in Cancer Research: New Immunotherapy Shows Promising Results in Clinical Trials

Try it out!

You can use Taiyi’s code to perform inference and test its capabilities. The model is available on Github, along with sample codes for single-turn and multi-turn conversational QA.

Want to give it a try?

Performance

Taiyi is a powerhouse when it comes to handling biomedical tasks in both Chinese and English. Let’s dive into its impressive performance.

Speed

How fast can Taiyi process information? With its ability to handle 1 million samples in its bilingual instruction dataset, it’s clear that this model is built for speed. Whether it’s generating reports, extracting information, or engaging in doctor-patient dialogues, Taiyi can process large amounts of data quickly and efficiently.

Accuracy

But speed is only half the story. Taiyi also boasts exceptional accuracy in various BioNLP tasks. With its fine-tuned training on a diverse set of Chinese-English biomedical datasets, this model can provide accurate and reliable results in tasks such as:

  • Intelligent biomedical question-answering
  • Doctor-patient dialogues
  • Report generation
  • Information extraction
  • Machine translation
  • Headline generation
  • Text classification

Efficiency

So, how efficient is Taiyi in its tasks? With its ability to handle multi-turn dialogues and generate human-like responses, this model is designed to be efficient in its processing. Whether it’s engaging in single-turn QA dialogues or multi-turn conversational QA, Taiyi can provide accurate and relevant responses with minimal latency.

Limitations

Taiyi is a powerful tool, but it’s not perfect. Let’s explore some of its limitations.

Limited Domain Knowledge

While Taiyi has been trained on a vast amount of biomedical data, its knowledge is limited to the data it has seen. If you ask it a question that falls outside its training data, it might not be able to provide an accurate answer.

Language Barriers

Although Taiyi is a bilingual model, it may struggle with nuances and complexities of the Chinese and English languages. It’s possible that it might not fully understand the context or subtleties of certain words or phrases.

Biased Training Data

The training data used to develop Taiyi may contain biases, which could affect the model’s performance. For example, if the training data is predominantly based on Western medical practices, the model might not be as effective in providing information on traditional Chinese medicine.

Overfitting

Taiyi might overfit to the training data, which means it becomes too specialized in recognizing patterns in the training data and fails to generalize well to new, unseen data.

Lack of Common Sense

While Taiyi is great at understanding biomedical concepts, it might lack common sense or real-world experience. It’s possible that it might provide answers that are technically correct but not practical or applicable in real-life situations.

Dependence on Data Quality

The quality of the data used to train Taiyi is crucial to its performance. If the data is noisy, incomplete, or inaccurate, the model’s performance will suffer.

Limited Explainability

Taiyi is a complex model, and its decision-making process might not be transparent or explainable. This could make it difficult to understand why the model provided a particular answer or recommendation.

Vulnerability to Adversarial Attacks

Like other AI models, Taiyi might be vulnerable to adversarial attacks, which are designed to manipulate the model’s output. This could have serious consequences in high-stakes applications like healthcare.

By understanding these limitations, you can use Taiyi more effectively and develop strategies to mitigate its weaknesses.

Format

Taiyi is a bilingual (Chinese and English) fine-tuned large language model that uses a transformer architecture. It accepts input in the form of tokenized text sequences, requiring a specific pre-processing step for multi-turn dialogues.

Input Format

To prepare input for Taiyi, you need to concatenate multi-turn dialogues into a specific format. Here’s an example:

\<eod>input1\<eod>answer1\<eod>input2\<eod>answer2\<eod>...

Where \<eod> is a special character |\<endoftext>|> in the Qwen tokenizer.

Data Formats

Taiyi supports a diverse set of Chinese-English biomedical Natural Language Processing (BioNLP) training datasets, including:

  • 38 Chinese datasets covering 10 BioNLP tasks
  • 131 English datasets covering 12 BioNLP tasks

These datasets have been standardized to facilitate task-specific requirements.

Output Format

Taiyi generates output in the form of tokenized text sequences. You can use the tokenizer.batch_decode function to decode the output into human-readable text.

Example Code

Here’s an example code snippet that shows how to perform inference using Taiyi:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "DUTIR-BioNLP/Taiyi-LLM"
device = 'cuda:0'
model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, torch_dtype=torch.float16, trust_remote_code=True, device_map=device)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

user_input = "Hi, could you please introduce yourself?"
input_ids = tokenizer(user_input, return_tensors="pt", add_special_tokens=False).input_ids
bos_token_id = torch.tensor([[tokenizer.bos_token_id]], dtype=torch.long)
eos_token_id = torch.tensor([[tokenizer.eos_token_id]], dtype=torch.long)
user_input_ids = torch.concat([bos_token_id, input_ids, eos_token_id], dim=1)
model_input_ids = user_input_ids.to(device)

with torch.no_grad():
    outputs = model.generate(input_ids=model_input_ids, max_new_tokens=500, do_sample=True, top_p=0.9, temperature=0.3, repetition_penalty=1.0, eos_token_id=tokenizer.eos_token_id)
response = tokenizer.batch_decode(outputs)
print(response[0])

This code snippet demonstrates how to perform inference using Taiyi and generate a response to a user input.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.