Taiyi LLM
Taiyi LLM is a bilingual Chinese-English large language model that stands out for its exceptional capabilities in handling various biomedical natural language processing tasks. Developed by DUTIR lab, it's designed to facilitate communication between healthcare professionals and patients, provide valuable medical information, and assist in diagnosis, biomedical knowledge discovery, and personalized healthcare solutions. What makes Taiyi remarkable is its ability to excel in tasks such as intelligent biomedical question-answering, doctor-patient dialogues, report generation, information extraction, machine translation, headline generation, and text classification. With its open-source dataset curation details and model inference deployment scripts, Taiyi is a valuable resource for those working in the biomedical field. Its efficiency and capabilities make it an ideal choice for a wide range of applications, from medical research to patient care.
Table of Contents
Model Overview
Meet Taiyi, a bilingual (Chinese and English) fine-tuned large language model designed for diverse biomedical tasks. Developed by the DUTIR lab, Taiyi is built on top of the Qwen-7b-base model.
Capabilities
The Taiyi model is a powerful tool for handling a variety of Chinese-English natural language processing tasks in the biomedical field. Let’s take a closer look at what it can do.
Primary Tasks
- Intelligent biomedical question-answering
- Doctor-patient dialogues
- Report generation
- Information extraction
- Machine translation
- Headline generation
- Text classification
Strengths
- Bilingual capabilities: Taiyi can process and understand both Chinese and English languages, making it a valuable asset for biomedical applications that require communication across language barriers.
- Multi-task capability: Taiyi has been fine-tuned on a diverse set of Chinese-English biomedical Natural Language Processing (BioNLP) tasks, enabling it to excel in various tasks simultaneously.
- Abundant training resources: The model has been trained on a large collection of Chinese-English biomedical datasets, including 38 Chinese datasets and 131 English datasets.
Unique Features
- Open-source information: The Taiyi model weights, model inference deployment scripts, and Chinese-English BioNLP dataset curation details are all open-source, making it accessible to researchers and developers.
- Standardized data formats: The model uses standardized data formats to ensure consistent formatting across all datasets, facilitating task-specific requirements.
How does Taiyi work?
Taiyi uses a simple yet effective approach to process multi-turn dialogues. It concatenates input and answer pairs, tokenizes them, and then generates responses using its fine-tuned language model.
Try it out!
You can use Taiyi’s code to perform inference and test its capabilities. The model is available on Github, along with sample codes for single-turn and multi-turn conversational QA.
Want to give it a try?
Performance
Taiyi is a powerhouse when it comes to handling biomedical tasks in both Chinese and English. Let’s dive into its impressive performance.
Speed
How fast can Taiyi process information? With its ability to handle 1 million samples in its bilingual instruction dataset, it’s clear that this model is built for speed. Whether it’s generating reports, extracting information, or engaging in doctor-patient dialogues, Taiyi can process large amounts of data quickly and efficiently.
Accuracy
But speed is only half the story. Taiyi also boasts exceptional accuracy in various BioNLP tasks. With its fine-tuned training on a diverse set of Chinese-English biomedical datasets, this model can provide accurate and reliable results in tasks such as:
- Intelligent biomedical question-answering
- Doctor-patient dialogues
- Report generation
- Information extraction
- Machine translation
- Headline generation
- Text classification
Efficiency
So, how efficient is Taiyi in its tasks? With its ability to handle multi-turn dialogues and generate human-like responses, this model is designed to be efficient in its processing. Whether it’s engaging in single-turn QA dialogues or multi-turn conversational QA, Taiyi can provide accurate and relevant responses with minimal latency.
Limitations
Taiyi is a powerful tool, but it’s not perfect. Let’s explore some of its limitations.
Limited Domain Knowledge
While Taiyi has been trained on a vast amount of biomedical data, its knowledge is limited to the data it has seen. If you ask it a question that falls outside its training data, it might not be able to provide an accurate answer.
Language Barriers
Although Taiyi is a bilingual model, it may struggle with nuances and complexities of the Chinese and English languages. It’s possible that it might not fully understand the context or subtleties of certain words or phrases.
Biased Training Data
The training data used to develop Taiyi may contain biases, which could affect the model’s performance. For example, if the training data is predominantly based on Western medical practices, the model might not be as effective in providing information on traditional Chinese medicine.
Overfitting
Taiyi might overfit to the training data, which means it becomes too specialized in recognizing patterns in the training data and fails to generalize well to new, unseen data.
Lack of Common Sense
While Taiyi is great at understanding biomedical concepts, it might lack common sense or real-world experience. It’s possible that it might provide answers that are technically correct but not practical or applicable in real-life situations.
Dependence on Data Quality
The quality of the data used to train Taiyi is crucial to its performance. If the data is noisy, incomplete, or inaccurate, the model’s performance will suffer.
Limited Explainability
Taiyi is a complex model, and its decision-making process might not be transparent or explainable. This could make it difficult to understand why the model provided a particular answer or recommendation.
Vulnerability to Adversarial Attacks
Like other AI models, Taiyi might be vulnerable to adversarial attacks, which are designed to manipulate the model’s output. This could have serious consequences in high-stakes applications like healthcare.
By understanding these limitations, you can use Taiyi more effectively and develop strategies to mitigate its weaknesses.
Format
Taiyi is a bilingual (Chinese and English) fine-tuned large language model that uses a transformer architecture. It accepts input in the form of tokenized text sequences, requiring a specific pre-processing step for multi-turn dialogues.
Input Format
To prepare input for Taiyi, you need to concatenate multi-turn dialogues into a specific format. Here’s an example:
\<eod>input1\<eod>answer1\<eod>input2\<eod>answer2\<eod>...
Where \<eod> is a special character |\<endoftext>|> in the Qwen tokenizer.
Data Formats
Taiyi supports a diverse set of Chinese-English biomedical Natural Language Processing (BioNLP) training datasets, including:
- 38 Chinese datasets covering 10 BioNLP tasks
- 131 English datasets covering 12 BioNLP tasks
These datasets have been standardized to facilitate task-specific requirements.
Output Format
Taiyi generates output in the form of tokenized text sequences. You can use the tokenizer.batch_decode function to decode the output into human-readable text.
Example Code
Here’s an example code snippet that shows how to perform inference using Taiyi:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "DUTIR-BioNLP/Taiyi-LLM"
device = 'cuda:0'
model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, torch_dtype=torch.float16, trust_remote_code=True, device_map=device)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
user_input = "Hi, could you please introduce yourself?"
input_ids = tokenizer(user_input, return_tensors="pt", add_special_tokens=False).input_ids
bos_token_id = torch.tensor([[tokenizer.bos_token_id]], dtype=torch.long)
eos_token_id = torch.tensor([[tokenizer.eos_token_id]], dtype=torch.long)
user_input_ids = torch.concat([bos_token_id, input_ids, eos_token_id], dim=1)
model_input_ids = user_input_ids.to(device)
with torch.no_grad():
outputs = model.generate(input_ids=model_input_ids, max_new_tokens=500, do_sample=True, top_p=0.9, temperature=0.3, repetition_penalty=1.0, eos_token_id=tokenizer.eos_token_id)
response = tokenizer.batch_decode(outputs)
print(response[0])
This code snippet demonstrates how to perform inference using Taiyi and generate a response to a user input.


