Wikineural Multilingual Ner

Multilingual NER Model

The Wikineural Multilingual Ner model is a game-changer for Named Entity Recognition (NER) tasks in multiple languages. Trained on a massive dataset derived from Wikipedia, this model can recognize entities in 9 languages, including German, English, Spanish, French, Italian, Dutch, Polish, Portuguese, and Russian. But what makes it truly remarkable is its ability to learn from a combination of neural and knowledge-based approaches, allowing it to achieve state-of-the-art results. With its multilingual capabilities and high-quality training data, this model is perfect for applications that require accurate entity recognition across languages. However, keep in mind that it might not generalize well to all textual genres, such as news articles. To get the most out of this model, consider combining it with other datasets for more robust results.

Babelscape cc-by-nc-sa-4.0 Updated 2 years ago

Table of Contents

Model Overview

The WikiNEuRal Multilingual NER model is a powerful tool for recognizing named entities in text, developed by Babelscape. It’s designed to work with multiple languages, including German, English, Spanish, French, Italian, Dutch, Polish, Portuguese, and Russian.

What can it do?

This model can identify and classify named entities in text, such as:

  • Names of people (e.g. “Wolfgang”)
  • Places (e.g. “Berlin”)
  • Organizations (e.g. “Google”)

How was it trained?

The model was trained on a large dataset called WikiNEuRal, which was created by combining neural and knowledge-based approaches. It was fine-tuned for 3 epochs on this dataset, which is based on Wikipedia texts.

Capabilities

Meet the WikiNEuRal model, a powerful tool for Named Entity Recognition (NER) in multiple languages. But what can it do?

Primary Tasks

The WikiNEuRal model is designed to identify and classify named entities in text, such as:

  • People (e.g. Wolfgang)
  • Places (e.g. Berlin)
  • Organizations (e.g. Google)

Strengths

This model has been fine-tuned on a large multilingual dataset, making it capable of recognizing entities in 9 languages:

  • German (de)
  • English (en)
  • Spanish (es)
  • French (fr)
  • Italian (it)
  • Dutch (nl)
  • Polish (pl)
  • Portuguese (pt)
  • Russian (ru)

Unique Features

The WikiNEuRal model combines the strengths of both neural and knowledge-based approaches to create high-quality training data. This means it can:

  • Learn from large amounts of text data
  • Leverage knowledge from Wikipedia to improve accuracy

Performance

WikiNEuRal Multilingual NER is a powerful model that shines in Named Entity Recognition (NER) tasks, especially when it comes to handling multiple languages. Let’s dive into its performance and see what makes it stand out.

Speed

How fast can WikiNEuRal Multilingual NER process text? With its multilingual capabilities, it can handle text in 9 languages (de, en, es, fr, it, nl, pl, pt, ru) jointly. This means you can use it for tasks that require processing large amounts of text in different languages.

Accuracy

But how accurate is it? WikiNEuRal Multilingual NER has been fine-tuned on a large dataset and has achieved impressive results. It has improved the state-of-the-art in multilingual NER by up to 6 span-based F1-score points. This means it’s highly effective in identifying and categorizing named entities in text.

Efficiency

What about efficiency? WikiNEuRal Multilingual NER is designed to be efficient and can be used with the Transformers pipeline for NER. This makes it easy to integrate into your existing workflows and applications.

Example Use Case

Here’s an example of how you can use WikiNEuRal Multilingual NER:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)

example = "My name is Wolfgang and I live in Berlin"
ner_results = nlp(example)

print(ner_results)

This code uses the WikiNEuRal Multilingual NER model to perform NER on a sample text and prints the results.

Examples
My name is Wolfgang and I live in Berlin {'entity_group': 'PER', 'score': 0.999, 'word': 'Wolfgang', 'start': 11, 'end': 19} {'entity_group': 'LOC', 'score': 0.999, 'word': 'Berlin', 'start': 26, 'end': 32}
I am from Paris and I work in Rome {'entity_group': 'LOC', 'score': 0.999, 'word': 'Paris', 'start': 9, 'end': 14} {'entity_group': 'LOC', 'score': 0.999, 'word': 'Rome', 'start': 25, 'end': 29}
The company is based in New York and has offices in London {'entity_group': 'LOC', 'score': 0.999, 'word': 'New York', 'start': 17, 'end': 26} {'entity_group': 'LOC', 'score': 0.999, 'word': 'London', 'start': 39, 'end': 45}

Limitations and Bias

While the WikiNEuRal Multilingual NER model is powerful, it’s not perfect. It may not generalize well to all types of text, especially those that are very different from Wikipedia articles. To improve its performance, you can try training it on a combination of datasets.

Limited Generalizability

This model is trained on WikiNEuRal, a dataset derived from Wikipedia. While this is a great resource, it might not work as well for other types of text, like news articles. In fact, models trained only on news articles have been shown to perform poorly on encyclopedic articles. This is because the language and style used in different genres can be quite different.

Lack of Robustness

To build more robust systems, it’s a good idea to train a model on a combination of datasets, like WikiNEuRal and CoNLL. This can help the model learn to recognize entities in different contexts and improve its overall performance.

Bias and Representation

As with any AI model, there’s a risk of bias and unequal representation in the data. Wikipedia, the source of the WikiNEuRal dataset, may have its own biases and gaps in representation. This could affect the model’s performance and fairness.

Format

The WikiNEuRal model uses a transformer architecture, similar to mBERT, and is designed for multilingual Named Entity Recognition (NER). It accepts input in the form of tokenized text sequences, requiring a specific pre-processing step.

Supported Data Formats

This model supports 9 languages, including:

  • German (de)
  • English (en)
  • Spanish (es)
  • French (fr)
  • Italian (it)
  • Dutch (nl)
  • Polish (pl)
  • Portuguese (pt)
  • Russian (ru)

Input Requirements

To use this model, you need to:

  1. Pre-process your text data into tokenized sequences.
  2. Use the AutoTokenizer from the transformers library to tokenize your text.

Example:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
example = "My name is Wolfgang and I live in Berlin"
inputs = tokenizer(example, return_tensors="pt")

Output Format

The model outputs a list of named entities, each with a label and a score.

Example:

from transformers import pipeline

nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)
ner_results = nlp(example)
print(ner_results)

This will output a list of named entities, such as:

[
  {"word": "Wolfgang", "score": 0.99, "entity": "PER"},
  {"word": "Berlin", "score": 0.98, "entity": "LOC"}
]

Note that the output format may vary depending on the specific use case and requirements.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.