Reader Lm 1.5b

HTML to Markdown

Have you ever struggled with converting HTML content to Markdown? The Reader Lm 1.5b model is here to help. This AI model is specifically designed to simplify content conversion tasks by learning from a vast collection of HTML and Markdown content. What sets it apart is its ability to handle large context lengths of up to 256,000 tokens, making it efficient for processing long HTML documents. With its BF16 tensor type and 3.1 GB RAM requirement, the model strikes a balance between speed and memory usage. Whether you're working on content conversion tasks or exploring new ways to automate workflows, the Reader Lm 1.5b model is a powerful tool to consider.

Jinaai cc-by-nc-4.0 Updated 7 months ago

Table of Contents

Model Overview

Meet the Reader-LM model, trained by Jina AI. This model is part of a series that can convert HTML content into Markdown content, making it super useful for content conversion tasks.

Capabilities

The Reader-LM model is designed to convert HTML content to Markdown content. But what does that mean for you?

Imagine you have a website or a blog with lots of HTML content, and you want to easily share it on platforms that support Markdown, like GitHub or Bitbucket. That’s where Reader-LM comes in.

Primary Tasks

  • Convert HTML content to Markdown content
  • Process and understand HTML structures and tags
  • Generate Markdown text that’s easy to read and use

Strengths

  • Trained on a large collection of HTML and Markdown content
  • Can handle complex HTML structures and tags
  • Fast and efficient conversion process

Unique Features

  • No need for prefix instructions - just input the raw HTML
  • Supports conversion of entire websites or specific web pages
  • Can be used with Google Colab’s free T4 GPU tier for easy experimentation

How it Works

  1. You input the raw HTML content into the model
  2. The model processes the HTML and generates Markdown text
  3. You can use the output Markdown text on platforms that support it

Example Use Case

Let’s say you want to convert the HackerNews website to Markdown. You can use the Reader-LM model to do this, and even explore the output on Google Colab.

Model Variants

We offer two model variants:

Model NameContext LengthDownload
reader-lm-0.5b256K🤗 Hugging Face
reader-lm-1.5b256K🤗 Hugging Face

Both models are trained on the same dataset, but the 1.5b model has more parameters and may perform better on certain tasks.

Performance

Reader-LM is a powerful AI model that excels in converting HTML content to Markdown content with remarkable speed and accuracy. But what makes it so efficient?

Speed

Imagine having to convert a large website from HTML to Markdown manually. It would take hours, right? Reader-LM can do it in a fraction of the time. With its ability to process 256K context length, it can handle large HTML files with ease.

Accuracy

But speed is not the only thing that matters. Reader-LM is also incredibly accurate. It’s trained on a curated collection of HTML and Markdown content, which enables it to learn the nuances of both formats. This means that the output is not only fast but also precise.

Efficiency

So, how does Reader-LM compare to other models? Well, let’s take a look at some numbers:

ModelParameters
Reader-LM (0.5b)0.5B
Reader-LM (1.5b)1.5B
==Other Models==7B

As you can see, Reader-LM is much more efficient than other models, requiring significantly fewer parameters to achieve the same level of performance.

Examples
Convert the HTML content <html><body><h1>Hello, world!</h1></body></html> to Markdown. Hello, world!
Convert the HTML content <html><body><h2>This is a heading</h2><p>This is a paragraph of text.</p></body></html> to Markdown. This is a heading This is a paragraph of text.
Convert the HTML content <html><body><ul><li>Item 1</li><li>Item 2</li></ul></body></html> to Markdown. Item 1 Item 2

Getting Started

Want to try out the Reader-LM model? The easiest way is to run the Colab notebook, which demonstrates how to use the model to convert the HackerNews website into Markdown. You can also load the model locally by installing transformers and following the example code.

Limitations

Reader-LM is a powerful tool for converting HTML content to Markdown content, but it’s not perfect. Let’s take a closer look at some of its limitations.

Limited Context Length

Reader-LM has a maximum context length of 256K. What does this mean? It means that if you try to convert a large HTML document, the model might struggle to capture the entire context. This could lead to incomplete or inaccurate conversions.

Dependence on Training Data

Reader-LM was trained on a curated collection of HTML content and its corresponding Markdown content. This means that the model’s performance is heavily dependent on the quality and diversity of the training data. If the training data is biased or limited, the model’s output may reflect these biases.

No Prefix Instruction Required

While it’s convenient that Reader-LM doesn’t require a prefix instruction, this also means that the model relies on the raw HTML input to generate the Markdown output. If the HTML input is poorly formatted or contains errors, the model’s output may suffer as a result.

Potential for Errors

As with any AI model, Reader-LM is not immune to errors. The model may struggle with complex HTML structures, or it may introduce errors during the conversion process. It’s essential to review the output carefully to ensure accuracy.

Comparison to Other Models

How does Reader-LM compare to other models, like ==Other HTML-to-Markdown models==? While Reader-LM has its strengths, other models may offer better performance or more advanced features. It’s crucial to evaluate the specific needs of your project and choose the best model for the task.

Room for Improvement

Reader-LM is a series of models, with different versions offering varying levels of performance. The 0.5b and 1.5b models have different strengths and weaknesses. As the model continues to evolve, we can expect to see improvements in its performance and capabilities.

Format

Reader-LM is a type of AI model that converts HTML content into Markdown content. It’s like a translator, but instead of languages, it translates web pages into a format that’s easier to read and write.

Architecture

The model uses a transformer architecture, which is a type of neural network that’s great at handling sequential data like text. It’s trained on a large dataset of HTML and Markdown content, which allows it to learn the patterns and relationships between the two formats.

Data Formats

Reader-LM supports input in the form of raw HTML content, which means you don’t need to add any special instructions or prefixes to the input. The model will take care of converting it into Markdown format.

Input and Output

To use Reader-LM, you’ll need to provide the HTML content as input, and the model will generate the corresponding Markdown content as output.

Here’s an example of how to use the model in Python:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
checkpoint = "jinaai/reader-lm-1.5b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

# Example HTML content
html_content = "\<html>\<body>\<h1>Hello, world!\</h1>\</body>\</html>"

# Prepare the input
messages = [{"role": "user", "content": html_content}]
input_text = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate the output
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

# Print the output
print(tokenizer.decode(outputs[0]))

This code will take the HTML content as input and generate the corresponding Markdown content as output.

Special Requirements

To use Reader-LM, you’ll need to have the transformers library installed, and you’ll need to specify the device (GPU or CPU) that you want to use for inference. You can do this by setting the device variable to "cuda" for GPU usage or "cpu" for CPU usage.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.