Fine-Tune | Inference of Idefics3–8B on custom data for OCR

9 min readDec 19, 2024

After successfully fine-tuning and quantifying Qwen2-VL mLLM, I tried to fine-tune Idefics3–8B just to compare the results for my OCR use-case. This blog will cover on how to create our custom dataset for training, followed by LoRA/QLoRA fine-tuning of the model and then inferencing the model on our images.

Hit the clap button and follow me if I just saved your 10 hours of grid search!!!

Why is Idefics3–8B a good model for OCR use-cases?

Before we go through the fine-tuning and inferencing of Idefics3–8B, let me explain why I choose this model for this use-case.

Idefics3–8B-Llama3, an open multimodal model by HuggingFace, processes image and text inputs to produce text outputs. It excels in answering image-related questions, describing visuals, creating image-based stories, and functioning as a pure language model, with enhanced OCR, document understanding, and visual reasoning capabilities over its predecessors.

Key Features and Technical Details:

Enhanced Visual Token Encoding:

Idefics3–8B uses 169 visual tokens for encoding an image of size 364x364 pixels, improving image representation compared to Idefics2 (64 tokens). This increase significantly benefits OCR tasks, which require precise text localization and understanding.

2. Image-Splitting Strategy:

The model divides larger images into smaller tiles of 364x364 pixels, processes each independently, and integrates them into a single sequence while maintaining spatial context. This method enables effective handling of high-resolution documents and complex layouts.

3. Improved Pre-Training Data:

Idefics3 integrates the Docmatix dataset, which contains 2.4M images and 9.5M QA pairs from 1.3M PDF documents. This large-scale, high-quality dataset improves document understanding and OCR performance.

4. Specialized Fine-Tuning:

Supervised fine-tuning stages incorporate datasets tailored to document understanding, text transcription, and OCR tasks, ensuring robust model performance on these use cases.

5. Architecture Updates:

Replaces the perceiver resampler with a more efficient pixel shuffle strategy, increasing the number of visual tokens while retaining OCR-relevant details.

6. Multimodal Input Support:

Idefics3 can handle interleaved text and images, allowing it to excel in complex visual-textual reasoning and real-world OCR applications.

7. Alignment of Training and Inference:

The model’s training incorporates OCR-specific challenges, such as recognizing text in natural images and document layouts, and ensures alignment with evaluation tasks like document understanding and text extraction.

8. Efficiency and Scalability:

Despite enhancements, the model maintains scalability and efficient training processes, crucial for real-world applications.

Idefics3–8B is particularly suited for OCR due to its ability to process large, complex images effectively, handle diverse textual layouts, and excel in document understanding tasks. These improvements are grounded in its architectural innovations, specialized datasets, and training strategies.

Here are the benchmark scores.

https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3

Custom Dataset creation for fine-tuning Idefics3–8B

Enough about the technical details of the model, lets jump into creating our own custom dataset for Idefics3–8B fine-tuning. I am using the same dataset which I have used to finetune Qwen2-VL model. My dataset consists of around 3000 images. My main objective is to extract the Model, Vehicle Sr. No., and Engine No. from the VIN plate images and the Chassis No. from the chassis images.

Here are sample images and their OCR label format.

Vinplate Image:

“vinplate.jpg”: 
{
"Vehicle Sr No": "MA1TA2YS2R2A13882",
"Engine No": "YSR4A38798",
"Model": "SCORPIO CLASSIC S5 MT 7S"
}

Chassis Image:

"chassis.jpg": 
{
"Vehicle Sr No": "MA1TA2YS2R2A17264",
"Engine No": null,
"Model": null
}

We have to create a image QA pair for each sample which would be a list of dictionary, something like this. If you want to learn more about how this custom dataset preparation, refer to this blog.

[
    {
        "image": "path/to/image_folder/chassis_2024-02-23_12-09-51.jpg",
        "question": "What Vehicle Sr No, Engine No, Model and image_label can be identified in the image?",
        "answer": "{\n    \"Vehicle Sr No\": null,\n    \"Engine No\": null,\n    \"Model\": null,\n    \"image_label\": \"other\"\n}"
    },
    {
        "image": "path/to/image_folder/chassis_2024-04-02_15-13-31.jpg",
        "question": "Can you pull the Vehicle Sr No, Engine No, Model and image_label from the image?",
        "answer": "{\n    \"Vehicle Sr No\": null,\n    \"Engine No\": null,\n    \"Model\": null,\n    \"image_label\": \"other\"\n}"
    },
    {
        "image": "path/to/image_folder/chassis_2024-04-02_15-17-41.jpg",
        "question": "Identify the Vehicle Sr No, Engine No, Model and image_label from this image.",
        "answer": "{\n    \"Vehicle Sr No\": null,\n    \"Engine No\": null,\n    \"Model\": null,\n    \"image_label\": \"other\"\n}"
    },
    .
    .
    .
]

Lets name this file idefics3-dataset.json

LoRA/QLoRA fine-tuning of Idefics3–8B on custom dataset

I have explained the benefits of LoRA / QLoRA finetuning in this blog.

Let dive into fine-tuning the Idefics3–8B model.

My working environment:

OS                                 Linux
Python                             3.11
nvcc --version                     cuda_12.1
accelerate                         1.2.1
bitsandbytes                       0.45.0
datasets                           3.2.0
torch                              2.4.0+cu121
peft                               0.14.0
flash-attn                         2.7.2

Here is the fine-tuning script, although I recommend to use Jupyter Notebook for this.

## Loading required libraries
import json
from tqdm import tqdm
import torch
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoProcessor, AutoModelForVision2Seq, BitsAndBytesConfig
from datasets import load_dataset
from datasets import Dataset, DatasetDict, Image
import random



## Defining the fine-tuning type
## Choose whether to use LoRA or QLoRA fine-tuning
DEVICE = "cuda:0"
USE_LORA = False
USE_QLORA = True
model_path = "HuggingFaceM4/Idefics3-8B-Llama3"


## Loading processor
processor = AutoProcessor.from_pretrained(
    model_path,
    do_image_splitting=False,
    local_files_only=True)

## Loading Model
if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
        use_dora=False if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    lora_config.inference_mode = False
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )
        
    model = AutoModelForVision2Seq.from_pretrained(
        model_path,
        quantization_config=bnb_config if USE_QLORA else None,
        _attn_implementation="flash_attention_2",
        device_map="auto"
    )
    model.add_adapter(lora_config)
    model.enable_adapters()
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    print(model.get_nb_trainable_parameters())
else:
    model = AutoModelForVision2Seq.from_pretrained(
        model_path,
        torch_dtype=torch.bfloat16,
        _attn_implementation="flash_attention_2",
    ).to('cuda')
    
    # if you'd like to only fine-tune LLM
    for param in model.model.vision_model.parameters():
        param.requires_grad = False



## Loading Dataset
dataset_dict_path = 'path/to/idefics3-dataset.json'

with open(dataset_dict_path, 'r') as file:
    dataset_dict = json.load(file)


ds = load_dataset("json", data_files=dataset_dict_path)

# Spliting dataset into train and test
split_ds = ds['train'].train_test_split(test_size=0.1)
split_ds = split_ds.cast_column("image", Image())



## Creating Data Collector Class
class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor
        # Extract the image token ID from the processor's tokenizer
        self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")
        ]
    
    def __call__(self, examples):
        texts = []
        images = []

        for example in examples:
            image = example["image"]
            question = example["question"]
            answer = example["answer"]

            # Prepare messages for user and assistant
            messages = [
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": "You are given a image, your task is to extract out the desired text from the image as per the examples and classify the label of the image as 'chassis', 'vinplate', or 'other' based on the whole image."},
                        {"type": "image"},
                        {"type": "text", "text": question}
                    ]
                },
                {
                    "role": "assistant",
                    "content": [
                        {"type": "text", "text": answer}
                    ]
                }
            ]
            
            # Apply the chat template to format the messages
            text = self.processor.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text.strip())

            # Append image directly to the images list (no need to wrap in a list)
            images.append(image)

        # Process the texts and images into a batch
        batch = self.processor(text=texts, images=images, return_tensors="pt", padding=True)

        # Clone the input_ids from the batch and set padding/image tokens to -100
        labels = batch["input_ids"].clone()
        labels[labels == self.processor.tokenizer.pad_token_id] = -100
        labels[labels == self.image_token_id] = -100

        # Add the labels to the batch
        batch["labels"] = labels

        return batch
    
data_collator = MyDataCollator(processor)



## Defining Training Arguments
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    num_train_epochs=25,
    per_device_train_batch_size=6,
    gradient_accumulation_steps=8,
    warmup_steps=50,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=250,
    save_strategy="steps",
    save_steps=250*5,
    save_total_limit=1,
    optim="adamw_hf", # for 8-bit, pick paged_adamw_hf
    # evaluation_strategy="epoch",
    bf16=True,
    output_dir="path/to/idefics3_8b_qlora",
    remove_unused_columns=False
)


## Load trainer and train the model 
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=split_ds['train'],
    eval_dataset=split_ds['test'],
)

trainer.train()

Congratulation, you can successfully fine-tuned Idefics3–8B.

If you check model save directory, you will find the adapter weights similar to this

I have copied some of the json files from the original model files, such as chat_template.json, tokenizer.json etc which are then used to infer the model.

Inference of custom Idefics3–8B QLoRA model

Finally we came to the stage where we can test out our custom Idefics3–8B model.

Here is the script:

### Load required libraries
import requests
import torch
from PIL import Image
from io import BytesIO
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoProcessor, AutoModelForVision2Seq, BitsAndBytesConfig
from transformers.image_utils import load_image
import json
import re


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model_path = 'path/to/idefics3_8b_qlora'

## For QLoRA Inferencing
USE_QLORA = True

## Load Processor
processor = AutoProcessor.from_pretrained(
    model_path,
    do_image_splitting=False,
    local_files_only=True)

## Load LoRA Configuration
lora_config = LoraConfig(
    r=8,
    lora_alpha=8,
    lora_dropout=0,
    target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
    use_dora=False if USE_QLORA else True,
    init_lora_weights="gaussian"
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

## Load custom Idefics3-8b-QLoRA model
model = AutoModelForVision2Seq.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    _attn_implementation="flash_attention_2",
    device_map="auto"
)  
model = model.to(device)


## Load Image
image = load_image("/home/bhavya/Desktop/bhavya/llm/dataset/test/vin-500/test/Vin_2024-01-09_10-09-50.jpg")
question = "Identify the Vehicle Sr No, Engine No, Model and image_label from this image."
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "You are given a image, your task is to extract out the desired text from the image as per the examples and classify the label of the image as 'chassis', 'vinplate', or 'other' based on the whole image."},
            {"type": "image"},
            {"type": "text", "text": question}
        ]
    }
]


## Helper funtion to parse the model output
def parse_response(text):
    match = re.search(r"Assistant:\s*(\{.*\})", text, re.DOTALL)
    if match:
        json_str = match.group(1)
        try:
            # Parse the JSON string into a Python dictionary
            return json.loads(json_str)
        except json.JSONDecodeError:
            raise ValueError("Failed to parse the extracted JSON.")
    else:
        raise ValueError("No valid Assistant JSON response found in the input text.")



prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}


# Generate Output
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

final_response = parse_response(generated_texts[0])

## print(final_response)
## {'Vehicle Sr No': 'MA1TA4WR2R2A13785', 'Engine No': 'WRR4A11146', 'Model': 'THAR S11 mHawk-140 4WD 7S', 'image_label': 'vinplate'}

Congratulation, you can successfully fine-tuned and infer the Idefics3–8B model on your custom dataset.

Merging and Quantizing the QLoRA weights with original weights

from transformers import AutoModelForVision2Seq, AutoProcessor, BitsAndBytesConfig
from peft import PeftModel
import torch
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# Paths to the base model and QLoRA adapter
BASE_MODEL_PATH = "path/to/original_weights/idefics3-8b"
ADAPTER_PATH = "path/to/adapter_weights/idefics3_8b_qlora"
OUTPUT_PATH = "path/to/quantized_weights/idefics3-8b-4bit-gptq"

USE_QLORA = True

lora_config = LoraConfig(
    r=8,
    lora_alpha=8,
    lora_dropout=0,
    target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
    use_dora=False if USE_QLORA else True,
    init_lora_weights="gaussian"
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load the base model with the specified configuration
print("Loading base model with flash attention...")
base_model = AutoModelForVision2Seq.from_pretrained(
    BASE_MODEL_PATH,
    quantization_config=bnb_config,
    _attn_implementation="flash_attention_2",
    device_map="auto"  # Automatically map layers to available devices
)

# Load the QLoRA adapter
print("Loading adapter weights...")
adapter_model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)

# Merge the adapter weights into the base model
print("Merging adapter weights into the base model...")
merged_model = adapter_model.merge_and_unload()

# Save the merged model (in full precision) before quantizing
print("Saving merged model...")
merged_model.save_pretrained("merged_model_path1")

# Reload the merged model with quantization
print("Loading merged model with quantization...")
quantized_model = AutoModelForVision2Seq.from_pretrained(
    "merged_model_path1",
    quantization_config=bnb_config,
    _attn_implementation="flash_attention_2",
    device_map="auto"
)

# Save the final quantized model
print("Saving merged and quantized model...")
quantized_model.save_pretrained(OUTPUT_PATH)
print("Model saved to:", OUTPUT_PATH)

Loading the merged quantized weights for inferencing

## Loading the quantized model
processor = AutoProcessor.from_pretrained(
    model_path,
    do_image_splitting=False,
    local_files_only=True)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForVision2Seq.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    _attn_implementation="flash_attention_2",
    device_map="auto"
)  
model = model.to(device)

image = load_image("path/to/image/Vin_2024-01-09_10-09-50.jpg")
question = "Identify the Vehicle Sr No, Engine No, Model and image_label from this image."
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "You are given a image, your task is to extract out the desired text from the image as per the examples and classify the label of the image as 'chassis', 'vinplate', or 'other' based on the whole image."},
            {"type": "image"},
            {"type": "text", "text": question}
        ]
    }
]

# Helper funtion to parse the output
def parse_response(text):
    match = re.search(r"Assistant:\s*(\{.*\})", text, re.DOTALL)
    if match:
        json_str = match.group(1)
        try:
            # Parse the JSON string into a Python dictionary
            return json.loads(json_str)
        except json.JSONDecodeError:
            raise ValueError("Failed to parse the extracted JSON.")
    else:
        raise ValueError("No valid Assistant JSON response found in the input text.")


prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

final_response = parse_response(generated_texts[0])

'''
print(final_response)
print(type(final_response))
{'Vehicle Sr No': 'MA1UJ4YK2R2A21300', 'Engine No': 'YKR4A37696', 'Model': 'THAR LX D MT 4WD 4S HT', 'image_label': 'vehicle-details'}
<class 'dict'>
'''

Note: My merged and quantized weights takes around 6.7GB of VRAM while inferencing, against 26GB of VRAM which is taken by the original Idefics3–8B model.

Coming Next:

Fine-tuning and quantization of PaliGemma 2 (3B, 10B), Phi3 Vision 4b, Llama 3.2 Vision, GOT-OCR 2.0
Comprehensive comparison of OCR results of Qwen2-VL-7b, Idefics3–8b, PaliGemma 2, Phi3 Vision 4B, Llama 3.2 Vision, and GOT-OCR 2.0 on my custom dataset

Happy to connect with you on LinkedIn: https://www.linkedin.com/in/bhavyajoshi809

!! Kudos

#OCR

#mLLM

#Idefics3

#Huggingface

#ML #DL #AI #GenAI

Fine-Tune | Inference of Idefics3–8B on custom data for OCR

Why is Idefics3–8B a good model for OCR use-cases?

Key Features and Technical Details:

Custom Dataset creation for fine-tuning Idefics3–8B

LoRA/QLoRA fine-tuning of Idefics3–8B on custom dataset

Inference of custom Idefics3–8B QLoRA model

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Bhavya Joshi

No responses yet