Fine-Tuning | Quantize | Infer — Qwen2-VL mLLM on Custom Data for OCR: Part 2

Bhavya Joshi
7 min readNov 2, 2024

--

LoRA Fine-Tuning Qwen2-VL

This is the second part of my three-part Qwen2-VL fine-tuning and Quantization series. If you want to learn about creating training datasets on custom data, go through the previous blog where I have explained it in detail.

Here comes the interesting part: I chose to fine-tune Qwen2-VL-2B using LoRA, but you can also choose the 7B model based on your GPU availability. I tried fine-tuning the 7B model with LoRA on an RTX 4090 24GB, which took around 20GB to load the model and the image batches for fine-tuning.

Blog Series :

  1. Custom Dataset Preparation for multimodel LLM fine-tuning(Qwen2-VL)
  2. LoRA Fine-Tuning Qwen2-VL (This Blog)
  3. Quantization and Inferencing of custom Qwen2-VL-2B mLLM (GPTQ and AWQ)

Yeah but why LoRA, why not fully train the model?

Well you can, but do you have enough GPU for that?

(More on this in the next part).

First lets understand what exactly is LoRA?

Low-Rank Adaptation (LoRA) is a technique used to fine-tune large language models (LLMs) efficiently by only adjusting a small, low-rank subset of the model’s weights, rather than updating all parameters. Instead of modifying the entire set of model weights, LoRA introduces a small set of low-rank matrices that capture task-specific information, which is then combined with the existing model weights during inference.

Benefits of LoRA over Full Fine-Tuning:

  1. Lower GPU Memory Requirement: By updating only a small subset of parameters, LoRA drastically reduces the VRAM needed for fine-tuning, making it feasible to use large models on limited GPU resources.
  2. Efficiency: LoRA reduces the computational cost and time required for fine-tuning since fewer parameters are trained, making it faster to adapt models to new tasks.
  3. Parameter Efficiency: LoRA allows for fine-tuning without modifying the entire model, so it maintains the original model’s knowledge while adding task-specific learning, which can be easily stored and reused.
  4. Modularity: Since LoRA matrices are small, multiple task-specific LoRA adapters can be loaded and swapped easily without retraining, adding flexibility to the model.

Training LoRA Qwen2-VL

Having set that, let's deep dive into fine-tuning Qwen2-VL using LoRA.

I have used Linux for fine-tuning, quantization, and inferencing.

First, we have to clone and install Llama-Factory repository.

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

Requirements are mentioned in the repository but I am listing my working environment:

OS                                 Linux
Python 3.10.12
nvcc --version cuda_12.2
accelerate 1.0.1
bitsandbytes 0.43.1
datasets 2.20.0
llamafactory 0.9.1.dev0
torch 2.4.0+cu121
peft 0.12.0
trl 0.9.6
flash-attn 2.6.3

In the previous blog, we prepared the dataset named final-llm-input.json

First, go to the data\dataset_info.json

find mllm_demo, and modify it as

  "mllm_demo": {
"file_name": "path/to/final-llm-input.json",
"formatting": "sharegpt",
"columns": {
"messages": "messages",
"images": "images"
},

then go to example/train_lora/qwen2vl_lora_sft.yaml

modify it as given

### model
model_name_or_path: Qwen/Qwen2-VL-2B-Instruct ## Modify according to the model you want to finetune

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all

### dataset
dataset: mllm_demo ## this should be same as mentioned in dataset_info.json
template: qwen2_vl
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 1

### output
output_dir: saves/qwen2_vl-7b/lora/sft ## modify save path here
logging_steps: 10 ## modify logging sets according to training dataset
save_steps: 500 ## modify weight save steps according to training dataset
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 12 ## modify batch size according to GPU
gradient_accumulation_steps: 8
learning_rate: 1.0e-4 ## modify learning rate
num_train_epochs: 100 ## modify number of epochs
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

You can change your training hyperparameters and model save path in this file.

Finally, run the command:

llamafactory-cli train examples/train_lora/qwen2vl_lora_sft.yaml

Now wait for the training to be completed.

Note: If you get the following error which I got

IOError: image file is truncated

Following the trace where PIL is getting imported, probably in a Python file inside src/llamafactory/data/mm_plugin.py

and paste the following import where PIL is been imported.

from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

Merging LoRA adapter weights on top of original weights

I am assuming you have saved the adapter weights at :

saves/qwen2_vl-2b/lora/sft

Once the training is completed, we need to merge the LoRA weights on top of the original weights to get the fine-tuned weights.

For that, go to examples/merge_lora/qwen2vl_lora_sft.yaml

modify the following:

adapter_name_or_path : saves/qwen2_vl-2b/lora/sft   #path to your adapter weight

Then run the following command :

llamafactory-cli export examples/merge_lora/qwen2vl_lora_sft.yaml

Assuming you saved the merged weights at saves/qwen2_vl-2b-merged

Congratulations!!! You can successfully fine-tune Qwen2-VL on your custom dataset.

Once the merging is completed, the final fine-tuned weight folder should contain the following files.

Yes yes I know its Windows!! Never-mind that

Here’s a brief explanation of each file we get after fine-tuning the Qwen2-VL using LoRA.

  1. added_tokens.json: Contains information about any custom tokens that were added to the tokenizer beyond the standard vocabulary. These might be domain-specific terms, rare words, or special symbols that improve the model’s performance in specific tasks.
  2. chat_template.json: This file likely contains configuration data related to generating chat responses, such as templates for prompting the model during interactive conversation sessions.
  3. config.json: Holds the model architecture configuration, including settings such as the number of layers, hidden dimensions, attention heads, and other hyperparameters. It’s essential for loading the model structure properly.
  4. generation_config.json: Contains settings specific to text generation, such as temperature, top-k, top-p, and other parameters that influence how the model generates text. Useful for controlling the creativity or randomness of outputs.
  5. merges.txt: This file is part of the tokenizer data. It contains the merge rules for Byte Pair Encoding (BPE), which helps the tokenizer break down words into subword tokens.
  6. model.safetensors.index.json: The index file for the SafeTensors format, which helps in managing and loading the sharded SafeTensor files (model-00001-of-00003.safetensors, etc.). It contains metadata about tensor shapes and locations in each shard.
  7. model-00001-of-00003.safetensors, model-00002-of-00003.safetensors, model-00003-of-00003.safetensors: These are sharded files containing the actual model weights. SafeTensors splits large models into multiple files to make loading and handling easier.
  8. preprocessor_config.json: This file may hold preprocessing settings applied to input data before tokenization, like lowercasing or removing special characters. This is useful to ensure consistency when preprocessing new inputs for inference.
  9. special_tokens_map.json: Maps special tokens (like [CLS], [SEP], [MASK], etc.) to specific values used by the tokenizer. Special tokens are crucial for defining the start or end of sentences, padding, and other specific markers in language models.
  10. tokenizer.json: Stores the vocabulary and tokenization rules. This file maps words or subwords to unique IDs used by the model for processing input text.
  11. tokenizer_config.json: Contains additional configuration parameters for the tokenizer, such as type of tokenization, maximum length, and any special pre-tokenization rules.
  12. vocab.json: This is a vocabulary file that lists all the tokens and their corresponding IDs. It’s used by the tokenizer to convert input text into token IDs that the model can understand.
  13. model.safetensors.index.json file is an index file for models stored in the SafeTensors format. It provides metadata that allows the model loader to locate and load specific parts of the model from different sharded SafeTensors files (in this case, model-00001-of-00003.safetensors, model-00002-of-00003.safetensors, and model-00003-of-00003.safetensors).

In my case, the final fine-tuned weight folder was 4.12 GB.

Inferencing the fine-tuned Qwen2-VL-LoRA

I have done the inferencing in Linux, but you can try on Windows

Use the following Python code for the same.

import requests
import torch
from PIL import Image
from io import BytesIO

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
import time
import json
import re

def extract_json_from_string(input_string):
# Using regex to extract the JSON part from the string
json_match = re.search(r'({.*})', input_string, re.DOTALL)

if json_match:
json_str = json_match.group(1) # Extract the JSON-like part
try:
# Parsing the extracted string as JSON
extracted_data = json.loads(json_str)
return extracted_data
except json.JSONDecodeError as e:
print(f"Error decoding JSON: {e}")
return None
else:
print("No JSON found in the string.")
return None


DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

imagepath = "path/to/vinplate.jpg"
image = load_image(imagepath)

model_path = "saves/qwen2_vl-2b-merged"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForVision2Seq.from_pretrained(
model_path, torch_dtype=torch.float16, device_map= DEVICE
)
model.to(DEVICE)

# Create inputs
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text":
'''
Please extract the Vehicle Sr No, Engine No, and Model from this image.
Response only json format nothing else.
Analyze the font and double check for similar letters such as "V":"U", "8":"S":"0", "R":"P".
'''
}
]
}
]

t1 = time.time()
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
t2 = time.time()

response_json = extract_json_from_string(generated_texts[0])
print(response_json)
print('Time Taken')
print(t2-t1)

Note: This will require a considerable amount of GPU (more about this in the next part), as the model still needs to be quantized. We will quantify the model in the next part of this blog series.

Part 3 — AWQ and GPTQ quantization of custom Qwen2-VL:

https://medium.com/@bhavya.joshi809/fine-tuning-qwen2-vl-mllm-on-custom-data-for-ocr-part-3-quantization-of-custom-qwen2-vl-2b-mllm-2c94577f83a5

Happy to connect with you on LinkedIn: https://www.linkedin.com/in/bhavyajoshi809

--

--

Bhavya Joshi
Bhavya Joshi

Written by Bhavya Joshi

Deputy Manager - AI/ML @Mahindra&Mahindra || ex-47Billion

Responses (2)