{"id":232,"date":"2024-03-30T06:45:31","date_gmt":"2024-03-30T06:45:31","guid":{"rendered":"https:\/\/www.aerolift.ai\/blog\/?p=232"},"modified":"2024-03-30T06:45:31","modified_gmt":"2024-03-30T06:45:31","slug":"deep-dive-into-fine-tuning-the-tiny-llama-model-on-a-custom-dataset","status":"publish","type":"post","link":"https:\/\/www.aerolift.ai\/blog\/deep-dive-into-fine-tuning-the-tiny-llama-model-on-a-custom-dataset\/","title":{"rendered":"Deep Dive into Fine-Tuning the Tiny-Llama Model on a Custom Dataset"},"content":{"rendered":"\n<p>In the rapidly evolving landscape of machine learning, the ability to fine-tune models on custom datasets is a game-changer. It allows for the creation of models that are not only powerful but also tailored to specific domains, enhancing their performance and relevance. This article delves into the intricacies of fine-tuning the Tiny-Llama model on a custom dataset, exploring the entire process from data preparation to model training and highlighting advanced training techniques like PEFT and QLORA.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h5 class=\"wp-block-heading\"><span class=\"uppercase\"><strong>The Power of Fine-Tuning<\/strong><\/span><\/h5>\n\n\n\n<p>Fine-tuning is a process where a pre-trained model is further trained (i.e., &#8220;fine-tuned&#8221;) on a smaller, domain-specific dataset. This approach leverages the knowledge gained during the initial training phase and adapts it to the specific needs of the new task. The Tiny-Llama model, a compact and efficient variant of the GPT family, is particularly well-suited for fine-tuning due to its balance between performance and resource efficiency.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h5 class=\"wp-block-heading\"><span class=\"uppercase\"><strong>Data Prepa<\/strong><\/span><span class=\"uppercase\"><strong>r<\/strong><\/span><span class=\"uppercase\"><strong>a<\/strong><\/span><span class=\"uppercase\"><strong>t<\/strong><\/span><span class=\"uppercase\"><strong>ion: The Foundation of Success<\/strong><\/span><\/h5>\n\n\n\n<p>The success of fine-tuning hinges on the quality and relevance of the training data. For this tutorial, we&#8217;ll explore how to prepare a dataset from PDF documents, converting it into a structured format that can be efficiently used for training.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h6 class=\"wp-block-heading\"><span class=\"uppercase\"><strong>Extracting Text from PDFs<\/strong><\/span><\/h6>\n\n\n\n<p>The first step involves extracting meaningful text from PDF documents. This is crucial for generating questions and answers that can be used to fine-tune the model. We use the\u00a0<code>PyPDF2<\/code>\u00a0library, which provides a straightforward way to read PDF files and extract text.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import PyPDF2\r\n\r\ndef extract_text_from_pdf(file_path):\r\n    pdf_file_obj = open(file_path, 'rb')\r\n    pdf_reader = PyPDF2.PdfReader(pdf_file_obj)\r\n    text = ''\r\n    for page_num in range(len(pdf_reader.pages)):\r\n        page_obj = pdf_reader.pages&#091;page_num]\r\n        text += page_obj.extract_text()\r\n    pdf_file_obj.close()\r\n    return text\r\n<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<h6 class=\"wp-block-heading\">G<span class=\"uppercase\"><strong>enerating Questions and Answers<\/strong><\/span><\/h6>\n\n\n\n<p>Once we have the text, we need to generate questions and answers. This is achieved by sending the text to an API (in this case, using the GPT-4 model from OpenAI) and processing the response to extract the questions and answers.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import openai\r\nimport json\r\nfrom typing import List\r\nfrom tqdm import tqdm\r\n\r\ndef generate_questions_answers(text_chunk):\r\n    messages = &#091;\r\n        {'role': 'system', 'content': 'You are an API that converts bodies of text into a single question and answer into a JSON format. Each JSON \" \\\r\n          \"contains a single question with a single answer. Only respond with the JSON and no additional text. \\n.'},\r\n        {'role': 'user', 'content': 'Text: ' + text_chunk}\r\n    ]\r\n\r\n    response = openai.chat.completions.create(\r\n        model=\"gpt-4\",\r\n        messages=messages,\r\n        max_tokens=2048,\r\n        n=1,\r\n        stop=None,\r\n        temperature=0.7,\r\n    )\r\n\r\n    response_text = response.choices&#091;0].message.content.strip()\r\n    try:\r\n        json_data = json.loads(response_text)\r\n        return json_data\r\n    except json.JSONDecodeError:\r\n        print(\"Error: Response is not valid JSON.\")\r\n        return &#091;]\r\n<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<h6 class=\"wp-block-heading\"><span class=\"uppercase\"><strong>Processing Text and Writing to JSON<\/strong><\/span><\/h6>\n\n\n\n<p>The final step in data preparation involves processing the text in chunks, generating questions and answers for each chunk, and writing the results to a JSON file. This structured format is essential for the subsequent steps of training.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def process_text(text: str, chunk_size: int = 4000) -&gt; List&#091;dict]:\r\n    text_chunks = &#091;text&#091;i:i+chunk_size] for i in range(0, len(text), chunk_size)]\r\n    all_responses = &#091;]\r\n    for chunk in tqdm(text_chunks, desc=\"Processing chunks\", unit=\"chunk\"):\r\n        responses = generate_questions_answers(chunk)\r\n        all_responses.extend(responses)\r\n    return all_responses\r\n\r\ntext = extract_text_from_pdf('Provide Path to your file')\r\nresponses = {\"responses\": process_text(text)}\r\n\r\nwith open('responses.json', 'w') as f:\r\n    json.dump(responses, f, indent=2)\r\n<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<h6 class=\"wp-block-heading\"><span class=\"uppercase\"><strong>Converting JSON to CSV for Training<\/strong><\/span><\/h6>\n\n\n\n<p>With the data prepared, we convert it into a CSV format that is suitable for training. This involves extracting the questions and answers from the JSON and writing them to a CSV file with columns for the prompt, question, and answer.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import json\r\nimport csv\r\n\r\ninstruction = \"You will answer questions about Machine Learning\"\r\n\r\nwith open('responses.json', 'r', encoding='utf-8') as f:\r\n    responses = json.load(f)\r\n\r\nwith open('responses.csv', 'w', newline='', encoding='utf-8') as csvfile:\r\n    fieldnames = &#091;'prompt', 'question', 'answer']\r\n    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n    writer.writeheader()\r\n    for response in responses&#091;'responses']:\r\n        if 'question' in response and 'answer' in response:\r\n            writer.writerow({'prompt': instruction, 'question': response&#091;'question'], 'answer': response&#091;'answer']})\r\n<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<h6 class=\"wp-block-heading\"><span class=\"uppercase\"><strong>Preparing the Data for Training<\/strong><\/span><\/h6>\n\n\n\n<p>Before training, we need to format the data correctly. This involves reading the CSV file, combining the prompt, question, and answer into a single text column, and then saving the formatted data to a new CSV file.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\r\n\r\ndf = pd.read_csv(\"responses.csv\")\r\ndf.fillna(\"\")\r\n\r\ntext_col = &#091;]\r\n\r\nfor _, row in df.iterrows():\r\n prompt = \"Below is an instruction which describes a tasks paired with a question that provides further context. Write a response that completes the request.\\n\\n\"\r\n instruction = str(row&#091;\"prompt\"])\r\n input_query = str(row&#091;\"question\"])\r\n response = str(row&#091;\"answer\"])\r\n\r\n if len(input_query.strip()) == 0: \r\n    text = prompt + \"### Instruction:\\n \" + instruction + \"\\n### Response:\\n \" + response\r\n else:\r\n    text = prompt + \"### Instruction:\\n \" + instruction + \"\\n### Input:\" + input_query + \"\\n### Response:\\n \" + response\r\n\r\n text_col.append(text)\r\n\r\ndf.loc&#091;:, \"train\"] = text_col\r\ndf.to_csv(\"train.csv\", index=False)\r\n<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<h6 class=\"wp-block-heading\"><span class=\"uppercase\"><strong>Training with Auto<\/strong><\/span><span class=\"uppercase\"><strong> <\/strong><\/span><span class=\"uppercase\"><strong>Train Advanced<\/strong><\/span><\/h6>\n\n\n\n<p>For training, we use Auto Train Advanced, a powerful tool that simplifies the training process. It supports advanced training techniques like PEFT (Process-Efficient Fine-Tuning) and QLORA (Quantized Low-Rank Approximation), which can significantly reduce training time and memory usage.<\/p>\n\n\n\n<p><span class=\"uppercase\"><strong>Tiny-LLama<\/strong><\/span><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"858\" src=\"https:\/\/www.aerolift.ai\/blog\/wp-content\/uploads\/2024\/03\/TinyLlama_logo-1024x858.png\" alt=\"\" class=\"wp-image-233\" srcset=\"https:\/\/www.aerolift.ai\/blog\/wp-content\/uploads\/2024\/03\/TinyLlama_logo-1024x858.png 1024w, https:\/\/www.aerolift.ai\/blog\/wp-content\/uploads\/2024\/03\/TinyLlama_logo-300x251.png 300w, https:\/\/www.aerolift.ai\/blog\/wp-content\/uploads\/2024\/03\/TinyLlama_logo-768x643.png 768w, https:\/\/www.aerolift.ai\/blog\/wp-content\/uploads\/2024\/03\/TinyLlama_logo-1536x1287.png 1536w, https:\/\/www.aerolift.ai\/blog\/wp-content\/uploads\/2024\/03\/TinyLlama_logo.png 1540w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>The Tiny-Llama model is a compact and efficient variant of the GPT family, designed for fine-tuning on custom datasets. It leverages the Llama architecture to enhance text generation capabilities, offering a proof of concept for recreating the TinyStories-1M model with improved efficiency and performance. Hosted on Hugging Face, the Tiny-Llama model showcases the potential of fine-tuning and domain-specific modeling in the field of machine learning.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>! autotrain llm --train --project-name josh-ops --model Maykeye\/TinyLLama-v0 --data-path \/kaggle\/input\/llamav2 --use-peft --quantization int4 --lr 1e-4 --train-batch-size 15 --epochs 60 --trainer sft\r\n<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<h6 class=\"wp-block-heading\"><span class=\"uppercase\"><strong>Understanding PEFT and QLORA<\/strong><\/span><\/h6>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>PEFT (Process-Efficient Fine-Tuning)<\/strong>: This technique optimizes the training process by reducing the number of parameters that need to be updated during each training step. It&#8217;s particularly useful for fine-tuning large models on custom datasets.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>QLORA (Quantized Low-Rank Approximation)<\/strong>: QLORA is a method for reducing the memory footprint of models during training. It involves quantizing the model&#8217;s weights and applying a low-rank approximation to the weight matrices. This can significantly reduce the amount of memory required for training, making it possible to train larger models on hardware with limited resources.<\/li>\n<\/ul>\n\n\n\n<h6 class=\"wp-block-heading\"><span class=\"uppercase\"><strong>Conclusion<\/strong><\/span><\/h6>\n\n\n\n<p>Fine-tuning the Tiny-Llama model on a custom dataset is a powerful way to achieve domain-specific performance. By leveraging advanced training techniques like PEFT and QLORA, and using tools like Auto Train Advanced, we can efficiently train models on custom datasets. This process not only enhances the model&#8217;s performance on specific tasks but also opens up new possibilities for applying machine learning in various domains. As the field of machine learning continues to evolve, the ability to fine-tune models on custom datasets will remain a critical skill, enabling us to create models that are not only powerful but also tailored to our specific needs.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>TAKE THE FIRST STEP<\/strong><\/p>\n\n\n\n<p>Ready to experience the power of Aerolift.AI firsthand? Visit the&nbsp;<a rel=\"noreferrer noopener\" href=\"https:\/\/www.aerolift.ai\/\" target=\"_blank\">website<\/a>&nbsp;to explore their features, compare licensing options, and start your free trial. Discover how Aerolift.AI can revolutionize the way you interact with documents and unlock hidden value within your data.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the rapidly evolving landscape of machine learning, the ability to fine-tune models on custom datasets is a game-changer. It allows for the creation of models that are not only powerful but also tailored to specific domains, enhancing their performance and relevance. This article delves into the intricacies of fine-tuning the Tiny-Llama model on a [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_coblocks_attr":"","_coblocks_dimensions":"","_coblocks_responsive_height":"","_coblocks_accordion_ie_support":"","hide_page_title":"","footnotes":""},"categories":[],"tags":[7,9,101,100,10,98],"class_list":["post-232","post","type-post","status-publish","format-standard","hentry","tag-aerolift-ai","tag-generative-ai","tag-large-language-models","tag-llama-2","tag-natural-language-processing","tag-transformer-architecture"],"_links":{"self":[{"href":"https:\/\/www.aerolift.ai\/blog\/wp-json\/wp\/v2\/posts\/232","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.aerolift.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aerolift.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aerolift.ai\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aerolift.ai\/blog\/wp-json\/wp\/v2\/comments?post=232"}],"version-history":[{"count":2,"href":"https:\/\/www.aerolift.ai\/blog\/wp-json\/wp\/v2\/posts\/232\/revisions"}],"predecessor-version":[{"id":235,"href":"https:\/\/www.aerolift.ai\/blog\/wp-json\/wp\/v2\/posts\/232\/revisions\/235"}],"wp:attachment":[{"href":"https:\/\/www.aerolift.ai\/blog\/wp-json\/wp\/v2\/media?parent=232"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aerolift.ai\/blog\/wp-json\/wp\/v2\/categories?post=232"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aerolift.ai\/blog\/wp-json\/wp\/v2\/tags?post=232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}