For LLM Fine-tuning

To prepare a Dataset for Fine-tuning a Large Language Model (LLM), below are the various supported methods:

  1. Augment existing Datasets
  2. Synthesize Instruction Datasets
  3. Upload your Custom Datasets
  4. Use a HuggingFace Dataset

Below are detailed guides for each supported method:

1. Augment your existing Datasets

The Data Augmentation Service enables users to expand their datasets by generating additional data rows based on existing data or creating a new preference dataset. This service is particularly useful for model fine-tuning, leading to better performance by introducing more varied data.

You can specify details about the data to be augmented, including source type and data split, and choose between two tasks: generating evolved instructions or generating preference datasets.


Due to OpenAI token generation Rate limits, right now the service is available to models with > 400,000 TPM. accessible through Tier 2 and above accounts.

The following steps are involved in creating an augmented dataset:

  1. First, pick the dataset you want to augment, then create an OpenAI API key and MonsterAPI access token:
hf_dataset_id = '<hf_dataset>'
dataset_split_to_use = 'test'
  1. Load the dataset and visualize it.
dataset = datasets.load_dataset(hf_dataset_id)
df = pd.DataFrame(dataset[dataset_split_to_use])

By visualizing the dataset, we can see which column contains the prompt.

  1. We can post a request to the API to augment the dataset, like below:
body = {
  "data_config": {
    "data_path": f"{hf_dataset_id}",
    "data_subset": None,
    "prompt_column_name": "prompt",
    "data_source_type": "hub_link",
    "split": f"{dataset_split_to_use}"
  "task": "evol_instruct",
  "generate_model1_name": "gpt-3.5-turbo",
  "generate_model2_name": "gpt-3.5-turbo",
  "judge_model_name": "gpt-3.5-turbo",
  "num_evolutions": 4,
  "openai_api_key": f"{OPENAI_KEY}"

headers = {'Authorization': f'Bearer {MONSTERAPI_KEY}'}
response ='', json=body, headers=headers)

if response.status_code != 200:
    raise ValueError('Failed to send request', response.json())

process_id = response.json()['process_id']

  1. We can check the status of the job by running the following cell:
# lets check the status of the process
def check_status(process_id):
    response = requests.get(f"{process_id}", headers=headers)
    return    response.json()['status']

# wait for it to show 'COMPLETED'
status = check_status(process_id)
while status != 'COMPLETED':
    status = check_status(process_id)
    if status == 'FAILED':
        print('Process failed, Try again!')
        raise ValueError('Process failed')
    print(f"Process status: {check_status(process_id)}")

# lets get the results
response = requests.get(f"{process_id}", headers=headers)
results = response.json()['result']
output_path = results['output'][0]
print('Output path:', output_path)

  1. Once the process is completed we can download the augmented dataset as follows:
output_ds = datasets.load_dataset('csv',data_files=output_path)


2. Synthesize instruction datasets

The potential of supervised multitask learning, particularly in the post-training phase, cannot be overlooked. This method has shown promising results in enhancing model generalization. In a groundbreaking paper, researchers have introduced a novel framework called Instruction Pre-Training, which aims to elevate language model pre-training by integrating instruction-response pairs into the learning process.

With the Synthesizer API, users can effortlessly generate their own instruction-response datasets. This service leverages the instruction synthesizer model to create datasets suitable for both instruction pre-training and instruction fine-tuning, simplifying the process significantly. This no-code solution spares users from the laborious tasks of data scraping and running a language model locally. Additionally, it offers substantial cost savings, generating labeled datasets at a fraction of the cost compared to models like GPT and Claude.

To generate an instruction-response dataset, all you need is a corpus in a hugging face dataset format. Make Sure your corpus is available as manageable chunks in a column named ‘text’.

You can call the instruction synthesizer with the following steps:

  1. Setup the base_url to the API’s base url as follows:
url = ""
  1. Once this is set up, you can simply post a request as follows:
payload = {  
   "model_name": "instruction-pretrain/instruction-synthesizer",  
   "temperature": 0,  
   "max_tokens": 400,  
   "batch_size": 2,  
   "seed": 42,  
   "input_dataset_name": "<INPUT DATASET PATH>(For example: RaagulQB/quantum-field-theory)",  
   "output_dataset_name": "<OUTPUT DATASET PATH>(For example: RaagulQB/quantum-field-theory-instruct)",  
   "hf_token": "<HF TOKEN>"  
headers = {  
   "accept": "application/json",  
   "content-type": "application/json",  
   "authorization": "Bearer <MONSTER API TOKEN>"  

response =, json=payload, headers=headers)

  • Input_dataset_name: name of the hf dataset to use
  • Output_dataset_name: name of the dataset to be uploaded to huggingface
  • Hf_token: A valid Hugging face token with write permission.

If you don’t have a proper dataset, you can refer to following colab notebook that explains how you can create a dataset out of a PDF file: Open In Colab

3. Custom Dataset: Using your own dataset

  1. Select Task Type: Click on the dropdown menu and choose the type of task you are training for, such as text classification, summarization, or question-answering.

  1. Choose Dataset Source: Click on the Choose Dataset dropdown menu. You have two options here:

  1. Select Your Dataset: Choose the Custom Dataset you have uploaded in the Datasets section. Make sure your dataset is in one of the supported formats: JSON, JSONL, CSV, or Parquet.

  1. Configure Hyperparameters: Set the appropriate hyperparameters based on your dataset's structure, and then proceed to the next step.

Click 'Next' to finalize your fine-tuning job request. Our FineTuner will then handle the remaining steps with precision and efficiency.

4. Using Hugging Face Datasets:

  1. Select Task Type: Follow the same step as above to select the task type.

  2. Choose Dataset Source: Set the dataset source to "Hugging Face Datasets."

  3. Select the Dataset: Choose a dataset from the Hugging Face list or provide the path to a specific Hugging Face dataset by selecting "Other" and entering the dataset's path.

  1. Choose the Subset: If the dataset has multiple subsets, select the one you want to use. If no subsets are available, the dataset will default to the full version.

Prompt Configuration

Upon selecting a dataset, you'll find a section labeled 'Prompt Configuration.' This section should be adjusted according to the specifics of your selected dataset.

  • For Custom Datasets: Replace placeholders in the prompt configuration section with the actual column names in your dataset.
  • For HuggingFace Dataset: Replace the placeholders inside the square brackets with the actual column names from your dataset that you wish to use for fine-tuning.


If you have a pre-curated HuggingFace Dataset then no dataset prompt configuration is required as it is pre-filled and you can simply click on "Next" and proceed ahead.

For example, if your dataset has columns like prompt, response, and source, you will replace:

  • {replace with instruction column name} with prompt and
  • {replace with response column name} with response.

After making these changes, your updated data preparation window looks like this:

And we are done!! Click 'Next' to finalize your fine-tuning job request. Our FineTuner will then handle the remaining steps with precision and efficiency.

What’s Next

See steps to launch a LLM fine-tuning job