Dataset Preparation

Dataset preparation is a crucial step in ensuring the successful fine-tuning of models using MonsterAPI. Properly prepared datasets enable accurate model training, leading to better performance and tailored outputs.

MonsterAPI provides several ways through which you can generate or use existing datasets for fine-tuning AI models.

Here's how to use your datasets or generate datasets for LLM Fine-tuning on MonsterAPI:

1. Generate Datasets

  • Dataset Augmentation API: If you have a dataset with a limited number of samples, then MonsterAPI's Data Augmentation API can expand that dataset by generating additional data rows based on existing data or creating a new dataset for performing preference optimization. This service is particularly useful for model fine-tuning, leading to better performance by introducing more varied data.

  • Instruction Dataset Synthesis API: You can effortlessly generate your own instruction-response datasets. This service leverages the instruction synthesizer model to create datasets suitable for both instruction pre-training and instruction fine-tuning, simplifying the process significantly. This no-code solution spares users from the laborious tasks of data scraping and running a language model locally. Additionally, it offers substantial cost savings, generating labeled datasets at a fraction of the cost compared to models like GPT and Claude.

2. Using Hugging Face Datasets

MonsterAPI also supports public or private HuggingFace datasets. At the time of creating an LLM or Whisper Fine-tuning job, you may specify the path of a dataset, such as tatsu-lab/alpaca and the Fine-tuning service will automatically validate the dataset to ensure it has the right splits and is formatted properly or not. For private datasets, you'd need to provide a HuggingFace key with read permission.


3. Using a Custom Dataset

You can upload a custom dataset for fine-tuning a model on MonsterAPI. The dataset needs to be in one of the following supported formats:

  • JSON
  • CSV
  • Parquet
  • Zip (Collection of images for Image generation models only)

You may upload and manage your datasets in your account's custom datasets portal.


Dataset Types and Formats:

MonsterAPI supports the following format types for datasets based on the model type:

  1. Text Generation Models (LLMs):

    • Format: Supported formats include JSON, JSONL, CSV, and Parquet.
    • Purpose: Provide diverse and high-quality text data to refine language models effectively.

  1. Image Generation Models (SDXL):

    • Format: Upload datasets as ZIP folders containing a collection of images paired with descriptive text.
    • Purpose: Facilitate image generation by associating images with descriptive text, guiding the model to produce relevant outputs.

  1. Speech Processing Models (Whisper):

    • Format: Use dataset paths from HuggingFace directly in our platform.
    • Purpose: Ensure your audio data includes clear and accurate transcriptions to enhance model performance in transcription and translation tasks.

Validating and Testing

  1. Ensure the dataset format and contents are structured properly.
  2. Test with a sample to confirm compatibility.

By following these guidelines, you can efficiently prepare and upload your datasets, setting the stage for successful model fine-tuning with MonsterAPI. If you have any questions or need further assistance, our support team is here to help.


What’s Next

Prepare Dataset