LLM Evaluations

LLM Evaluations Assess model performance using benchmark tasks through the Eleuther Evaluation Harness and lm_eval engine.

LLM evaluations use the Eleuther Evaluation Harness with the lm_eval engine to assess your fine-tuned models on various benchmarks. Supported evaluations include:

  • MMLU
  • GSM8K
  • Hellaswag
  • ARC
  • TruthfulQA
  • Winogrande

❗️

Models > 8B and context more than 8k are currently not supported for LLM Evals. Support will be added shortly.

Below mentioned is example code for

Launching an LLM Evaluation Job:

import requests

url = "https://api.monsterapi.ai/v1/deploy/evaluation/llm/lm_eval"

payload = {
    "deployment_name": "YourDeploymentName",
    "basemodel_path": "mistralai/Mistral-7B-v0.1",
    "eval_engine": "lm_eval",
    "task": "gsm8k,hellaswag"
}
headers = {
    "accept": "application/json",
    "content-type": "application/json",
    "authorization": "Bearer YOUR_API_KEY"
}

response = requests.post(url, json=payload, headers=headers)

print(response.text)

Replace "YOUR_API_KEY" with your actual API key, and adjust deployment_name and basemodel_path as needed.

This API would provide you with an eval job ID such as xxxx-xxxx-xxxxxxx.

You can query the status of the job with:

Job Status API:

import requests

url = f"https://api.monsterapi.ai/v1/deploy/status/{deployment_id}"

headers = {"accept": "application/json"}

response = requests.get(url, headers=headers)

print(response.text)

Once the evaluation job has completed, you will get a list of eval scores for your provided model.

For further details, checkout our LM Evaluation API documentation.an

MonsterAPI currently supports evaluation for all the tasks listed here.


Need Benchmarking Support or the Addition of a Specific Benchmark?

For assistance with specific benchmarking needs or to request the addition of new benchmarks, please reach out to our support team. We are here to help customize evaluations and accommodate your requirements.