LLM Evaluations
LLM evaluations use the Eleuther Evaluation Harness with the lm_eval
engine to assess your fine-tuned models on various benchmarks. Supported evaluations include:
- MMLU
- GSM8K
- Hellaswag
- ARC
- TruthfulQA
- Winogrande
Models > 8B and context more than 8k are currently not supported for LLM Evals. Support will be added shortly.
Below mentioned is example code for
Launching an LLM Evaluation Job:
import requests
url = "https://api.monsterapi.ai/v1/deploy/evaluation/llm/lm_eval"
payload = {
"deployment_name": "YourDeploymentName",
"basemodel_path": "mistralai/Mistral-7B-v0.1",
"eval_engine": "lm_eval",
"task": "gsm8k,hellaswag"
}
headers = {
"accept": "application/json",
"content-type": "application/json",
"authorization": "Bearer YOUR_API_KEY"
}
response = requests.post(url, json=payload, headers=headers)
print(response.text)
Replace "YOUR_API_KEY"
with your actual API key, and adjust deployment_name
and basemodel_path
as needed.
This API would provide you with an eval job ID such as xxxx-xxxx-xxxxxxx.
You can query the status of the job with:
Job Status API:
import requests
url = f"https://api.monsterapi.ai/v1/deploy/status/{deployment_id}"
headers = {"accept": "application/json"}
response = requests.get(url, headers=headers)
print(response.text)
Once the evaluation job has completed, you will get a list of eval scores for your provided model.
For further details, checkout our LM Evaluation API documentation.an
MonsterAPI currently supports evaluation for all the tasks listed here.
Need Benchmarking Support or the Addition of a Specific Benchmark?
For assistance with specific benchmarking needs or to request the addition of new benchmarks, please reach out to our support team. We are here to help customize evaluations and accommodate your requirements.
Updated 5 months ago