MonsterAPI Deploy
Serve LLMs as a Rest API inference endpoint with one click using Monster Deploy
Introduction
Introducing MonsterAPI's Deploy Service, your gateway to seamlessly deploy Large language models (LLMs) and docker containers on MonsterAPI's scalable and compliant GPU cloud. Developed with the vLLM (Variably-Large Language Models) project as its foundation, the Deploy service is designed to maximize throughput for serving a vast array of AI models, providing swift deployment with a throughput and cost optimized pipeline.
Quick Start Resources:
📝 Demo Notebooks:
Discover Colab notebooks with Monster Deploy integration on our Projects page 😊🚀!
Features Overview
- Diverse Deployment Options:
- Deploy open-source LLMs as REST API endpoints.
- Deploy docker containers with your choice of docker images.
- Deploy finetuned LLMs by simply specifying LoRA adapters.
- Increased throughput: Inception from the vLLM project ensures higher throughput while serving requests.
- Custom Resource Allocation: Define your custom GPU and RAM configurations.
- Multi-GPU Support: Resource allocation for up to 4 GPUs to handle large AI models.
Supported deployment methods:
MonsterAPI Deploy currently supports deployment of LLMs as a REST API endpoint and any custom docker image as a hosted docker container on our low cost yet scalable and secure GPU infrastructure.
Two services are facilitate with these API:
Deployment Services
- /deploy/llm: Deploy an LLM as a REST API service with/without LoRA adapter.
- /deploy/custom_image: Deploy a docker container with any docker image from your docker registry.
- /deploy/sdxl-dreambooth: Deploy a SDXL Gradio Dreambooth with finetuned model.
Finetuning Services
- /finetune/llm: Finetune an LLM using LoRA/QLoRA.
- /finetune/speech2text/whisper: Finetune a whisper speech to text model.
- /finetune/text2image/sdxl-dreambooth: Dreambooth finetune stable diffusion models.
This opens up an array of possibilities such as:
- Quickly get an API endpoint that can start serving text generation requests using models like Llama2 7B, CodeLlama 34B, Falcon 40B for your AI projects.
- Deploy docker container driven applications such as Automatic1111 for stable diffusion UI, with just specifying a docker image.
- Finetune models on MonsterAPI's no-code LLM finetuner and then deploy them with their LoRA adapters to swiftly get an API endpoint serving requests using the domain specific LLM finetuned on your datasets.
Resource Configurations
You can choose from a range of GPU vRAM configurations such as:
RAM Size (GB) | 8 | 16 | 24 | 48 | 80 |
---|
Our computing network ensures there's enough availability of GPU resource to meet the specific demands of your AI projects and thus, giving you unparalleled processing capability to tackle even the most complex tasks.
Model Compatibility Criteria
This section outlines the foundational requirements and benchmarks that models need to meet to be successfully integrated into the Deploy Service platform. Ensuring compatibility guarantees seamless integration and optimal performance when deploying your models.
- Base Model Path: Initiate with a path to a Hugging Face model. Ensure it is authenticated and identified within the Hugging Face platform.
- Vast Model Support: Leveraging vLLM technology, Deploy Service accepts any model supported by vLLM.
Curated List of Compatible Base Models
The following table lists the model architectures currently supported by vLLM, along with examples of popular models that utilize these architectures.
Architecture | Models | Example HuggingFace Models |
---|---|---|
AquilaForCausalLM | AquilaForCausalLM | BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc. |
ArcticForCausalLM | Arctic | Snowflake/snowflake-arctic-base, Snowflake/snowflake-arctic-instruct, etc. |
BaiChuanForCausalLM | Baichuan & Baichuan2 | baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc. |
BloomForCausalLM | BLOOM, BLOOMZ, BLOOMChat | bigscience/bloom, bigscience/bloomz, etc. |
ChatGLMModel | ChatGLM | THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc. |
CohereForCausalLM | Command-R | CohereForAI/c4ai-command-r-v01, etc. |
DbrxForCausalLM | DBRX | databricks/dbrx-base, databricks/dbrx-instruct, etc. |
DeciLMForCausalLM | DeciLM | Deci/DeciLM-7B, Deci/DeciLM-7B-instruct, etc. |
FalconForCausalLM | Falcon | tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc. |
GemmaForCausalLM | Gemma | google/gemma-2b, google/gemma-7b, etc. |
GPT2LMHeadModel | GPT-2 | gpt2, gpt2-xl, etc. |
GPTBigCodeForCausalLM | StarCoder, SantaCoder, WizardCoder | bigcode/starcoder, bigcode/gpt_bigcode-santacoder, WizardLM/WizardCoder-15B-V1.0, etc. |
GPTJForCausalLM | GPT-J | EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc. |
GPTNeoXForCausalLM | GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM | EleutherAI/gpt-neox-20b, EleutherAI/pythia-12b, etc. |
InternLMForCausalLM | InternLM | internlm/internlm-7b, internlm/internlm-chat-7b, etc. |
InternLM2ForCausalLM | InternLM2 | internlm/internlm2-7b, internlm/internlm2-chat-7b, etc. |
JAISLMHeadModel | Jais | core42/jais-13b, core42/jais-13b-chat, core42/jais-30b-v3, etc. |
LlamaForCausalLM | LLaMA, Llama 2, Meta Llama 3,Meta Llama 3.1, Vicuna, Alpaca, Yi | meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B-Instruct, etc. |
MiniCPMForCausalLM | MiniCPM | openbmb/MiniCPM-2B-sft-bf16, openbmb/MiniCPM-2B-dpo-bf16, etc. |
MistralForCausalLM | Mistral, Mistral-Instruct | mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc. |
MixtralForCausalLM | Mixtral-8x7B, Mixtral-8x7B-Instruct | mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, etc. |
MPTForCausalLM | MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter | mosaicml/mpt-7b, mosaicml/mpt-7b-storywriter, mosaicml/mpt-30b, etc. |
OLMoForCausalLM | OLMo | allenai/OLMo-1B-hf, allenai/OLMo-7B-hf, etc. |
OPTForCausalLM | OPT, OPT-IML | facebook/opt-66b, facebook/opt-iml-max-30b, etc. |
OrionForCausalLM | Orion | OrionStarAI/Orion-14B-Base, OrionStarAI/Orion-14B-Chat, etc. |
PhiForCausalLM | Phi | microsoft/phi-1_5, microsoft/phi-2, etc. |
Phi3ForCausalLM | Phi-3 | microsoft/Phi-3-mini-4k-instruct, microsoft/Phi-3-mini-128k-instruct, etc. |
Phi3SmallForCausalLM | Phi-3-Small | microsoft/Phi-3-small-8k-instruct, microsoft/Phi-3-small-128k-instruct, etc. |
Phi3VForCausalLM | Phi-3-Vision | microsoft/Phi-3-vision-128k-instruct, etc. |
QwenLMHeadModel | Qwen | Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc. |
Qwen2ForCausalLM | Qwen2 | Qwen/Qwen2-beta-7B, Qwen/Qwen2-beta-7B-Chat, etc. |
Qwen2MoeForCausalLM | Qwen2MoE | Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat, etc. |
StableLmForCausalLM | StableLM | stabilityai/stablelm-3b-4e1t/ , stabilityai/stablelm-base-alpha-7b-v2, etc. |
Starcoder2ForCausalLM | Starcoder2 | bigcode/starcoder2-3b, bigcode/starcoder2-7b, bigcode/starcoder2-15b, etc. |
XverseForCausalLM | Xverse | xverse/XVERSE-7B-Chat, xverse/XVERSE-13B-Chat, xverse/XVERSE-65B-Chat, etc. |
Note: Our base model list is always expanding. Stay tuned for more integrations!
Beta Phase & Feedback
The Quick Deploy Service is still in its beta stage. We are keen on refining and enhancing the platform based on your feedback.
Get Beta Access: Sign up here for beta access and get free credits to try out the Deploy service deployments.
Guides & Tutorials
- API Reference: Visit here
- API CURL Usage: MonsterAPI Deploy CURL tutorial
- Python Client Usage CookBooks: MonsterAPI Deploy Service Python Tutorial
- Demo Notebooks: Quick Start Notebooks
Connect & Explore
Your feedback and insights are a cornerstone of our development.
- 📧 Email: [email protected]
- 🤖 Discord: Join our community on discord
Updated about 1 month ago