Monster Deploy Llm Generate

post https://llm.monsterapi.ai/v1/private-deployment/generate/

Supported Models are:

google/gemma-2-9b-it
meta-llama/Meta-Llama-3.1-8B-Instruct
monsterapi/Llama3.3_70b.

The Request should be a JSON object with the following fields:

model: Model to execute input on supported Models are in object definition.
prompt: Formatted prompt to be fed as input for model. Note: input to this value is expected to be formatted prompt.

messages:
Optional[List[dict]] OpenAI Formatted Message:

messages = [
{"role": "user", "content": "What is your favourite condiment?"},
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
{"role": "user", "content": "Do you have mayonnaise recipes?"}]

When this input format is used model prompt template is auto applied.
Note this is not supported for microsoft/phi-2.

max_tokens: An integer representing the maximum number of tokens to generate in the output.
n: number of outputs to generate / number of beams to use.[optional]
best_of: Controls the number of candidate generations to produce from which the best is selected. Only relevant when n is greater than 1. If None, the model will decide the best number internally. [Optional]
presence_penalty: A float that penalizes new tokens based on their existing presence in the text. This encourages the model to explore new topics and ideas. [Default: 0.0]
frequency_penalty: A float that decreases the likelihood of repetition of previously used words. The higher the penalty, the less likely the model is to repeat the same line of text. [Default: 0.0]
repetition_penalty: A float that controls the penalty for token repetitions in the output. Values > 1 will penalize and decrease the likelihood of repetition. [Default: 1.0]
temperature: A float that controls randomness in the generation. Lower temperatures make the model more deterministic and higher temperatures encourage creativity and diversity. [Default: 1.0]
top_p: A float in the range [0,1] controlling the nucleus sampling method, which truncates the distribution of token probabilities to the top p%. This ensures diversity and that only the most probable tokens are considered for generation. [Default: 1.0]
top_k: An integer controlling the number of highest probability vocabulary tokens to keep for top-k filtering. If set to -1, no filtering is applied. [Default: -1]
min_p: An integer controlling the minimum number of tokens to be considered for generation. This can prevent the model from generating too few tokens. [Default: 0.0]
use_beam_search: Boolean indicating whether to use beam search for generation. Beam search might provide better quality outputs at the expense of speed. [Default: False]
length_penalty: A float that penalizes or rewards longer sequences. Values < 1 will favor shorter sequences, and values > 1 will favor longer ones. [Default: 1.0]
early_stopping: Boolean indicating whether to stop generation early if the end token is predicted. This can make the generation faster and prevent overly long outputs. [Default: False]

stream: Boolean Set to True to stream tokens from model. byte encoded stream can be parsed with delimiter=b"\0" as a separator.

# Usage:
for chunk in response.iter_lines(chunk_size=8192,
                                    decode_unicode=False,
                                    delimiter=b""):
        if chunk:
            data = json.loads(chunk.decode("utf-8"))
            print(data)
            print(100*'#')
            yield data

mock_response: Boolean if True mock response is generate, False is not supported right now.

Ensure that your input adheres to these parameters for optimal generation results. The model will process the input and generate text based on the configuration and content provided in 'input_variables'.