Supported Models are:
-
google/gemma-2-9b-it
-
mistralai/Mistral-7B-Instruct-v0.2
-
microsoft/Phi-3-mini-4k-instruct
-
meta-llama/Meta-Llama-3.1-8B-Instruct.
The Request should be a JSON object with the following fields:
-
model: Model to execute input on supported Models are in object definition.
-
prompt: Formatted prompt to be fed as input for model. Note: input to this value is expected to be formatted prompt.
-
messages:
Optional[List[dict]] OpenAI Formatted Message:messages = [ {"role": "user", "content": "What is your favourite condiment?"}, {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}, {"role": "user", "content": "Do you have mayonnaise recipes?"}]
When this input format is used model prompt template is auto applied.
Note this is not supported for microsoft/phi-2. -
max_tokens: An integer representing the maximum number of tokens to generate in the output.
-
n: number of outputs to generate / number of beams to use.[optional]
-
best_of: Controls the number of candidate generations to produce from which the best is selected. Only relevant when
n
is greater than 1. IfNone
, the model will decide the best number internally. [Optional] -
presence_penalty: A float that penalizes new tokens based on their existing presence in the text. This encourages the model to explore new topics and ideas. [Default: 0.0]
-
frequency_penalty: A float that decreases the likelihood of repetition of previously used words. The higher the penalty, the less likely the model is to repeat the same line of text. [Default: 0.0]
-
repetition_penalty: A float that controls the penalty for token repetitions in the output. Values > 1 will penalize and decrease the likelihood of repetition. [Default: 1.0]
-
temperature: A float that controls randomness in the generation. Lower temperatures make the model more deterministic and higher temperatures encourage creativity and diversity. [Default: 1.0]
-
top_p: A float in the range [0,1] controlling the nucleus sampling method, which truncates the distribution of token probabilities to the top p%. This ensures diversity and that only the most probable tokens are considered for generation. [Default: 1.0]
-
top_k: An integer controlling the number of highest probability vocabulary tokens to keep for top-k filtering. If set to -1, no filtering is applied. [Default: -1]
-
min_p: An integer controlling the minimum number of tokens to be considered for generation. This can prevent the model from generating too few tokens. [Default: 0.0]
-
use_beam_search: Boolean indicating whether to use beam search for generation. Beam search might provide better quality outputs at the expense of speed. [Default: False]
-
length_penalty: A float that penalizes or rewards longer sequences. Values < 1 will favor shorter sequences, and values > 1 will favor longer ones. [Default: 1.0]
-
early_stopping: Boolean indicating whether to stop generation early if the end token is predicted. This can make the generation faster and prevent overly long outputs. [Default: False]
-
stream: Boolean Set to True to stream tokens from model. byte encoded stream can be parsed with delimiter=b"\0" as a separator.
# Usage: for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False, delimiter=b"