Supported Models are:

  • TinyLlama/TinyLlama-1.1B-Chat-v1.0

  • mistralai/Mistral-7B-Instruct-v0.2

  • microsoft/Phi-3-mini-4k-instruct

  • meta-llama/Meta-Llama-3-8B-Instruct.

The Request should be a JSON object with the following fields:

  • model: Model to execute input on supported Models are in object definition.

  • prompt: Formatted prompt to be fed as input for model. Note: input to this value is expected to be formatted prompt.

  • messages:
    Optional[List[dict]] OpenAI Formatted Message:

    messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}]
    

    When this input format is used model prompt template is auto applied.
    Note this is not supported for microsoft/phi-2.

  • max_tokens: An integer representing the maximum number of tokens to generate in the output.

  • n: number of outputs to generate / number of beams to use.[optional]

  • best_of: Controls the number of candidate generations to produce from which the best is selected. Only relevant when n is greater than 1. If None, the model will decide the best number internally. [Optional]

  • presence_penalty: A float that penalizes new tokens based on their existing presence in the text. This encourages the model to explore new topics and ideas. [Default: 0.0]

  • frequency_penalty: A float that decreases the likelihood of repetition of previously used words. The higher the penalty, the less likely the model is to repeat the same line of text. [Default: 0.0]

  • repetition_penalty: A float that controls the penalty for token repetitions in the output. Values > 1 will penalize and decrease the likelihood of repetition. [Default: 1.0]

  • temperature: A float that controls randomness in the generation. Lower temperatures make the model more deterministic and higher temperatures encourage creativity and diversity. [Default: 1.0]

  • top_p: A float in the range [0,1] controlling the nucleus sampling method, which truncates the distribution of token probabilities to the top p%. This ensures diversity and that only the most probable tokens are considered for generation. [Default: 1.0]

  • top_k: An integer controlling the number of highest probability vocabulary tokens to keep for top-k filtering. If set to -1, no filtering is applied. [Default: -1]

  • min_p: An integer controlling the minimum number of tokens to be considered for generation. This can prevent the model from generating too few tokens. [Default: 0.0]

  • use_beam_search: Boolean indicating whether to use beam search for generation. Beam search might provide better quality outputs at the expense of speed. [Default: False]

  • length_penalty: A float that penalizes or rewards longer sequences. Values < 1 will favor shorter sequences, and values > 1 will favor longer ones. [Default: 1.0]

  • early_stopping: Boolean indicating whether to stop generation early if the end token is predicted. This can make the generation faster and prevent overly long outputs. [Default: False]

  • stream: Boolean Set to True to stream tokens from model. byte encoded stream can be parsed with delimiter=b"\0" as a separator.

    # Usage:
    for chunk in response.iter_lines(chunk_size=8192,
                                        decode_unicode=False,
                                        delimiter=b""):
            if chunk:
                data = json.loads(chunk.decode("utf-8"))
                print(data)
                print(100*'#')
                yield data
    
  • mock_response: Boolean if True mock response is generate, False is not supported right now.

Ensure that your input adheres to these parameters for optimal generation results. The model will process the input and generate text based on the configuration and content provided in 'input_variables'.

📘

Learn more about the New Gen LLMs here

The Request should be a JSON object with the following fields:

ParameterDescriptionDefault Value
modelModel to execute input on supported Models are in object definition.-
promptFormatted prompt to be fed as input for the model. Note: input to this value is expected to be formatted prompt.-
messagesOptional[List[dict]] OpenAI Formatted Message. Example:

messages = [ {"role": "user", "content": "What is your favourite condiment?"}, {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}, {"role": "user", "content": "Do you have mayonnaise recipes?"}]

When this input format is used, model prompt template is auto applied. Note this is not supported for microsoft/phi-2.
-
max_tokensAn integer representing the maximum number of tokens to generate in the output.-
nNumber of outputs to generate / number of beams to use. Optional.-
best_ofControls the number of candidate generations to produce from which the best is selected. Optional.None
presence_penaltyA float that penalizes new tokens based on their existing presence in the text. Encourages exploration of new topics and ideas.0.0
frequency_penaltyA float that decreases the likelihood of repetition of previously used words. The higher the penalty, the less likely repetition.0.0
repetition_penaltyA float that controls the penalty for token repetitions in the output. Values > 1 will penalize and decrease repetition likelihood.1.0
temperatureA float that controls randomness in the generation. Lower values are more deterministic, higher values encourage diversity.1.0
top_pA float in the range [0,1] controlling the nucleus sampling method, which truncates the distribution to the top p%.1.0
top_kAn integer controlling the number of highest probability vocabulary tokens to keep for top-k filtering.-1
min_pAn integer controlling the minimum number of tokens to be considered for generation. This can prevent generating too few tokens.0.0
use_beam_searchBoolean indicating whether to use beam search for generation, which might provide better quality outputs at the expense of speed.False
length_penaltyA float that penalizes or rewards longer sequences. Values < 1 favor shorter sequences, and values > 1 favor longer ones.1.0
early_stoppingBoolean indicating whether to stop generation early if the end token is predicted. Makes generation faster and prevents long outputs.False
streamBoolean indicating whether to stream response or notFalse
mock_responseBoolean indicating if a mock response is generated. Currently, only True is supported.-

Ensure that your input adheres to these parameters for optimal generation results. The model will process the input and generate text based on the configuration and content provided in 'input_variables'.

Language
Authorization
Bearer
Click Try It! to start a request and see the response here!