Skip to main content

Enter Model Details

  • Model name: Provide a name for your model.
  • Source: Specify the source from which the model will be fetched. You can choose from:
    • HuggingFace Model Hub – Provide the repository name (e.g.,creator/model-slug).
    • AWS S3 – Enter the S3 bucket path.
    • GCP GCS – Enter the Google Cloud Storage bucket path.
    • Public URL – Provide a publicly accessible model download link.
  • Path: Enter the path to the model file or directory (e.g., openai/whisper-large-v3-turbo.)
Cloud credentials (Required for AWS S3 or GCP GCS)
Provide your cloud credentials (Secret) to enable secure access to private storage buckets.
Alt Text
  • Visit huggingface.co.
  • Use the search bar to find the desired model. (e.g., “whisper-large”)
  • Click on the model you want from the search results. (e.g., openai/whisper-large-v3-turbo)
  • Copy the model path displayed at the top of the page (e.g., openai/whisper-large-v3-turbo) for use.
The model path on HuggingFace follows the format: creator/model-slug.
Note: Only instruct-style models are supported in the model compilation step for LLMs. These are typically chat-optimized models and are often identified by the suffix -Instruct in their names (e.g., meta-llama/Llama-3.2-3B-Instruct).Base models such as meta-llama/Llama-3.2-3B (without the -Instruct suffix) are not supported.

Optimizing Infrastructure

  • Configure the infrastructure to optimize the model’s performance, such as selecting the appropriate compute resources and optimization techniques.
Choosing Your Infrastructure You can choose where to compile and optimize your model based on your set up and preferences:
  • BYOC (Bring Your Own Cloud): Select this option if you want to use your own infrastructure and resources for compilation. Select the Cloud Accounts that was previously added under the Cloud Accounts section within Integrations.
  • Simplismart Cloud: Select this option to use Simplismart’s managed infrastructure for faster and more efficient optimization.
Alt Text

Configuration

  • Select the desired quantization format: FP16, FP8 or AWQ (based on your performance and resource requirements)
    • FP16: Offers higher precision and accuracy, but requires more GPU memory and compute power.
    • FP8: Provides faster inference and reduced memory usage, with a minor trade-off in numerical precision compared to FP16.
    • AWQ (Activation-aware Weight Quantization): Reduces model size and memory usage with minimal impact on accuracy, making it suitable for resource-constrained environments.
  • The optimization, model, and pipeline configurations are auto-filled based on the details provided earlier. You may modify them if required to suit your deployment needs.
  • Finalize the model’s configuration by setting any additional parameters or preferences required for deployment.
Alt Text