Enter Model Details
- Model name: Provide a name for your model.
- Source: Specify the source from which the model will be fetched. You can choose from:
- HuggingFace Model Hub – Provide the repository name (e.g.,
creator/model-slug). - AWS S3 – Enter the S3 bucket path.
- GCP GCS – Enter the Google Cloud Storage bucket path.
- Public URL – Provide a publicly accessible model download link.
- HuggingFace Model Hub – Provide the repository name (e.g.,
- Path: Enter the path to the model file or directory (e.g., openai/whisper-large-v3-turbo.)
Cloud credentials (Required for AWS S3 or GCP GCS)
Provide your cloud credentials (Secret) to enable secure access to private storage buckets.
Provide your cloud credentials (Secret) to enable secure access to private storage buckets.

Getting your model path from HuggingFace
Getting your model path from HuggingFace
- Visit huggingface.co.
- Use the search bar to find the desired model. (e.g., “whisper-large”)
- Click on the model you want from the search results. (e.g., openai/whisper-large-v3-turbo)
- Copy the model path displayed at the top of the page (e.g., openai/whisper-large-v3-turbo) for use.
The model path on HuggingFace follows the format: creator/model-slug.
Optimizing Infrastructure
- Configure the infrastructure to optimize the model’s performance, such as selecting the appropriate compute resources and optimization techniques.
Choosing Your Infrastructure
You can choose where to compile and optimize your model based on your set up and preferences:
- BYOC (Bring Your Own Cloud): Select this option if you want to use your own infrastructure and resources for compilation. Select the Cloud Accounts that was previously added under the Cloud Accounts section within Integrations.
- Simplismart Cloud: Select this option to use Simplismart’s managed infrastructure for faster and more efficient optimization.

Configuration
- Select the desired quantization format: FP16, FP8 or AWQ (based on your performance and resource requirements)
- FP16: Offers higher precision and accuracy, but requires more GPU memory and compute power.
- FP8: Provides faster inference and reduced memory usage, with a minor trade-off in numerical precision compared to FP16.
- AWQ (Activation-aware Weight Quantization): Reduces model size and memory usage with minimal impact on accuracy, making it suitable for resource-constrained environments.
- The optimization, model, and pipeline configurations are auto-filled based on the details provided earlier. You may modify them if required to suit your deployment needs.
- Finalize the model’s configuration by setting any additional parameters or preferences required for deployment.
