FluxPipeline provides support for multiple image generation pipelines with and without controlnets, including text-to-image (txt2img), image-to-image (img2img), and inpainting.

Important Note

Ensure that a volume mount is added to the deployment, as all images generated are dumped inside /data/outputs directory in the container.


Model Optimization Configuration

Optimization Settings

For optimization, under the optimization config, use:

"optimisations": {
    "attention_caching": {
      "type": "auto",
      "enabled": true,
      "extra_params": {
        "threshold": 0.1
      }
    }
}
  • Higher threshold values result in greater speed gains but may degrade image generation accuracy.
  • We recommend a threshold of 0.1, which can provide up to a 40% speed improvement during inference while maintaining reasonable quality.

Pipeline Settings

For optimization, under the optimization config, use:

  • Multiple ControlNet models can be added under the controlnets section.
  • Each ControlNet model requires a name, source, and authentication details if needed.
{
  "type": "flux",
  "loras": [],
  "lora_repo": {
    "path": "",
    "type": "",
    "secret": {
      "type": ""
    },
    "ownership": ""
  },
  "pipelines": [
    "txt2img"
  ],
  "controlnets": [
    {
      "name": "canny",
      "source": {
        "path": "InstantX/FLUX.1-dev-Controlnet-Canny",
        "type": "hf",
        "secret": {
          "type": "hf",
          "token": ""
        }
      }
    },
    {
      "name": "depth",
      "source": {
        "path": "InstantX/FLUX.1-dev-Controlnet-Depth",
        "type": "hf",
        "secret": {
          "type": "hf",
          "token": ""
        }
      }
    }
  ],
  "model_choice": {
    "flux_type": "flux"
  },
  "custom_pipeline_config": [],
  "custom_pipeline_resources": ""
}

Here are some key pointers for understanding and structuring controlnet requests:

Understanding ControlNet Parameters

ControlNet Name Convention:

  • The parameters follow a structured pattern:
<controlnet_name>_<parameter_name><controlnet-name>_control_image
<controlnet-name>_weightage
  • Example for Canny:
"canny_control_image" - The input image processed with the **Canny edge detection** model.
"canny_weightage" - Defines the influence of the **Canny edge map** on the final image generation.
  • Example for Depth:
"depth_control_image" → The input image processed with the `Depth estimation` model.
"depth_weightage" → Determines how strongly the depth control image impacts the generation.

Extensibility for Multiple ControlNets:

  • This pattern allows easy extension to additional ControlNet models in a structured way.
  • If you add a new ControlNet (e.g., OpenPose), you’d include:
"openpose_control_image": "URL_to_openpose_image",
"openpose_weightage": 0.5

How Weightage Works::

  • Each weightage parameter (canny_weightage, depth_weightage, etc.) determines the degree of influence that specific ControlNet has on the final image.
  • Higher values make the model adhere more strictly to the control image, potentially sacrificing flexibility.
  • Lower values allow more artistic freedom but reduce adherence to structured inputs.

Combining Multiple ControlNets:

  • You can combine multiple ControlNets in a single request to layer different structural constraints.
  • In this example:
    • Canny edge detection helps maintain sharp edges in the image.
    • Depth estimation preserves 3D structural information.
    • By adjusting the weightages, you can balance between these two influences.

Generalized Pattern for Other ControlNets:

"<controlnet-name>_control_image": "<URL_to_controlnet_input>",
"<controlnet-name>_weightage": <float_value>
  • Example with Pose and Normal Map:

    "pose_control_image": "URL_to_pose_estimation_image",
    "pose_weightage": 0.5,
    "normal_control_image": "URL_to_normal_map_image",
    "normal_weightage": 0.3
    

Supported Pipelines

  1. txt2img - Generates an image from text input.
  2. txt2img_controlnet - Generates an image from text input with controlnet support.
  3. img2img - Generates an image based on an input image and a given prompt.
  4. img2img_controlnet - Generates an image based on an input image and a given prompt with controlnet support.
  5. inpaint - Modifies specific regions of an image based on a mask and a given prompt.
  6. inpaint_controlnet - Modifies specific regions of an image based on a mask and a given prompt with controlnet support.

Example Requests

txt2img

{
    "prompt": "A girl in city, 25 years old, cool, futuristic <lora:multimodalart/plstps-local-feature:0.3> <lora:XLabs-AI/flux-RealismLora:0.3>",
    "negative_prompt": "canvas frame, (high contrast:1.2), (over saturated:1.2), (glossy:1.1), cartoon, 3d, ((disfigured)), ((bad art)), ((b&w)), blurry, ((bad anatomy)), (((bad proportions))), ((extra limbs)), cloned face, (((disfigured))), extra limbs, (bad anatomy), gross proportions, (malformed limbs), ((missing arms)), ((missing legs)), (((extra arms))), (((extra legs))), mutated hands, (fused fingers), (too many fingers), (((long neck))), Photoshop, video game, ugly, tiling, poorly drawn hands, 3d render",
    "height": 1024,
    "width": 1024,
    "num_images_per_prompt": 4,
    "num_inference_steps": 20,
    "seed": 2064977189,
    "guidance_scale": 4.5,
    "strength": 0.8,
    "scheduler": "EULER-A",
    "model_type": "txt2img"
}

txt2img_controlnet

{
    "prompt": "A girl in city, 25 years old, cool, futuristic <lora:multimodalart/plstps-local-feature:0.3> <lora:XLabs-AI/flux-RealismLora:0.3>",
    "negative_prompt": "canvas frame, (high contrast:1.2), (over saturated:1.2), (glossy:1.1), cartoon, 3d, ((disfigured)), ((bad art)), ((b&w)), blurry, ((bad anatomy)), (((bad proportions))), ((extra limbs)), cloned face, (((disfigured))), extra limbs, (bad anatomy), gross proportions, (malformed limbs), ((missing arms)), ((missing legs)), (((extra arms))), (((extra legs))), mutated hands, (fused fingers), (too many fingers), (((long neck))), Photoshop, video game, ugly, tiling, poorly drawn hands, 3d render",
    "height": 1024,
    "width": 1024,
    "num_images_per_prompt": 4,
    "num_inference_steps": 20,
    "seed": 2064977189,
    "guidance_scale": 4.5,
    "canny_control_image": "https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Canny/resolve/main/canny.jpg",
    "canny_weightage": 0.4,
    "depth_control_image": "https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Depth/resolve/main/depth.jpg",
    "depth_weightage": 0.4,
    "strength": 0.8,
    "scheduler": "EULER-A",
    "model_type": "txt2img_controlnet"
}

img2img

{
    "prompt": "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k <lora:multimodalart/plstps-local-feature:0.3> <lora:XLabs-AI/flux-RealismLora:0.3>",
    "negative_prompt": "canvas frame, (high contrast:1.2), (over saturated:1.2), (glossy:1.1), cartoon, 3d, ((disfigured)), ((bad art)), ((b&w)), blurry, ((bad anatomy)), (((bad proportions))), ((extra limbs)), cloned face, (((disfigured))), extra limbs, (bad anatomy), gross proportions, (malformed limbs), ((missing arms)), ((missing legs)), (((extra arms))), (((extra legs))), mutated hands, (fused fingers), (too many fingers), (((long neck))), Photoshop, video game, ugly, tiling, poorly drawn hands, 3d render",
    "height": 1024,
    "width": 1024,
    "num_images_per_prompt": 4,
    "num_inference_steps": 20,
    "image": "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg",
    "seed": 89395930,
    "guidance_scale": 7.0,
    "strength": 0.5,
    "scheduler": "EULER-A",
    "model_type": "img2img"
}

img2img_controlnet

{
    "prompt": "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k <lora:multimodalart/plstps-local-feature:0.3> <lora:XLabs-AI/flux-RealismLora:0.3>",
    "negative_prompt": "canvas frame, (high contrast:1.2), (over saturated:1.2), (glossy:1.1), cartoon, 3d, ((disfigured)), ((bad art)), ((b&w)), blurry, ((bad anatomy)), (((bad proportions))), ((extra limbs)), cloned face, (((disfigured))), extra limbs, (bad anatomy), gross proportions, (malformed limbs), ((missing arms)), ((missing legs)), (((extra arms))), (((extra legs))), mutated hands, (fused fingers), (too many fingers), (((long neck))), Photoshop, video game, ugly, tiling, poorly drawn hands, 3d render",
    "height": 1024,
    "width": 1024,
    "num_images_per_prompt": 4,
    "num_inference_steps": 20,
    "image": "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg",
    "canny_control_image": "https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Canny/resolve/main/canny.jpg",
    "canny_weightage": 0.4,
    "depth_control_image": "https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Depth/resolve/main/depth.jpg",
    "depth_weightage": 0.4,
    "seed": 89395930,
    "guidance_scale": 7.0,
    "strength": 0.5,
    "scheduler": "EULER-A",
    "model_type": "img2img_controlnet"
}

inpaint

{
    "prompt": "Face of a yellow cat, high resolution, sitting on a park bench <lora:multimodalart/plstps-local-feature:0.3> <lora:XLabs-AI/flux-RealismLora:0.3>",
    "height": 1024,
    "width": 1024,
    "num_images_per_prompt": 4,
    "num_inference_steps": 20,
    "image": "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png",
    "mask_image": "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png",
    "seed": 89395930,
    "guidance_scale": 7.0,
    "strength": 0.5,
    "scheduler": "EULER-A",
    "clip_skip": 0,
    "use_foocus": true,
    "model_type": "inpaint"
}

inpaint_controlnet

{
    "prompt": "Face of a yellow cat, high resolution, sitting on a park bench <lora:multimodalart/plstps-local-feature:0.3> <lora:XLabs-AI/flux-RealismLora:0.3>",
    "height": 1024,
    "width": 1024,
    "num_images_per_prompt": 4,
    "num_inference_steps": 20,
    "image": "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png",
    "mask_image": "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png",
    "canny_control_image": "https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Canny/resolve/main/canny.jpg",
    "canny_weightage": 0.4,
    "depth_control_image": "https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Depth/resolve/main/depth.jpg",
    "depth_weightage": 0.4,
    "seed": 89395930,
    "guidance_scale": 7.0,
    "strength": 0.5,
    "scheduler": "EULER-A",
    "clip_skip": 0,
    "use_foocus": true,
    "model_type": "inpaint_controlnet"
}

Example Response

{
    "response_id": "afbc439946a44d98bb8062c8b36ec16d",
    "inference_time_taken": 6.336474418640137,
    "lora_time": 1.8092551231384277,
    "total_time_taken": 7.176232099533081,
    "request_id": "6b060ab415d84117b7b6403d622414f5",
    "error": null
}

Key Notes

  • Ensure volume mounting in deployment for image storage.
  • ControlNet models are not loaded by default.
  • Supports multiple pipelines for text-to-image, image-to-image, and inpainting.