diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index 9131cb84e82a..4bef07ba54e2 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -17,8 +17,6 @@
title: AutoPipeline
- local: tutorials/basic_training
title: Train a diffusion model
- - local: tutorials/using_peft_for_inference
- title: Load LoRAs for inference
- local: tutorials/fast_diffusion
title: Accelerate inference of text-to-image diffusion models
title: Tutorials
@@ -31,11 +29,24 @@
title: Load schedulers and models
- local: using-diffusers/other-formats
title: Model files and layouts
- - local: using-diffusers/loading_adapters
- title: Load adapters
- local: using-diffusers/push_to_hub
title: Push files to the Hub
title: Load pipelines and adapters
+- sections:
+ - local: tutorials/using_peft_for_inference
+ title: LoRA
+ - local: using-diffusers/ip_adapter
+ title: IP-Adapter
+ - local: using-diffusers/controlnet
+ title: ControlNet
+ - local: using-diffusers/t2i_adapter
+ title: T2I-Adapter
+ - local: using-diffusers/dreambooth
+ title: DreamBooth
+ - local: using-diffusers/textual_inversion_inference
+ title: Textual inversion
+ title: Adapters
+ isExpanded: false
- sections:
- local: using-diffusers/unconditional_image_generation
title: Unconditional image generation
@@ -57,8 +68,6 @@
title: Create a server
- local: training/distributed_inference
title: Distributed inference
- - local: using-diffusers/merge_loras
- title: Merge LoRAs
- local: using-diffusers/scheduler_features
title: Scheduler features
- local: using-diffusers/callback
@@ -95,20 +104,12 @@
title: SDXL Turbo
- local: using-diffusers/kandinsky
title: Kandinsky
- - local: using-diffusers/ip_adapter
- title: IP-Adapter
- local: using-diffusers/omnigen
title: OmniGen
- local: using-diffusers/pag
title: PAG
- - local: using-diffusers/controlnet
- title: ControlNet
- - local: using-diffusers/t2i_adapter
- title: T2I-Adapter
- local: using-diffusers/inference_with_lcm
title: Latent Consistency Model
- - local: using-diffusers/textual_inversion_inference
- title: Textual inversion
- local: using-diffusers/shap-e
title: Shap-E
- local: using-diffusers/diffedit
diff --git a/docs/source/en/tutorials/using_peft_for_inference.md b/docs/source/en/tutorials/using_peft_for_inference.md
index 33414a331ea7..f17113ecb830 100644
--- a/docs/source/en/tutorials/using_peft_for_inference.md
+++ b/docs/source/en/tutorials/using_peft_for_inference.md
@@ -10,218 +10,625 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
-[[open-in-colab]]
+# LoRA
-# Load LoRAs for inference
+[LoRA (Low-Rank Adaptation)](https://huggingface.co/papers/2106.09685) is a method for quickly training a model for a new task. It works by freezing the original model weights and adding a small number of *new* trainable parameters. This means it is significantly faster and cheaper to adapt an existing model to new tasks, such as generating images in a new style.
-There are many adapter types (with [LoRAs](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) being the most popular) trained in different styles to achieve different effects. You can even combine multiple adapters to create new and unique images.
+LoRA checkpoints are typically only a couple hundred MBs in size, so they're very lightweight and easy to store. Load these smaller set of weights into an existing base model with [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] and specify the file name.
-In this tutorial, you'll learn how to easily load and manage adapters for inference with the 🤗 [PEFT](https://huggingface.co/docs/peft/index) integration in 🤗 Diffusers. You'll use LoRA as the main adapter technique, so you'll see the terms LoRA and adapter used interchangeably.
+
+
-Let's first install all the required libraries.
+```py
+import torch
+from diffusers import AutoPipelineForText2Image
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_lora_weights(
+ "ostris/super-cereal-sdxl-lora",
+ weight_name="cereal_box_sdxl_v1.safetensors",
+ adapter_name="cereal"
+)
+pipeline("bears, pizza bites").images[0]
+```
-```bash
-!pip install -q transformers accelerate peft diffusers
+
+
+
+```py
+import torch
+from diffusers import LTXConditionPipeline
+from diffusers.utils import export_to_video, load_image
+
+pipeline = LTXConditionPipeline.from_pretrained(
+ "Lightricks/LTX-Video-0.9.5", torch_dtype=torch.bfloat16
+)
+
+pipeline.load_lora_weights(
+ "Lightricks/LTX-Video-Cakeify-LoRA",
+ weight_name="ltxv_095_cakeify_lora.safetensors",
+ adapter_name="cakeify"
+)
+pipeline.set_adapters("cakeify")
+
+# use "CAKEIFY" to trigger the LoRA
+prompt = "CAKEIFY a person using a knife to cut a cake shaped like a Pikachu plushie"
+image = load_image("https://huggingface.co/Lightricks/LTX-Video-Cakeify-LoRA/resolve/main/assets/images/pikachu.png")
+
+video = pipeline(
+ prompt=prompt,
+ image=image,
+ width=576,
+ height=576,
+ num_frames=161,
+ decode_timestep=0.03,
+ decode_noise_scale=0.025,
+ num_inference_steps=50,
+).frames[0]
+export_to_video(video, "output.mp4", fps=26)
```
-Now, load a pipeline with a [Stable Diffusion XL (SDXL)](../api/pipelines/stable_diffusion/stable_diffusion_xl) checkpoint:
+
+
-```python
-from diffusers import DiffusionPipeline
+The [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method is the preferred way to load LoRA weights into the UNet and text encoder because it can handle cases where:
+
+- the LoRA weights don't have separate UNet and text encoder identifiers
+- the LoRA weights have separate UNet and text encoder identifiers
+
+The [`~loaders.PeftAdapterMixin.load_lora_adapter`] method is used to directly load a LoRA adapter at the *model-level*, as long as the model is a Diffusers model that is a subclass of [`PeftAdapterMixin`]. It builds and prepares the necessary model configuration for the adapter. This method also loads the LoRA adapter into the UNet.
+
+For example, if you're only loading a LoRA into the UNet, [`~loaders.PeftAdapterMixin.load_lora_adapter`] ignores the text encoder keys. Use the `prefix` parameter to filter and load the appropriate state dicts, `"unet"` to load.
+
+```py
import torch
+from diffusers import AutoPipelineForText2Image
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.unet.load_lora_adapter(
+ "jbilcke-hf/sdxl-cinematic-1",
+ weight_name="pytorch_lora_weights.safetensors",
+ adapter_name="cinematic"
+ prefix="unet"
+)
+# use cnmt in the prompt to trigger the LoRA
+pipeline("A cute cnmt eating a slice of pizza, stunning color scheme, masterpiece, illustration").images[0]
+```
+
+## torch.compile
-pipe_id = "stabilityai/stable-diffusion-xl-base-1.0"
-pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda")
+[torch.compile](../optimization/torch2.0#torchcompile) speeds up inference by compiling the PyTorch model to use optimized kernels. Before compiling, the LoRA weights need to be fused into the base model and unloaded first.
+
+```py
+import torch
+from diffusers import DiffusionPipeline
+
+# load base model and LoRA
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_lora_weights(
+ "ostris/ikea-instructions-lora-sdxl",
+ weight_name="ikea_instructions_xl_v1_5.safetensors",
+ adapter_name="ikea"
+)
+
+# activate LoRA and set adapter weight
+pipeline.set_adapters("ikea", adapter_weights=0.7)
+
+# fuse LoRAs and unload weights
+pipeline.fuse_lora(adapter_names=["ikea"], lora_scale=1.0)
+pipeline.unload_lora_weights()
+```
+
+Typically, the UNet is compiled because its the most compute intensive component of the pipeline.
+
+```py
+pipeline.unet.to(memory_format=torch.channels_last)
+pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)
+
+pipeline("A bowl of ramen shaped like a cute kawaii bear").images[0]
```
-Next, load a [CiroN2022/toy-face](https://huggingface.co/CiroN2022/toy-face) adapter with the [`~diffusers.loaders.StableDiffusionXLLoraLoaderMixin.load_lora_weights`] method. With the 🤗 PEFT integration, you can assign a specific `adapter_name` to the checkpoint, which lets you easily switch between different LoRA checkpoints. Let's call this adapter `"toy"`.
+Refer to the [hotswapping](#hotswapping) section to learn how to avoid recompilation when working with compiled models and multiple LoRAs.
-```python
-pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy")
+## Weight scale
+
+The `scale` parameter is used to control how much of a LoRA to apply. A value of `0` is equivalent to only using the base model weights and a value of `1` is equivalent to fully using the LoRA.
+
+
+
+
+For simple use cases, you can pass `cross_attention_kwargs={"scale": 1.0}` to the pipeline.
+
+```py
+import torch
+from diffusers import AutoPipelineForText2Image
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_lora_weights(
+ "ostris/super-cereal-sdxl-lora",
+ weight_name="cereal_box_sdxl_v1.safetensors",
+ adapter_name="cereal"
+)
+pipeline("bears, pizza bites", cross_attention_kwargs={"scale": 1.0}).images[0]
```
-Make sure to include the token `toy_face` in the prompt and then you can perform inference:
+
+
-```python
-prompt = "toy_face of a hacker with a hoodie"
+> [!WARNING]
+> The [`~loaders.PeftAdapterMixin.set_adapters`] method only scales attention weights. If a LoRA has ResNets or down and upsamplers, these components keep a scale value of `1.0`.
-lora_scale = 0.9
-image = pipe(
- prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0)
-).images[0]
-image
+For finer control over each individual component of the UNet or text encoder, pass a dictionary instead. In the example below, the `"down"` block in the UNet is scaled by 0.9 and you can further specify in the `"up"` block the scales of the transformers in `"block_0"` and `"block_1"`. If a block like `"mid"` isn't specified, the default value `1.0` is used.
+
+```py
+import torch
+from diffusers import AutoPipelineForText2Image
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_lora_weights(
+ "ostris/super-cereal-sdxl-lora",
+ weight_name="cereal_box_sdxl_v1.safetensors",
+ adapter_name="cereal"
+)
+scales = {
+ "text_encoder": 0.5,
+ "text_encoder_2": 0.5,
+ "unet": {
+ "down": 0.9,
+ "up": {
+ "block_0": 0.6,
+ "block_1": [0.4, 0.8, 1.0],
+ }
+ }
+}
+pipeline.set_adapters("cereal", scales)
+pipeline("bears, pizza bites").images[0]
```
-
+
+
-With the `adapter_name` parameter, it is really easy to use another adapter for inference! Load the [nerijs/pixel-art-xl](https://huggingface.co/nerijs/pixel-art-xl) adapter that has been fine-tuned to generate pixel art images and call it `"pixel"`.
+## Hotswapping
-The pipeline automatically sets the first loaded adapter (`"toy"`) as the active adapter, but you can activate the `"pixel"` adapter with the [`~loaders.peft.PeftAdapterMixin.set_adapters`] method:
+Hotswapping LoRAs is an efficient way to work with multiple LoRAs while avoiding accumulating memory from multiple calls to [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] and in some cases, recompilation, if a model is compiled. This workflow requires a loaded LoRA because the new LoRA weights are swapped in place for the existing loaded LoRA.
-```python
-pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")
-pipe.set_adapters("pixel")
+```py
+import torch
+from diffusers import DiffusionPipeline
+
+# load base model and LoRAs
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_lora_weights(
+ "ostris/ikea-instructions-lora-sdxl",
+ weight_name="ikea_instructions_xl_v1_5.safetensors",
+ adapter_name="ikea"
+)
```
-Make sure you include the token `pixel art` in your prompt to generate a pixel art image:
+> [!WARNING]
+> Hotswapping is unsupported for LoRAs that target the text encoder.
+
+Set `hotswap=True` in [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] to swap the second LoRA. Use the `adapter_name` parameter to indicate which LoRA to swap (`default_0` is the default name).
-```python
-prompt = "a hacker with a hoodie, pixel art"
-image = pipe(
- prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0)
-).images[0]
-image
+```py
+pipeline.load_lora_weights(
+ "lordjia/by-feng-zikai",
+ hotswap=True,
+ adapter_name="ikea"
+)
```
-
+### Compiled models
-
+For compiled models, use [`~loaders.lora_base.LoraBaseMixin.enable_lora_hotswap`] to avoid recompilation when hotswapping LoRAs. This method should be called *before* loading the first LoRA and `torch.compile` should be called *after* loading the first LoRA.
-By default, if the most up-to-date versions of PEFT and Transformers are detected, `low_cpu_mem_usage` is set to `True` to speed up the loading time of LoRA checkpoints.
+> [!TIP]
+> The [`~loaders.lora_base.LoraBaseMixin.enable_lora_hotswap`] method isn't always necessary if the second LoRA targets the identical LoRA ranks and scales as the first LoRA.
-
+Within [`~loaders.lora_base.LoraBaseMixin.enable_lora_hotswap`], the `target_rank` parameter is important for setting the rank for all LoRA adapters. Setting it to `max_rank` sets it to the highest value. For LoRAs with different ranks, you set it to a higher rank value. The default rank value is 128.
-## Merge adapters
+```py
+import torch
+from diffusers import DiffusionPipeline
-You can also merge different adapter checkpoints for inference to blend their styles together.
+# load base model and LoRAs
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+# 1. enable_lora_hotswap
+pipeline.enable_lora_hotswap(target_rank=max_rank)
+pipeline.load_lora_weights(
+ "ostris/ikea-instructions-lora-sdxl",
+ weight_name="ikea_instructions_xl_v1_5.safetensors",
+ adapter_name="ikea"
+)
+# 2. torch.compile
+pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)
+
+# 3. hotswap
+pipeline.load_lora_weights(
+ "lordjia/by-feng-zikai",
+ hotswap=True,
+ adapter_name="ikea"
+)
+```
-Once again, use the [`~loaders.peft.PeftAdapterMixin.set_adapters`] method to activate the `pixel` and `toy` adapters and specify the weights for how they should be merged.
+> [!TIP]
+> Move your code inside the `with torch._dynamo.config.patch(error_on_recompile=True)` context manager to detect if a model was recompiled. If a model is recompiled despite following all the steps above, please open an [issue](https://github.com/huggingface/diffusers/issues) with a reproducible example.
+
+There are still scenarios where recompulation is unavoidable, such as when the hotswapped LoRA targets more layers than the initial adapter. Try to load the LoRA that targets the most layers *first*. For more details about this limitation, refer to the PEFT [hotswapping](https://huggingface.co/docs/peft/main/en/package_reference/hotswap#peft.utils.hotswap.hotswap_adapter) docs.
+
+## Merge
+
+The weights from each LoRA can be merged together to produce a blend of multiple existing styles. There are several methods for merging LoRAs, each of which differ in *how* the weights are merged (may affect generation quality).
+
+### set_adapters
+
+The [`~loaders.PeftAdapterMixin.set_adapters`] method merges LoRAs by concatenating their weighted matrices. Pass the LoRA names to [`~loaders.PeftAdapterMixin.set_adapters`] and use the `adapter_weights` parameter to control the scaling of each LoRA. For example, if `adapter_weights=[0.5, 0.5]`, the output is an average of both LoRAs.
+
+> [!TIP]
+> The `"scale"` parameter determines how much of the merged LoRA to apply. See the [Weight scale](#weight-scale) section for more details.
-```python
-pipe.set_adapters(["pixel", "toy"], adapter_weights=[0.5, 1.0])
+```py
+import torch
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_lora_weights(
+ "ostris/ikea-instructions-lora-sdxl",
+ weight_name="ikea_instructions_xl_v1_5.safetensors",
+ adapter_name="ikea"
+)
+pipeline.load_lora_weights(
+ "lordjia/by-feng-zikai",
+ weight_name="fengzikai_v1.0_XL.safetensors",
+ adapter_name="feng"
+)
+pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8])
+# use by Feng Zikai to activate the lordjia/by-feng-zikai LoRA
+pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", cross_attention_kwargs={"scale": 1.0}).images[0]
```
-
+
+
+
+
+### add_weighted_adapter
+
+> [!TIP]
+> This is an experimental method and you can refer to PEFTs [Model merging](https://huggingface.co/docs/peft/developer_guides/model_merging) for more details. Take a look at this [issue](https://github.com/huggingface/diffusers/issues/6892) if you're interested in the motivation and design behind this integration.
+
+The [`~peft.LoraModel.add_weighted_adapter`] method enables more efficient merging methods like [TIES](https://huggingface.co/papers/2306.01708) or [DARE](https://huggingface.co/papers/2311.03099). These merging methods remove redundant and potentially interfering parameters from merged models. Keep in mind the LoRA ranks need to have identical ranks to be merged.
-LoRA checkpoints in the diffusion community are almost always obtained with [DreamBooth](https://huggingface.co/docs/diffusers/main/en/training/dreambooth). DreamBooth training often relies on "trigger" words in the input text prompts in order for the generation results to look as expected. When you combine multiple LoRA checkpoints, it's important to ensure the trigger words for the corresponding LoRA checkpoints are present in the input text prompts.
+Make sure the latest stable version of Diffusers and PEFT is installed.
-
+```bash
+pip install -U -q diffusers peft
+```
-Remember to use the trigger words for [CiroN2022/toy-face](https://hf.co/CiroN2022/toy-face) and [nerijs/pixel-art-xl](https://hf.co/nerijs/pixel-art-xl) (these are found in their repositories) in the prompt to generate an image.
+Load a UNET that corresponds to the LoRA UNet.
-```python
-prompt = "toy_face of a hacker with a hoodie, pixel art"
-image = pipe(
- prompt, num_inference_steps=30, cross_attention_kwargs={"scale": 1.0}, generator=torch.manual_seed(0)
-).images[0]
-image
+```py
+import copy
+import torch
+from diffusers import AutoModel, DiffusionPipeline
+from peft import get_peft_model, LoraConfig, PeftModel
+
+unet = AutoModel.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+ variant="fp16",
+ subfolder="unet",
+).to("cuda")
```
-
+Load a pipeline, pass the UNet to it, and load a LoRA.
-Impressive! As you can see, the model generated an image that mixed the characteristics of both adapters.
+```py
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ variant="fp16",
+ torch_dtype=torch.float16,
+ unet=unet
+).to("cuda")
+pipeline.load_lora_weights(
+ "ostris/ikea-instructions-lora-sdxl",
+ weight_name="ikea_instructions_xl_v1_5.safetensors",
+ adapter_name="ikea"
+)
+```
+
+Create a [`~peft.PeftModel`] from the LoRA checkpoint by combining the first UNet you loaded and the LoRA UNet from the pipeline.
+
+```py
+sdxl_unet = copy.deepcopy(unet)
+ikea_peft_model = get_peft_model(
+ sdxl_unet,
+ pipeline.unet.peft_config["ikea"],
+ adapter_name="ikea"
+)
+
+original_state_dict = {f"base_model.model.{k}": v for k, v in pipeline.unet.state_dict().items()}
+ikea_peft_model.load_state_dict(original_state_dict, strict=True)
+```
> [!TIP]
-> Through its PEFT integration, Diffusers also offers more efficient merging methods which you can learn about in the [Merge LoRAs](../using-diffusers/merge_loras) guide!
+> You can save and reuse the `ikea_peft_model` by pushing it to the Hub as shown below.
+> ```py
+> ikea_peft_model.push_to_hub("ikea_peft_model", token=TOKEN)
+> ```
+
+Repeat this process and create a [`~peft.PeftModel`] for the second LoRA.
-To return to only using one adapter, use the [`~loaders.peft.PeftAdapterMixin.set_adapters`] method to activate the `"toy"` adapter:
+```py
+pipeline.delete_adapters("ikea")
+sdxl_unet.delete_adapters("ikea")
+
+pipeline.load_lora_weights(
+ "lordjia/by-feng-zikai",
+ weight_name="fengzikai_v1.0_XL.safetensors",
+ adapter_name="feng"
+)
+pipeline.set_adapters(adapter_names="feng")
+
+feng_peft_model = get_peft_model(
+ sdxl_unet,
+ pipeline.unet.peft_config["feng"],
+ adapter_name="feng"
+)
+
+original_state_dict = {f"base_model.model.{k}": v for k, v in pipe.unet.state_dict().items()}
+feng_peft_model.load_state_dict(original_state_dict, strict=True)
+```
-```python
-pipe.set_adapters("toy")
+Load a base UNet model and load the adapters.
-prompt = "toy_face of a hacker with a hoodie"
-lora_scale = 0.9
-image = pipe(
- prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0)
-).images[0]
-image
+```py
+base_unet = AutoModel.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+ variant="fp16",
+ subfolder="unet",
+).to("cuda")
+
+model = PeftModel.from_pretrained(
+ base_unet,
+ "stevhliu/ikea_peft_model",
+ use_safetensors=True,
+ subfolder="ikea",
+ adapter_name="ikea"
+)
+model.load_adapter(
+ "stevhliu/feng_peft_model",
+ use_safetensors=True,
+ subfolder="feng",
+ adapter_name="feng"
+)
```
-Or to disable all adapters entirely, use the [`~loaders.peft.PeftAdapterMixin.disable_lora`] method to return the base model.
+Merge the LoRAs with [`~peft.LoraModel.add_weighted_adapter`] and specify how you want to merge them with `combination_type`. The example below uses the `"dare_linear"` method (refer to this [blog post](https://huggingface.co/blog/peft_merging) to learn more about these merging methods), which randomly prunes some weights and then performs a weighted sum of the tensors based on the set weightage of each LoRA in `weights`.
-```python
-pipe.disable_lora()
+Activate the merged LoRAs with [`~loaders.PeftAdapterMixin.set_adapters`].
-prompt = "toy_face of a hacker with a hoodie"
-image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0]
-image
+```py
+model.add_weighted_adapter(
+ adapters=["ikea", "feng"],
+ combination_type="dare_linear",
+ weights=[1.0, 1.0],
+ adapter_name="ikea-feng"
+)
+model.set_adapters("ikea-feng")
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ unet=model,
+ variant="fp16",
+ torch_dtype=torch.float16,
+).to("cuda")
+pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai").images[0]
```
-
+
+
+
-### Customize adapters strength
+### fuse_lora
-For even more customization, you can control how strongly the adapter affects each part of the pipeline. For this, pass a dictionary with the control strengths (called "scales") to [`~loaders.peft.PeftAdapterMixin.set_adapters`].
+The [`~loaders.lora_base.LoraBaseMixin.fuse_lora`] method fuses the LoRA weights directly with the original UNet and text encoder weights of the underlying model. This reduces the overhead of loading the underlying model for each LoRA because it only loads the model once, which lowers memory usage and increases inference speed.
-For example, here's how you can turn on the adapter for the `down` parts, but turn it off for the `mid` and `up` parts:
-```python
-pipe.enable_lora() # enable lora again, after we disabled it above
-prompt = "toy_face of a hacker with a hoodie, pixel art"
-adapter_weight_scales = { "unet": { "down": 1, "mid": 0, "up": 0} }
-pipe.set_adapters("pixel", adapter_weight_scales)
-image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0]
-image
+```py
+import torch
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_lora_weights(
+ "ostris/ikea-instructions-lora-sdxl",
+ weight_name="ikea_instructions_xl_v1_5.safetensors",
+ adapter_name="ikea"
+)
+pipeline.load_lora_weights(
+ "lordjia/by-feng-zikai",
+ weight_name="fengzikai_v1.0_XL.safetensors",
+ adapter_name="feng"
+)
+pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8])
```
-
+Call [`~loaders.lora_base.LoraBaseMixin.fuse_lora`] to fuse them. The `lora_scale` parameter controls how much to scale the output by with the LoRA weights. It is important to make this adjustment now because passing `scale` to `cross_attention_kwargs` won't work in the pipeline.
-Let's see how turning off the `down` part and turning on the `mid` and `up` part respectively changes the image.
-```python
-adapter_weight_scales = { "unet": { "down": 0, "mid": 1, "up": 0} }
-pipe.set_adapters("pixel", adapter_weight_scales)
-image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0]
-image
+```py
+pipeline.fuse_lora(adapter_names=["ikea", "feng"], lora_scale=1.0)
```
-
+Unload the LoRA weights since they're already fused with the underlying model. Save the fused pipeline with either [`~DiffusionPipeline.save_pretrained`] to save it locally or [`~PushToHubMixin.push_to_hub`] to save it to the Hub.
+
+
+
-```python
-adapter_weight_scales = { "unet": { "down": 0, "mid": 0, "up": 1} }
-pipe.set_adapters("pixel", adapter_weight_scales)
-image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0]
-image
+```py
+pipeline.unload_lora_weights()
+pipeline.save_pretrained("path/to/fused-pipeline")
```
-
+
+
-Looks cool!
+```py
+pipeline.unload_lora_weights()
+pipeline.push_to_hub("fused-ikea-feng")
+```
-This is a really powerful feature. You can use it to control the adapter strengths down to per-transformer level. And you can even use it for multiple adapters.
-```python
-adapter_weight_scales_toy = 0.5
-adapter_weight_scales_pixel = {
- "unet": {
- "down": 0.9, # all transformers in the down-part will use scale 0.9
- # "mid" # because, in this example, "mid" is not given, all transformers in the mid part will use the default scale 1.0
- "up": {
- "block_0": 0.6, # all 3 transformers in the 0th block in the up-part will use scale 0.6
- "block_1": [0.4, 0.8, 1.0], # the 3 transformers in the 1st block in the up-part will use scales 0.4, 0.8 and 1.0 respectively
- }
- }
-}
-pipe.set_adapters(["toy", "pixel"], [adapter_weight_scales_toy, adapter_weight_scales_pixel])
-image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0]
-image
+
+
+
+The fused pipeline can now be quickly loaded for inference without requiring each LoRA to be separately loaded.
+
+```py
+pipeline = DiffusionPipeline.from_pretrained(
+ "username/fused-ikea-feng", torch_dtype=torch.float16,
+).to("cuda")
+pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai").images[0]
+```
+
+Use [`~loaders.LoraLoaderMixin.unfuse_lora`] to restore the underlying models weights, for example, if you want to use a different `lora_scale` value. You can only unfuse if there is a single LoRA fused. For example, it won't work with the pipeline from above because there are multiple fused LoRAs. In these cases, you'll need to reload the entire model.
+
+```py
+pipeline.unfuse_lora()
```
-
+
+
+
+
+## Manage
+
+Diffusers provides several methods to help you manage working with LoRAs. These methods can be especially useful if you're working with multiple LoRAs.
-## Manage adapters
+### set_adapters
-You have attached multiple adapters in this tutorial, and if you're feeling a bit lost on what adapters have been attached to the pipeline's components, use the [`~diffusers.loaders.StableDiffusionLoraLoaderMixin.get_active_adapters`] method to check the list of active adapters:
+[`~loaders.PeftAdapterMixin.set_adapters`] also activates the current LoRA to use if there are multiple active LoRAs. This allows you to switch between different LoRAs by specifying their name.
```py
-active_adapters = pipe.get_active_adapters()
-active_adapters
-["toy", "pixel"]
+import torch
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_lora_weights(
+ "ostris/ikea-instructions-lora-sdxl",
+ weight_name="ikea_instructions_xl_v1_5.safetensors",
+ adapter_name="ikea"
+)
+pipeline.load_lora_weights(
+ "lordjia/by-feng-zikai",
+ weight_name="fengzikai_v1.0_XL.safetensors",
+ adapter_name="feng"
+)
+# activates the feng LoRA instead of the ikea LoRA
+pipeline.set_adapters("feng")
```
-You can also get the active adapters of each pipeline component with [`~diffusers.loaders.StableDiffusionLoraLoaderMixin.get_list_adapters`]:
+### save_lora_adapter
+
+Save an adapter with [`~loaders.PeftAdapterMixin.save_lora_adapter`].
```py
-list_adapters_component_wise = pipe.get_list_adapters()
-list_adapters_component_wise
-{"text_encoder": ["toy", "pixel"], "unet": ["toy", "pixel"], "text_encoder_2": ["toy", "pixel"]}
+import torch
+from diffusers import AutoPipelineForText2Image
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.unet.load_lora_adapter(
+ "jbilcke-hf/sdxl-cinematic-1",
+ weight_name="pytorch_lora_weights.safetensors",
+ adapter_name="cinematic"
+ prefix="unet"
+)
+pipeline.save_lora_adapter("path/to/save", adapter_name="cinematic")
+```
+
+### unload_lora_weights
+
+The [`~loaders.lora_base.LoraBaseMixin.unload_lora_weights`] method unloads any LoRA weights in the pipeline to restore the underlying model weights.
+
+```py
+pipeline.unload_lora_weights()
```
-The [`~loaders.peft.PeftAdapterMixin.delete_adapters`] function completely removes an adapter and their LoRA layers from a model.
+### disable_lora
+
+The [`~loaders.PeftAdapterMixin.disable_lora`] method disables all LoRAs (but they're still kept on the pipeline) and restores the pipeline to the underlying model weights.
```py
-pipe.delete_adapters("toy")
-pipe.get_active_adapters()
-["pixel"]
+pipeline.disable_lora()
```
-## PeftInputAutocastDisableHook
+### get_active_adapters
+
+The [`~loaders.lora_base.LoraBaseMixin.get_active_adapters`] method returns a list of active LoRAs attached to a pipeline.
+
+```py
+pipeline.get_active_adapters()
+["cereal", "ikea"]
+```
+
+### get_list_adapters
+
+The [`~loaders.lora_base.LoraBaseMixin.get_list_adapters`] method returns the active LoRAs for each component in the pipeline.
+
+```py
+pipeline.get_list_adapters()
+{"unet": ["cereal", "ikea"], "text_encoder_2": ["cereal"]}
+```
+
+### delete_adapters
+
+The [`~loaders.PeftAdapterMixin.delete_adapters`] method completely removes a LoRA and its layers from a model.
+
+```py
+pipeline.delete_adapters("ikea")
+```
+
+## Resources
+
+Browse the [LoRA Studio](https://lorastudio.co/models) for different LoRAs to use or you can upload your favorite LoRAs from Civitai to the Hub with the Space below.
+
+
-[[autodoc]] hooks.layerwise_casting.PeftInputAutocastDisableHook
+You can find additional LoRAs in the [FLUX LoRA the Explorer](https://huggingface.co/spaces/multimodalart/flux-lora-the-explorer) and [LoRA the Explorer](https://huggingface.co/spaces/multimodalart/LoraTheExplorer) Spaces.
\ No newline at end of file
diff --git a/docs/source/en/using-diffusers/controlnet.md b/docs/source/en/using-diffusers/controlnet.md
index dd569b53601e..72843a6ff93a 100644
--- a/docs/source/en/using-diffusers/controlnet.md
+++ b/docs/source/en/using-diffusers/controlnet.md
@@ -12,46 +12,28 @@ specific language governing permissions and limitations under the License.
# ControlNet
-ControlNet is a type of model for controlling image diffusion models by conditioning the model with an additional input image. There are many types of conditioning inputs (canny edge, user sketching, human pose, depth, and more) you can use to control a diffusion model. This is hugely useful because it affords you greater control over image generation, making it easier to generate specific images without experimenting with different text prompts or denoising values as much.
+[ControlNet](https://huggingface.co/papers/2302.05543) is an adapter that enables controllable generation such as generating an image of a cat in a *specific pose* or following the lines in a sketch of a *specific* cat. It works by adding a smaller network of "zero convolution" layers and progressively training these to avoid disrupting with the original model. The original model parameters are frozen to avoid retraining it.
-
+A ControlNet is conditioned on extra visual information or "structural controls" (canny edge, depth maps, human pose, etc.) that can be combined with text prompts to generate images that are guided by the visual input.
-Check out Section 3.5 of the [ControlNet](https://huggingface.co/papers/2302.05543) paper v1 for a list of ControlNet implementations on various conditioning inputs. You can find the official Stable Diffusion ControlNet conditioned models on [lllyasviel](https://huggingface.co/lllyasviel)'s Hub profile, and more [community-trained](https://huggingface.co/models?other=stable-diffusion&other=controlnet) ones on the Hub.
+> [!TIP]
+> ControlNets are available to many models such as [Flux](../api/pipelines/controlnet_flux), [Hunyuan-DiT](../api/pipelines/controlnet_hunyuandit), [Stable Diffusion 3](../api/pipelines/controlnet_sd3), and more. The examples in this guide use Flux and Stable Diffusion XL.
-For Stable Diffusion XL (SDXL) ControlNet models, you can find them on the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization, or you can browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) ones on the Hub.
+Load a ControlNet conditioned on a specific control, such as canny edge, and pass it to the pipeline in [`~DiffusionPipeline.from_pretrained`].
-
+
+
-A ControlNet model has two sets of weights (or blocks) connected by a zero-convolution layer:
-
-- a *locked copy* keeps everything a large pretrained diffusion model has learned
-- a *trainable copy* is trained on the additional conditioning input
-
-Since the locked copy preserves the pretrained model, training and implementing a ControlNet on a new conditioning input is as fast as finetuning any other model because you aren't training the model from scratch.
-
-This guide will show you how to use ControlNet for text-to-image, image-to-image, inpainting, and more! There are many types of ControlNet conditioning inputs to choose from, but in this guide we'll only focus on several of them. Feel free to experiment with other conditioning inputs!
-
-Before you begin, make sure you have the following libraries installed:
+Generate a canny image with [opencv-python](https://github.com/opencv/opencv-python).
```py
-# uncomment to install the necessary libraries in Colab
-#!pip install -q diffusers transformers accelerate opencv-python
-```
-
-## Text-to-image
-
-For text-to-image, you normally pass a text prompt to the model. But with ControlNet, you can specify an additional conditioning input. Let's condition the model with a canny image, a white outline of an image on a black background. This way, the ControlNet can use the canny image as a control to guide the model to generate an image with the same outline.
-
-Load an image and use the [opencv-python](https://github.com/opencv/opencv-python) library to extract the canny image:
-
-```py
-from diffusers.utils import load_image, make_image_grid
-from PIL import Image
import cv2
import numpy as np
+from PIL import Image
+from diffusers.utils import load_image
original_image = load_image(
- "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/non-enhanced-prompt.png"
)
image = np.array(original_image)
@@ -65,523 +47,300 @@ image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
```
-
-
-
- original image
-
-
-
- canny image
-
-
-
-Next, load a ControlNet model conditioned on canny edge detection and pass it to the [`StableDiffusionControlNetPipeline`]. Use the faster [`UniPCMultistepScheduler`] and enable model offloading to speed up inference and reduce memory usage.
+Pass the canny image to the pipeline. Use the `controlnet_conditioning_scale` parameter to determine how much weight to assign to the control.
```py
-from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
import torch
+from diffusers.utils import load_image
+from diffusers import FluxControlNetPipeline, FluxControlNetModel
-controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True)
-pipe = StableDiffusionControlNetPipeline.from_pretrained(
- "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
+controlnet = FluxControlNetModel.from_pretrained(
+ "InstantX/FLUX.1-dev-Controlnet-Canny", torch_dtype=torch.bfloat16
)
-
-pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
-pipe.enable_model_cpu_offload()
-```
-
-Now pass your prompt and canny image to the pipeline:
-
-```py
-output = pipe(
- "the mona lisa", image=canny_image
+pipeline = FluxControlNetPipeline.from_pretrained(
+ "black-forest-labs/FLUX.1-dev", controlnet=controlnet, torch_dtype=torch.bfloat16
+).to("cuda")
+
+prompt = """
+A photorealistic overhead image of a cat reclining sideways in a flamingo pool floatie holding a margarita.
+The cat is floating leisurely in the pool and completely relaxed and happy.
+"""
+
+pipeline(
+ prompt,
+ control_image=canny_image,
+ controlnet_conditioning_scale=0.5,
+ num_inference_steps=50,
+ guidance_scale=3.5,
).images[0]
-make_image_grid([original_image, canny_image, output], rows=1, cols=3)
```
-
-## Image-to-image
-
-For image-to-image, you'd typically pass an initial image and a prompt to the pipeline to generate a new image. With ControlNet, you can pass an additional conditioning input to guide the model. Let's condition the model with a depth map, an image which contains spatial information. This way, the ControlNet can use the depth map as a control to guide the model to generate an image that preserves spatial information.
-You'll use the [`StableDiffusionControlNetImg2ImgPipeline`] for this task, which is different from the [`StableDiffusionControlNetPipeline`] because it allows you to pass an initial image as the starting point for the image generation process.
+
+
-Load an image and use the `depth-estimation` [`~transformers.Pipeline`] from 🤗 Transformers to extract the depth map of an image:
+Generate a depth map with a depth estimation pipeline from Transformers.
```py
import torch
import numpy as np
+from PIL import Image
+from transformers import DPTImageProcessor, DPTForDepthEstimation
+from diffusers import ControlNetModel, StableDiffusionXLControlNetImg2ImgPipeline, AutoencoderKL
+from diffusers.utils import load_image
+
+
+depth_estimator = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas").to("cuda")
+feature_extractor = DPTImageProcessor.from_pretrained("Intel/dpt-hybrid-midas")
+
+def get_depth_map(image):
+ image = feature_extractor(images=image, return_tensors="pt").pixel_values.to("cuda")
+ with torch.no_grad(), torch.autocast("cuda"):
+ depth_map = depth_estimator(image).predicted_depth
+
+ depth_map = torch.nn.functional.interpolate(
+ depth_map.unsqueeze(1),
+ size=(1024, 1024),
+ mode="bicubic",
+ align_corners=False,
+ )
+ depth_min = torch.amin(depth_map, dim=[1, 2, 3], keepdim=True)
+ depth_max = torch.amax(depth_map, dim=[1, 2, 3], keepdim=True)
+ depth_map = (depth_map - depth_min) / (depth_max - depth_min)
+ image = torch.cat([depth_map] * 3, dim=1)
+ image = image.permute(0, 2, 3, 1).cpu().numpy()[0]
+ image = Image.fromarray((image * 255.0).clip(0, 255).astype(np.uint8))
+ return image
-from transformers import pipeline
-from diffusers.utils import load_image, make_image_grid
-
-image = load_image(
- "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-img2img.jpg"
-)
-
-def get_depth_map(image, depth_estimator):
- image = depth_estimator(image)["depth"]
- image = np.array(image)
- image = image[:, :, None]
- image = np.concatenate([image, image, image], axis=2)
- detected_map = torch.from_numpy(image).float() / 255.0
- depth_map = detected_map.permute(2, 0, 1)
- return depth_map
-
-depth_estimator = pipeline("depth-estimation")
-depth_map = get_depth_map(image, depth_estimator).unsqueeze(0).half().to("cuda")
+depth_image = get_depth_map(image)
```
-Next, load a ControlNet model conditioned on depth maps and pass it to the [`StableDiffusionControlNetImg2ImgPipeline`]. Use the faster [`UniPCMultistepScheduler`] and enable model offloading to speed up inference and reduce memory usage.
+Pass the depth map to the pipeline. Use the `controlnet_conditioning_scale` parameter to determine how much weight to assign to the control.
```py
-from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
-import torch
-
-controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11f1p_sd15_depth", torch_dtype=torch.float16, use_safetensors=True)
-pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
- "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
+controlnet = ControlNetModel.from_pretrained(
+ "diffusers/controlnet-depth-sdxl-1.0-small",
+ torch_dtype=torch.float16,
)
+vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
+pipeline = StableDiffusionXLControlNetImg2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ controlnet=controlnet,
+ vae=vae,
+ torch_dtype=torch.float16,
+).to("cuda")
-pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
-pipe.enable_model_cpu_offload()
-```
-
-Now pass your prompt, initial image, and depth map to the pipeline:
-
-```py
-output = pipe(
- "lego batman and robin", image=image, control_image=depth_map,
+prompt = """
+A photorealistic overhead image of a cat reclining sideways in a flamingo pool floatie holding a margarita.
+The cat is floating leisurely in the pool and completely relaxed and happy.
+"""
+image = load_image(
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/non-enhanced-prompt.png"
+).resize((1024, 1024))
+controlnet_conditioning_scale = 0.5
+pipeline(
+ prompt,
+ image=image,
+ control_image=depth_image,
+ controlnet_conditioning_scale=controlnet_conditioning_scale,
+ strength=0.99,
+ num_inference_steps=100,
).images[0]
-make_image_grid([image, output], rows=1, cols=2)
```
-
-## Inpainting
-
-For inpainting, you need an initial image, a mask image, and a prompt describing what to replace the mask with. ControlNet models allow you to add another control image to condition a model with. Let’s condition the model with an inpainting mask. This way, the ControlNet can use the inpainting mask as a control to guide the model to generate an image within the mask area.
+
+
-Load an initial image and a mask image:
+Generate a mask image and convert it to a tensor to mark the pixels in the original image as masked if the corresponding pixel in the mask image is over a certain threshold.
```py
-from diffusers.utils import load_image, make_image_grid
+import cv2
+import torch
+import numpy as np
+from PIL import Image
+from diffusers.utils import load_image
+from diffusers import StableDiffusionXLControlNetInpaintPipeline, ControlNetModel
init_image = load_image(
- "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint.jpg"
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/non-enhanced-prompt.png"
)
-init_image = init_image.resize((512, 512))
-
+init_image = init_image.resize((1024, 1024))
mask_image = load_image(
- "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-mask.jpg"
+ "/content/cat_mask.png"
)
-mask_image = mask_image.resize((512, 512))
-make_image_grid([init_image, mask_image], rows=1, cols=2)
-```
-
-Create a function to prepare the control image from the initial and mask images. This'll create a tensor to mark the pixels in `init_image` as masked if the corresponding pixel in `mask_image` is over a certain threshold.
-
-```py
-import numpy as np
-import torch
-
-def make_inpaint_condition(image, image_mask):
- image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
- image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0
+mask_image = mask_image.resize((1024, 1024))
- assert image.shape[0:1] == image_mask.shape[0:1]
- image[image_mask > 0.5] = -1.0 # set as masked pixel
- image = np.expand_dims(image, 0).transpose(0, 3, 1, 2)
- image = torch.from_numpy(image)
+def make_canny_condition(image):
+ image = np.array(image)
+ image = cv2.Canny(image, 100, 200)
+ image = image[:, :, None]
+ image = np.concatenate([image, image, image], axis=2)
+ image = Image.fromarray(image)
return image
-control_image = make_inpaint_condition(init_image, mask_image)
+control_image = make_canny_condition(init_image)
```
-
-
-
- original image
-
-
-
- mask image
-
-
-
-Load a ControlNet model conditioned on inpainting and pass it to the [`StableDiffusionControlNetInpaintPipeline`]. Use the faster [`UniPCMultistepScheduler`] and enable model offloading to speed up inference and reduce memory usage.
+Pass the mask and control image to the pipeline. Use the `controlnet_conditioning_scale` parameter to determine how much weight to assign to the control.
```py
-from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler
-
-controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, use_safetensors=True)
-pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
- "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
+controlnet = ControlNetModel.from_pretrained(
+ "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16
)
-
-pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
-pipe.enable_model_cpu_offload()
-```
-
-Now pass your prompt, initial image, mask image, and control image to the pipeline:
-
-```py
-output = pipe(
- "corgi face with large ears, detailed, pixar, animated, disney",
- num_inference_steps=20,
- eta=1.0,
+pipeline = StableDiffusionXLControlNetInpaintPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, torch_dtype=torch.float16
+)
+pipeline(
+ "a cute and fluffy bunny rabbit",
+ num_inference_steps=100,
+ strength=0.99,
+ controlnet_conditioning_scale=0.5,
image=init_image,
mask_image=mask_image,
control_image=control_image,
).images[0]
-make_image_grid([init_image, mask_image, output], rows=1, cols=3)
```
-
-## Guess mode
-
-[Guess mode](https://github.com/lllyasviel/ControlNet/discussions/188) does not require supplying a prompt to a ControlNet at all! This forces the ControlNet encoder to do its best to "guess" the contents of the input control map (depth map, pose estimation, canny edge, etc.).
+
+
-Guess mode adjusts the scale of the output residuals from a ControlNet by a fixed ratio depending on the block depth. The shallowest `DownBlock` corresponds to 0.1, and as the blocks get deeper, the scale increases exponentially such that the scale of the `MidBlock` output becomes 1.0.
+## Multi-ControlNet
-
+You can compose multiple ControlNet conditionings, such as canny image and a depth map, to create a *MultiControlNet*. For the best rersults, you should mask conditionings so they don't overlap and experiment with different `controlnet_conditioning_scale` parameters to adjust how much weight is assigned to each control input.
-Guess mode does not have any impact on prompt conditioning and you can still provide a prompt if you want.
+The example below composes a canny image and depth map.
-
-
-Set `guess_mode=True` in the pipeline, and it is [recommended](https://github.com/lllyasviel/ControlNet#guess-mode--non-prompt-mode) to set the `guidance_scale` value between 3.0 and 5.0.
+Pass the ControlNets as a list to the pipeline and resize the images to the expected input size.
```py
-from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
-from diffusers.utils import load_image, make_image_grid
-import numpy as np
import torch
-from PIL import Image
-import cv2
-
-controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", use_safetensors=True)
-pipe = StableDiffusionControlNetPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, use_safetensors=True).to("cuda")
-
-original_image = load_image("https://huggingface.co/takuma104/controlnet_dev/resolve/main/bird_512x512.png")
-
-image = np.array(original_image)
-
-low_threshold = 100
-high_threshold = 200
-
-image = cv2.Canny(image, low_threshold, high_threshold)
-image = image[:, :, None]
-image = np.concatenate([image, image, image], axis=2)
-canny_image = Image.fromarray(image)
-
-image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0]
-make_image_grid([original_image, canny_image, image], rows=1, cols=3)
-```
-
-
-
-
- regular mode with prompt
-
-
-
- guess mode without prompt
-
-
-
-## ControlNet with Stable Diffusion XL
-
-There aren't too many ControlNet models compatible with Stable Diffusion XL (SDXL) at the moment, but we've trained two full-sized ControlNet models for SDXL conditioned on canny edge detection and depth maps. We're also experimenting with creating smaller versions of these SDXL-compatible ControlNet models so it is easier to run on resource-constrained hardware. You can find these checkpoints on the [🤗 Diffusers Hub organization](https://huggingface.co/diffusers)!
-
-Let's use a SDXL ControlNet conditioned on canny images to generate an image. Start by loading an image and prepare the canny image:
-
-```py
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
-from diffusers.utils import load_image, make_image_grid
-from PIL import Image
-import cv2
-import numpy as np
-import torch
-
-original_image = load_image(
- "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
-)
-
-image = np.array(original_image)
-
-low_threshold = 100
-high_threshold = 200
-
-image = cv2.Canny(image, low_threshold, high_threshold)
-image = image[:, :, None]
-image = np.concatenate([image, image, image], axis=2)
-canny_image = Image.fromarray(image)
-make_image_grid([original_image, canny_image], rows=1, cols=2)
-```
-
-
-
- original image
-
-
-
- canny image
-
-
-
-Load a SDXL ControlNet model conditioned on canny edge detection and pass it to the [`StableDiffusionXLControlNetPipeline`]. You can also enable model offloading to reduce memory usage.
-
-```py
-controlnet = ControlNetModel.from_pretrained(
- "diffusers/controlnet-canny-sdxl-1.0",
- torch_dtype=torch.float16,
- use_safetensors=True
-)
-vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
-pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
- "stabilityai/stable-diffusion-xl-base-1.0",
- controlnet=controlnet,
- vae=vae,
- torch_dtype=torch.float16,
- use_safetensors=True
-)
-pipe.enable_model_cpu_offload()
-```
-
-Now pass your prompt (and optionally a negative prompt if you're using one) and canny image to the pipeline:
-
-
+controlnets = [
+ ControlNetModel.from_pretrained(
+ "diffusers/controlnet-depth-sdxl-1.0-small", torch_dtype=torch.float16
+ ),
+ ControlNetModel.from_pretrained(
+ "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16,
+ ),
+]
-The [`controlnet_conditioning_scale`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet#diffusers.StableDiffusionControlNetPipeline.__call__.controlnet_conditioning_scale) parameter determines how much weight to assign to the conditioning inputs. A value of 0.5 is recommended for good generalization, but feel free to experiment with this number!
+vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
+pipeline = StableDiffusionXLControlNetPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnets, vae=vae, torch_dtype=torch.float16
+).to("cuda")
-
+prompt = """
+a relaxed rabbit sitting on a striped towel next to a pool with a tropical drink nearby,
+bright sunny day, vacation scene, 35mm photograph, film, professional, 4k, highly detailed
+"""
+negative_prompt = "lowres, bad anatomy, worst quality, low quality, deformed, ugly"
-```py
-prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
-negative_prompt = 'low quality, bad quality, sketches'
+images = [canny_image.resize((1024, 1024)), depth_image.resize((1024, 1024))]
-image = pipe(
+pipeline(
prompt,
negative_prompt=negative_prompt,
- image=canny_image,
- controlnet_conditioning_scale=0.5,
+ image=images,
+ num_inference_steps=100,
+ controlnet_conditioning_scale=[0.5, 0.5],
+ strength=0.7,
).images[0]
-make_image_grid([original_image, canny_image, image], rows=1, cols=3)
```
-
-You can use [`StableDiffusionXLControlNetPipeline`] in guess mode as well by setting the parameter to `True`:
+## guess_mode
+
+[Guess mode](https://github.com/lllyasviel/ControlNet/discussions/188) generates an image from **only** the control input (canny edge, depth map, pose, etc.) and without guidance from a prompt. It adjusts the scale of the ControlNet's output residuals by a fixed ratio depending on block depth. The earlier `DownBlock` is only scaled by `0.1` and the `MidBlock` is fully scaled by `1.0`.
```py
-from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
-from diffusers.utils import load_image, make_image_grid
-import numpy as np
import torch
-import cv2
-from PIL import Image
-
-prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
-negative_prompt = "low quality, bad quality, sketches"
-
-original_image = load_image(
- "https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
-)
+from diffusers.utils import load_iamge
+from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
controlnet = ControlNetModel.from_pretrained(
- "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
+ "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16
)
-vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
-pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
- "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, vae=vae, torch_dtype=torch.float16, use_safetensors=True
-)
-pipe.enable_model_cpu_offload()
-
-image = np.array(original_image)
-image = cv2.Canny(image, 100, 200)
-image = image[:, :, None]
-image = np.concatenate([image, image, image], axis=2)
-canny_image = Image.fromarray(image)
-
-image = pipe(
- prompt, negative_prompt=negative_prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True,
+pipeline = StableDiffusionXLControlNetPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ controlnet=controlnet,
+ torch_dtype=torch.float16
+).to("cuda")
+
+canny_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/canny-cat.png")
+pipeline(
+ "",
+ image=canny_image,
+ guess_mode=True
).images[0]
-make_image_grid([original_image, canny_image, image], rows=1, cols=3)
```
-
-
-You can use a refiner model with `StableDiffusionXLControlNetPipeline` to improve image quality, just like you can with a regular `StableDiffusionXLPipeline`.
-See the [Refine image quality](./sdxl#refine-image-quality) section to learn how to use the refiner model.
-Make sure to use `StableDiffusionXLControlNetPipeline` and pass `image` and `controlnet_conditioning_scale`.
-
-```py
-base = StableDiffusionXLControlNetPipeline(...)
-image = base(
- prompt=prompt,
- controlnet_conditioning_scale=0.5,
- image=canny_image,
- num_inference_steps=40,
- denoising_end=0.8,
- output_type="latent",
-).images
-# rest exactly as with StableDiffusionXLPipeline
-```
-
-
-
-## MultiControlNet
-
-
-
-Replace the SDXL model with a model like [stable-diffusion-v1-5/stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) to use multiple conditioning inputs with Stable Diffusion models.
-
-
-
-You can compose multiple ControlNet conditionings from different image inputs to create a *MultiControlNet*. To get better results, it is often helpful to:
-
-1. mask conditionings such that they don't overlap (for example, mask the area of a canny image where the pose conditioning is located)
-2. experiment with the [`controlnet_conditioning_scale`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet#diffusers.StableDiffusionControlNetPipeline.__call__.controlnet_conditioning_scale) parameter to determine how much weight to assign to each conditioning input
-
-In this example, you'll combine a canny image and a human pose estimation image to generate a new image.
-
-Prepare the canny image conditioning:
-
-```py
-from diffusers.utils import load_image, make_image_grid
-from PIL import Image
-import numpy as np
-import cv2
-
-original_image = load_image(
- "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
-)
-image = np.array(original_image)
-
-low_threshold = 100
-high_threshold = 200
-
-image = cv2.Canny(image, low_threshold, high_threshold)
-
-# zero out middle columns of image where pose will be overlaid
-zero_start = image.shape[1] // 4
-zero_end = zero_start + image.shape[1] // 2
-image[:, zero_start:zero_end] = 0
-
-image = image[:, :, None]
-image = np.concatenate([image, image, image], axis=2)
-canny_image = Image.fromarray(image)
-make_image_grid([original_image, canny_image], rows=1, cols=2)
-```
-
-
-
-
- original image
-
-
-
- canny image
-
-
-
-For human pose estimation, install [controlnet_aux](https://github.com/patrickvonplaten/controlnet_aux):
-
-```py
-# uncomment to install the necessary library in Colab
-#!pip install -q controlnet-aux
-```
-
-Prepare the human pose estimation conditioning:
-
-```py
-from controlnet_aux import OpenposeDetector
-
-openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
-original_image = load_image(
- "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"
-)
-openpose_image = openpose(original_image)
-make_image_grid([original_image, openpose_image], rows=1, cols=2)
-```
-
-
-
-
- original image
-
-
-
- human pose image
-
-
-
-Load a list of ControlNet models that correspond to each conditioning, and pass them to the [`StableDiffusionXLControlNetPipeline`]. Use the faster [`UniPCMultistepScheduler`] and enable model offloading to reduce memory usage.
-
-```py
-from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL, UniPCMultistepScheduler
-import torch
-
-controlnets = [
- ControlNetModel.from_pretrained(
- "thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16
- ),
- ControlNetModel.from_pretrained(
- "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
- ),
-]
-
-vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
-pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
- "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnets, vae=vae, torch_dtype=torch.float16, use_safetensors=True
-)
-pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
-pipe.enable_model_cpu_offload()
-```
-
-Now you can pass your prompt (an optional negative prompt if you're using one), canny image, and pose image to the pipeline:
-
-```py
-prompt = "a giant standing in a fantasy landscape, best quality"
-negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
-
-generator = torch.manual_seed(1)
-
-images = [openpose_image.resize((1024, 1024)), canny_image.resize((1024, 1024))]
-
-images = pipe(
- prompt,
- image=images,
- num_inference_steps=25,
- generator=generator,
- negative_prompt=negative_prompt,
- num_images_per_prompt=3,
- controlnet_conditioning_scale=[1.0, 0.8],
-).images
-make_image_grid([original_image, canny_image, openpose_image,
- images[0].resize((512, 512)), images[1].resize((512, 512)), images[2].resize((512, 512))], rows=2, cols=3)
-```
-
-
-
-
+
+
+
+ canny image
+
+
+
+ generated image
+
+
\ No newline at end of file
diff --git a/docs/source/en/using-diffusers/dreambooth.md b/docs/source/en/using-diffusers/dreambooth.md
new file mode 100644
index 000000000000..6c37124cb7ff
--- /dev/null
+++ b/docs/source/en/using-diffusers/dreambooth.md
@@ -0,0 +1,35 @@
+
+
+# DreamBooth
+
+[DreamBooth](https://huggingface.co/papers/2208.12242) is a method for generating personalized images of a specific instance. It works by fine-tuning the model on 3-5 images of the subject (for example, a cat) that is associated with a unique identifier (`sks cat`). This allows you to use `sks cat` in your prompt to trigger the model to generate images of your cat in different settings, lighting, poses, and styles.
+
+DreamBooth checkpoints are typically a few GBs in size because it contains the full model weights.
+
+Load the DreamBooth checkpoint with [`~DiffusionPipeline.from_pretrained`] and include the unique identifier in the prompt to activate its generation.
+
+```py
+import torch
+from diffusers import AutoPipelineForText2Image
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "sd-dreambooth-library/herge-style",
+ torch_dtype=torch.float16
+).to("cuda")
+prompt = "A cute sks herge_style brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration"
+pipeline(prompt).images[0]
+```
+
+
+
+
\ No newline at end of file
diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md
index 5f483fbbdfee..4dad3fc749bc 100644
--- a/docs/source/en/using-diffusers/ip_adapter.md
+++ b/docs/source/en/using-diffusers/ip_adapter.md
@@ -12,172 +12,149 @@ specific language governing permissions and limitations under the License.
# IP-Adapter
-[IP-Adapter](https://hf.co/papers/2308.06721) is an image prompt adapter that can be plugged into diffusion models to enable image prompting without any changes to the underlying model. Furthermore, this adapter can be reused with other models finetuned from the same base model and it can be combined with other adapters like [ControlNet](../using-diffusers/controlnet). The key idea behind IP-Adapter is the *decoupled cross-attention* mechanism which adds a separate cross-attention layer just for image features instead of using the same cross-attention layer for both text and image features. This allows the model to learn more image-specific features.
+[IP-Adapter](https://huggingface.co/papers/2308.06721) is a lightweight adapter designed to integrate image-based guidance with text-to-image diffusion models. The adapter uses an image encoder to extract image features that are passed to the newly added cross-attention layers in the UNet and fine-tuned. The original UNet model and the existing cross-attention layers corresponding to text features is frozen. Decoupling the cross-attention for image and text features enables more fine-grained and controllable generation.
-> [!TIP]
-> Learn how to load an IP-Adapter in the [Load adapters](../using-diffusers/loading_adapters#ip-adapter) guide, and make sure you check out the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section which requires manually loading the image encoder.
-
-This guide will walk you through using IP-Adapter for various tasks and use cases.
-
-## General tasks
-
-Let's take a look at how to use IP-Adapter's image prompting capabilities with the [`StableDiffusionXLPipeline`] for tasks like text-to-image, image-to-image, and inpainting. We also encourage you to try out other pipelines such as Stable Diffusion, LCM-LoRA, ControlNet, T2I-Adapter, or AnimateDiff!
-
-In all the following examples, you'll see the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] method. This method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.
+IP-Adapter files are typically ~100MBs because they only contain the image embeddings. This means you need to load a model first, and then load the IP-Adapter with [`~loaders.IPAdapterMixin.load_ip_adapter`].
> [!TIP]
-> In the examples below, try adding `low_cpu_mem_usage=True` to the [`~loaders.IPAdapterMixin.load_ip_adapter`] method to speed up the loading time.
-
-
-
+> IP-Adapters are available to many models such as [Flux](../api/pipelines/flux#ip-adapter) and [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), and more. The examples in this guide use Stable Diffusion and Stable Diffusion XL.
-Crafting the precise text prompt to generate the image you want can be difficult because it may not always capture what you'd like to express. Adding an image alongside the text prompt helps the model better understand what it should generate and can lead to more accurate results.
-
-Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the SDXL model weights.
+Use the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] parameter to scale the influence of the IP-Adapter during generation. A value of `1.0` means the model is only conditioned on the image prompt, and `0.5` typically produces balanced results between the text and image prompt.
```py
+import torch
from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
-import torch
-pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
-pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
-pipeline.set_ip_adapter_scale(0.6)
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="sdxl_models",
+ weight_name="ip-adapter_sdxl.bin"
+)
+pipeline.set_ip_adapter_scale(0.8)
```
-Create a text prompt and load an image prompt before passing them to the pipeline to generate an image.
+Pass an image to `ip_adapter_image` along with a text prompt to generate an image.
```py
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
-generator = torch.Generator(device="cpu").manual_seed(0)
-images = pipeline(
+pipeline(
prompt="a polar bear sitting in a chair drinking a milkshake",
ip_adapter_image=image,
negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
- num_inference_steps=100,
- generator=generator,
-).images
-images[0]
+).images[0]
```
-
-
-
- IP-Adapter image
-
-
-
- generated image
-
+
+
+
+ IP-Adapter image
+
+
+
+ generated image
+
-
-
-
-IP-Adapter can also help with image-to-image by guiding the model to generate an image that resembles the original image and the image prompt.
+Take a look at the examples below to learn how to use IP-Adapter for other tasks.
-Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the SDXL model weights.
+
+
```py
+import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image
-import torch
-pipeline = AutoPipelineForImage2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
-pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
-pipeline.set_ip_adapter_scale(0.6)
-```
-
-Pass the original image and the IP-Adapter image prompt to the pipeline to generate an image. Providing a text prompt to the pipeline is optional, but in this example, a text prompt is used to increase image quality.
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="sdxl_models",
+ weight_name="ip-adapter_sdxl.bin"
+)
+pipeline.set_ip_adapter_scale(0.8)
-```py
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png")
-ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_2.png")
-
-generator = torch.Generator(device="cpu").manual_seed(4)
-images = pipeline(
+ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_gummy.png")
+pipeline(
prompt="best quality, high quality",
image=image,
ip_adapter_image=ip_image,
- generator=generator,
- strength=0.6,
-).images
-images[0]
+ strength=0.5,
+).images[0]
```
-
-
-
-IP-Adapter is also useful for inpainting because the image prompt allows you to be much more specific about what you'd like to generate.
-
-Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the SDXL model weights.
+
```py
-from diffusers import AutoPipelineForInpainting
-from diffusers.utils import load_image
import torch
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import load_image
-pipeline = AutoPipelineForInpainting.from_pretrained("diffusers/stable-diffusion-xl-1.0-inpainting-0.1", torch_dtype=torch.float16).to("cuda")
-pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="sdxl_models",
+ weight_name="ip-adapter_sdxl.bin"
+)
pipeline.set_ip_adapter_scale(0.6)
-```
-
-Pass a prompt, the original image, mask image, and the IP-Adapter image prompt to the pipeline to generate an image.
-```py
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_mask.png")
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png")
ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_gummy.png")
-
-generator = torch.Generator(device="cpu").manual_seed(4)
-images = pipeline(
+pipeline(
prompt="a cute gummy bear waving",
image=image,
mask_image=mask_image,
ip_adapter_image=ip_image,
- generator=generator,
- num_inference_steps=100,
-).images
-images[0]
+).images[0]
```
-
-
-
-IP-Adapter can also help you generate videos that are more aligned with your text prompt. For example, let's load [AnimateDiff](../api/pipelines/animatediff) with its motion adapter and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method.
+
-> [!WARNING]
-> If you're planning on offloading the model to the CPU, make sure you run it after you've loaded the IP-Adapter. When you call [`~DiffusionPipeline.enable_model_cpu_offload`] before loading the IP-Adapter, it offloads the image encoder module to the CPU and it'll return an error when you try to run the pipeline.
+The [`~DiffusionPipeline.enable_model_cpu_offload`] method is useful for reducing memory and it should be enabled **after** the IP-Adapter is loaded. Otherwise, the IP-Adapter's image encoder is also offloaded to the CPU and returns an error.
```py
import torch
@@ -185,8 +162,15 @@ from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
from diffusers.utils import export_to_gif
from diffusers.utils import load_image
-adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
-pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, torch_dtype=torch.float16)
+adapter = MotionAdapter.from_pretrained(
+ "guoyww/animatediff-motion-adapter-v1-5-2",
+ torch_dtype=torch.float16
+)
+pipeline = AnimateDiffPipeline.from_pretrained(
+ "emilianJR/epiCRealism",
+ motion_adapter=adapter,
+ torch_dtype=torch.float16
+)
scheduler = DDIMScheduler.from_pretrained(
"emilianJR/epiCRealism",
subfolder="scheduler",
@@ -197,60 +181,123 @@ scheduler = DDIMScheduler.from_pretrained(
)
pipeline.scheduler = scheduler
pipeline.enable_vae_slicing()
-
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
pipeline.enable_model_cpu_offload()
-```
-Pass a prompt and an image prompt to the pipeline to generate a short video.
-
-```py
ip_adapter_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_inpaint.png")
-
-output = pipeline(
+pipeline(
prompt="A cute gummy bear waving",
negative_prompt="bad quality, worse quality, low resolution",
ip_adapter_image=ip_adapter_image,
num_frames=16,
guidance_scale=7.5,
num_inference_steps=50,
- generator=torch.Generator(device="cpu").manual_seed(0),
-)
-frames = output.frames[0]
-export_to_gif(frames, "gummy_bear.gif")
+).frames[0]
```
-
-
-
- IP-Adapter image
-
-
-
- generated video
-
+
+
+
+ IP-Adapter image
+
+
+
+ generated video
+
-## Configure parameters
+## Model variants
-There are a couple of IP-Adapter parameters that are useful to know about and can help you with your image generation tasks. These parameters can make your workflow more efficient or give you more control over image generation.
+There are two variants of IP-Adapter, Plus and FaceID. The Plus variant uses patch embeddings and the ViT-H image encoder. FaceID variant uses face embeddings generated from InsightFace.
-### Image embeddings
+
+
-IP-Adapter enabled pipelines provide the `ip_adapter_image_embeds` parameter to accept precomputed image embeddings. This is particularly useful in scenarios where you need to run the IP-Adapter pipeline multiple times because you have more than one image. For example, [multi IP-Adapter](#multi-ip-adapter) is a specific use case where you provide multiple styling images to generate a specific image in a specific style. Loading and encoding multiple images each time you use the pipeline would be inefficient. Instead, you can precompute and save the image embeddings to disk (which can save a lot of space if you're using high-quality images) and load them when you need them.
+```py
+import torch
+from transformers import CLIPVisionModelWithProjection, AutoPipelineForText2Image
-> [!TIP]
-> This parameter also gives you the flexibility to load embeddings from other sources. For example, ComfyUI image embeddings for IP-Adapters are compatible with Diffusers and should work ouf-of-the-box!
+image_encoder = CLIPVisionModelWithProjection.from_pretrained(
+ "h94/IP-Adapter",
+ subfolder="models/image_encoder",
+ torch_dtype=torch.float16
+)
-Call the [`~StableDiffusionPipeline.prepare_ip_adapter_image_embeds`] method to encode and generate the image embeddings. Then you can save them to disk with `torch.save`.
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ image_encoder=image_encoder,
+ torch_dtype=torch.float16
+).to("cuda")
-> [!TIP]
-> If you're using IP-Adapter with `ip_adapter_image_embedding` instead of `ip_adapter_image`', you can set `load_ip_adapter(image_encoder_folder=None,...)` because you don't need to load an encoder to generate the image embeddings.
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="sdxl_models",
+ weight_name="ip-adapter-plus_sdxl_vit-h.safetensors"
+)
+```
+
+
+
+
+```py
+import torch
+from transformers import AutoPipelineForText2Image
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter-FaceID",
+ subfolder=None,
+ weight_name="ip-adapter-faceid_sdxl.bin",
+ image_encoder_folder=None
+)
+```
+
+To use a IP-Adapter FaceID Plus model, load the CLIP image encoder as well as [`~transformers.CLIPVisionModelWithProjection`].
+
+```py
+from transformers import AutoPipelineForText2Image, CLIPVisionModelWithProjection
+
+image_encoder = CLIPVisionModelWithProjection.from_pretrained(
+ "laion/CLIP-ViT-H-14-laion2B-s32B-b79K",
+ torch_dtype=torch.float16,
+)
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stable-diffusion-v1-5/stable-diffusion-v1-5",
+ image_encoder=image_encoder,
+ torch_dtype=torch.float16
+).to("cuda")
+
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter-FaceID",
+ subfolder=None,
+ weight_name="ip-adapter-faceid-plus_sd15.bin"
+)
+```
+
+
+
+
+## Image embeddings
+
+The `prepare_ip_adapter_image_embeds` generates image embeddings you can reuse if you're running the pipeline multiple times because you have more than one image. Loading and encoding multiple images each time you use the pipeline can be inefficient. Precomputing the image embeddings ahead of time, saving them to disk, and loading them when you need them is more efficient.
```py
+import torch
+from diffusers import AutoPipelineForText2Image
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+
image_embeds = pipeline.prepare_ip_adapter_image_embeds(
ip_adapter_image=image,
ip_adapter_image_embeds=None,
@@ -262,117 +309,123 @@ image_embeds = pipeline.prepare_ip_adapter_image_embeds(
torch.save(image_embeds, "image_embeds.ipadpt")
```
-Now load the image embeddings by passing them to the `ip_adapter_image_embeds` parameter.
+Reload the image embeddings by passing them to the `ip_adapter_image_embeds` parameter. Set `image_encoder_folder` to `None` because you don't need the image encoder anymore to generate the image embeddings.
+
+> [!TIP]
+> You can also load image embeddings from other sources such as ComfyUI.
```py
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="sdxl_models",
+ image_encoder_folder=None,
+ weight_name="ip-adapter_sdxl.bin"
+)
+pipeline.set_ip_adapter_scale(0.8)
image_embeds = torch.load("image_embeds.ipadpt")
-images = pipeline(
+pipeline(
prompt="a polar bear sitting in a chair drinking a milkshake",
ip_adapter_image_embeds=image_embeds,
negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
num_inference_steps=100,
generator=generator,
-).images
+).images[0]
```
-### IP-Adapter masking
+## Masking
-Binary masks specify which portion of the output image should be assigned to an IP-Adapter. This is useful for composing more than one IP-Adapter image. For each input IP-Adapter image, you must provide a binary mask.
+Binary masking enables assigning an IP-Adapter image to a specific area of the output image, making it useful for composing multiple IP-Adapter images. Each IP-Adapter image requires a binary mask.
-To start, preprocess the input IP-Adapter images with the [`~image_processor.IPAdapterMaskProcessor.preprocess()`] to generate their masks. For optimal results, provide the output height and width to [`~image_processor.IPAdapterMaskProcessor.preprocess()`]. This ensures masks with different aspect ratios are appropriately stretched. If the input masks already match the aspect ratio of the generated image, you don't have to set the `height` and `width`.
+Load the [`~image_processor.IPAdapterMaskProcessor`] to preprocess the image masks. For the best results, provide the output `height` and `width` to ensure masks with different aspect ratios are appropriately sized. If the input masks already match the aspect ratio of the generated image, you don't need to set the `height` and `width`.
```py
+import torch
+from diffusers import AutoPipelineForText2Image
from diffusers.image_processor import IPAdapterMaskProcessor
+from diffusers.utils import load_image
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
mask1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png")
mask2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask2.png")
-output_height = 1024
-output_width = 1024
-
processor = IPAdapterMaskProcessor()
-masks = processor.preprocess([mask1, mask2], height=output_height, width=output_width)
+masks = processor.preprocess([mask1, mask2], height=1024, width=1024)
```
-
-
-
- mask one
-
-
-
- mask two
-
+
+
+
+ mask 1
+
+
+
+ mask 2
+
-When there is more than one input IP-Adapter image, load them as a list and provide the IP-Adapter scale list. Each of the input IP-Adapter images here corresponds to one of the masks generated above.
+Provide both the IP-Adapter images and their scales as a list. Pass the preprocessed masks to `cross_attention_kwargs` in the pipeline.
```py
-pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"])
-pipeline.set_ip_adapter_scale([[0.7, 0.7]]) # one scale for each image-mask pair
-
face_image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")
face_image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl2.png")
-ip_images = [[face_image1, face_image2]]
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="sdxl_models",
+ weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"]
+)
+pipeline.set_ip_adapter_scale([[0.7, 0.7]])
+ip_images = [[face_image1, face_image2]]
masks = [masks.reshape(1, masks.shape[0], masks.shape[2], masks.shape[3])]
-```
-
+
+
+ generated with mask
+
+
+
+ generated without mask
+
-## Specific use cases
-
-IP-Adapter's image prompting and compatibility with other adapters and models makes it a versatile tool for a variety of use cases. This section covers some of the more popular applications of IP-Adapter, and we can't wait to see what you come up with!
+## Applications
-### Face model
+The section below covers some popular applications of IP-Adapter.
-Generating accurate faces is challenging because they are complex and nuanced. Diffusers supports two IP-Adapter checkpoints specifically trained to generate faces from the [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter) repository:
+### Face models
-* [ip-adapter-full-face_sd15.safetensors](https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter-full-face_sd15.safetensors) is conditioned with images of cropped faces and removed backgrounds
-* [ip-adapter-plus-face_sd15.safetensors](https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter-plus-face_sd15.safetensors) uses patch embeddings and is conditioned with images of cropped faces
+Face generation and preserving its details can be challenging. To help generate more accurate faces, there are checkpoints specifically conditioned on images of cropped faces. You can find the face models in the [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter) repository or the [h94/IP-Adapter-FaceID](https://huggingface.co/h94/IP-Adapter-FaceID) repository. The FaceID checkpoints use the FaceID embeddings from [InsightFace](https://github.com/deepinsight/insightface) instead of CLIP image embeddings.
-Additionally, Diffusers supports all IP-Adapter checkpoints trained with face embeddings extracted by `insightface` face models. Supported models are from the [h94/IP-Adapter-FaceID](https://huggingface.co/h94/IP-Adapter-FaceID) repository.
+We recommend using the [`DDIMScheduler`] or [`EulerDiscreteScheduler`] for face models.
-For face models, use the [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter) checkpoint. It is also recommended to use [`DDIMScheduler`] or [`EulerDiscreteScheduler`] for face models.
+
+
```py
import torch
@@ -380,41 +433,45 @@ from diffusers import StableDiffusionPipeline, DDIMScheduler
from diffusers.utils import load_image
pipeline = StableDiffusionPipeline.from_pretrained(
- "stable-diffusion-v1-5/stable-diffusion-v1-5",
- torch_dtype=torch.float16,
+ "stable-diffusion-v1-5/stable-diffusion-v1-5",
+ torch_dtype=torch.float16,
).to("cuda")
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
-pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-full-face_sd15.bin")
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="models",
+ weight_name="ip-adapter-full-face_sd15.bin"
+)
pipeline.set_ip_adapter_scale(0.5)
-
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_einstein_base.png")
-generator = torch.Generator(device="cpu").manual_seed(26)
-image = pipeline(
+pipeline(
prompt="A photo of Einstein as a chef, wearing an apron, cooking in a French restaurant",
ip_adapter_image=image,
negative_prompt="lowres, bad anatomy, worst quality, low quality",
num_inference_steps=100,
- generator=generator,
).images[0]
-image
```
-
-
-
- IP-Adapter image
-
-
-
- generated image
-
+
+
+
+ IP-Adapter image
+
+
+
+ generated image
+
-To use IP-Adapter FaceID models, first extract face embeddings with `insightface`. Then pass the list of tensors to the pipeline as `ip_adapter_image_embeds`.
+
+
+
+For FaceID models, extract the face embeddings and pass them as a list of tensors to `ip_adapter_image_embeds`.
```py
+# pip install insightface
import torch
from diffusers import StableDiffusionPipeline, DDIMScheduler
from diffusers.utils import load_image
@@ -425,7 +482,12 @@ pipeline = StableDiffusionPipeline.from_pretrained(
torch_dtype=torch.float16,
).to("cuda")
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
-pipeline.load_ip_adapter("h94/IP-Adapter-FaceID", subfolder=None, weight_name="ip-adapter-faceid_sd15.bin", image_encoder_folder=None)
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter-FaceID",
+ subfolder=None,
+ weight_name="ip-adapter-faceid_sd15.bin",
+ image_encoder_folder=None
+)
pipeline.set_ip_adapter_scale(0.6)
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")
@@ -441,50 +503,32 @@ ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")
-generator = torch.Generator(device="cpu").manual_seed(42)
-
-images = pipeline(
+pipeline(
prompt="A photo of a girl",
ip_adapter_image_embeds=[id_embeds],
negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
- num_inference_steps=20, num_images_per_prompt=1,
- generator=generator
-).images
+).images[0]
```
-Both IP-Adapter FaceID Plus and Plus v2 models require CLIP image embeddings. You can prepare face embeddings as shown previously, then you can extract and pass CLIP embeddings to the hidden image projection layers.
+The IP-Adapter FaceID Plus and Plus v2 models require CLIP image embeddings. Prepare the face embeddings and then extract and pass the CLIP embeddings to the hidden image projection layers.
```py
-from insightface.utils import face_align
-
-ref_images_embeds = []
-ip_adapter_images = []
-app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
-app.prepare(ctx_id=0, det_size=(640, 640))
-image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB)
-faces = app.get(image)
-ip_adapter_images.append(face_align.norm_crop(image, landmark=faces[0].kps, image_size=224))
-image = torch.from_numpy(faces[0].normed_embedding)
-ref_images_embeds.append(image.unsqueeze(0))
-ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
-neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
-id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")
-
clip_embeds = pipeline.prepare_ip_adapter_image_embeds(
[ip_adapter_images], None, torch.device("cuda"), num_images, True)[0]
pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
-pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False # True if Plus v2
+# set to True if using IP-Adapter FaceID Plus v2
+pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False
```
-### Multi IP-Adapter
+
+
-More than one IP-Adapter can be used at the same time to generate specific images in more diverse styles. For example, you can use IP-Adapter-Face to generate consistent faces and characters, and IP-Adapter Plus to generate those faces in a specific style.
+### Multiple IP-Adapters
-> [!TIP]
-> Read the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section to learn why you need to manually load the image encoder.
+Combine multiple IP-Adapters to generate images in more diverse styles. For example, you can use IP-Adapter Face to generate consistent faces and characters and IP-Adapter Plus to generate those faces in specific styles.
-Load the image encoder with [`~transformers.CLIPVisionModelWithProjection`].
+Load an image encoder with [`~transformers.CLIPVisionModelWithProjection`].
```py
import torch
@@ -499,10 +543,10 @@ image_encoder = CLIPVisionModelWithProjection.from_pretrained(
)
```
-Next, you'll load a base model, scheduler, and the IP-Adapters. The IP-Adapters to use are passed as a list to the `weight_name` parameter:
+Load a base model, scheduler and the following IP-Adapters.
-* [ip-adapter-plus_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT-H image encoder
-* [ip-adapter-plus-face_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) has the same architecture but it is conditioned with images of cropped faces
+- [ip-adapter-plus_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT-H image encoder
+- [ip-adapter-plus-face_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT-H image encoder but it is conditioned on images of cropped faces
```py
pipeline = AutoPipelineForText2Image.from_pretrained(
@@ -517,10 +561,11 @@ pipeline.load_ip_adapter(
weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"]
)
pipeline.set_ip_adapter_scale([0.7, 0.3])
+# enable_model_cpu_offload to reduce memory usage
pipeline.enable_model_cpu_offload()
```
-Load an image prompt and a folder containing images of a certain style you want to use.
+Load an image and a folder containing images of a certain style to apply.
```py
face_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png")
@@ -528,150 +573,160 @@ style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/ma
style_images = [load_image(f"{style_folder}/img{i}.png") for i in range(10)]
```
-
-
-
- IP-Adapter image of face
-
-
-
- IP-Adapter style images
-
+
+
+
+ face image
+
+
+
+ style images
+
-Pass the image prompt and style images as a list to the `ip_adapter_image` parameter, and run the pipeline!
+Pass style and face images as a list to `ip_adapter_image`.
```py
generator = torch.Generator(device="cpu").manual_seed(0)
-image = pipeline(
+pipeline(
prompt="wonderwoman",
ip_adapter_image=[style_images, face_image],
negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
- num_inference_steps=50, num_images_per_prompt=1,
- generator=generator,
).images[0]
-image
```
-
-Â Â
+
+
+
+ generated image
+
### Instant generation
-[Latent Consistency Models (LCM)](../using-diffusers/inference_with_lcm_lora) are diffusion models that can generate images in as little as 4 steps compared to other diffusion models like SDXL that typically require way more steps. This is why image generation with an LCM feels "instantaneous". IP-Adapters can be plugged into an LCM-LoRA model to instantly generate images with an image prompt.
+[Latent Consistency Models (LCM)](../api/pipelines/latent_consistency_models) can generate images 4 steps or less, unlike other diffusion models which require a lot more steps, making it feel "instantaneous". IP-Adapters are compatible with LCM models to instantly generate images.
-The IP-Adapter weights need to be loaded first, then you can use [`~StableDiffusionPipeline.load_lora_weights`] to load the LoRA style and weight you want to apply to your image.
+Load the IP-Adapter weights and load the LoRA weights with [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`].
```py
-from diffusers import DiffusionPipeline, LCMScheduler
import torch
+from diffusers import DiffusionPipeline, LCMScheduler
from diffusers.utils import load_image
-model_id = "sd-dreambooth-library/herge-style"
-lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5"
-
-pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
+pipeline = DiffusionPipeline.from_pretrained(
+ "sd-dreambooth-library/herge-style",
+ torch_dtype=torch.float16
+)
-pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
-pipeline.load_lora_weights(lcm_lora_id)
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="models",
+ weight_name="ip-adapter_sd15.bin"
+)
+pipeline.load_lora_weights("latent-consistency/lcm-lora-sdv1-5")
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
+# enable_model_cpu_offload to reduce memory usage
pipeline.enable_model_cpu_offload()
```
-Try using with a lower IP-Adapter scale to condition image generation more on the [herge_style](https://huggingface.co/sd-dreambooth-library/herge-style) checkpoint, and remember to use the special token `herge_style` in your prompt to trigger and apply the style.
+Try using a lower IP-Adapter scale to condition generation more on the style you want to apply and remember to use the special token in your prompt to trigger its generation.
```py
pipeline.set_ip_adapter_scale(0.4)
prompt = "herge_style woman in armor, best quality, high quality"
-generator = torch.Generator(device="cpu").manual_seed(0)
ip_adapter_image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png")
-image = pipeline(
+pipeline(
prompt=prompt,
ip_adapter_image=ip_adapter_image,
num_inference_steps=4,
guidance_scale=1,
).images[0]
-image
```
-
-Â Â
+
+
+
+ generated image
+
### Structural control
-To control image generation to an even greater degree, you can combine IP-Adapter with a model like [ControlNet](../using-diffusers/controlnet). A ControlNet is also an adapter that can be inserted into a diffusion model to allow for conditioning on an additional control image. The control image can be depth maps, edge maps, pose estimations, and more.
+For structural control, combine IP-Adapter with [ControlNet](../api/pipelines/controlnet) conditioned on depth maps, edge maps, pose estimations, and more.
-Load a [`ControlNetModel`] checkpoint conditioned on depth maps, insert it into a diffusion model, and load the IP-Adapter.
+The example below loads a [`ControlNetModel`] checkpoint conditioned on depth maps and combines it with a IP-Adapter.
```py
-from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
from diffusers.utils import load_image
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
-controlnet_model_path = "lllyasviel/control_v11f1p_sd15_depth"
-controlnet = ControlNetModel.from_pretrained(controlnet_model_path, torch_dtype=torch.float16)
+controlnet = ControlNetModel.from_pretrained(
+ "lllyasviel/control_v11f1p_sd15_depth",
+ torch_dtype=torch.float16
+)
pipeline = StableDiffusionControlNetPipeline.from_pretrained(
- "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16)
-pipeline.to("cuda")
-pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
-```
-
-Now load the IP-Adapter image and depth map.
-
-```py
-ip_adapter_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/statue.png")
-depth_map = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/depth.png")
+ "stable-diffusion-v1-5/stable-diffusion-v1-5",
+ controlnet=controlnet,
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="models",
+ weight_name="ip-adapter_sd15.bin"
+)
```
-
-
-
- IP-Adapter image
-
-
-
- depth map
-
-
-
-Pass the depth map and IP-Adapter image to the pipeline to generate an image.
+Pass the depth map and IP-Adapter image to the pipeline.
```py
-generator = torch.Generator(device="cpu").manual_seed(33)
-image = pipeline(
- prompt="best quality, high quality",
- image=depth_map,
- ip_adapter_image=ip_adapter_image,
- negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
- num_inference_steps=50,
- generator=generator,
+pipeline(
+ prompt="best quality, high quality",
+ image=depth_map,
+ ip_adapter_image=ip_adapter_image,
+ negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
).images[0]
-image
```
-
-### Style & layout control
+### Style and layout control
-[InstantStyle](https://arxiv.org/abs/2404.02733) is a plug-and-play method on top of IP-Adapter, which disentangles style and layout from image prompt to control image generation. This way, you can generate images following only the style or layout from image prompt, with significantly improved diversity. This is achieved by only activating IP-Adapters to specific parts of the model.
+For style and layout control, combine IP-Adapter with [InstantStyle](https://huggingface.co/papers/2404.02733). InstantStyle separates *style* (color, texture, overall feel) and *content* from each other. It only applies the style in style-specific blocks of the model to prevent it from distorting other areas of an image. This generates images with stronger and more consistent styles and better control over the layout.
-By default IP-Adapters are inserted to all layers of the model. Use the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] method with a dictionary to assign scales to IP-Adapter at different layers.
+The IP-Adapter is only activated for specific parts of the model. Use the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] method to scale the influence of the IP-Adapter in different layers. The example below activates the IP-Adapter in the second layer of the models down `block_2` and up `block_0`. Down `block_2` is where the IP-Adapter injects layout information and up `block_0` is where style is injected.
```py
+import torch
from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
-import torch
-pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
-pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="sdxl_models",
+ weight_name="ip-adapter_sdxl.bin"
+)
scale = {
"down": {"block_2": [0.0, 1.0]},
@@ -680,37 +735,34 @@ scale = {
pipeline.set_ip_adapter_scale(scale)
```
-This will activate IP-Adapter at the second layer in the model's down-part block 2 and up-part block 0. The former is the layer where IP-Adapter injects layout information and the latter injects style. Inserting IP-Adapter to these two layers you can generate images following both the style and layout from image prompt, but with contents more aligned to text prompt.
+Load the style image and generate an image.
```py
style_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg")
-generator = torch.Generator(device="cpu").manual_seed(26)
-image = pipeline(
+pipeline(
prompt="a cat, masterpiece, best quality, high quality",
ip_adapter_image=style_image,
negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry",
guidance_scale=5,
- num_inference_steps=30,
- generator=generator,
).images[0]
-image
```
-
-
-
- IP-Adapter image
-
-
-
- generated image
-
+
+
+
+ style image
+
+
+
+ generated image
+
-In contrast, inserting IP-Adapter to all layers will often generate images that overly focus on image prompt and diminish diversity.
+You can also insert the IP-Adapter in all the model layers. This tends to generate images that focus more on the image prompt and may reduce the diversity of generated images. Only activate the IP-Adapter in up `block_0` or the style layer.
-Activate IP-Adapter only in the style layer and then call the pipeline again.
+> [!TIP]
+> You don't need to specify all the layers in the `scale` dictionary. Layers not included are set to 0, which means the IP-Adapter is disabled.
```py
scale = {
@@ -718,27 +770,21 @@ scale = {
}
pipeline.set_ip_adapter_scale(scale)
-generator = torch.Generator(device="cpu").manual_seed(26)
-image = pipeline(
+pipeline(
prompt="a cat, masterpiece, best quality, high quality",
ip_adapter_image=style_image,
negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry",
guidance_scale=5,
- num_inference_steps=30,
- generator=generator,
).images[0]
-image
```
-
-
-
- IP-Adapter only in style layer
-
-
-
- IP-Adapter in all layers
-
-
-
-Note that you don't have to specify all layers in the dictionary. Those not included in the dictionary will be set to scale 0 which means disable IP-Adapter by default.
+
-
-## Textual inversion
-
-[Textual inversion](https://textual-inversion.github.io/) is very similar to DreamBooth and it can also personalize a diffusion model to generate certain concepts (styles, objects) from just a few images. This method works by training and finding new embeddings that represent the images you provide with a special word in the prompt. As a result, the diffusion model weights stay the same and the training process produces a relatively tiny (a few KBs) file.
-
-Because textual inversion creates embeddings, it cannot be used on its own like DreamBooth and requires another model.
-
-```py
-from diffusers import AutoPipelineForText2Image
-import torch
-
-pipeline = AutoPipelineForText2Image.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
-```
-
-Now you can load the textual inversion embeddings with the [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] method and generate some images. Let's load the [sd-concepts-library/gta5-artwork](https://huggingface.co/sd-concepts-library/gta5-artwork) embeddings and you'll need to include the special word `` in your prompt to trigger it:
-
-```py
-pipeline.load_textual_inversion("sd-concepts-library/gta5-artwork")
-prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, style"
-image = pipeline(prompt).images[0]
-image
-```
-
-
-
-
-
-Textual inversion can also be trained on undesirable things to create *negative embeddings* to discourage a model from generating images with those undesirable things like blurry images or extra fingers on a hand. This can be an easy way to quickly improve your prompt. You'll also load the embeddings with [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`], but this time, you'll need two more parameters:
-
-- `weight_name`: specifies the weight file to load if the file was saved in the 🤗 Diffusers format with a specific name or if the file is stored in the A1111 format
-- `token`: specifies the special word to use in the prompt to trigger the embeddings
-
-Let's load the [sayakpaul/EasyNegative-test](https://huggingface.co/sayakpaul/EasyNegative-test) embeddings:
-
-```py
-pipeline.load_textual_inversion(
- "sayakpaul/EasyNegative-test", weight_name="EasyNegative.safetensors", token="EasyNegative"
-)
-```
-
-Now you can use the `token` to generate an image with the negative embeddings:
-
-```py
-prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, EasyNegative"
-negative_prompt = "EasyNegative"
-
-image = pipeline(prompt, negative_prompt=negative_prompt, num_inference_steps=50).images[0]
-image
-```
-
-
-
-
-
-## LoRA
-
-[Low-Rank Adaptation (LoRA)](https://huggingface.co/papers/2106.09685) is a popular training technique because it is fast and generates smaller file sizes (a couple hundred MBs). Like the other methods in this guide, LoRA can train a model to learn new styles from just a few images. It works by inserting new weights into the diffusion model and then only the new weights are trained instead of the entire model. This makes LoRAs faster to train and easier to store.
-
-
-
-LoRA is a very general training technique that can be used with other training methods. For example, it is common to train a model with DreamBooth and LoRA. It is also increasingly common to load and merge multiple LoRAs to create new and unique images. You can learn more about it in the in-depth [Merge LoRAs](merge_loras) guide since merging is outside the scope of this loading guide.
-
-
-
-LoRAs also need to be used with another model:
-
-```py
-from diffusers import AutoPipelineForText2Image
-import torch
-
-pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
-```
-
-Then use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the [ostris/super-cereal-sdxl-lora](https://huggingface.co/ostris/super-cereal-sdxl-lora) weights and specify the weights filename from the repository:
-
-```py
-pipeline.load_lora_weights("ostris/super-cereal-sdxl-lora", weight_name="cereal_box_sdxl_v1.safetensors")
-prompt = "bears, pizza bites"
-image = pipeline(prompt).images[0]
-image
-```
-
-
-
-
-
-The [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method loads LoRA weights into both the UNet and text encoder. It is the preferred way for loading LoRAs because it can handle cases where:
-
-- the LoRA weights don't have separate identifiers for the UNet and text encoder
-- the LoRA weights have separate identifiers for the UNet and text encoder
-
-To directly load (and save) a LoRA adapter at the *model-level*, use [`~loaders.PeftAdapterMixin.load_lora_adapter`], which builds and prepares the necessary model configuration for the adapter. Like [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`], [`~loaders.PeftAdapterMixin.load_lora_adapter`] can load LoRAs for both the UNet and text encoder. For example, if you're loading a LoRA for the UNet, [`~loaders.PeftAdapterMixin.load_lora_adapter`] ignores the keys for the text encoder.
-
-Use the `weight_name` parameter to specify the specific weight file and the `prefix` parameter to filter for the appropriate state dicts (`"unet"` in this case) to load.
-
-```py
-from diffusers import AutoPipelineForText2Image
-import torch
-
-pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
-pipeline.unet.load_lora_adapter("jbilcke-hf/sdxl-cinematic-1", weight_name="pytorch_lora_weights.safetensors", prefix="unet")
-
-# use cnmt in the prompt to trigger the LoRA
-prompt = "A cute cnmt eating a slice of pizza, stunning color scheme, masterpiece, illustration"
-image = pipeline(prompt).images[0]
-image
-```
-
-
-
-
-
-Save an adapter with [`~loaders.PeftAdapterMixin.save_lora_adapter`].
-
-To unload the LoRA weights, use the [`~loaders.StableDiffusionLoraLoaderMixin.unload_lora_weights`] method to discard the LoRA weights and restore the model to its original weights:
-
-```py
-pipeline.unload_lora_weights()
-```
-
-### Adjust LoRA weight scale
-
-For both [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] and [`~loaders.UNet2DConditionLoadersMixin.load_attn_procs`], you can pass the `cross_attention_kwargs={"scale": 0.5}` parameter to adjust how much of the LoRA weights to use. A value of `0` is the same as only using the base model weights, and a value of `1` is equivalent to using the fully finetuned LoRA.
-
-For more granular control on the amount of LoRA weights used per layer, you can use [`~loaders.StableDiffusionLoraLoaderMixin.set_adapters`] and pass a dictionary specifying by how much to scale the weights in each layer by.
-```python
-pipe = ... # create pipeline
-pipe.load_lora_weights(..., adapter_name="my_adapter")
-scales = {
- "text_encoder": 0.5,
- "text_encoder_2": 0.5, # only usable if pipe has a 2nd text encoder
- "unet": {
- "down": 0.9, # all transformers in the down-part will use scale 0.9
- # "mid" # in this example "mid" is not given, therefore all transformers in the mid part will use the default scale 1.0
- "up": {
- "block_0": 0.6, # all 3 transformers in the 0th block in the up-part will use scale 0.6
- "block_1": [0.4, 0.8, 1.0], # the 3 transformers in the 1st block in the up-part will use scales 0.4, 0.8 and 1.0 respectively
- }
- }
-}
-pipe.set_adapters("my_adapter", scales)
-```
-
-This also works with multiple adapters - see [this guide](https://huggingface.co/docs/diffusers/tutorials/using_peft_for_inference#customize-adapters-strength) for how to do it.
-
-
-
-Currently, [`~loaders.StableDiffusionLoraLoaderMixin.set_adapters`] only supports scaling attention weights. If a LoRA has other parts (e.g., resnets or down-/upsamplers), they will keep a scale of 1.0.
-
-
-
-### Hotswapping LoRA adapters
-
-A common use case when serving multiple adapters is to load one adapter first, generate images, load another adapter, generate more images, load another adapter, etc. This workflow normally requires calling [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`], [`~loaders.StableDiffusionLoraLoaderMixin.set_adapters`], and possibly [`~loaders.peft.PeftAdapterMixin.delete_adapters`] to save memory. Moreover, if the model is compiled using `torch.compile`, performing these steps requires recompilation, which takes time.
-
-To better support this common workflow, you can "hotswap" a LoRA adapter, to avoid accumulating memory and in some cases, recompilation. It requires an adapter to already be loaded, and the new adapter weights are swapped in-place for the existing adapter.
-
-Pass `hotswap=True` when loading a LoRA adapter to enable this feature. It is important to indicate the name of the existing adapter, (`default_0` is the default adapter name), to be swapped. If you loaded the first adapter with a different name, use that name instead.
-
-```python
-pipe = ...
-# load adapter 1 as normal
-pipeline.load_lora_weights(file_name_adapter_1)
-# generate some images with adapter 1
-...
-# now hot swap the 2nd adapter
-pipeline.load_lora_weights(file_name_adapter_2, hotswap=True, adapter_name="default_0")
-# generate images with adapter 2
-```
-
-
-
-
-Hotswapping is not currently supported for LoRA adapters that target the text encoder.
-
-
-
-For compiled models, it is often (though not always if the second adapter targets identical LoRA ranks and scales) necessary to call [`~loaders.lora_base.LoraBaseMixin.enable_lora_hotswap`] to avoid recompilation. Use [`~loaders.lora_base.LoraBaseMixin.enable_lora_hotswap`] _before_ loading the first adapter, and `torch.compile` should be called _after_ loading the first adapter.
-
-```python
-pipe = ...
-# call this extra method
-pipe.enable_lora_hotswap(target_rank=max_rank)
-# now load adapter 1
-pipe.load_lora_weights(file_name_adapter_1)
-# now compile the unet of the pipeline
-pipe.unet = torch.compile(pipeline.unet, ...)
-# generate some images with adapter 1
-...
-# now hot swap adapter 2
-pipeline.load_lora_weights(file_name_adapter_2, hotswap=True, adapter_name="default_0")
-# generate images with adapter 2
-```
-
-The `target_rank=max_rank` argument is important for setting the maximum rank among all LoRA adapters that will be loaded. If you have one adapter with rank 8 and another with rank 16, pass `target_rank=16`. You should use a higher value if in doubt. By default, this value is 128.
-
-However, there can be situations where recompilation is unavoidable. For example, if the hotswapped adapter targets more layers than the initial adapter, then recompilation is triggered. Try to load the adapter that targets the most layers first. Refer to the PEFT docs on [hotswapping](https://huggingface.co/docs/peft/main/en/package_reference/hotswap#peft.utils.hotswap.hotswap_adapter) for more details about the limitations of this feature.
-
-
-
-Move your code inside the `with torch._dynamo.config.patch(error_on_recompile=True)` context manager to detect if a model was recompiled. If you detect recompilation despite following all the steps above, please open an issue with [Diffusers](https://github.com/huggingface/diffusers/issues) with a reproducible example.
-
-
-
-### Kohya and TheLastBen
-
-Other popular LoRA trainers from the community include those by [Kohya](https://github.com/kohya-ss/sd-scripts/) and [TheLastBen](https://github.com/TheLastBen/fast-stable-diffusion). These trainers create different LoRA checkpoints than those trained by 🤗 Diffusers, but they can still be loaded in the same way.
-
-
-
-
-To load a Kohya LoRA, let's download the [Blueprintify SD XL 1.0](https://civitai.com/models/150986/blueprintify-sd-xl-10) checkpoint from [Civitai](https://civitai.com/) as an example:
-
-```sh
-!wget https://civitai.com/api/download/models/168776 -O blueprintify-sd-xl-10.safetensors
-```
-
-Load the LoRA checkpoint with the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method, and specify the filename in the `weight_name` parameter:
-
-```py
-from diffusers import AutoPipelineForText2Image
-import torch
-
-pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
-pipeline.load_lora_weights("path/to/weights", weight_name="blueprintify-sd-xl-10.safetensors")
-```
-
-Generate an image:
-
-```py
-# use bl3uprint in the prompt to trigger the LoRA
-prompt = "bl3uprint, a highly detailed blueprint of the eiffel tower, explaining how to build all parts, many txt, blueprint grid backdrop"
-image = pipeline(prompt).images[0]
-image
-```
-
-
-
-Some limitations of using Kohya LoRAs with 🤗 Diffusers include:
-
-- Images may not look like those generated by UIs - like ComfyUI - for multiple reasons, which are explained [here](https://github.com/huggingface/diffusers/pull/4287/#issuecomment-1655110736).
-- [LyCORIS checkpoints](https://github.com/KohakuBlueleaf/LyCORIS) aren't fully supported. The [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method loads LyCORIS checkpoints with LoRA and LoCon modules, but Hada and LoKR are not supported.
-
-
-
-
-
-
-Loading a checkpoint from TheLastBen is very similar. For example, to load the [TheLastBen/William_Eggleston_Style_SDXL](https://huggingface.co/TheLastBen/William_Eggleston_Style_SDXL) checkpoint:
-
-```py
-from diffusers import AutoPipelineForText2Image
-import torch
-
-pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
-pipeline.load_lora_weights("TheLastBen/William_Eggleston_Style_SDXL", weight_name="wegg.safetensors")
-
-# use by william eggleston in the prompt to trigger the LoRA
-prompt = "a house by william eggleston, sunrays, beautiful, sunlight, sunrays, beautiful"
-image = pipeline(prompt=prompt).images[0]
-image
-```
-
-
-
-
-## IP-Adapter
-
-[IP-Adapter](https://ip-adapter.github.io/) is a lightweight adapter that enables image prompting for any diffusion model. This adapter works by decoupling the cross-attention layers of the image and text features. All the other model components are frozen and only the embedded image features in the UNet are trained. As a result, IP-Adapter files are typically only ~100MBs.
-
-You can learn more about how to use IP-Adapter for different tasks and specific use cases in the [IP-Adapter](../using-diffusers/ip_adapter) guide.
-
-> [!TIP]
-> Diffusers currently only supports IP-Adapter for some of the most popular pipelines. Feel free to open a feature request if you have a cool use case and want to integrate IP-Adapter with an unsupported pipeline!
-> Official IP-Adapter checkpoints are available from [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter).
-
-To start, load a Stable Diffusion checkpoint.
-
-```py
-from diffusers import AutoPipelineForText2Image
-import torch
-from diffusers.utils import load_image
-
-pipeline = AutoPipelineForText2Image.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
-```
-
-Then load the IP-Adapter weights and add it to the pipeline with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method.
-
-```py
-pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
-```
-
-Once loaded, you can use the pipeline with an image and text prompt to guide the image generation process.
-
-```py
-image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/load_neg_embed.png")
-generator = torch.Generator(device="cpu").manual_seed(33)
-images = pipeline(
-Â Â prompt='best quality, high quality, wearing sunglasses',
-Â Â ip_adapter_image=image,
-Â Â negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
-Â Â num_inference_steps=50,
-Â Â generator=generator,
-).images[0]
-images
-```
-
-
-Â Â
-
-
-### IP-Adapter Plus
-
-IP-Adapter relies on an image encoder to generate image features. If the IP-Adapter repository contains an `image_encoder` subfolder, the image encoder is automatically loaded and registered to the pipeline. Otherwise, you'll need to explicitly load the image encoder with a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to the pipeline.
-
-This is the case for *IP-Adapter Plus* checkpoints which use the ViT-H image encoder.
-
-```py
-from transformers import CLIPVisionModelWithProjection
-
-image_encoder = CLIPVisionModelWithProjection.from_pretrained(
- "h94/IP-Adapter",
- subfolder="models/image_encoder",
- torch_dtype=torch.float16
-)
-
-pipeline = AutoPipelineForText2Image.from_pretrained(
- "stabilityai/stable-diffusion-xl-base-1.0",
- image_encoder=image_encoder,
- torch_dtype=torch.float16
-).to("cuda")
-
-pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors")
-```
-
-### IP-Adapter Face ID models
-
-The IP-Adapter FaceID models are experimental IP Adapters that use image embeddings generated by `insightface` instead of CLIP image embeddings. Some of these models also use LoRA to improve ID consistency.
-You need to install `insightface` and all its requirements to use these models.
-
-
-As InsightFace pretrained models are available for non-commercial research purposes, IP-Adapter-FaceID models are released exclusively for research purposes and are not intended for commercial use.
-
-
-```py
-pipeline = AutoPipelineForText2Image.from_pretrained(
- "stabilityai/stable-diffusion-xl-base-1.0",
- torch_dtype=torch.float16
-).to("cuda")
-
-pipeline.load_ip_adapter("h94/IP-Adapter-FaceID", subfolder=None, weight_name="ip-adapter-faceid_sdxl.bin", image_encoder_folder=None)
-```
-
-If you want to use one of the two IP-Adapter FaceID Plus models, you must also load the CLIP image encoder, as this models use both `insightface` and CLIP image embeddings to achieve better photorealism.
-
-```py
-from transformers import CLIPVisionModelWithProjection
-
-image_encoder = CLIPVisionModelWithProjection.from_pretrained(
- "laion/CLIP-ViT-H-14-laion2B-s32B-b79K",
- torch_dtype=torch.float16,
-)
-
-pipeline = AutoPipelineForText2Image.from_pretrained(
- "stable-diffusion-v1-5/stable-diffusion-v1-5",
- image_encoder=image_encoder,
- torch_dtype=torch.float16
-).to("cuda")
-
-pipeline.load_ip_adapter("h94/IP-Adapter-FaceID", subfolder=None, weight_name="ip-adapter-faceid-plus_sd15.bin")
-```
diff --git a/docs/source/en/using-diffusers/merge_loras.md b/docs/source/en/using-diffusers/merge_loras.md
deleted file mode 100644
index e3ade4b01cf0..000000000000
--- a/docs/source/en/using-diffusers/merge_loras.md
+++ /dev/null
@@ -1,266 +0,0 @@
-
-
-# Merge LoRAs
-
-It can be fun and creative to use multiple [LoRAs]((https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora)) together to generate something entirely new and unique. This works by merging multiple LoRA weights together to produce images that are a blend of different styles. Diffusers provides a few methods to merge LoRAs depending on *how* you want to merge their weights, which can affect image quality.
-
-This guide will show you how to merge LoRAs using the [`~loaders.PeftAdapterMixin.set_adapters`] and [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) methods. To improve inference speed and reduce memory-usage of merged LoRAs, you'll also see how to use the [`~loaders.StableDiffusionLoraLoaderMixin.fuse_lora`] method to fuse the LoRA weights with the original weights of the underlying model.
-
-For this guide, load a Stable Diffusion XL (SDXL) checkpoint and the [KappaNeuro/studio-ghibli-style](https://huggingface.co/KappaNeuro/studio-ghibli-style) and [Norod78/sdxl-chalkboarddrawing-lora](https://huggingface.co/Norod78/sdxl-chalkboarddrawing-lora) LoRAs with the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method. You'll need to assign each LoRA an `adapter_name` to combine them later.
-
-```py
-from diffusers import DiffusionPipeline
-import torch
-
-pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
-pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea")
-pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng")
-```
-
-## set_adapters
-
-The [`~loaders.PeftAdapterMixin.set_adapters`] method merges LoRA adapters by concatenating their weighted matrices. Use the adapter name to specify which LoRAs to merge, and the `adapter_weights` parameter to control the scaling for each LoRA. For example, if `adapter_weights=[0.5, 0.5]`, then the merged LoRA output is an average of both LoRAs. Try adjusting the adapter weights to see how it affects the generated image!
-
-```py
-pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8])
-
-generator = torch.manual_seed(0)
-prompt = "A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai"
-image = pipeline(prompt, generator=generator, cross_attention_kwargs={"scale": 1.0}).images[0]
-image
-```
-
-
-
-
-
-## add_weighted_adapter
-
-> [!WARNING]
-> This is an experimental method that adds PEFTs [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) method to Diffusers to enable more efficient merging methods. Check out this [issue](https://github.com/huggingface/diffusers/issues/6892) if you're interested in learning more about the motivation and design behind this integration.
-
-The [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) method provides access to more efficient merging method such as [TIES and DARE](https://huggingface.co/docs/peft/developer_guides/model_merging). To use these merging methods, make sure you have the latest stable version of Diffusers and PEFT installed.
-
-```bash
-pip install -U diffusers peft
-```
-
-There are three steps to merge LoRAs with the [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) method:
-
-1. Create a [PeftModel](https://huggingface.co/docs/peft/package_reference/peft_model#peft.PeftModel) from the underlying model and LoRA checkpoint.
-2. Load a base UNet model and the LoRA adapters.
-3. Merge the adapters using the [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) method and the merging method of your choice.
-
-Let's dive deeper into what these steps entail.
-
-1. Load a UNet that corresponds to the UNet in the LoRA checkpoint. In this case, both LoRAs use the SDXL UNet as their base model.
-
-```python
-from diffusers import AutoModel
-import torch
-
-unet = AutoModel.from_pretrained(
- "stabilityai/stable-diffusion-xl-base-1.0",
- torch_dtype=torch.float16,
- use_safetensors=True,
- variant="fp16",
- subfolder="unet",
-).to("cuda")
-```
-
-Load the SDXL pipeline and the LoRA checkpoints, starting with the [ostris/ikea-instructions-lora-sdxl](https://huggingface.co/ostris/ikea-instructions-lora-sdxl) LoRA.
-
-```python
-from diffusers import DiffusionPipeline
-
-pipeline = DiffusionPipeline.from_pretrained(
- "stabilityai/stable-diffusion-xl-base-1.0",
- variant="fp16",
- torch_dtype=torch.float16,
- unet=unet
-).to("cuda")
-pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea")
-```
-
-Now you'll create a [PeftModel](https://huggingface.co/docs/peft/package_reference/peft_model#peft.PeftModel) from the loaded LoRA checkpoint by combining the SDXL UNet and the LoRA UNet from the pipeline.
-
-```python
-from peft import get_peft_model, LoraConfig
-import copy
-
-sdxl_unet = copy.deepcopy(unet)
-ikea_peft_model = get_peft_model(
- sdxl_unet,
- pipeline.unet.peft_config["ikea"],
- adapter_name="ikea"
-)
-
-original_state_dict = {f"base_model.model.{k}": v for k, v in pipeline.unet.state_dict().items()}
-ikea_peft_model.load_state_dict(original_state_dict, strict=True)
-```
-
-> [!TIP]
-> You can optionally push the ikea_peft_model to the Hub by calling `ikea_peft_model.push_to_hub("ikea_peft_model", token=TOKEN)`.
-
-Repeat this process to create a [PeftModel](https://huggingface.co/docs/peft/package_reference/peft_model#peft.PeftModel) from the [lordjia/by-feng-zikai](https://huggingface.co/lordjia/by-feng-zikai) LoRA.
-
-```python
-pipeline.delete_adapters("ikea")
-sdxl_unet.delete_adapters("ikea")
-
-pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng")
-pipeline.set_adapters(adapter_names="feng")
-
-feng_peft_model = get_peft_model(
- sdxl_unet,
- pipeline.unet.peft_config["feng"],
- adapter_name="feng"
-)
-
-original_state_dict = {f"base_model.model.{k}": v for k, v in pipe.unet.state_dict().items()}
-feng_peft_model.load_state_dict(original_state_dict, strict=True)
-```
-
-2. Load a base UNet model and then load the adapters onto it.
-
-```python
-from peft import PeftModel
-
-base_unet = AutoModel.from_pretrained(
- "stabilityai/stable-diffusion-xl-base-1.0",
- torch_dtype=torch.float16,
- use_safetensors=True,
- variant="fp16",
- subfolder="unet",
-).to("cuda")
-
-model = PeftModel.from_pretrained(base_unet, "stevhliu/ikea_peft_model", use_safetensors=True, subfolder="ikea", adapter_name="ikea")
-model.load_adapter("stevhliu/feng_peft_model", use_safetensors=True, subfolder="feng", adapter_name="feng")
-```
-
-3. Merge the adapters using the [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) method and the merging method of your choice (learn more about other merging methods in this [blog post](https://huggingface.co/blog/peft_merging)). For this example, let's use the `"dare_linear"` method to merge the LoRAs.
-
-> [!WARNING]
-> Keep in mind the LoRAs need to have the same rank to be merged!
-
-```python
-model.add_weighted_adapter(
- adapters=["ikea", "feng"],
- weights=[1.0, 1.0],
- combination_type="dare_linear",
- adapter_name="ikea-feng"
-)
-model.set_adapters("ikea-feng")
-```
-
-Now you can generate an image with the merged LoRA.
-
-```python
-model = model.to(dtype=torch.float16, device="cuda")
-
-pipeline = DiffusionPipeline.from_pretrained(
- "stabilityai/stable-diffusion-xl-base-1.0", unet=model, variant="fp16", torch_dtype=torch.float16,
-).to("cuda")
-
-image = pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", generator=torch.manual_seed(0)).images[0]
-image
-```
-
-
-
-
-
-## fuse_lora
-
-Both the [`~loaders.PeftAdapterMixin.set_adapters`] and [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) methods require loading the base model and the LoRA adapters separately which incurs some overhead. The [`~loaders.lora_base.LoraBaseMixin.fuse_lora`] method allows you to fuse the LoRA weights directly with the original weights of the underlying model. This way, you're only loading the model once which can increase inference and lower memory-usage.
-
-You can use PEFT to easily fuse/unfuse multiple adapters directly into the model weights (both UNet and text encoder) using the [`~loaders.lora_base.LoraBaseMixin.fuse_lora`] method, which can lead to a speed-up in inference and lower VRAM usage.
-
-For example, if you have a base model and adapters loaded and set as active with the following adapter weights:
-
-```py
-from diffusers import DiffusionPipeline
-import torch
-
-pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
-pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea")
-pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng")
-
-pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8])
-```
-
-Fuse these LoRAs into the UNet with the [`~loaders.lora_base.LoraBaseMixin.fuse_lora`] method. The `lora_scale` parameter controls how much to scale the output by with the LoRA weights. It is important to make the `lora_scale` adjustments in the [`~loaders.lora_base.LoraBaseMixin.fuse_lora`] method because it won’t work if you try to pass `scale` to the `cross_attention_kwargs` in the pipeline.
-
-```py
-pipeline.fuse_lora(adapter_names=["ikea", "feng"], lora_scale=1.0)
-```
-
-Then you should use [`~loaders.StableDiffusionLoraLoaderMixin.unload_lora_weights`] to unload the LoRA weights since they've already been fused with the underlying base model. Finally, call [`~DiffusionPipeline.save_pretrained`] to save the fused pipeline locally or you could call [`~DiffusionPipeline.push_to_hub`] to push the fused pipeline to the Hub.
-
-```py
-pipeline.unload_lora_weights()
-# save locally
-pipeline.save_pretrained("path/to/fused-pipeline")
-# save to the Hub
-pipeline.push_to_hub("fused-ikea-feng")
-```
-
-Now you can quickly load the fused pipeline and use it for inference without needing to separately load the LoRA adapters.
-
-```py
-pipeline = DiffusionPipeline.from_pretrained(
- "username/fused-ikea-feng", torch_dtype=torch.float16,
-).to("cuda")
-
-image = pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", generator=torch.manual_seed(0)).images[0]
-image
-```
-
-You can call [`~~loaders.lora_base.LoraBaseMixin.unfuse_lora`] to restore the original model's weights (for example, if you want to use a different `lora_scale` value). However, this only works if you've only fused one LoRA adapter to the original model. If you've fused multiple LoRAs, you'll need to reload the model.
-
-```py
-pipeline.unfuse_lora()
-```
-
-### torch.compile
-
-[torch.compile](../optimization/torch2.0#torchcompile) can speed up your pipeline even more, but the LoRA weights must be fused first and then unloaded. Typically, the UNet is compiled because it is such a computationally intensive component of the pipeline.
-
-```py
-from diffusers import DiffusionPipeline
-import torch
-
-# load base model and LoRAs
-pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
-pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea")
-pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng")
-
-# activate both LoRAs and set adapter weights
-pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8])
-
-# fuse LoRAs and unload weights
-pipeline.fuse_lora(adapter_names=["ikea", "feng"], lora_scale=1.0)
-pipeline.unload_lora_weights()
-
-# torch.compile
-pipeline.unet.to(memory_format=torch.channels_last)
-pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)
-
-image = pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", generator=torch.manual_seed(0)).images[0]
-```
-
-Learn more about torch.compile in the [Accelerate inference of text-to-image diffusion models](../tutorials/fast_diffusion#torchcompile) guide.
-
-## Next steps
-
-For more conceptual details about how each merging method works, take a look at the [🤗 PEFT welcomes new merging methods](https://huggingface.co/blog/peft_merging#concatenation-cat) blog post!
diff --git a/docs/source/en/using-diffusers/t2i_adapter.md b/docs/source/en/using-diffusers/t2i_adapter.md
index 52552d848fe1..113d85724932 100644
--- a/docs/source/en/using-diffusers/t2i_adapter.md
+++ b/docs/source/en/using-diffusers/t2i_adapter.md
@@ -12,41 +12,21 @@ specific language governing permissions and limitations under the License.
# T2I-Adapter
-[T2I-Adapter](https://hf.co/papers/2302.08453) is a lightweight adapter for controlling and providing more accurate
-structure guidance for text-to-image models. It works by learning an alignment between the internal knowledge of the
-text-to-image model and an external control signal, such as edge detection or depth estimation.
+[T2I-Adapter](https://huggingface.co/papers/2302.08453) is an adapter that enables controllable generation like [ControlNet](./controlnet). A T2I-Adapter works by learning a *mapping* between a control signal (for example, a depth map) and a pretrained model's internal knowledge. The adapter is plugged in to the base model to provide extra guidance based on the control signal during generation.
-The T2I-Adapter design is simple, the condition is passed to four feature extraction blocks and three downsample
-blocks. This makes it fast and easy to train different adapters for different conditions which can be plugged into the
-text-to-image model. T2I-Adapter is similar to [ControlNet](controlnet) except it is smaller (~77M parameters) and
-faster because it only runs once during the diffusion process. The downside is that performance may be slightly worse
-than ControlNet.
-
-This guide will show you how to use T2I-Adapter with different Stable Diffusion models and how you can compose multiple
-T2I-Adapters to impose more than one condition.
-
-> [!TIP]
-> There are several T2I-Adapters available for different conditions, such as color palette, depth, sketch, pose, and
-> segmentation. Check out the [TencentARC](https://hf.co/TencentARC) repository to try them out!
-
-Before you begin, make sure you have the following libraries installed.
+Load a T2I-Adapter conditioned on a specific control, such as canny edge, and pass it to the pipeline in [`~DiffusionPipeline.from_pretrained`].
```py
-# uncomment to install the necessary libraries in Colab
-#!pip install -q diffusers accelerate controlnet-aux==0.0.7
-```
-
-## Text-to-image
-
-Text-to-image models rely on a prompt to generate an image, but sometimes, text alone may not be enough to provide more
-accurate structural guidance. T2I-Adapter allows you to provide an additional control image to guide the generation
-process. For example, you can provide a canny image (a white outline of an image on a black background) to guide the
-model to generate an image with a similar structure.
+import torch
+from diffusers import T2IAdapter, StableDiffusionXLAdapterPipeline, AutoencoderKL
-
-
+t2i_adapter = T2IAdapter.from_pretrained(
+ "TencentARC/t2i-adapter-canny-sdxl-1.0",
+ torch_dtype=torch.float16,
+)
+```
-Create a canny image with the [opencv-library](https://github.com/opencv/opencv-python).
+Generate a canny image with [opencv-python](https://github.com/opencv/opencv-python).
```py
import cv2
@@ -54,166 +34,124 @@ import numpy as np
from PIL import Image
from diffusers.utils import load_image
-image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png")
-image = np.array(image)
+original_image = load_image(
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/non-enhanced-prompt.png"
+)
+
+image = np.array(original_image)
low_threshold = 100
high_threshold = 200
image = cv2.Canny(image, low_threshold, high_threshold)
-image = Image.fromarray(image)
-```
-
-Now load a T2I-Adapter conditioned on [canny images](https://hf.co/TencentARC/t2iadapter_canny_sd15v2) and pass it to
-the [`StableDiffusionAdapterPipeline`].
-
-```py
-import torch
-from diffusers import StableDiffusionAdapterPipeline, T2IAdapter
-
-adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_canny_sd15v2", torch_dtype=torch.float16)
-pipeline = StableDiffusionAdapterPipeline.from_pretrained(
- "stable-diffusion-v1-5/stable-diffusion-v1-5",
- adapter=adapter,
- torch_dtype=torch.float16,
-)
-pipeline.to("cuda")
-```
-
-Finally, pass your prompt and control image to the pipeline.
-
-```py
-generator = torch.Generator("cuda").manual_seed(0)
-
-image = pipeline(
- prompt="cinematic photo of a plush and soft midcentury style rug on a wooden floor, 35mm photograph, film, professional, 4k, highly detailed",
- image=image,
- generator=generator,
-).images[0]
-image
-```
-
-
-
-
-
-
-
-
-Create a canny image with the [controlnet-aux](https://github.com/huggingface/controlnet_aux) library.
-
-```py
-from controlnet_aux.canny import CannyDetector
-from diffusers.utils import load_image
-
-canny_detector = CannyDetector()
-
-image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png")
-image = canny_detector(image, detect_resolution=384, image_resolution=1024)
+image = image[:, :, None]
+image = np.concatenate([image, image, image], axis=2)
+canny_image = Image.fromarray(image)
```
-Now load a T2I-Adapter conditioned on [canny images](https://hf.co/TencentARC/t2i-adapter-canny-sdxl-1.0) and pass it
-to the [`StableDiffusionXLAdapterPipeline`].
+Pass the canny image to the pipeline to generate an image.
```py
-import torch
-from diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter, EulerAncestralDiscreteScheduler, AutoencoderKL
-
-scheduler = EulerAncestralDiscreteScheduler.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler")
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
-adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", torch_dtype=torch.float16)
pipeline = StableDiffusionXLAdapterPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
- adapter=adapter,
+ adapter=t2i_adapter,
vae=vae,
- scheduler=scheduler,
torch_dtype=torch.float16,
- variant="fp16",
-)
-pipeline.to("cuda")
-```
-
-Finally, pass your prompt and control image to the pipeline.
+).to("cuda")
-```py
-generator = torch.Generator("cuda").manual_seed(0)
+prompt = """
+A photorealistic overhead image of a cat reclining sideways in a flamingo pool floatie holding a margarita.
+The cat is floating leisurely in the pool and completely relaxed and happy.
+"""
-image = pipeline(
- prompt="cinematic photo of a plush and soft midcentury style rug on a wooden floor, 35mm photograph, film, professional, 4k, highly detailed",
- image=image,
- generator=generator,
+pipeline(
+ prompt,
+ image=canny_image,
+ num_inference_steps=100,
+ guidance_scale=10,
).images[0]
-image
```
-
-
-
-
## MultiAdapter
-T2I-Adapters are also composable, allowing you to use more than one adapter to impose multiple control conditions on an
-image. For example, you can use a pose map to provide structural control and a depth map for depth control. This is
-enabled by the [`MultiAdapter`] class.
+You can compose multiple controls, such as canny image and a depth map, with the [`MultiAdapter`] class.
-Let's condition a text-to-image model with a pose and depth adapter. Create and place your depth and pose image and in a list.
+The example below composes a canny image and depth map.
+
+Load the control images and T2I-Adapters as a list.
```py
+import torch
from diffusers.utils import load_image
+from diffusers import StableDiffusionXLAdapterPipeline, AutoencoderKL, MultiAdapter, T2IAdapter
-pose_image = load_image(
- "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"
+canny_image = load_image(
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/canny-cat.png"
)
depth_image = load_image(
- "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl_depth_image.png"
)
-cond = [pose_image, depth_image]
-prompt = ["Santa Claus walking into an office room with a beautiful city view"]
-```
-
-
-
-
- depth image
-
-
-
- pose image
-
-
-
-Load the corresponding pose and depth adapters as a list in the [`MultiAdapter`] class.
-
-```py
-import torch
-from diffusers import StableDiffusionAdapterPipeline, MultiAdapter, T2IAdapter
+controls = [canny_image, depth_image]
+prompt = ["""
+a relaxed rabbit sitting on a striped towel next to a pool with a tropical drink nearby,
+bright sunny day, vacation scene, 35mm photograph, film, professional, 4k, highly detailed
+"""]
adapters = MultiAdapter(
[
- T2IAdapter.from_pretrained("TencentARC/t2iadapter_keypose_sd14v1"),
- T2IAdapter.from_pretrained("TencentARC/t2iadapter_depth_sd14v1"),
+ T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", torch_dtype=torch.float16),
+ T2IAdapter.from_pretrained("TencentARC/t2i-adapter-depth-midas-sdxl-1.0", torch_dtype=torch.float16),
]
)
-adapters = adapters.to(torch.float16)
```
-Finally, load a [`StableDiffusionAdapterPipeline`] with the adapters, and pass your prompt and conditioned images to
-it. Use the [`adapter_conditioning_scale`] to adjust the weight of each adapter on the image.
+Pass the adapters, prompt, and control images to [`StableDiffusionXLAdapterPipeline`]. Use the `adapter_conditioning_scale` parameter to determine how much weight to assign to each control.
```py
-pipeline = StableDiffusionAdapterPipeline.from_pretrained(
- "CompVis/stable-diffusion-v1-4",
+vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
+pipeline = StableDiffusionXLAdapterPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
+ vae=vae,
adapter=adapters,
).to("cuda")
-image = pipeline(prompt, cond, adapter_conditioning_scale=[0.7, 0.7]).images[0]
-image
+pipeline(
+ prompt,
+ image=controls,
+ height=1024,
+ width=1024,
+ adapter_conditioning_scale=[0.7, 0.7]
+).images[0]
```
-
\ No newline at end of file
diff --git a/docs/source/en/using-diffusers/textual_inversion_inference.md b/docs/source/en/using-diffusers/textual_inversion_inference.md
index 6315caef10b6..9923bc22fd69 100644
--- a/docs/source/en/using-diffusers/textual_inversion_inference.md
+++ b/docs/source/en/using-diffusers/textual_inversion_inference.md
@@ -10,109 +10,56 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
-# Textual inversion
+# Textual Inversion
-[[open-in-colab]]
+[Textual Inversion](https://huggingface.co/papers/2208.01618) is a method for generating personalized images of a concept. It works by fine-tuning a models word embeddings on 3-5 images of the concept (for example, pixel art) that is associated with a unique token (``). This allows you to use the `` token in your prompt to trigger the model to generate pixel art images.
-The [`StableDiffusionPipeline`] supports textual inversion, a technique that enables a model like Stable Diffusion to learn a new concept from just a few sample images. This gives you more control over the generated images and allows you to tailor the model towards specific concepts. You can get started quickly with a collection of community created concepts in the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer).
-
-This guide will show you how to run inference with textual inversion using a pre-learned concept from the Stable Diffusion Conceptualizer. If you're interested in teaching a model new concepts with textual inversion, take a look at the [Textual Inversion](../training/text_inversion) training guide.
-
-Import the necessary libraries:
+Textual Inversion weights are very lightweight and typically only a few KBs because they're only word embeddings. However, this also means the word embeddings need to be loaded after loading a model with [`~DiffusionPipeline.from_pretrained`].
```py
import torch
-from diffusers import StableDiffusionPipeline
-from diffusers.utils import make_image_grid
-```
-
-## Stable Diffusion 1 and 2
-
-Pick a Stable Diffusion checkpoint and a pre-learned concept from the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer):
-
-```py
-pretrained_model_name_or_path = "stable-diffusion-v1-5/stable-diffusion-v1-5"
-repo_id_embeds = "sd-concepts-library/cat-toy"
-```
-
-Now you can load a pipeline, and pass the pre-learned concept to it:
+from diffusers import AutoPipelineForText2Image
-```py
-pipeline = StableDiffusionPipeline.from_pretrained(
- pretrained_model_name_or_path, torch_dtype=torch.float16, use_safetensors=True
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stable-diffusion-v1-5/stable-diffusion-v1-5",
+ torch_dtype=torch.float16
).to("cuda")
-
-pipeline.load_textual_inversion(repo_id_embeds)
```
-Create a prompt with the pre-learned concept by using the special placeholder token ``, and choose the number of samples and rows of images you'd like to generate:
+Load the word embeddings with [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] and include the unique token in the prompt to activate its generation.
```py
-prompt = "a grafitti in a favela wall with a on it"
-
-num_samples_per_row = 2
-num_rows = 2
-```
-
-Then run the pipeline (feel free to adjust the parameters like `num_inference_steps` and `guidance_scale` to see how they affect image quality), save the generated images and visualize them with the helper function you created at the beginning:
-
-```py
-all_images = []
-for _ in range(num_rows):
- images = pipeline(prompt, num_images_per_prompt=num_samples_per_row, num_inference_steps=50, guidance_scale=7.5).images
- all_images.extend(images)
-
-grid = make_image_grid(all_images, num_rows, num_samples_per_row)
-grid
+pipeline.load_textual_inversion("sd-concepts-library/gta5-artwork")
+prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, style"
+pipeline(prompt).images[0]
```
-
+
-## Stable Diffusion XL
+Textual Inversion can also be trained to learn *negative embeddings* to steer generation away from unwanted characteristics such as "blurry" or "ugly". It is useful for improving image quality.
-Stable Diffusion XL (SDXL) can also use textual inversion vectors for inference. In contrast to Stable Diffusion 1 and 2, SDXL has two text encoders so you'll need two textual inversion embeddings - one for each text encoder model.
-
-Let's download the SDXL textual inversion embeddings and have a closer look at it's structure:
+EasyNegative is a widely used negative embedding that contains multiple learned negative concepts. Load the negative embeddings and specify the file name and token associated with the negative embeddings. Pass the token to `negative_prompt` in your pipeline to activate it.
```py
-from huggingface_hub import hf_hub_download
-from safetensors.torch import load_file
-
-file = hf_hub_download("dn118/unaestheticXL", filename="unaestheticXLv31.safetensors")
-state_dict = load_file(file)
-state_dict
-```
-
-```
-{'clip_g': tensor([[ 0.0077, -0.0112, 0.0065, ..., 0.0195, 0.0159, 0.0275],
- ...,
- [-0.0170, 0.0213, 0.0143, ..., -0.0302, -0.0240, -0.0362]],
- 'clip_l': tensor([[ 0.0023, 0.0192, 0.0213, ..., -0.0385, 0.0048, -0.0011],
- ...,
- [ 0.0475, -0.0508, -0.0145, ..., 0.0070, -0.0089, -0.0163]],
-```
-
-There are two tensors, `"clip_g"` and `"clip_l"`.
-`"clip_g"` corresponds to the bigger text encoder in SDXL and refers to
-`pipe.text_encoder_2` and `"clip_l"` refers to `pipe.text_encoder`.
-
-Now you can load each tensor separately by passing them along with the correct text encoder and tokenizer
-to [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`]:
-
-```py
-from diffusers import AutoPipelineForText2Image
import torch
+from diffusers import AutoPipelineForText2Image
-pipe = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", variant="fp16", torch_dtype=torch.float16)
-pipe.to("cuda")
-
-pipe.load_textual_inversion(state_dict["clip_g"], token="unaestheticXLv31", text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2)
-pipe.load_textual_inversion(state_dict["clip_l"], token="unaestheticXLv31", text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer)
-
-# the embedding should be used as a negative embedding, so we pass it as a negative prompt
-generator = torch.Generator().manual_seed(33)
-image = pipe("a woman standing in front of a mountain", negative_prompt="unaestheticXLv31", generator=generator).images[0]
-image
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stable-diffusion-v1-5/stable-diffusion-v1-5",
+ torch_dtype=torch.float16
+).to("cuda")
+pipeline.load_textual_inversion(
+ "EvilEngine/easynegative",
+ weight_name="easynegative.safetensors",
+ token="easynegative"
+)
+prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration"
+negative_prompt = "easynegative"
+pipeline(prompt, negative_prompt).images[0]
```
+
+