Beginner's Guide - Generate Videos With SwarmUI #716
mcmonkey4eva
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
So, you want to generate AI videos with SwarmUI? Don't worry, it's easy!
(Forenote: this guide was written in April 2025. Things are likely to change in the future, and this guide will eventually be outdated.)
Part One: Pick A Video Model
Video models supported in SwarmUI are documented in the Video Model Support document. That page is kept up to date, with a list of all supported model classes, details about the unique usage needs of each, a chart to guide you on picking the right model, and a general recommendation for what most users should pick.
When you're first starting out with video gen, keep it simple: Use the base models for the given model class, don't play with parameters too much, and use easy / friendly test content to generate. If something goes wrong, you might have to ask for help, and you don't want to show someone your weirdest gens or unreadably long prompt/parameter piles.
Once you've got the basics then, then move on to generating what you're actually hoping for. Search civitai or other model sites for finetuned model variants or loras that fit what you're hoping to get, and feed the model with prompts/parameters that you actually want.
At time of writing, the leading video model class is Wan 2.1. In this case, the docs give a pretty long list of install options. Because I'm running an RTX 4090, I can fit the large variant (14B), and I'll have best performance with fp8 models. I want both text2video and image2video models, and for i2v I prefer the faster option instead of the higher res option. So I'm grabbing Wan 2.1 Text2Video 14B fp8_scaled and Wan 2.1 Image2Video 14B 480p fp8_scaled. Your own choices might be difference, and of course if you're reading this in the future there might be different options available.
Download the model, save it in the relevant folder (usually in
diffusion_models
), I prefer to organize my models into subfolders, so I'm saving these Wan models intoSwarmUI/Models/diffusion_models/Wan/
.Refresh your models list in Swarm and make sure the model shows up. Feel free to click the "=" menu on the side of the menu and then "Edit Metadata" to add some extra info or an icon to the model.
In my actual personal setup, my wan folder is full of a bunch of different Wan variant models, and I have tacked on lazy icons to recognize a few of them more easily
Part Two: Basic Text2Video
Setup T2V
Text-To-Video is the most basic form of ai video generation. You type a prompt, you get a video, done deal. To be honest with you, it's not a great method, for reason we'll get into later... but it's usually fast and easy, and every model class supports it, so let's start there.
In your models list, click your Text2Video model to select it.
Make sure your other parameters are default - if you're not sure, click "Quick Tools" at the top right, then "Reset Params to Default"

In your parameter list (the left sidebar), configure parameters according to the video model support doc and your choices.
In my case, with Wan Text2Video 14B, I've made the following adjustments:
gif-hd
, which is the best format GitHub natively embeds. I usually preferwebp
, but a lot of sites don't support that.Understanding Params
Note: When in doubt, there's docs! Are you curious for example about what the options for that "Text2Video Format" param actually are? Just click that

?
buttonSwarmUI is covered in docs, both in the docs folder (where the video model support doc is), and in-line in the UI. You should never feel completely lost while working in SwarmUI - there's always a way to figure things out. Worst case scenario, if nothing in the UI or the docs clarifies it, come ask on the Discord.
Generate
Now, the most important parameter: prompt! I want something dramatic, but cute, which represents how cool it is that SwarmUI is generating videos for me... so how about
real video of a cat walking through a dimly lit rainbow forest, beneath a neon sign that reads "Swarm UI", shot on Sony a6100
Different models have different prompting needs. Wan is a model that likes simple clear English or Chinese language sentences. A minimal bit of "tagging" can help guide style, but don't overdo it - in this case I'll just add a
shot on Sony a6100
to encourage it to look like a real camera video instead of a cartoon aesthetic.Then... hit that big "Generate" button! Wan-14B is pretty slow, this took me about 3 and a half minutes to generate:

That... is decent, but not quite what I was hoping for. It's got all the pieces, but not really focused on the cat walking around like I wanted.
If speed is an issue, other models are faster (such as Wan 1.3B, or LTX-V is quite fast, but check the video model support doc for up-to-date recommendations).
If you don't like the outcome, try changing the basic parameters - frame count, prompt, resolution, etc. and try again. Or, generate again without changing any param (with Seed set to -1, ie randomize) to see if you'll get lucky on the next try.
I recommend always doing a variety of generations with any new model while you're starting, just to get familiar with how the model responds to inputs.
I fiddled with the params and played luck-o-the-seed a bit, and ended up with the video I used as the header of this guide using the same prompt and a different resolution.
Here I have a generation going where I can already see the composition isn't how I want it to be:

So I'm going to go ahead and click the "Interrupt" button to tell it to stop:

That will end the generation early (it may take a few seconds to process the interrupt) and allow you to immediately queue up a new attempt.
Watch The Gen
Most video models in SwarmUI natively support live previews, so while you're waiting for it to generate, you can watch a preview of the video that's coming.
Part Three: Text To Image To Video
Now let's talk about the way I think is better for ai video generation: generate an image you really like, and then use an image-to-video model to make it move.
I prefer this because image models often run in seconds, so you can experiment a lot with images, vs text2video often takes a while to generate - and you don't want to wait 3 minutes just to find out the result was bad. There's also tons of loras and other customizations out there for image models, whereas video models often have fewer available.
Swarm makes text-to-image-to-video super easy to do, so let's go for it!
Set up your image generation
First, get image generation going. Basic image gen setup is covered in the Basic Usage Doc. In my case, I'm going to use Flux Dev with CFG=1 (required by Flux Dev), largely default parameters, and the same prompt as the above generation.

First try looks awesome.
Enable image to video
Now, let's enable the Image To Video parameter group, select the video model we're using (in my case wan 14B 480 fp8).

Most parameters here you can leave default/unset, they will automatically default correctly. The big one you'll want to play with is of course frame count. That "Video Resolution" parameter is magic, it will by default automatically resize the Flux image (1024x1024) to the resolution set in the video model's metadata (in this case, 640x640), accounting for whatever aspect ratio you used too. Convenient!
I'll once again use gif-hd so I can post to GitHub here.
Note I left "Video CFG" unchecked: Wan default CFG of 6 is perfectly fine, and Swarm will automatically apply the class-appropriate-default CFG to a video when that's unchecked. This is different from base model generation, where you're expected to set CFG yourself normally.
Generate the video
Now hit "Generate" again - you'll have an image generate, and then it will generate a video in which the first frame is the image you just made, and the rest of the video is hopefully in moving in a neat way.

Don't like the image you got, and don't want to wait for the video? Just hit that Interrupt button.
The video will be cancelled and you can try again.
In my case, the image and video it made is pretty neat I think

Alternatives
Like the image you got, but don't like the video you got? The other option available is direct image to video, covered below. You can simply generate images in advance, then separately go and generate videos of them. This lets you play with video params and roll seeds more.
Another concern that can arise here is you might simply run out of system RAM - two entire diffusion models loaded can eat up some space! In that case, you'll want to first generate images, then stop and switch to image-to-video generation.
Part Four: Direct Image To Video
Have an image of your content already, or generated one in advance with a text-to-image model?
There's an app for thatthere's an easy way to do that, too!Set it up
First, drag your image onto the "Init Image" parameter, and set "Init Image Creativity" to 0 (!Important! Make sure creativity is set to 0! Forgetting this is a common mistake!)
In my case, I'm grabbing the flux gen I made earlier:

You'll also want to copy the image aspect ratio using the "Res" button next to Init Image

Double check your "Resolution" parameter is set how you expect it to be.
NOTE: Swarm's main Generate tab interface is an image generation system, and image2video is a special case normally reserved for text2image2video setups, so what we're doing here is a little trick where we set up text2image2video, but skip the text2image stage. That's why we're using "Init Image" with "Creativity=0", and why we need to be careful with model selection.
In the "Models" menu at the bottom, you can select any model you want, it doesn't particularly matter, because the text2image stage is being skipped - however it's common to select to the image-to-video model here just to avoid memory/load issues. Note that you cannot use dedicated image2video models as a real base model, we're only allowed to select it here because we're explicitly skipping that stage.
Now, the real setup: enable the "Image To Video" parameter group, and set things up how you want. Select the video model we're using (in my case wan 14B 480 fp8). Most parameters here you can leave default/unset, they will automatically default correctly. The big one you'll want to play with is of course frame count.
For right now, I want to generate videos very quickly, so I'm going to set Frames down to 33, and I'm going to do a little trick: First, I set the Resolution to a custom 512x512:


Then, I'm going to set "Video Resolution" to "Image", meaning copy my standard resolution parameter without any resize magic.
Without this, the default "Image Aspect, Model Res" would resize the image to the video models default (640x640) which I want to go lower than just to get some more speed.
And, of course, format gif-hd because I need to post my outputs here on github. You'll probably use webp.
Here's my final params:

And 90 seconds later, I got a quick gen output:

... That's a bit wonky, not quite the type of rainbows I was hoping for. I'm not quite prompting this right!
The nice thing about Wan's I2V models, is prompting is actually very easy: we don't need to tell it what's in the image, it already knows! We only need to prompt the motion! Because I prompted for rainbows above, it added rainbow motion. I don't want that, I just want the cat to walk forward. Let's make it way simpler:

the cat walks forward through the forest
Wowza! Way better!
Part Five: Going Beyond
There's so much more you can do with video generation now that you've got the basics.
How about trying some other model classes? There's new ones all the time.
How about some high res / high length / high detail gens? Can you make something beautiful?
There's tons of performance/microquality/etc. hacks out there - TorchCompile, TeaCache, etc. - details are beyond the scope of this guide, but look around at what parameters are available in the "Advanced" section and what Extensions are available in the server tab for some options. Also don't be afraid to look at online discussions on Discord, GitHub, Reddit, etc. to see what the hot new techniques are.
Once you got a good approach locked in, my favorite part: bulk automation! Set up an Text-To-Image-To-Video pipeline you like, get some prompt formats and wildcards that create great results, set "Images" to 100, hit "Generate", and go to bed. When you wake up in the morning, scroll through all the cool videos you generated overnight and hit the Star button on your favorites to save them to a special folder of your image history.
Want to bulk automate image-to-video? Fill up a folder on your PC with images, set the filename to appropriate prompts for the images, then in SwarmUI use Tools -> Image Edit Batcher -> give it your input folder, pick an output folder, check "Use As Init" and "Append Filename to Prompt", then hit "Run Batch" (replaces the Generate button).
Beta Was this translation helpful? Give feedback.
All reactions