Stable Cascade released Within 24 Hours! A New Better And Faster Diffusion Model!

Future Thinker @Benji
14 Feb 202416:23

Summary

TLDR本视频介绍了Stability AI最新发布的AI扩散模型——Stable Cascade。这个模型基于Verschen架构,能够以更快的速度和更小的像素训练扩散模型,生成高标准尺寸图像。Stable Cascade采用了三阶段图像生成过程,优化了图像处理速度和质量,同时支持Lora控制网和IP适配器。视频还提供了该模型与现有模型的性能比较,展示了其在提示对齐和审美质量方面的优势。最后,作者通过Hugging Face的演示页面测试了Stable Cascade,展示了其在生成复杂场景和细节方面的能力,预示着这一新模型对AI图像生成领域的重大贡献。

Takeaways

  • 🚀 稳定级联(Stable Cascade)是Stability AI最新发布的AI扩散模型,基于Verschen架构。
  • 🌟 该模型能够以更小的像素(24x24)进行训练,训练数据比传统的稳定扩散模型(128x128)小42倍,加速了图像生成速度。
  • 🔍 稳定级联支持更自然的语言输入方式,与传统的稳定扩散模型(1.5版本)相比,它能更好地处理复杂的文本提示。
  • 🎨 它通过三个阶段的图像生成过程——潜在生成器、潜在解码器和细化调整——来产生高质量的图像输出。
  • 🏆 在效能评估中,稳定级联在提示对齐和美学质量方面表现优异,超过了市场上的其他几个扩散模型。
  • 🛠️ 该模型还引入了新的控制机制,如控制网和IP适配器,以及对Lora控制网络的支持,增强了生成图像的自定义能力。
  • 📊 稳定级联在生成图像时,提供了先前引导规模、解码器引导规模和推理步骤等新参数,为用户提供了更多的调节选项。
  • 🌐 虽然目前该模型尚未与自动化工具如Automatic 1111或Comy UI集成,但其GitHub页面和Hugging Face演示页面已经开放,供公众测试。
  • 💡 在演示测试中,稳定级联展现了处理多元素文本提示和生成具有动态元素和详细背景的图像的能力。
  • 🔗 该模型目前仅供研究目的使用,尚未开放商业用途的授权。

Q & A

  • 什么是Stable Cascade?

    -Stable Cascade是Stability AI最新发布的一种基于Ver Chen架构的AI扩散模型,用于更快速、高效地生成图像。

  • Stable Cascade与传统的稳定扩散模型有什么不同?

    -Stable Cascade使用更小的像素尺寸(24x24像素)进行编码训练,相比传统的128x128像素,训练数据量小42倍,使得图像生成更快。

  • Stable Cascade支持哪些新功能?

    -它支持Laura控制网IP适配器和LCM,以及更自然语言的输入提示,提供了面部识别、图像增强和对象训练等高级控制功能。

  • Stable Cascade的图像生成过程分为哪几个阶段?

    -图像生成过程分为三个阶段:首先是潜在生成器根据文本提示生成图像的初步想法,然后通过潜在解码器将这些像素组装成对象,最后在第三阶段进行精细调整以获得完整图像。

  • Stable Cascade的性能表现如何?

    -在提示对齐和美学质量方面,Stable Cascade超越了市场上的其他一些模型,如SDXL和Playground版本2,尽管在美学质量上略逊于Playground版本2。

  • 如何访问和测试Stable Cascade?

    -Stable Cascade提供了一个演示页面,用户可以在Hugging Face平台上测试这个模型,目前尚未在Auto 11或Comy UI等系统中正式支持。

  • Stable Cascade在生成图像时使用的文本提示与以往有何不同?

    -与传统的稳定扩散1.5模型相比,Stable Cascade鼓励使用更接近自然语言的文本提示,而不是简单的关键词和逗号分隔的提示。

  • Stable Cascade能否用于商业用途?

    -目前,Stable Cascade仅供研究目的使用,尚未提供用于商业目的的授权。

  • 在Stable Cascade的图像生成中有哪些新的调整选项?

    -除了传统的宽度、高度和图像数量设置外,Stable Cascade引入了先验引导尺度、解码器引导尺度和推理步骤等新的调整选项。

  • Stable Cascade模型如何处理复杂的图像生成请求?

    -通过其三阶段生成过程和高级控制功能,Stable Cascade能够处理包含多个元素的复杂文本提示,并在图像中准确地反映这些元素。

Outlines

00:00

🚀 稳定级联:探索最新AI扩散模型

本段讨论了稳定级联(Stable Cascade),这是由Stability AI最近发布的一种新型AI扩散模型。稳定级联建立在先前提及的VersChen架构之上,旨在加速扩散模型的训练过程,同时使用更小的像素尺寸来生成高质量的图像。与传统的稳定扩散模型相比,它训练数据需求减少了42倍,且在图像生成速度上有显著提升。此外,稳定级联支持多种控制网技术,如Lora Control Net IP Adapter和LCM,预示着它可能很快就会与现有的Web UI系统兼容,如Automatic 1111或Comy UI。作者还强调了该模型在图像生成中的三个阶段:文本提示生成、潜在解码器处理,以及图像细化,这些都体现了其优于现有技术的能力。

05:00

🌐 体验稳定级联:演示与评估

在这一段中,作者转向稳定级联的演示和评估,特别是通过Hugging Face提供的演示页面。作者分享了如何访问和使用这个演示,包括模型卡片和GitHub页面,这些资源为用户提供了关于如何使用新模型输入文本提示的指导。此外,作者讨论了稳定级联在面部识别、图像细节增强、和图像识别方面的能力,突出了其相对于其他模型如SDXL和Stable Diffusions 1.5的优势。评估结果显示,稳定级联在提示对齐和审美质量方面优于市场上的其他扩散模型,尽管在审美质量方面略逊于Playground版本2。

10:01

🔍 实践测试:稳定级联的应用

作者通过几个具体的文本提示来测试稳定级联的实际效果,从而提供了对模型性能的深入分析。这些测试展示了模型在处理复杂场景、角色动作、和光照效果方面的能力,尤其是在生成具有多个元素的图像时。通过与旧版本的对比,作者指出稳定级联在图像生成的清晰度、细节处理,以及文本提示对齐方面表现出色。作者还指出了某些图像生成方面的限制,如角色的某些细节处理不足,但总体上认为模型的表现超出了先前版本。

15:02

🎉 结论与未来展望

在本段的总结中,作者表达了对稳定级联模型的高度评价,尤其是其在增强AI动画质量和图像生成速度方面的潜力。通过与旧模型的比较,稳定级联显示出在图像内容丰富度和动作捕捉方面的显著改进。作者期待该模型未来在商业和研究领域的应用,并鼓励观众尝试使用这一新技术。最后,作者表示希望能在未来的视频中探讨更多关于稳定视频扩散更新的内容,同时对观众的支持表示感谢。

Mindmap

Keywords

💡稳定级联

稳定级联(Stable Cascade)是由Stability AI最新发布的一个AI扩散模型。它基于一个特殊的架构设计,旨在提高生成图像的速度和质量。在视频中,提到这个模型是建立在'Ver Chen'架构之上的,这表明它采用了一种高效的方法来处理和生成图像。这个模型允许以更小的像素单位进行训练,同时保持生成高分辨率图像的能力,这使得图像生成过程更加快速和高效。

💡扩散模型

扩散模型是一种用于生成图像的深度学习技术,它通过逐步添加噪声到图像中,然后再逐步去除噪声的方式来生成图像。在这段视频脚本中,提到了多个版本的扩散模型,包括稳定视频扩散1.1和稳定级联。这些模型的共同点在于它们都能够基于文本提示生成高质量的图像,但是稳定级联通过使用新的架构和技术,提供了更高效的图像生成过程。

💡Ver Chen架构

Ver Chen架构是视频中提到的一种特殊架构,它被用于稳定级联模型中。这种架构允许模型以更小的像素单位进行训练(24x24像素),相比于传统的稳定扩散模型(例如128x128像素)使用的数据量小42倍。这种架构的设计使得图像生成过程更加高效,同时保持了生成高分辨率图像的能力。

💡LCM

LCM(可能是指低码率控制模块,视频中未具体解释)在视频中被提及,作为稳定级联模型支持的功能之一。这可能指的是模型能够适应不同的控制信号,以调整生成图像的细节和质量,使其在资源受限的环境下也能高效运行。这种技术的引入,为图像生成提供了更大的灵活性和适应性。

💡图像生成

图像生成是指使用AI模型根据文本描述自动创建图像的过程。在视频中,稳定级联模型通过接收文本提示来生成图像,展示了其在处理多元素提示和生成高质量图像方面的优势。这种技术的发展为创意产业、内容创建和其他应用开辟了新的可能性,使得生成特定主题或场景的图像变得更加容易和高效。

💡自然语言处理

自然语言处理(NLP)在视频中提到的上下文中,是指模型能够理解和处理人类语言形式输入的能力。稳定级联模型使用NLP技术来解析文本提示,并根据这些提示生成图像。这表明该模型能够理解复杂的文本描述,并将其转化为具体的视觉内容。

💡性能评估

性能评估是指对AI模型在特定任务上的效能进行评价的过程。视频中提到,稳定级联在提示对齐和美学质量方面与其他模型进行了比较。这种比较显示了稳定级联在图像生成任务上的优势,特别是在处理多元素提示和保持图像美观方面的能力。

💡GitHub页面

GitHub页面在视频中被提及,作为获取稳定级联模型和相关代码的资源之一。这表明开发者提供了一种方式,让用户能够下载、探索和使用这个模型。通过GitHub页面,用户可以访问模型的代码,了解其工作原理,并根据需要对其进行修改或集成到自己的项目中。

💡控制网

控制网在视频中提到,可能是指模型生成图像过程中用于精确控制图像属性和特征的技术。例如,它可以用来控制生成图像中的特定对象,如面部特征或场景的某些方面。这种技术的引入,增加了用户对生成图像的控制能力,允许更精细的定制和优化。

💡图像识别

图像识别是指AI模型识别和理解图像内容的能力。视频中提到,稳定级联模型在图像识别方面表现出色,这意味着它能够准确识别和解释图像中的对象和场景。这对于生成符合文本描述的高质量图像至关重要,因为它确保了生成的图像不仅视觉上吸引人,而且与文本提示紧密对应。

Highlights

The video is abnormal, and we are working hard to fix it.
Please replace the link and try again.

Transcripts

play00:00

let's talk about stable Cascade the new

play00:03

AI diffusion model just

play00:06

released so AI have been going really

play00:08

fast in the development and everything

play00:11

new AI models release every day look at

play00:14

hugging face they have everything

play00:16

listing in here that you see the metav

play00:19

voice IO well I'm going to talk about

play00:22

this in uh the large language model

play00:25

Channel soon and then scroll down into

play00:28

here I see well stability AI have stable

play00:31

video diffusions 1.1 well I was going to

play00:34

do this but then when I scroll up a

play00:36

little

play00:37

bit they showed stability AI stable

play00:40

Cascade and I saw this is like not even

play00:43

a days ago and 16 hours ago and then I

play00:47

was checking out this one and I say okay

play00:50

forget about stable video diffusions 1.1

play00:53

updates and let's do this

play00:56

one because when I saw this one they

play00:59

said this model is built up on the ver

play01:02

Chen architecture and I have mentioned

play01:04

this diffusion model previously in my

play01:06

YouTube channel which is here this is

play01:08

the

play01:10

sausage and where this name I search it

play01:12

is actually a German language of a

play01:14

sausage and well that's how I came up

play01:17

with the thumbnails of this videos and

play01:20

then I saw this is very interesting that

play01:22

stable diffusion newer model from

play01:25

stability AI they are creating a new

play01:28

diffusions model using the vers chin and

play01:31

then we had talked about this

play01:32

architecture before it is able to train

play01:35

diffusions model faster speed with

play01:38

smaller pixels image and you are able to

play01:41

produce sdxl standard size image like

play01:45

this one we have 1024x 1024 image but

play01:49

then their encoding of this architecture

play01:52

is using 24x 24 PX eels instead of that

play01:56

it is 42 times smaller training data

play01:59

compared with traditional stable

play02:00

diffusions 1.5 128 by 128 pixels and it

play02:06

is even faster than sdxl because well

play02:09

why not right you do something newer

play02:11

model of course it's going to be

play02:14

performed better than the older AI

play02:16

models and one thing is really good

play02:19

about this model stable Cascade created

play02:21

by stability AI they are also supporting

play02:24

for Laura control net IP adapter and LCM

play02:28

like oh my goodness this is insane if

play02:31

there's new updates for automatic 1111

play02:34

or comy UI any web UI system that will

play02:37

support stable diffusions I believe

play02:39

later on we will have an update that can

play02:41

support this stable Cascade AI models to

play02:44

run image Generations on those system

play02:48

and one of the very good news about that

play02:50

is that they have a new demo page that

play02:53

we will be testing out this model and

play02:56

right now they have not officially

play02:57

released any support in auto automatic

play03:00

11 or com vui at this moment they just

play03:04

updated today within just 24 hours so

play03:07

I'm going to just say okay forget about

play03:10

stable video diffusions update and let's

play03:12

do this one first so this model let's go

play03:16

through their overview and some

play03:18

technical backgrounds about this one

play03:21

first of all you see this model stable

play03:24

Cascade they are separating with three

play03:26

stages of image generation process which

play03:30

we call it the latent generator and the

play03:33

latent generator in stage C they

play03:35

generate this using the text your input

play03:38

text that means your text prompt

play03:41

generating the brief ideas of the image

play03:45

and then they pass it to the Stage B to

play03:47

do the latent decoder so that it allows

play03:50

the AI to put those pixels those little

play03:53

little do pixels put it back into this

play03:55

whole objects and then through these

play03:58

objects they can be refined tuned in

play04:01

stage a and you will be getting a full

play04:04

image of your result so within this one

play04:07

they are using better I would say better

play04:09

performance than stable diffusions since

play04:11

they are using a smaller size pixels for

play04:14

their encoder training and then the

play04:17

processing data is smaller size that is

play04:19

like 42 times compared with the

play04:21

traditional stable diffusion so that is

play04:24

really Advantage for us to processing

play04:26

faster even you have lower-end GPU

play04:29

graphic card or high-end graphic cards

play04:31

both of them are able to have advantage

play04:33

to generate image faster and one of the

play04:36

really good thing that I saw in here

play04:38

they have the evaluations you got the

play04:40

prom alignment and the aesthetic quality

play04:43

they have compared with playground

play04:45

version 2 and sdxl turbo and sdxl and

play04:49

then the version version 2 so in the

play04:52

prompt alignment of them are suppressing

play04:54

those older model that currently on the

play04:57

market and then in the aesthetic quality

play05:00

the playground 2 version 2 is a little

play05:02

bit higher score than the stable Cascade

play05:05

but it is way better than other three

play05:08

diffusions model in here as they test in

play05:11

the this is like the benchmarking result

play05:13

from their testing phase so let's go to

play05:16

their demo page in hugging

play05:19

phase right now we have this page now I

play05:22

will share the links of this hugging

play05:24

face demo page and also the model card

play05:27

in here as well and and also they have

play05:30

GitHub page that is talking about the

play05:32

same thing of what we saw in the hugging

play05:35

face model card the same information is

play05:37

in here and you can check this out as

play05:39

well and also they have more details

play05:42

about the text prompt that you are going

play05:43

to input that is not like those stable

play05:46

diffusions 1.5 stylus text prompt and

play05:51

this is more like a natural language

play05:54

manner of input prompts for creating a

play05:57

new image in this new model and also

play06:00

they have control net here as you can

play06:03

see you can control the face the face

play06:05

identity and then well if you have the

play06:08

face identity that means they have

play06:10

already handled the face swap features

play06:13

within the model I will believe that is

play06:15

so and then they have the candy's

play06:17

control net that is going to just like

play06:19

other stable diffusion control net that

play06:21

we used to do and then the super

play06:24

resolutions that means they have

play06:26

something like an upscaling for making

play06:28

your image more details and more

play06:30

refinement on all the detail small part

play06:33

of your AI image and then you can easily

play06:36

train your lorus with any objects for

play06:38

example they have a dog of this and then

play06:41

they train with this image and they can

play06:43

reproduce this image with a space suit

play06:46

of the dog wearing a space suit and then

play06:48

the image recognitions are well I can

play06:51

say it's better than stable diffusion

play06:53

1.5 or sdxl because their training is

play06:57

they have more image training for their

play06:59

their models without this principle it

play07:01

has already suppressed the image

play07:03

recognition of the older stable

play07:05

diffusions model okay so let's go to the

play07:08

demo page right here let's try this out

play07:12

now I have tried one time with a very

play07:14

simple prompt in here I say the

play07:16

playground an old man walking with his

play07:18

grandson holding hand and sunset time

play07:22

now this is not like the old days the

play07:24

traditional text prompts in stable

play07:26

diffusions 1.5 as you can see we are

play07:31

using a more natural language sentence

play07:33

almost like a sentence to create an

play07:36

image like this now as I can see the

play07:38

image in here is well it's pretty nice

play07:41

let's go to a new tab and we can check

play07:44

out the full size of it now as you can

play07:47

see all these prompts in here it does

play07:50

generating into the image already

play07:52

there's a grandson and then the old man

play07:55

is holding hands together in a

play07:56

playground and then the sunset time

play07:59

basically all the principle of my

play08:02

prompts is already appear on this image

play08:04

and it does really well to handle

play08:06

multiple elements of a text prompt

play08:08

rather than in stable diffusions 1.5 or

play08:11

even

play08:12

sdxl sometimes you cannot do multiple

play08:15

element handling which means they are

play08:17

not in here they have set the prompt

play08:19

alignment which means they have not did

play08:21

that quite good in sdxl and even the

play08:24

older one the SD 1.5 but then in here

play08:27

it's done really well right and then you

play08:30

can see in here they have the advanced

play08:32

options that you can of course you put

play08:34

negative prompts you can generate seat

play08:37

numbers that is very typical for all AI

play08:40

models especially for image generation

play08:42

models and then you can set your width

play08:45

and height in here and that is the same

play08:47

size of sdxl by default of this AI model

play08:51

and then numbers of image and then this

play08:54

one is kind of a new thing for us as a

play08:56

stable diffusions user the prior guide

play08:59

idance scale and then the prior

play09:01

inference steps and then the decoder

play09:03

guidance scale this is something and

play09:05

also the lastly in here the inference

play09:08

steps this is something that we don't

play09:10

have in stable diffusions well the steps

play09:13

we can classify this one can be the

play09:14

sampling steps like how much you want to

play09:17

set in 25 step or 30 step Etc but then

play09:21

the other two from this decoder scales

play09:24

and then the decoder steps in here it is

play09:26

something that we don't have in stable

play09:28

diff fusions currently so I guess if

play09:31

they have to implement this model into

play09:33

comfy UI or automatic 1111 they have to

play09:37

create a new notes in there or a new

play09:39

input data area for us to set these two

play09:42

parameters in automatic 1111 as well so

play09:46

I'm waiting if they have an update about

play09:49

this model can be compatible with both

play09:51

automatic or comy UI but at this moment

play09:55

right now we are able to test the stable

play09:57

Cascade in this demo page of hugging

play10:00

face and the GitHub page in here allows

play10:03

you to download the coding on the top of

play10:06

here you can download this code and this

play10:09

is also the same demo page in here but

play10:12

you can run it in locally but I guess it

play10:14

is not the point for us to download this

play10:17

demo page locally using the GitHub page

play10:20

project instead just enjoy and try in

play10:23

this GitHub demo page at this moment and

play10:26

let's wait for the updates if they are

play10:29

supporting in other web UI like

play10:31

automatic 1111 or comy UI then we can

play10:34

fully enjoy this model using in those

play10:37

system right so let's try another

play10:40

example in here that using their default

play10:43

text I will say this is a pretty cool

play10:45

thing like a city of Los Angeles this

play10:48

one and let's try

play10:51

this okay so here we have the result

play10:53

here that is kind of funny thing you're

play10:56

putting something that is not realistic

play10:58

but it is in realistic styles of Los

play11:00

Angeles Street and you see all these

play11:02

details of the street and then of the

play11:04

concrete on the top here see all those

play11:07

Mark they have did very detail on this

play11:10

and it looks pretty good and let's try

play11:13

some prompts that not by default in here

play11:16

uh let's say uh you know in previous

play11:18

videos I have tests about the versen

play11:21

diffusions model I have test John Wick

play11:24

and let's try with this one John Wick in

play11:26

cyberpunk and let's try in diffusion

play11:29

Cascade so let's say John Wick John Wick

play11:32

closeup shot

play11:48

okay actually let's not going to do the

play11:50

traditional text Palm let's do something

play11:53

like John Wick in Disco clapping places

play11:57

he hold pistol ready to shoot the place

play11:59

with cyberpunk Leon

play12:01

light okay let's try this one like more

play12:04

natural language prompts that is not

play12:06

like those one keyword and comma one

play12:09

keyword and then another comma those

play12:11

table diffusions 1.5 text prompt Styles

play12:15

hopefully it will generate something for

play12:27

me

play12:38

and there you go more clear and

play12:41

hopefully there is something and let's

play12:43

see the full view of

play12:45

this well the eyes is not kind of clear

play12:48

at this moment but we can see okay

play12:49

there's like the Assassin's ring and

play12:51

he's showing very detail of that and

play12:53

then the wash but then John Wick is

play12:56

carrying the wash in other side and

play12:58

actually inside the hand wrist of the

play13:00

watch should be facing inside I should

play13:02

say but it does doing something

play13:03

realistic kind of like everything in

play13:06

here is following my prom really well

play13:09

like in a disco clubbing Pace holding a

play13:11

pistol ready to shoot so the action of

play13:13

John Wick is ready to shooting the

play13:15

pistol and then you see the Cyber Punk

play13:18

Leon light is all over here so I can say

play13:21

this is a pass but then the eyes of this

play13:24

we might need a refinder to do that if

play13:28

we need to enhance this image let's say

play13:30

if let's let's try with this problem

play13:33

again with more content and I would say

play13:36

let's fix the eyes okay with John Wick

play13:39

picture with clear face and eyes just do

play13:44

that okay just add one more content in

play13:48

here let's hope that it can help for our

play13:50

character face with better

play13:57

quality

play14:01

now one thing I have to mention about

play14:02

this AI model is that it is not for

play14:05

commercial purpose yet okay maybe one

play14:09

day you can purchase this AI models for

play14:11

the license for commercial purpose but

play14:14

right now we are just doing it for

play14:16

research purpose okay so another one

play14:20

here yeah we have a better phas more

play14:22

clear and similar Styles I would say

play14:25

well the pistol is kind of awkward in

play14:28

this direction Direction well if you

play14:30

guys have played with Firearms before

play14:32

and you would know that the wrists of

play14:34

this and then the angles of the pistol

play14:36

pointing is kind of awkward way it

play14:39

should be more pointing in the center of

play14:42

the character instead of pointing

play14:44

outward of the character but oh well I

play14:47

would still give it a pass for this

play14:50

right this style compared with the

play14:53

previous one the purely trained one chai

play14:56

diffusions model the older diffusion

play14:59

model of this sausage model's name is

play15:01

always giving me the close upshot of a

play15:04

character but then in the stable Cascade

play15:07

they have given me more element and

play15:09

actions of the characters there is more

play15:12

content in the generate image within

play15:14

this sort of prompts so I would say yes

play15:16

their quality have suppressed the one

play15:18

child version to already a lot I mean it

play15:21

should say a lot and then of course they

play15:23

are suppressing the sdxl a lot as well

play15:27

and I can see that if we are able to use

play15:29

this model in the future and we can make

play15:32

AI animations using this model instead

play15:35

of SD 1.5 or

play15:37

sdxl and of course we can do way better

play15:40

quality than what we have in today in AI

play15:43

elements so it I hope you guys enjoyed

play15:46

these videos have a quick test I just

play15:49

did a very fast videos quick fast videos

play15:52

about this newer models I really want to

play15:54

do it today to share it with you guys

play15:57

and um yeah maybe the stable video

play15:59

diffusions uh newer update 1.1 I would

play16:03

do it next time in other videos but then

play16:05

I hope you guys can get inspired of this

play16:08

new models stable Cascade and then try

play16:11

it out this is a very exciting news for

play16:14

me and I hope you guys do so I will see

play16:17

you guys in the next videos and have a

play16:19

nice day

play16:21

bye