Stable Cascade released Within 24 Hours! A New Better And Faster Diffusion Model!
Summary
TLDR本视频介绍了Stability AI最新发布的AI扩散模型——Stable Cascade。这个模型基于Verschen架构,能够以更快的速度和更小的像素训练扩散模型,生成高标准尺寸图像。Stable Cascade采用了三阶段图像生成过程,优化了图像处理速度和质量,同时支持Lora控制网和IP适配器。视频还提供了该模型与现有模型的性能比较,展示了其在提示对齐和审美质量方面的优势。最后,作者通过Hugging Face的演示页面测试了Stable Cascade,展示了其在生成复杂场景和细节方面的能力,预示着这一新模型对AI图像生成领域的重大贡献。
Takeaways
- 🚀 稳定级联(Stable Cascade)是Stability AI最新发布的AI扩散模型,基于Verschen架构。
- 🌟 该模型能够以更小的像素(24x24)进行训练,训练数据比传统的稳定扩散模型(128x128)小42倍,加速了图像生成速度。
- 🔍 稳定级联支持更自然的语言输入方式,与传统的稳定扩散模型(1.5版本)相比,它能更好地处理复杂的文本提示。
- 🎨 它通过三个阶段的图像生成过程——潜在生成器、潜在解码器和细化调整——来产生高质量的图像输出。
- 🏆 在效能评估中,稳定级联在提示对齐和美学质量方面表现优异,超过了市场上的其他几个扩散模型。
- 🛠️ 该模型还引入了新的控制机制,如控制网和IP适配器,以及对Lora控制网络的支持,增强了生成图像的自定义能力。
- 📊 稳定级联在生成图像时,提供了先前引导规模、解码器引导规模和推理步骤等新参数,为用户提供了更多的调节选项。
- 🌐 虽然目前该模型尚未与自动化工具如Automatic 1111或Comy UI集成,但其GitHub页面和Hugging Face演示页面已经开放,供公众测试。
- 💡 在演示测试中,稳定级联展现了处理多元素文本提示和生成具有动态元素和详细背景的图像的能力。
- 🔗 该模型目前仅供研究目的使用,尚未开放商业用途的授权。
Q & A
什么是Stable Cascade?
-Stable Cascade是Stability AI最新发布的一种基于Ver Chen架构的AI扩散模型,用于更快速、高效地生成图像。
Stable Cascade与传统的稳定扩散模型有什么不同?
-Stable Cascade使用更小的像素尺寸(24x24像素)进行编码训练,相比传统的128x128像素,训练数据量小42倍,使得图像生成更快。
Stable Cascade支持哪些新功能?
-它支持Laura控制网IP适配器和LCM,以及更自然语言的输入提示,提供了面部识别、图像增强和对象训练等高级控制功能。
Stable Cascade的图像生成过程分为哪几个阶段?
-图像生成过程分为三个阶段:首先是潜在生成器根据文本提示生成图像的初步想法,然后通过潜在解码器将这些像素组装成对象,最后在第三阶段进行精细调整以获得完整图像。
Stable Cascade的性能表现如何?
-在提示对齐和美学质量方面,Stable Cascade超越了市场上的其他一些模型,如SDXL和Playground版本2,尽管在美学质量上略逊于Playground版本2。
如何访问和测试Stable Cascade?
-Stable Cascade提供了一个演示页面,用户可以在Hugging Face平台上测试这个模型,目前尚未在Auto 11或Comy UI等系统中正式支持。
Stable Cascade在生成图像时使用的文本提示与以往有何不同?
-与传统的稳定扩散1.5模型相比,Stable Cascade鼓励使用更接近自然语言的文本提示,而不是简单的关键词和逗号分隔的提示。
Stable Cascade能否用于商业用途?
-目前,Stable Cascade仅供研究目的使用,尚未提供用于商业目的的授权。
在Stable Cascade的图像生成中有哪些新的调整选项?
-除了传统的宽度、高度和图像数量设置外,Stable Cascade引入了先验引导尺度、解码器引导尺度和推理步骤等新的调整选项。
Stable Cascade模型如何处理复杂的图像生成请求?
-通过其三阶段生成过程和高级控制功能,Stable Cascade能够处理包含多个元素的复杂文本提示,并在图像中准确地反映这些元素。
Outlines
🚀 稳定级联:探索最新AI扩散模型
本段讨论了稳定级联(Stable Cascade),这是由Stability AI最近发布的一种新型AI扩散模型。稳定级联建立在先前提及的VersChen架构之上,旨在加速扩散模型的训练过程,同时使用更小的像素尺寸来生成高质量的图像。与传统的稳定扩散模型相比,它训练数据需求减少了42倍,且在图像生成速度上有显著提升。此外,稳定级联支持多种控制网技术,如Lora Control Net IP Adapter和LCM,预示着它可能很快就会与现有的Web UI系统兼容,如Automatic 1111或Comy UI。作者还强调了该模型在图像生成中的三个阶段:文本提示生成、潜在解码器处理,以及图像细化,这些都体现了其优于现有技术的能力。
🌐 体验稳定级联:演示与评估
在这一段中,作者转向稳定级联的演示和评估,特别是通过Hugging Face提供的演示页面。作者分享了如何访问和使用这个演示,包括模型卡片和GitHub页面,这些资源为用户提供了关于如何使用新模型输入文本提示的指导。此外,作者讨论了稳定级联在面部识别、图像细节增强、和图像识别方面的能力,突出了其相对于其他模型如SDXL和Stable Diffusions 1.5的优势。评估结果显示,稳定级联在提示对齐和审美质量方面优于市场上的其他扩散模型,尽管在审美质量方面略逊于Playground版本2。
🔍 实践测试:稳定级联的应用
作者通过几个具体的文本提示来测试稳定级联的实际效果,从而提供了对模型性能的深入分析。这些测试展示了模型在处理复杂场景、角色动作、和光照效果方面的能力,尤其是在生成具有多个元素的图像时。通过与旧版本的对比,作者指出稳定级联在图像生成的清晰度、细节处理,以及文本提示对齐方面表现出色。作者还指出了某些图像生成方面的限制,如角色的某些细节处理不足,但总体上认为模型的表现超出了先前版本。
🎉 结论与未来展望
在本段的总结中,作者表达了对稳定级联模型的高度评价,尤其是其在增强AI动画质量和图像生成速度方面的潜力。通过与旧模型的比较,稳定级联显示出在图像内容丰富度和动作捕捉方面的显著改进。作者期待该模型未来在商业和研究领域的应用,并鼓励观众尝试使用这一新技术。最后,作者表示希望能在未来的视频中探讨更多关于稳定视频扩散更新的内容,同时对观众的支持表示感谢。
Mindmap
Keywords
💡稳定级联
💡扩散模型
💡Ver Chen架构
💡LCM
💡图像生成
💡自然语言处理
💡性能评估
💡GitHub页面
💡控制网
💡图像识别
Highlights
Please replace the link and try again.
Transcripts
let's talk about stable Cascade the new
AI diffusion model just
released so AI have been going really
fast in the development and everything
new AI models release every day look at
hugging face they have everything
listing in here that you see the metav
voice IO well I'm going to talk about
this in uh the large language model
Channel soon and then scroll down into
here I see well stability AI have stable
video diffusions 1.1 well I was going to
do this but then when I scroll up a
little
bit they showed stability AI stable
Cascade and I saw this is like not even
a days ago and 16 hours ago and then I
was checking out this one and I say okay
forget about stable video diffusions 1.1
updates and let's do this
one because when I saw this one they
said this model is built up on the ver
Chen architecture and I have mentioned
this diffusion model previously in my
YouTube channel which is here this is
the
sausage and where this name I search it
is actually a German language of a
sausage and well that's how I came up
with the thumbnails of this videos and
then I saw this is very interesting that
stable diffusion newer model from
stability AI they are creating a new
diffusions model using the vers chin and
then we had talked about this
architecture before it is able to train
diffusions model faster speed with
smaller pixels image and you are able to
produce sdxl standard size image like
this one we have 1024x 1024 image but
then their encoding of this architecture
is using 24x 24 PX eels instead of that
it is 42 times smaller training data
compared with traditional stable
diffusions 1.5 128 by 128 pixels and it
is even faster than sdxl because well
why not right you do something newer
model of course it's going to be
performed better than the older AI
models and one thing is really good
about this model stable Cascade created
by stability AI they are also supporting
for Laura control net IP adapter and LCM
like oh my goodness this is insane if
there's new updates for automatic 1111
or comy UI any web UI system that will
support stable diffusions I believe
later on we will have an update that can
support this stable Cascade AI models to
run image Generations on those system
and one of the very good news about that
is that they have a new demo page that
we will be testing out this model and
right now they have not officially
released any support in auto automatic
11 or com vui at this moment they just
updated today within just 24 hours so
I'm going to just say okay forget about
stable video diffusions update and let's
do this one first so this model let's go
through their overview and some
technical backgrounds about this one
first of all you see this model stable
Cascade they are separating with three
stages of image generation process which
we call it the latent generator and the
latent generator in stage C they
generate this using the text your input
text that means your text prompt
generating the brief ideas of the image
and then they pass it to the Stage B to
do the latent decoder so that it allows
the AI to put those pixels those little
little do pixels put it back into this
whole objects and then through these
objects they can be refined tuned in
stage a and you will be getting a full
image of your result so within this one
they are using better I would say better
performance than stable diffusions since
they are using a smaller size pixels for
their encoder training and then the
processing data is smaller size that is
like 42 times compared with the
traditional stable diffusion so that is
really Advantage for us to processing
faster even you have lower-end GPU
graphic card or high-end graphic cards
both of them are able to have advantage
to generate image faster and one of the
really good thing that I saw in here
they have the evaluations you got the
prom alignment and the aesthetic quality
they have compared with playground
version 2 and sdxl turbo and sdxl and
then the version version 2 so in the
prompt alignment of them are suppressing
those older model that currently on the
market and then in the aesthetic quality
the playground 2 version 2 is a little
bit higher score than the stable Cascade
but it is way better than other three
diffusions model in here as they test in
the this is like the benchmarking result
from their testing phase so let's go to
their demo page in hugging
phase right now we have this page now I
will share the links of this hugging
face demo page and also the model card
in here as well and and also they have
GitHub page that is talking about the
same thing of what we saw in the hugging
face model card the same information is
in here and you can check this out as
well and also they have more details
about the text prompt that you are going
to input that is not like those stable
diffusions 1.5 stylus text prompt and
this is more like a natural language
manner of input prompts for creating a
new image in this new model and also
they have control net here as you can
see you can control the face the face
identity and then well if you have the
face identity that means they have
already handled the face swap features
within the model I will believe that is
so and then they have the candy's
control net that is going to just like
other stable diffusion control net that
we used to do and then the super
resolutions that means they have
something like an upscaling for making
your image more details and more
refinement on all the detail small part
of your AI image and then you can easily
train your lorus with any objects for
example they have a dog of this and then
they train with this image and they can
reproduce this image with a space suit
of the dog wearing a space suit and then
the image recognitions are well I can
say it's better than stable diffusion
1.5 or sdxl because their training is
they have more image training for their
their models without this principle it
has already suppressed the image
recognition of the older stable
diffusions model okay so let's go to the
demo page right here let's try this out
now I have tried one time with a very
simple prompt in here I say the
playground an old man walking with his
grandson holding hand and sunset time
now this is not like the old days the
traditional text prompts in stable
diffusions 1.5 as you can see we are
using a more natural language sentence
almost like a sentence to create an
image like this now as I can see the
image in here is well it's pretty nice
let's go to a new tab and we can check
out the full size of it now as you can
see all these prompts in here it does
generating into the image already
there's a grandson and then the old man
is holding hands together in a
playground and then the sunset time
basically all the principle of my
prompts is already appear on this image
and it does really well to handle
multiple elements of a text prompt
rather than in stable diffusions 1.5 or
even
sdxl sometimes you cannot do multiple
element handling which means they are
not in here they have set the prompt
alignment which means they have not did
that quite good in sdxl and even the
older one the SD 1.5 but then in here
it's done really well right and then you
can see in here they have the advanced
options that you can of course you put
negative prompts you can generate seat
numbers that is very typical for all AI
models especially for image generation
models and then you can set your width
and height in here and that is the same
size of sdxl by default of this AI model
and then numbers of image and then this
one is kind of a new thing for us as a
stable diffusions user the prior guide
idance scale and then the prior
inference steps and then the decoder
guidance scale this is something and
also the lastly in here the inference
steps this is something that we don't
have in stable diffusions well the steps
we can classify this one can be the
sampling steps like how much you want to
set in 25 step or 30 step Etc but then
the other two from this decoder scales
and then the decoder steps in here it is
something that we don't have in stable
diff fusions currently so I guess if
they have to implement this model into
comfy UI or automatic 1111 they have to
create a new notes in there or a new
input data area for us to set these two
parameters in automatic 1111 as well so
I'm waiting if they have an update about
this model can be compatible with both
automatic or comy UI but at this moment
right now we are able to test the stable
Cascade in this demo page of hugging
face and the GitHub page in here allows
you to download the coding on the top of
here you can download this code and this
is also the same demo page in here but
you can run it in locally but I guess it
is not the point for us to download this
demo page locally using the GitHub page
project instead just enjoy and try in
this GitHub demo page at this moment and
let's wait for the updates if they are
supporting in other web UI like
automatic 1111 or comy UI then we can
fully enjoy this model using in those
system right so let's try another
example in here that using their default
text I will say this is a pretty cool
thing like a city of Los Angeles this
one and let's try
this okay so here we have the result
here that is kind of funny thing you're
putting something that is not realistic
but it is in realistic styles of Los
Angeles Street and you see all these
details of the street and then of the
concrete on the top here see all those
Mark they have did very detail on this
and it looks pretty good and let's try
some prompts that not by default in here
uh let's say uh you know in previous
videos I have tests about the versen
diffusions model I have test John Wick
and let's try with this one John Wick in
cyberpunk and let's try in diffusion
Cascade so let's say John Wick John Wick
closeup shot
okay actually let's not going to do the
traditional text Palm let's do something
like John Wick in Disco clapping places
he hold pistol ready to shoot the place
with cyberpunk Leon
light okay let's try this one like more
natural language prompts that is not
like those one keyword and comma one
keyword and then another comma those
table diffusions 1.5 text prompt Styles
hopefully it will generate something for
me
and there you go more clear and
hopefully there is something and let's
see the full view of
this well the eyes is not kind of clear
at this moment but we can see okay
there's like the Assassin's ring and
he's showing very detail of that and
then the wash but then John Wick is
carrying the wash in other side and
actually inside the hand wrist of the
watch should be facing inside I should
say but it does doing something
realistic kind of like everything in
here is following my prom really well
like in a disco clubbing Pace holding a
pistol ready to shoot so the action of
John Wick is ready to shooting the
pistol and then you see the Cyber Punk
Leon light is all over here so I can say
this is a pass but then the eyes of this
we might need a refinder to do that if
we need to enhance this image let's say
if let's let's try with this problem
again with more content and I would say
let's fix the eyes okay with John Wick
picture with clear face and eyes just do
that okay just add one more content in
here let's hope that it can help for our
character face with better
quality
now one thing I have to mention about
this AI model is that it is not for
commercial purpose yet okay maybe one
day you can purchase this AI models for
the license for commercial purpose but
right now we are just doing it for
research purpose okay so another one
here yeah we have a better phas more
clear and similar Styles I would say
well the pistol is kind of awkward in
this direction Direction well if you
guys have played with Firearms before
and you would know that the wrists of
this and then the angles of the pistol
pointing is kind of awkward way it
should be more pointing in the center of
the character instead of pointing
outward of the character but oh well I
would still give it a pass for this
right this style compared with the
previous one the purely trained one chai
diffusions model the older diffusion
model of this sausage model's name is
always giving me the close upshot of a
character but then in the stable Cascade
they have given me more element and
actions of the characters there is more
content in the generate image within
this sort of prompts so I would say yes
their quality have suppressed the one
child version to already a lot I mean it
should say a lot and then of course they
are suppressing the sdxl a lot as well
and I can see that if we are able to use
this model in the future and we can make
AI animations using this model instead
of SD 1.5 or
sdxl and of course we can do way better
quality than what we have in today in AI
elements so it I hope you guys enjoyed
these videos have a quick test I just
did a very fast videos quick fast videos
about this newer models I really want to
do it today to share it with you guys
and um yeah maybe the stable video
diffusions uh newer update 1.1 I would
do it next time in other videos but then
I hope you guys can get inspired of this
new models stable Cascade and then try
it out this is a very exciting news for
me and I hope you guys do so I will see
you guys in the next videos and have a
nice day
bye
浏览更多相关视频
[ML News] Jamba, CMD-R+, and other new models (yes, I know this is like a week behind 🙃)
Stable Diffusion 3 Announced! How can you get it?
How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile
【生成式AI導論 2024】第18講:有關影像的生成式AI (下) — 快速導讀經典影像生成方法 (VAE, Flow, Diffusion, GAN) 以及與生成的影片互動
What are Diffusion Models?
Text-to-GRAPH w/ LGGM: Generative Graph Models
5.0 / 5 (0 votes)