Episode 1- Efficient LLM training with Unsloth.ai Co-Founder
Summary
TLDRIn this episode of 'Unsupervised Learning,' Renee interviews Daniel, co-founder of UNS, creators of the AI training system Sloth. They discuss how Sloth achieves 30 times faster fine-tuning for large language models, reducing memory usage by 50%. Daniel shares his background at Nvidia, where he optimized GPU algorithms, and how that experience influenced the development of Sloth. The conversation delves into the challenges and potential of language models, the importance of treating them as multiple agents, and the future vision of personal AI chatbots. Daniel also talks about the open-source community's role in UNS's growth and how they can support the project through collaborations and donations.
Takeaways
- 🚀 UNS (Unsupervised Learning) Sloth is an AI training system that accelerates fine-tuning of large language models by 30 times.
- 🌟 The open-source package from UNS has 3,000 GitHub stars and reduces memory usage by 50%, making fine-tuning twice as fast.
- 💡 Daniel's experience at Nvidia involved making algorithms run 2,000 times faster on GPUs, which influenced the development of UNS Sloth.
- 🏆 UNS Sloth was developed in response to the LLM Efficiency Challenge, focusing on training faster to achieve high accuracy.
- 🔍 The system rewrites the backpropagation algorithm in hard math and optimizes it for GPU code, leading to efficiency gains.
- 🌐 UNS Sloth supports training on various languages, not just English, and simplifies the conversion of models to different formats.
- 🤖 The team behind UNS Sloth uses AI, like Chat GPT, for engineering tasks and to overcome coding challenges.
- 📊 Language models have limitations in math due to tokenization issues, specialized training data, and their design not being focused on mathematical tasks.
- 📚 RAG (Retrieval-Augmented Generation) is a method for knowledge injection into language models, allowing them to search large databases for accurate answers.
- 🌐 Daniel's preferred sources for staying updated on AI include Twitter, Reddit, and YouTube channels like Yan's.
- 💸 UNS Sloth is currently bootstrapped, and they welcome community support through collaborations and donations via their GitHub page.
Q & A
What is UNS and how did it come to be?
-UNS (unsupervised learning) is an open-source package developed to make the fine-tuning of language models 30 times faster. It was created by Daniel, the co-founder, with the goal of reducing the time and memory usage required for fine-tuning, making it more accessible and efficient.
How does UNS reduce memory usage during fine-tuning?
-UNS reduces memory usage by 50% by optimizing the kernels in Triton language and rewriting the entire backpropagation algorithm in hard maths, which allows for more efficient memory management and faster processing.
What was Daniel's role at Nvidia and how did it influence UNS Sloth?
-Daniel worked at Nvidia with the goal of making algorithms on the GPU faster. His experience in writing Cuda kernels and optimizing algorithms was applied to UNS Sloth, where he rewrote the kernels in Triton language and performed extensive code optimizations.
What is the significance of the Triton language in UNS Sloth?
-Triton language is used in UNS Sloth for its performance and efficiency benefits. It allows for the creation of optimized kernels that improve the speed and memory usage of the fine-tuning process.
How does UNS Sloth address the language limitations of models like MESTRO and LLaMA?
-UNS Sloth allows users to fine-tune open-source models like MESTRO and LLaMA in any language, not just English. This overcomes the limitation of these models being English-based and enables training in languages like Portuguese or Mandarin.
What are some of the challenges with language models in performing mathematical operations?
-Language models often struggle with mathematical operations like multiplication and addition due to tokenization issues and the specialized nature of math problems. They may not have seen complex formulas in their training set, which limits their ability to perform certain mathematical tasks.
How does Daniel envision the future of UNS Sloth?
-Daniel's vision for the future includes a fine-tuning bot on every computer, even those with weak GPUs. This bot would read personal data, perform fine-tuning daily, and learn about the user to become a personal chatbot, making AI more accessible and personalized.
What is RAG and how does it enhance language models?
-RAG (Retrieval-Augmented Generation) is a knowledge injection system that allows language models to search large databases, like Wikipedia or the internet, to find correct answers. This enhances the model's ability to provide accurate and up-to-date information.
How does Daniel stay updated with the latest developments in AI?
-Daniel uses Twitter for new releases, Reddit for the latest information, and YouTube for educational content. He particularly recommends Yan's YouTube videos for staying up-to-date with AI developments.
How can people support UNS and contribute to its development?
-Support for UNS can come in the form of contributions to their open-source package, feature requests, or financial donations through platforms like Ko-fi (pronounced 'ko-fi' or 'coffee'), which helps fund further development and implementation of new features.
Outlines
🚀 Introduction to UNS Sloth and AI Training
Renee interviews Daniel, co-founder of UNS, about their AI training system, UNS Sloth, which accelerates fine-tuning of large language models by 30 times. They discuss the open-source package that makes fine-tuning twice as fast and reduces memory usage by 50%. Daniel shares his background at Nvidia and the inspiration behind UNS Sloth, which emerged from a competition focused on training large language models efficiently. The conversation delves into the technical aspects of optimizing GPU algorithms and the use of Triton language for kernel development.
🤖 Utilizing Chat GPT and Open Source
Daniel and Renee discuss the practical applications of UNS Sloth, including the ability to fine-tune models in different languages and the challenges faced by language models in mathematics. They touch on the use of Chat GPT for engineering tasks and the limitations of language models in handling mathematical problems due to tokenization and training data constraints. Daniel also shares his thoughts on the potential of language models to achieve AGI and the importance of treating them as a system of multiple agents rather than a single entity.
🌐 Democratizing AI and the Future of UNS Sloth
The discussion shifts to the broader goals of UNS Sloth, which include making AI accessible to everyone and reducing energy consumption. Daniel talks about the partnership with Hugging Face and the aim to enable fine-tuning on even small graphics cards. He envisions a future where everyone has a personal AI chatbot on their computer, capable of learning about the user and providing personalized assistance. The conversation also covers the misconceptions about working with large language models and the importance of using them in tandem for better results.
🔍 Keeping Up with AI and Supporting UNS
Daniel shares his sources for staying updated in the AI field, including Twitter, Reddit, and YouTube, and emphasizes the value of local communities for the latest information. He also discusses the current bootstrapped state of UNS and how people can support their work through donations and collaboration on their open-source GitHub repository. The episode concludes with a mention of the sponsor, Jan, which allows users to run any LLM locally on their devices.
Mindmap
Keywords
💡Unsupervised Learning
💡AI Training System
💡Fine-tuning
💡Large Language Models
💡GPU
💡CUDA
💡Open Source
💡Triton Language
💡Memory Usage
💡LLM Efficiency Challenge
💡Knowledge Injection
Highlights
Daniel, co-founder of UNS, discusses their AI training system, Sloth, which accelerates fine-tuning of large language models by 30 times.
UNS has an open-source package that makes fine-tuning two times faster and reduces memory usage by 50%.
Daniel's previous work at Nvidia focused on making algorithms run faster on GPUs, including speeding up TS and randomized SVD.
UNS Sloth was developed during the LLM Efficiency Challenge, aiming to train language models faster to achieve high accuracy.
UNS rewrote kernels in Triton language and performed extensive optimizations to improve efficiency.
UNS is agnostic to the type of GPU, supporting both Nvidia and AMD, and can fine-tune models faster than traditional methods.
The open-source package includes around 50 tricks to enhance speed and efficiency.
UNS allows fine-tuning of language models in various languages, not just English, addressing a limitation of previous models.
The community is using UNS for model conversion, leveraging the fine-tuning notebook for non-technical users.
Daniel's brother handles the non-technical aspects of UNS, including the frontend and website, while Daniel focuses on algorithms.
UNS uses Chat GPT for engineering help and to overcome roadblocks in development.
Language models struggle with math due to tokenization issues and specialized training data.
UNS aims to democratize AI by making it accessible to everyone, regardless of the size of their GPUs.
Daniel explains the misconception of treating language models as a single system, advocating for a layered approach with multiple models.
UNS has partnered with Hugging Face to make large language models more accessible.
The future vision for UNS includes a personal fine-tuning bot on every computer, providing a personal chatbot experience.
RAG (Retrieval-Augmented Generation) is a method for knowledge injection into language models, allowing them to search large databases for answers.
Daniel would like to interview someone from OpenAI who believes that language models are sufficient for AGI (Artificial General Intelligence).
Daniel uses Twitter, Reddit, and YouTube for staying up-to-date with the latest AI developments.
UNS is currently bootstrapped and open to community support through collaborations and donations.
Transcripts
[Music]
hi Renee here at unsupervised learning
your easy listening podcast for bleeding
edge open source Tech today we're
speaking to Daniel co-founder of UNS
sloth the AI training system boasting 30
times faster fine tuning for large
language models we speak on working with
family his Beginnings at Nvidia what the
hell is rag anyway and his favorite
places to keep up to date join me for
this catchup on unsupervised learning
are you able to give a quick rundown of
unsl how it came to be so we make fine
tuning of language models 30 times
faster we have an open source package
with 3,000 GitHub Stars it makes fine
tuning two times faster and it reduces
memory usage by 50% so the goal of unso
was before it can take very slow to fine
tune it can use lots of memory and now
with our like free collab notebooks you
can like just click you know run all you
can put your data set whatever you like
and then it just fine tunes and finally
at the end you can like export it to
like ggf you can export it to like VM
that's good to know so you PR Rec worked
with Nvidia is that where the uh writing
Cuda Triton kernel kind of idea came
from
Bron yeah so I used to work at Nvidia so
my goal was to make algorithms on the
GPU faster so I made like TS 2000 times
faster I made like randomized SVD faster
as well that was my like old role and so
I think I think the goal of UNS sloth
was I think it came about like last like
October there was a competition called
the llm efficiency challenge by neurs um
the goal was you have one day on one GPU
to train one llm to you know to attain
the highest accuracy so we talk a
different approach and we thought like
you know to get the highest accuracy you
can also train faster to reach that you
know that accuracy faster and so like we
diverged and we focused on you know
training faster and so like I think yeah
we did take out Nvidia experience of
writing Cuda kernels to unso and so we
rewrote you know all of the kernels in
Triton language in open a eyes Tron
language we we also did like mass
manipulations and we did lots of
optimizations in the code yeah so it's
like something I was wondering if it was
a case of those things that you don't
think will be relevant or you don't
realize will be relevant then becoming
relevant later on yeah that yeah I think
that's a fair point I think like openi
Tron language wasn't that popular but
then I think like now it's getting very
very popular because like you know you
want performance you want efficiency you
want memory usage so like Triton kernels
is a part of our system but I think like
that's only half of the system I think
the rest the other half is like all math
equations and like you know how do we
write the math equations more
efficiently yeah and your agnostic so
you run is it you run Nvidia AMD gpus
and what's the other Intel for the non
technical people like me and some other
people listening the kind of value prop
I guess is you can fine tune it way
faster than traditional methods so is it
in the meth method that is
different so what we do is we take like
the entire like back propagation
algorithm in the fine shooting process
right like the gradient descent process
and we rewrite it um in like hard maths
right so like we write all the math
equations down and then we optimize
every single line with like the correct
like the more optimal format for like
the math equations for example you can
like rack it correctly and then that can
increase speed and we can we also like
reduce memory usage by writing
everything in like cud programming so we
re rewrite everything in lowlevel GPU
code um and this can reduce memory usage
and make it faster so there's like 50
something tricks in like the open source
package and that's how we make it faster
aside from knowing you from the team at
Jan I also saw you've got heaps of
people talking about you online and I
think there was a there was a thread on
like black hat world or something like
that and it was like people were so
amazed like oh it's open source now but
they were like explain it to me in
layman terms because they they were
excited about it but they didn't know
what they could do with it so what are
you seeing like I'm in your Discord but
what are you seeing the community sort
of doing with
unso yeah so I think a few days ago
people like so for example the open
source models like mestro and llama
they're only english-based so they
somehow like trained it on their own
Like Home Country language like vmin or
like you know any Portuguese or Spanish
you can train you can like use our fine
tuning notebook to train on any language
that you like um so I think that was the
fundamental like problems of mist and
llama it's just english-based you can
also train like you know Mandarin and
other languages very simply I think that
was the most like interesting thing that
I found recently I think like some other
people they didn't even know how to like
convert like a model to like ggf or like
llama C++ like formats VM like for
example to use on Jan um and so like we
also solve that conversion step as well
and so they like currently people are
just using our notebook just for
conversion so they don't even do the
finetuning step they just do the
conversion step so that's a very
interesting use case yeah yeah and so is
your brother's is he like the
non-technical one of you two would you
say like how deep is he in the technical
side of what you're doing yeah so like
he does all the front end he does all
the website he does like everything else
that I don't want to do so I just do the
algorithms and so he does everything
else he's also like he's a very good
like chat GPT user so we use we actually
use chat GPT and like being chat for
like some engineering health so that's
so he's like that side of the equation
and I just do the algorithms yeah
obviously the whole kind of idea of this
podcast is for people like you like
really clever people to explain things
to me like I'm a child so on like
autonomous agents and stuff like that
like I don't want to dispute the
definition of agent but like do you have
because you're just a tiny team right
but you clearly have a lot that you're
doing a lot that you're shipping do you
have like you make use of that in an
autonomous way or is it more like you're
kind of Consulting something like chat
GPT to then help you with
code yeah that's a very good question I
think so I think if we wanted to make
like applications and like different
things and we get like we get like stuck
somewhere I think CH p is a very good
unstuck like a you know it can make you
unstuck on something that you get stuck
on so that's very useful I think we
haven't actually used it for like a
whole like application design like you
know everything from scratch I did use
it to like try to like do like Matrix
differentials and try to do like like
differentiation via chpt it was quite
bad so I did not use I unfortunately had
to fall back to like writing it on the
like you know on paper and doing the
differentials by hand um so I did try to
use chbt for like the maass part it's
just not that good so yeah so we just we
generally use you know you know chbt gb4
for like just unstuck mechanism can
improve productivity a lot like how do
you represent math
in a word format and then is that why it
has issues like why you can't trust it
for maths so I think in terms of maths I
think there was like three problems with
like language models the first one is
the tokenization problem so I think gbd4
so from what I understand gbd4 tokenizes
numbers by like you know 0er 1 two 3
four 5 six seven eight nine they
tokenize individual numbers but they
also do like double digits so like you
know 10 to 99 they also tokenize each
one has like a ID and then like 100 to
1,000 they take like individual ID so I
know like you know language models are
very bad at like multiplication they're
quite bad at addition it's because of
the tokenization issue so I know like
llama fixed fixed it as well by
tokenizing each so for example if you
have a number 100 right it would
tokenize each like they will split the
numbers into individual digits right so
one zero zero so that's one of the
issues the other issue is I guess the
training data is just too less like you
know in maths there's like in maths each
problem is like very specialized if you
want to take the differential of like X
squ that's that's easy that's 2x right
so like but then if the formula gets
very like complicated it might have not
seen it in the training set and so I
think you know once they have a larger
training set this could you know by the
phenomenon of grocking maybe they can do
some maths so there was like a I think
that people talk about qar um so
supposedly qar like can solve you know
like high school maths level problems so
this was open a i thing so those are the
two main problem s and the third one I
guess is just like I don't think so
language models are design for maths so
you need to like maybe have like a
separate component to like verify the
maths like a proof machine or like
python to write the maths um so those
are the three main problems so thank you
for the context by the way so when you
spoke about the current kind of what
unslot is being used for your recent
kind of partnership like is that part of
something bigger or is it like part of
an open source plan so we did do a
partnership with hugging face on a Blog
so we did a Blog with them we also in
the TRL docs and so we we're trying to
like make like you know llms accessible
to everyone and so our goal is to make
you know even the very small graphics
cards I think like the biggest
difference is like you have like a 8 GP
eight gigabyte like you know vram on
your GPU people before couldn't even
train on like a very you know on a 8GB
graphics card but now we made it like
you know fit just you know just there It
just fits in AGB I and so like we can
like we essentially our goal is to make
AI like you know easily available to
everyone reduce energy usage because
this training two times faster we also
want to reduce energy usage and we just
want to make it more productive for
people so that's yeah the goal is to
democratize AI for everyone but the only
to find tuning space for now
[Music]
yeah so what's something that people get
wrong about working with llms yeah
that's a great question I think the
biggest issue would be like people treat
llms like as one system so like when you
call in an llm you might get an answer
but I think that's the wrong way to go
about it you're supposed to treat llms
as like this one agent like you know
you're supposed to use llms as if like
you have like 10 of them right so like
you build layers on top of each layer so
one llm you can tell it okay you are a
you are a verify right you verify
someone's answers and then just
someone's answers is another llm which
generates answers right so one one llm
does the generation of your answer one
of them does the verification one of
them does the implementation right and
execution of your answer right so like
you need to I think the biggest
misconception is like like you know a
language model is this very powerful
system that can answer every single
question but I think like you need to
prompt it correctly and then like you
know use many of them in tandom together
and this can be more powerful than just
using like one system I think like the
biggest issue is I can see feel like
people using it as one system and then
they like you know constantly repeat
answers and it sometimes doesn't give
the correct answer and it sometimes
forgets like what you said before so I
think that way to solve this problem is
to use many of them together CPU GPU
we're gonna go right back to the start
Daniel because CPU is
what CPU is like the processor on the
computer so like so the GPU is like a
graphics card for like gaming and like
you know uh run like language models and
you know your computer display you can
have like a CPU but then if you want to
play games it'll be so slow so you need
to add this extra Hardware called the
GPU you shove it in your computer and
that is the GPU it's like an extra it
makes like you know Graphics faster and
that's h100s is that oh that's very
expensive so h100s are like extremely
pricey yeah so nvidia's like I think you
probably have one like now in a computer
like RTX 39 RTX like 29 you you probably
do have a GPU in your like computer yeah
I can run I can run some things on Jan
I've got like an it's a Mac it's a m
something M1 yeah see what what we're
working with here uh this is the yeah
the idea is if if AI is going to be
accessible and democratized it means
that you know potato people people
running potatoes and people who have
potato brains need to understand it so
that's not a roast that's just a reality
of the fact that I think that there's so
much technology that's being being
developed that is so Advanced but we
have no way of communicating it and it
sits in this very kind of
uh closed off like you know you need to
like like it's yeah all the advancements
are like you know very hard to like
access and stuff yeah it's it's
unintentional but I think like there's
so much that happens in little tiny
subreddits that the world needs to kind
of know about and like I don't go in
there because I'm scared but I see it
and I'm like oh that that would be
interesting if I knew what that was
what's a GPU but but uh yeah so no no
like yeah look it's like Olden like you
know we generally run like CPUs there's
like two right there's a CPU and there's
GPU um technically GPU is not even
necessary it's like it's optional you
just if you want it to be faster you
know more of responsive you can buy a
GPU you can still do everything on a CPU
it's just very slow
yeah so bind tuning is the current
and the the future is fluid I think the
goal is to like our vision is in the
future like everyone will have a
graphics card on your computer it might
be very weak like you know but but I
think everyone will have a GPU on the
computer and if this is the case we want
to provide like a fine-tuning bot on
your PC it reads all of your personal
data it does not upload any data this is
your personal chatbot that you have on
your local machine and it does fine
tuning every single day and so this this
model will learn about you and you can
ask questions whoever however you like
so this become like a personal chbt a
personal chbt for you only and so that's
kind of our ultimate Mission that's yeah
but that's like a future plan yeah yeah
like a like a a non crappy
clippy but yeah exactly right yeah it's
yeah it's a personal chpt for everyone
yeah yeah and so I guess a lot of the
conversations in commercial use cases
always that they love the idea of local
because it's local that's how I came to
kind of learn about what rag is and that
was like a whole thing for me so I was
like oh so that to my understanding this
is like semi off topic but you're a
person who knows in front of me so I'm
gon to ask you that rag is a way of
verifying the information that you get
back from a large language model is that
right so yes so rag I guess is like
knowledge injection so like pretend you
ask a question's today's date right so a
language w't know that right it doesn't
understand anything about like current
events so what you can do is you can
take all of Wikipedia right shove all of
Wikipedia as like a library and allow
the language model to search in the
library for the answer right so like if
I ask the question again what is today's
date Wikipedia will say you know
somewhere on the homepage today's date
is you know whatever the date is right
and so like the langu you can use rag
which is retrial or generation to like
search a big library for the correct
answer so that's rag nice so it's
different is it different from like chat
GPT or B accessing the internet that's
just different right oh that's that's
also rag so like so assume so you know
like when I said like the Wikipedia is
like a library assume Google the
internet is the library right so like
it's just replace replace Wikipedia with
Google like Google search replace
Wikipedia with like you know with
something else like you know if you want
to do like movies if you have a
complicated question about a movie you
just replace it with like the IMDb data
like you know database if you have like
a law question replace it with like
Supreme Court like you know judgments or
something so rag is just like knowledge
injection system you just need to have a
big database yeah nice I've got two more
questions the first one is who would you
interview in the AI space if you could
interview someone in the AI space
I think I would like to talk to an
interview maybe ier from open AI I think
I've watched his videos and understood
his views I I think his main his very
interesting take is language models are
enough for AGI um because he like most
people think that if you predict the
next word right it's just language
models are just predicting the next word
right it's nothing special it's not like
intelligence or something right but then
his take is if you want to predict the
next word before it you need to
understand everything about the world
right like if you want to predict
something if you want to predict the
next word you need to understand
everything the context of everything
about it you know what what why is the
word after these few words you need to
understand everything in the world and
so like his take is very fascinating I
think I would like to yeah if I had a
chance I would talk to him
yeah did you did you go to the Sam
Oldman chat in I think it was Sydney
like last year no I did not it's funny
it was in Sydney but I didn't go yeah
was I think right yeah where do you keep
up with everything because I know that
when you work in AI you actually don't
sleep it's not allowed it's in the rules
and I said like was it an RSS feed like
to have a favorite YouTube or is it just
discords
Reddit yeah I think I I think Twitter's
generally useful for like new releases
of stuff Twitter's yeah Twitter's pretty
good R yeah local llama is very useful I
think like local Lama has like all the
latest information very very useful I
also like YouTube videos are generally
good I think they're a bit delayed
though I think YouTube like if you want
to stay at the very edge there will be
like one week delayed I like like Yan's
like YouTube videos his very useful like
his videos are very helpful I watch them
like all the time so Yan is my
recommendation yeah that's if you got
like if you want to stay up to date with
everything so Twitter Reddit and I guess
YouTube I guess this sounds like
everyone like it sounds like General
like general advice um yeah yeah and
you're currently
bootstrapped where can
people uh support you and how can people
support you I know that you have a is it
is it pronounced Kofi or coffee I
actually don't know Kofi I I call it
Kofi or is it coffee I don't know I
always thought it was coffee or probably
I don't know I'm gonna message them and
ask but you have one and that is at
unoth actually don't know what's the
coffee
link I think we just typed it's on a
GitHub so we like we have like an open
source package you know we're open to
any collaborators from like the open
source Community if you want to add like
new feature requests or pool request
more than happy if you want to support
our work we have like a coile link um so
like you can donate like some money to
us and we can like Implement more
features quicker like I think we do have
like some collab expenses but there's
not that much for like testing testing
stuff out so I think that's the main
expense we have but yeah more than happy
to like accept community help if that's
what you like yeah nice thank you very
much thanks for joining me on
unsupervised learning we were speaking
to Daniel from
unso this episode was sponsored by Jan
run any llm locally on your device the
Mac windows and Linux see the open Souls
GitHub repo at Jan
[Music]
HQ
関連する他のビデオを見る
5.0 / 5 (0 votes)