YOLO World Training Workflow with LVIS Dataset and Guide Walkthrough | Episode 46
Summary
TLDRThis video tutorial guides viewers on training a custom YOLO World model using the large-scale, fine-grain LVIS dataset. It covers setting up the training pipeline with Allytic's platform, selecting the YOLO World 2 model, and choosing between training from scratch or fine-tuning with custom data. The video also highlights the process of using the extensive LVIS dataset, which contains over 1,200 object classes, and demonstrates how to train the model locally with the help of provided code snippets.
Takeaways
- 📚 The video tutorial focuses on training a custom YOLOv5 model for object detection using a large-scale dataset called LVIS.
- 🔍 LVIS is a large-scale, fine-grained vocabulary dataset with over 1,200 object categories, which is more extensive than the standard COCO dataset.
- 💻 The video demonstrates how to set up the training pipeline using the Ultralytics framework, which simplifies the process without needing to write extensive code.
- 🚀 The tutorial covers both training a YOLOv5 model from scratch and fine-tuning it with a custom dataset for specific tasks.
- 🌟 YOLOv5 models come in different sizes: small, medium, large, and extra large, each suitable for different computational capabilities and accuracy needs.
- 📈 The video provides a step-by-step guide on how to use the model's YAML file to specify the dataset, training, and validation splits.
- 💾 It mentions the importance of having a powerful GPU for training large datasets like LVIS, as it can take several hours or even days.
- 📊 The tutorial shows how to monitor the training process by tracking metrics such as loss and mean average precision (mAP) over epochs.
- 🔧 The video suggests that for practical purposes, one might prefer to fine-tune a pre-trained model rather than training from scratch due to the significant time investment.
- 🔗 The script provides insights into using the trained model for predictions and mentions that the Ultralytics framework provides tools for further analysis like confusion matrices.
Q & A
What is the purpose of the video?
-The purpose of the video is to demonstrate how to train a custom YOLO World model, including using a large-scale dataset called LVIS and setting up the training pipeline.
What dataset is being used to train the YOLO World model?
-The dataset being used is LVIS, which is a large-scale fine-grained vocabulary dataset with over 160,000 images and 1,200 object categories, released by Facebook AI research.
What are the main differences between the LVIS dataset and the COCO dataset?
-The main differences are that the LVIS dataset contains over 1,200 object categories, while the COCO dataset only has 80. LVIS is more comprehensive and provides a larger and more diverse set of objects for training models.
What are the supported tasks for the YOLO World model?
-The YOLO World model supports inference, validation, training, and export tasks. However, export is only available with YOLO World 2 model.
How can you train a YOLO World model using your own custom dataset?
-You can train a YOLO World model using your own custom dataset by creating a dataset in the required format and specifying it in the model training command, using the LVIS dataset structure as a reference.
Why is it suggested to use a local environment for training instead of Google Colab?
-It is suggested to use a local environment because the dataset is very large, and extracting and training it in Google Colab would take a long time. Training on a local GPU is more efficient for large-scale datasets like LVIS.
What hardware specifications are required for training the YOLO World model locally?
-Training the YOLO World model locally requires a powerful GPU, such as an RTX 4090, as it involves processing over 100,000 images, which takes significant computational resources.
What are the key metrics used to evaluate the YOLO World model during training?
-The key metrics used during training are the Box loss, Class loss, DFL loss, and the mean Average Precision (mAP) at different IoU thresholds (0.5 and 0.5-0.95).
How long does it typically take to train the YOLO World model on the LVIS dataset?
-Training the YOLO World model on the LVIS dataset for 30 epochs may take several hours to days, depending on the hardware used. In the video, 10 epochs took around 3 hours using an RTX 4090 GPU.
What are the advantages of using open vocabulary models like YOLO World?
-Open vocabulary models like YOLO World can detect an arbitrary number of object classes beyond those available in datasets like COCO. This flexibility makes them suitable for a wider range of applications without requiring specific training for each possible class.
Outlines
🚀 Introduction to Training Custom YOLO World Models
The video begins with an introduction to training a custom YOLO World model. The presenter discusses the process of training models using a dataset called 'lb', which is a large scale, fine-grain vocabulary dataset. This dataset is different from the standard COCO dataset, as it contains a vast number of images and classes, allowing for the pre-training of YOLO World models. The video aims to demonstrate how to set up a pipeline for training these models, and also mentions the possibility of using one's own custom data for fine-tuning.
📚 Exploring YOLO World Models and Datasets
The presenter dives into the Ultralytics documentation to explore the available YOLO World models. They discuss the features of the models, such as their ability to detect an arbitrary number of objects due to their open-vocabulary nature. The video then guides viewers on how to select a model and prepare for training by choosing between different model sizes and versions. The presenter also touches on the process of using the LVIS dataset, a large-scale dataset released by Facebook AI research, for further training and fine-tuning of the models.
💻 Setting Up Training with Large-Scale Datasets
The video script describes the process of setting up training on a local machine using a large-scale dataset like LVIS, which contains over 100,000 images. The presenter explains the steps involved in unzipping and preparing the dataset for training, emphasizing the importance of having a GPU for such tasks. They also mention the use of Google Colab for smaller datasets and provide insights into the time and resources required for training on such a large dataset.
🔍 Analyzing Training Results and Model Performance
The final paragraph discusses the process of analyzing the training results and model performance. The presenter shares insights on tracking metrics over time, including losses and mean average precision. They mention the expected decrease in loss and increase in mean average precision as the model trains over epochs. The script also hints at the presenter's intention to cover more about the results in future videos, encouraging viewers to test out the training process themselves.
🎉 Conclusion and Invitation to Future Videos
The video concludes with a summary of the training process for YOLO World models and an invitation to viewers to explore the training of these models with their own custom data. The presenter expresses gratitude for watching and looks forward to engaging with the audience in upcoming videos.
Mindmap
Keywords
💡YOLO World model
💡Dataset
💡Fine-tuning
💡Pre-trained models
💡Open vocabulary
💡Training
💡Validation
💡Epoch
💡Loss
💡Mean Average Precision (mAP)
💡Google Colab
Highlights
Introduction to training a custom YOLO World model
Using the large scale fine grain vocabulary dataset called LVIS
YOLO World models are pre-trained on open vocabulary datasets
Demonstration of setting up the training pipeline
Option to use custom data for fine-tuning YOLO World models
Accessing YOLO World models through the Ultralytics platform
Overview of the YOLO World model and its key features
Support for various tasks like inference, validation, and training
Choice between different model sizes: small, medium, large, and extra large
Explanation of transfer learning on the COCO dataset
Details on the LVIS dataset: scale, annotations, and categories
How to use the LVIS dataset for training YOLO World models
Code snippets provided for easy setup and training
Instructions for local training with large datasets
Importance of using a GPU for efficient training
Process of extracting and preparing the dataset for training
Training the model locally with specified epochs and image sizes
Monitoring the training process and tracking metrics
Results after 10 epochs of training and the need for longer training
Analysis of the model's performance on the validation set
Final thoughts on training custom YOLO World models and future applications
Transcripts
hey guys welcome to video in this video
here we're going to see how we can train
a custom YOLO World model so in one of
the previous videos we already went over
how we can run the model and so on but
now we're going to take a look at how we
can train our own models we're going to
use a data set called
lb so it's basically just a large scale
fine grain vocabular data set so
normally when we train the YOLO World
model also the pre-trained ones from
allytics they're pre-trained on open
vocabulary data set so it's basically
just a huge data set with up to 100,000
images and also a bunch of different
classes so it's not just 80 classes from
the KOCO data set but these models and
these data sets can be used for pre-
training YOLO World models so we're
going to show you how we can set up that
pipeline you can also use your own
custom data if you have smaller data
sets that you want to fine-tune the YOLO
World models on so let's just jump
straight into the Alo litics
documentation if we go up inside the
models tab we'll be able to see all the
models that we have avable with ultral
litics so right now we're just going to
go down to YOLO World there we go first
of all you can read a short description
about it like get an overview and also
the key features but this is again just
an open vocabulary model so it's able to
detect an arbitary number of Optics you
can even prompt it and so on but in this
video here I'm going to show you how we
can either like train a YOLO World model
from scratch so basically like random
initialize waste or how you can use your
own custom data set to go in and
fine-tune these models for your specific
task so right now if you just scroll a
bit further down we can see the aaable
models supported task and also the
operating modes so first all here here
let's just go in and use the YOLO World
2 model so now we both have a version
one and also a version two we can see
the different task which are supported
so we can both do inference validation
training and also export but we can only
do export with yolo world to model so
definitely just go with that one so we
also have all the variations so we have
the small medium large and extra large
model so we're going to choose which of
those we want if we scroll a bit further
down we can then see the serero shot
transfer on the Coco data set so that's
also like a large scale data set which
we normally pre-train the standard
yellow models on but now we act like on
open vocabulary on large scale data set
with add a bunch of different classes so
we're going to have like hundreds if not
thousands of classes in these models I'm
going to show you how we can take
another data set and fine-tune it or
even like train that from scratch it
will just require a lot of training time
so we can see some USIC examples here
this is everything that we have to do
I'm going to do this locally because
it's going to take a long time the days
that we're going to use
which I'm going to show you in just a
second has 100,000 images in the
training set you can see here how you
can train it predict and also do
validation you can also go in and Export
and so on we have code snippit for all
of it take it directly copy paste it
either into a Google collab notebook or
directly into your local environment and
you're good to go you don't have to
write any code of all you can just use
the Alo litics framework and you just
have to specify a couple lines of code
or directly in the command line and then
you can go in and train the models
Direct corly so if I want to really go
more into details with this let's now go
inside the data sets and let's take a
look at the data set that we want to use
so if we scroll a bit further down we
can then see for the optic detection
data set we have this lvis data set so
if you just press on it we can go in and
see that this is a large scale fine
grain vocabulary level annotation data
set and it is released by Facebook AI
research so this is basically just a
research Benchmark for optic detection
and also instant segmentation but we are
only going to specify and work with with
the optic detection data set so here we
can just see that it is a large
vocabulary of categories aiming to drive
further advancement in computer vision
field so I'm basically just going to
call it lvis here so it contains 160,000
images and 2 million instance annotation
both for optic detection segmentation
and captioning task we can see that we
have over 1,200 optic categories so
instead of just having like the standard
objects like cars bicycles animals and
so on from the cocer data set where we
only have 80 class
now we can go in and train these models
on 1200 classes directly go in and use
those for our own applications and
projects as pre-trained models so right
now we can just see the key features we
won't really go too much into details
with that but we have the data set
structure I'm going to extract the
information and so on I'm going to show
you like how you can just you run the yl
file with Alo litics and it's going to
extract and unip everything take care of
it and you can train it directly so we
have a training split validation split
we also have a mini validation split and
the test set here at the end so if we
scroll a bit further down we will be
able to see the data set yaml file with
all the different classes that we're
going to do detections on and also the
path to our train validation and our
mini validation split and this is pretty
much everything that we need if you just
want to use it directly and train our
model we can see the example usage Yow
train we also need to specify detection
or segmentation and we set the data set
path here equal to elvis. yaml then it's
just going to pull the data set from Alo
litics here and you can you can use it
directly and I'm going to do that in
just a second so it's around 20 GB it's
over 100,000 images so it takes a long
time to act like go in and unip and so
on so I'm just going to let it run in
just a second so we can use it for
training later on so if you're using
like these large scale data set like
definitely do it locally if you have a
GPU and so on but you can still do it in
Google collab notebook but you will most
likely just go in and find tun it using
a Google cab notebook with a few hundred
images to a few thousand so you can
pretty much just see all the classes
here we have alarm clock airplane apple
apple sauce apricot apron scroll through
all of them ball basket ball instead of
just like sports ball which we have in a
cocer data set beach ball battery bed
cow and so on so we pretty much just
have any class that you can come up with
here in the data set so most likely if
you want to use a model directly out of
the box a pre-ra model and you don't
want to find your own data set you can
definitely go in and use these open
vocabulary data sets if you just go a
bit fill it down let's just go down and
verify yes we have
1,2003 classes this is how you can go
and download if you want to download
directly from code this is how you can
download the labels and also download
the data but if you just use it directly
with allytics it's going to extract all
the folders the whole data structure and
it's going to run the training directly
or if you want to use it for predictions
later on here we can see some sample
images and also annotations so it's
basically just ton of different images
both for inst mentation optic detection
and so on open vocabulary we have 1,00
classes so that's pretty much it let's
now go ahead and see how we can train
this model if we just go inside a Google
collab notebook so first all here we
just need to P install tics then we need
to create an instance of the YOLO world
class we just need to specify which of
the YOLO models and also the YOLO World
version two now we have the model now we
can just go down and specify if we want
to train it and which of the data set
that we want to train it on and right
now we just need to specify elvis. yaml
the number of pox image sizes and we
also have a bunch of other Arguments for
the training script here that you can go
and set based on the Alo litics
documentation you can also run it
directly in the command line so you can
just call your load detect train specify
the data path here so Elvis if it's not
able to find that on your own local
computer it's going to pull it from the
AL litic registry where we just have all
the data sets in there so it will take a
long time with this specific data set
here but if you use roboff flow if you
use conversion tool and so on to
generate this yaml file just for a few
hundred images you can do it perfectly
fine in here and it will only take a few
seconds to extract so right now I won't
do it here on Google collab notebook it
will just take too long so I'm going to
do it locally on my own computer so
right now let's just go in here and run
it and then I'm going to do it on my
local computer so I'm going to do the
exact same thing but I'm going to run
the training because it will take like
several hours to do the whole training
when we're talking about like 100,000
images that we need to process so I have
an RTX 490 on my home computer and we're
going to see the training results go
over Epoch for Epoch so we can see how
we can train these YOLO World model on a
large scale data set and it doesn't even
have to be large scale you can do it on
your own images and data set as well so
first of all we just paper install Alo
lytics we create an instance of the
model we train it and we don't have to
run the last line down here at the
bottom because it's going to do the
exact same thing in here so for epox we
can also specify the bat size and so on
but we're just going to go with default
ones for now because we we're only
interested in seeing the data set
structure and then we're going to do it
locally so if you just take a look at
the data set locally while it's running
in here in Google collab so right now I
just have data sets and we have Elvis we
have The annotation images and also
labels if I go inside the images we have
test train and validation and if we just
go inside the validation set we can see
that we have 5,000 images for the mini
validation and these are all the images
and we have all the labels for the
images and this is only the validation
set with 5K images so if you just scroll
through it you can see there's a bunch
of variations like a bunch of different
types of images bunch of different
objects and so on so this is a really
good data set to pre-train a model on so
normally when you have such a huge data
set you just train the model from
scratch but it would probably just take
too long to convert so I'm also just
going to fine-tune it just to be able to
run it like for 10 20 Epoch so it
doesn't take like multiple days to train
on my own single 490 GPU so right now we
can see that it just unip so to start
with it's going to download the model
directly if you're running this for the
first time so right now here we can see
that it's missing the path so it can
recognize this yl file locally or at
least in your environment right now so
it's going to extract all of that from
ultr lytics so we can see that it's on
sipping from data sets LV label segments
into this directory and we can go over
and see to the left so we have our data
sets we have our Elvis and then we have
our annotations labels and we're also
going to have our images later on so
this is such a huge data set and it will
take very long to go in and extract in
Google collab it probably took me around
like 20 minutes locally on my own
computer but let's now go and see how we
can set it up and also run training
directly on our own local environment so
while it's just unzipping the whole data
set in here let's just go in and open up
a new terminal I'm just going to use
Anaconda prompt right now we can just go
down and take this command directly
throw it in here after we have PIV
installed Al litics locally on on our
own environment so this is not a Google
callab notebook this is on my own
computer and if I just delete all of
this and verify that we have a GPU
attached to it we can call Nvidia SMI
there we go and we should get all the
information about our GPU 24 GB of RAM
Nvidia T4 rjx 4090 so we're good to go
and we can just copy paste this command
in we don't want to run it for too long
so let's just go down and act like run
it for 30 EPO and I'm going just going
to like L it run and then we're going to
come back and take a look at the results
EPO Epoch the image size here we can
also specify bad size and so on but
let's just go with the default
parameters so when we run it locally we
don't need the explanation mark that is
only in a Google collab notebook so
right now it's just going to extract the
whole data set so the training images we
can see that is extracting all the
images here it's going pretty pretty
fast but we also need to extract 100,000
images and we can see the track bar over
here or the progress bar so right now is
around 20% it's going to take the
training and also the validation set
after it's done extracting all the
images loading it into the system it is
going to start the training for the for
the epoch that we have specified and
then we can just lock the metrics over
time take a look at the losses and also
the mean error positions see how our
model converges and then we're just
going to let it run and come back and
take a look at the results because this
is going to take a long time to process
it'll probably take several hours to be
able to train this model and this is
still just on a fine-tune Model if he
wants to train from scratch it will
probably take multiple days for a model
to be able to to convert so we can do
meaningful predictions with our new YOLO
World model that we have trained from
scratch on a large scale data set so
this is also how you take a model from
scratch and create these preachment
models which we have with ulv 5 UL V8 UL
world and so on so right now we can see
that our training and validation set has
been extracted we can also see that we
have our Optimizer set up image size is
640 for the train and also validation
we're using eight data load workers and
we're also locking the results to runs
the tech train starting training for
thir Epoch and now we can go in and
track Epoch per Epoch the whole training
process so right now we can see Epoch 1
out of 30 the Box loss class loss dfl
loss and also our instances we'll also
get the mean position and so on but
right now we can see here that it has
processed 500 batches out of 6,000 for
single Epoch so that's a lot of data
that we need to process for every single
Epoch so right now let's just let it run
for some hours and we can go back and
take a look at the training results
after that so model is not down training
let's go down and take a look at the
epoch and the results so right now we
have just trained for 10 Epoch and we
should definitely have trained it for
longer but it will take like several
days to train either model from scratch
or just fining this on the pre-trained
YOLO World model so we're going to take
a look at the metrics Epoch per Epoch we
both have all the losses we also have
the mean a position of 50 and also mean
a position 50 to 95 and these are pretty
much the values that we should look at
the average position should be
increasing and the losses should be
decreasing over the number of epo if you
just go a bit further down we can then
see that the mean reposition which we
have here is just increasing over time
we start out at around
028 and then we end off after 10 Epoch
at
0.764 so that's pretty good our mean
positions are increasing and we can also
see that our losses is decreasing
significantly at least here in the start
which is also expected so we definitely
need to train this model for longer we
can see that the 10 Epoch completed in 3
hours if it were to actually like train
a model we can see that it has hasn't
even converged yet it is not near that
but if you want to train this model
fully we'll have to run it for probably
several days so right now the's position
is around like 07 and we could probably
expect it to be up in the0 40 ranges for
the specific data set after it's done
training it will also go in and do
evaluation with all the classes and so
on so we can see the individual classes
how does the model perform on those and
this is not really too meaningful but
you can dive into some of the classes if
you have some specific ones that you
want to take a look at or you can inside
the Run folder take a look at the
confusion Matrix but we have videos
about covering like the whole run folder
all the results that it's going to
generate after we have trained a world
model with ultra lytics so thank you
guys watch this video here I hope you
learn T basically just to see how we can
train a Yol World model both on a large
scale data set but you can also do the
exact same thing with your own custom
data with a few hundred images so
definitely go in test it out it is
really nice to learn how you can set up
the whole training Pipeline and test out
these open open vocabulary models where
we can do pretty much the optic
detection on an arbitrary optic instead
of only the 0 classes from the Coco data
set so thanks a lot for watching again
and I hope to see you guys in one of the
upcoming videos until then Happy
training
Weitere ähnliche Videos ansehen
YOLOv7 | Instance Segmentation on Custom Dataset
YOLOv8: How to Train for Object Detection on a Custom Dataset
Automatic number plate recognition (ANPR) with Yolov9 and EasyOCR
EASIEST Way to Fine-Tune a LLM and Use It With Ollama
Fine-tuning Tiny LLM on Your Data | Sentiment Analysis with TinyLlama and LoRA on a Single GPU
Fine-Tune Your Own Tiny-Llama on Custom Dataset
5.0 / 5 (0 votes)