Deploy Hugging Face models on Google Cloud: from the hub to Inference Endpoints

Julien Simon
9 Apr 202407:13

Summary

TLDRJulian from Hugging Face introduces a series of videos on deploying Hugging Face models on Google Cloud. The first video demonstrates using Hugging Face's Inference Endpoints to deploy models like the new Gemma model with a single click. It guides viewers through accessing the model, deploying it on Google Cloud, and testing it using a playground and API. Julian also highlights the ease of deleting the endpoint to stop charges, promising more deployment methods in upcoming videos.

Takeaways

  • 🚀 Julian introduces a series of videos on deploying Hugging Face models on Google Cloud.
  • 🤝 Hugging Face has announced a partnership with Google Cloud.
  • 📹 The video will demonstrate deploying models using Hugging Face's own service, Inference Endpoints.
  • 🔗 The deployment process will be shown for one-click models from The Hub to Google Cloud.
  • 🌐 The video mentions the availability of a single US region for deployment but hints at the addition of more regions.
  • 🛡️ The deployment includes options for security levels: public (not recommended) and protected with token authentication.
  • 🔄 The video discusses the use of TGI serving container, which has reverted to the Apache 2 license.
  • 💻 The script guides viewers on how to select deployment settings like autoscaling and model revision.
  • 🛑 The importance of deleting endpoints after testing to avoid charges is highlighted.
  • 📈 The video showcases the ease of deployment with a simple click and code copy-paste for testing.
  • 🔍 The script includes a practical example of deploying and testing the 'Gemma' model from Google.
  • 🗓️ More videos on different ways to deploy Hugging Face models on Google Cloud are promised for the future.

Q & A

  • Who is the speaker in the video?

    -The speaker is Julian from Hugging Face.

  • What is the main topic of the video?

    -The main topic is deploying Hugging Face models on Google Cloud using Inference endpoints.

  • What is the partnership announced in the video?

    -The partnership announced is between Hugging Face and Google Cloud.

  • What is the name of Hugging Face's own deployment service mentioned in the video?

    -The deployment service is called Inference Endpoints.

  • How many videos does Julian plan to make about deploying models on Google Cloud?

    -Julian plans to make several videos, with at least three mentioned.

  • What is the first model Julian decides to deploy on Google Cloud?

    -Julian decides to deploy the new version of the GEMMA model from Google.

  • What is the license under which TGI serving container is now available?

    -TGI serving container is now available under the Apache 2 license.

  • What are the security levels available for deploying models on Google Cloud as mentioned in the video?

    -The security levels mentioned are public and protected. There is no private option available at the moment.

  • How can viewers test the deployed model using the video's instructions?

    -Viewers can test the deployed model using the playground or by using the API with a token for protected security.

  • What should viewers do after they finish testing the deployed model?

    -After testing, viewers should delete the endpoint by going to settings, typing or pasting the endpoint name, and clicking delete to avoid further charges.

  • What does Julian suggest at the end of the video for viewers to do?

    -Julian suggests that viewers keep an eye out for the next two videos where he will show more ways to deploy Hugging Face models on Google Cloud.

Outlines

00:00

🚀 Deploying Hugging Face Models on Google Cloud

Julian from Hugging Face introduces a new partnership with Google Cloud and demonstrates how to deploy Hugging Face models on Google Cloud. He plans to create a series of videos to showcase different deployment methods. In this first video, he focuses on using Hugging Face's own deployment service called Inference Endpoints to deploy models from The Hub to Google Cloud with ease. Julian guides viewers through the process of deploying the Gemma model, a new version from Google, by accessing it from The Hub, requesting access if necessary, and then selecting Google Cloud as the deployment target. He explains the configuration options, including the serving container, security levels, and the choice between public and protected access. Julian also mentions the recent change of Hugging Face's Transformers library (TGI) back to the Apache 2 license, which is beneficial for the community. The video pauses as the deployment process begins, and Julian promises to test the endpoint once it's ready.

05:02

🔧 Testing and Managing Deployed Models on Google Cloud

In the second part of the video, Julian tests the deployed Gemma model on Google Cloud using the playground feature, which requires a token for authentication due to the 'protected' security level. He demonstrates how to use the API to generate text based on a given prompt and shows the output, which is more interesting content than the initial Starbucks example. Julian emphasizes the simplicity and efficiency of the deployment process, which allows for quick testing and experimentation. After testing, he guides viewers on how to delete the endpoint to avoid further charges, by navigating to the settings and confirming the deletion. He concludes by reminding viewers that there are more videos coming that will explore additional methods for deploying Hugging Face models on Google Cloud.

Mindmap

Keywords

💡Hugging Face

Hugging Face is an organization known for its contributions to the field of natural language processing (NLP), particularly through its open-source library, Transformers. In the video, Hugging Face is mentioned as the developer of models and the provider of a deployment service, which is central to the video's theme of deploying models on Google Cloud.

💡Google Cloud

Google Cloud is a suite of cloud computing services offered by Google. It is highlighted in the script as the platform where Hugging Face models are being deployed. The video's main purpose is to demonstrate the ease of deploying models on Google Cloud, showcasing its integration with Hugging Face's services.

💡Inference Endpoints

Inference Endpoints refer to the endpoints or interfaces through which machine learning models receive input and return predictions. In the context of the video, Inference Endpoints is a service provided by Hugging Face for deploying models, and the script describes how to use this service to deploy models on Google Cloud.

💡Model Deployment

Model Deployment is the process of putting a trained machine learning model into a production environment where it can be used to make predictions. The video script provides a tutorial on deploying Hugging Face models on Google Cloud, emphasizing the simplicity and speed of the process.

💡Gated Model

A gated model is one that requires access permission before it can be used. In the script, the presenter mentions that the Gemma model is a gated model, meaning that viewers need to request access and confirm via email to use it, which is a step in the deployment process described.

💡TGI Serving Container

TGI (Transformers General Interface) Serving Container is a container specifically designed for serving machine learning models. The script mentions that the deployment is done using TGI, indicating that it is the chosen technology for running the deployed models on Google Cloud.

💡Apache 2 License

The Apache 2 License is a permissive free software license written by the Apache Software Foundation. In the video, it is mentioned that TGI has reverted to the Apache 2 license, which is significant as it indicates that the software can be used in a wide range of ways, including commercial use, without concern for license compatibility.

💡Security Level

The security level in the context of the video refers to the access control settings for the deployed model. The script discusses two options: 'public' and 'protected', with the latter requiring token authentication, illustrating the importance of securing access to the deployed models.

💡Autoscaling

Autoscaling is a feature in cloud computing that automatically adjusts the amount of computational resources based on the demand. The script briefly mentions autoscaling as an option during the deployment process, indicating that it can be configured to handle varying loads of model usage.

💡Quantization

Quantization in machine learning refers to the process of reducing the precision of the numbers used in a model to save space and potentially speed up computations. The video script mentions quantization as an optional feature during deployment, suggesting a method to optimize model performance.

💡Playground

In the context of the video, Playground appears to be an interface or tool for testing the deployed model. The script describes using the Playground to test the model's responses to input, demonstrating the practical application of the deployed model.

💡API

API stands for Application Programming Interface, which is a set of rules and protocols for building software applications. The video script mentions using an API to interact with the deployed model, including passing a token for authentication and adjusting generation parameters, showing how developers can programmatically access the model.

Highlights

Introduction to a partnership with Google Cloud for deploying Hugging Face models.

Announcement of several upcoming videos demonstrating deployment methods on Google Cloud.

Introduction of Hugging Face's own deployment service called Inference Endpoints.

Demonstration of deploying one-click models from The Hub to Google Cloud.

Request for viewers to subscribe and enable notifications for future updates.

Explanation of accessing a gated model by entering an email for access.

Encouragement to read about the model and test it locally before deployment.

Step-by-step guide on deploying a model on Inference Endpoints with Google Cloud.

Mention of the deployment using Hugging Face's TGI serving container.

Announcement that TGI is now back to the Apache 2 license.

Discussion on choosing the security level for the deployment: public or protected.

Overview of configuration options for deployment, including autoscaling and model revision.

Simple process of selecting Google Cloud and security level to initiate deployment.

Pause in the video to allow the GCP instance to launch and prepare the endpoint.

Testing the deployed endpoint using a playground with a token for protected security.

Demonstration of changing generation parameters and using the API for testing.

Example of invoking the endpoint and printing the output in a notebook.

Instructions on how to delete the endpoint to stop charges after testing.

Teaser for two more videos showing additional ways to deploy models on Google Cloud.

Transcripts

play00:00

hi everybody this is Julian from hogging

play00:02

face as you can see I'm on the road

play00:04

right now but that's not an excuse not

play00:06

to do any

play00:07

videos as you probably know we've

play00:09

recently announced a partnership with

play00:11

Google cloud and in this video and in

play00:15

the following videos I will show you how

play00:18

you can quickly and easily deploy

play00:20

hugging face models on Google cloud and

play00:23

there are different ways to do this

play00:25

that's why I'm going to do several

play00:26

videos in the first one I'm going to

play00:29

show you how to use our own deployment

play00:31

service called inference end points and

play00:34

we'll see how we can deploy oneclick

play00:37

models from The Hub to Google Cloud as

play00:40

simple as that let's get started if you

play00:42

enjoy this video please give it a thumbs

play00:44

up and consider subscribing to my

play00:46

YouTube channel and if you do please

play00:49

don't forget to enable notifications so

play00:51

that you won't miss anything in the

play00:53

future also why not share this video on

play00:56

your social networks or with your

play00:58

colleagues because if you enjoyed it

play01:01

it's very likely someone else will thank

play01:03

you very much for your support starting

play01:05

from the Hub let's find a good model to

play01:07

deploy on Google Cloud so how about we

play01:10

try Gemma this new version of the Gemma

play01:13

model from Google so let's just click on

play01:16

this if this is the first time you open

play01:19

this model page you'll have to ask for

play01:21

access this is a gated model but just uh

play01:24

enter your email and confirm and you

play01:27

should have access in seconds okay so

play01:29

don't let that stop

play01:30

you as always I would encourage you to

play01:33

read about the model um why not maybe

play01:38

tested locally etc etc right lots of

play01:42

good information

play01:44

there but for now we want to deploy it

play01:47

on the inference endpoints so let's just

play01:49

click on deploy inference

play01:52

endpoints and you can see we have a new

play01:55

option for Google Cloud right next to

play01:58

AWS and isure so why don't we select

play02:02

Google uh at the moment we have a single

play02:05

us region but um pretty sure we will add

play02:09

more and um we automatically select what

play02:13

we think is the best

play02:15

configuration for this model so here

play02:17

we're going to deploy on this particular

play02:20

instance okay and as you can see uh we

play02:24

are deploying with our uh

play02:26

TGI um serving container and by the way

play02:30

just I think yesterday we announced that

play02:33

TGI is now back to the Apachi 2 license

play02:36

which I think is good news for

play02:40

everyone um we can

play02:43

decide what the security level should be

play02:46

so remember public means public right so

play02:49

wide open to the public internet uh no

play02:52

authentication I wouldn't recommend it

play02:55

and protected uh means accessible from

play02:58

the internet with

play03:00

token authentication right uh we don't

play03:03

have a private option for now which we

play03:06

have on on other clouds so let's go with

play03:09

protected we could always take a look at

play03:11

the configuration do we want

play03:14

autoscaling do we want a particular

play03:16

revision of the

play03:18

model I guess we'll go with a

play03:22

TGI we could enable quantization if we

play03:25

wanted etc etc but I will stick with all

play03:28

those defaults okay so very simple just

play03:31

select Google um and that's pretty much

play03:35

it yeah and the security level of course

play03:38

okay let's click on create

play03:41

endpoint and so now it will take a few

play03:43

minutes um of course we'll launch

play03:46

automatically this uh uh gcp instance in

play03:50

our own account and uh prepare the

play03:52

endpoint Etc so I'll pause the video and

play03:56

wait for the endpoint to come up and of

play03:57

course we'll test it afterwards after a

play04:00

few minutes the end point is up I can

play04:02

see it says running here and well why

play04:06

don't we test

play04:08

it

play04:10

so we could test it with the playground

play04:13

we just need to select a token obviously

play04:18

because uh we're using the protected

play04:21

security right so

play04:24

um let's just try

play04:28

this

play04:32

challenging

play04:34

question trust

play04:36

me all right let's see what that

play04:47

says all right what did I tell you

play04:50

Starbucks horrible coffee so hopefully

play04:53

there's something more interesting in

play04:55

Seattle than Starbucks anyway uh

play04:58

playground let's try the uh the

play05:02

API um and so again I need my token I

play05:06

could change some of the generation

play05:10

parameters I want it to uh to increase

play05:13

temperature etc

play05:16

etc and let's just include the token

play05:20

don't worry I will invalidate it

play05:23

afterwards and I just need to copy this

play05:26

okay and let's just switch to notebook

play05:30

okay so let's just paste the code maybe

play05:32

we'll change the

play05:39

question let's try that

play05:48

again okay let's just run

play05:53

this invoking the end point passing the

play05:58

token

play06:00

okay um and I guess we need to print the

play06:03

output okay let's

play06:05

just pre print the

play06:13

output all right the Seattle Fire of

play06:15

19907 was one of the more destructive in

play06:18

American history okay well that's

play06:19

clearly more interesting than Starbucks

play06:22

okay so as you can see super super nice

play06:25

and simple um just one click to deploy

play06:30

and then copy paste the code and you can

play06:32

test in minutes okay so when you're done

play06:36

don't forget to delete the Endo let me

play06:38

show you so when you're done testing

play06:41

just go to

play06:42

settings scroll all the way

play06:45

down you need to type or paste the

play06:49

endpoint name click on delete and it

play06:53

goes away and your stop being charged

play06:56

right perfect okay so that's the first

play06:58

way to deploy huging face models on

play07:02

Google Cloud using inference handpoint I

play07:05

hope this was interesting I've got two

play07:07

more ways to show you so keep an eye out

play07:09

for the next two videos okay keep

play07:12

rocking

Rate This

5.0 / 5 (0 votes)

相关标签
Hugging FaceGoogle CloudDeploymentInferenceModelsAIAPITutorialCloud ComputingAI DeploymentVideo Guide
您是否需要英文摘要?