How to make AI Startup worth over $30M l Twelve Labs Jae Lee

6 Mar 202412:34


TLDRJae, the co-founder and CEO of Twelve Labs, shares the inspiring journey of building an AI research and product company focused on developing video foundation models. He narrates how the team overcame challenges, participated in a prestigious competition, and secured partnerships with industry giants like Nvidia. Twelve Labs aims to empower developers and enterprises with cutting-edge AI models that can deeply understand video content, enabling applications like semantic search, classification, and summarization. With a 'video-first' ethos, the company tackles the massive problem of making sense of the vast video data in the world, impacting sectors like law enforcement and media.


  • 😄 Twelve Labs is an AI research and product company building video foundation models that can understand videos like humans, serving developers and enterprises via APIs.
  • 🔍 Their models aim to map human language to video content, enabling capabilities like semantic search, classification, and summarization of videos.
  • 🏆 Participating and winning a major video understanding competition helped Twelve Labs gain exposure and attract customers and investors.
  • 💻 The founders started the company while still in the military, working on laptops at a bagel shop during their free time.
  • 🌟 The company's 'secret sauce' is going head-on with the video understanding problem instead of reframing it as language or image understanding.
  • 📈 Twelve Labs has over 20,000 developers actively using their search API, with millions of monthly API calls and rapid growth in enterprise adoption.
  • 🎯 Setting an incredibly ambitious goal and having the determination to solve hard problems is crucial for founders to achieve impactful outcomes.
  • 🛡️ Building a moat by gathering unique data and fine-tuning smaller models, instead of relying solely on foundation models, is important for long-term success.
  • 📢 Effective communication and the ability to explain technical products to non-technical audiences is vital for widespread adoption and impact.
  • 🔭 The founders believe AI will present amazing opportunities for tech and humanity, and all products will be impacted by AI in the future.

Q & A

  • What is the main goal of Twelve Labs?

    -Twelve Labs aims to build massive AI models that can understand videos like humans and provide video understanding capabilities to developers and enterprises through APIs for tasks like semantic search, classification, and summarization.

  • How did Twelve Labs start, and what was the founding story?

    -Twelve Labs was founded by three co-founders who were serving in the Korean Cyber Command. They would meet during their weekends and work on their ideas before they were all discharged. The founding story was quite challenging as they had to coordinate their efforts while still in the military.

  • What was the significance of Twelve Labs participating in the ICC competition?

    -Participating in the ICC (International Conference on Computer Vision) competition for video understanding helped Twelve Labs gain exposure and recognition from potential customers and investors interested in multimodal AI and video understanding.

  • How does the Twelve Labs AI model work, and what data is used for training?

    -The Twelve Labs model aims to map human language to video content, enabling capabilities like search, classification, and summarization. It is trained on large amounts of video data, with the help of data partners who provide labeled data and licensed content in a copyright-friendly manner.

  • What is the current status of Twelve Labs' product adoption and usage?

    -As of June 2023, Twelve Labs had soft-launched their search API, which is actively used by over 20,000 developers and has crossed a couple million monthly API calls. The company is also seeing adoption from enterprise customers, including large creators, media/entertainment organizations, sports organizations, and law enforcement agencies.

  • What advice does the CEO of Twelve Labs give to founders and technical product builders?

    -The CEO advises founders to be patient and able to explain their technology and its impact to different audiences. He also emphasizes the importance of building a moat and not relying too heavily on foundation models, as well as setting ambitious goals and having the determination to solve incredibly hard problems.

  • What was the approach taken by Twelve Labs in building their video understanding technology?

    -Twelve Labs took a "video-first" approach, building their machine learning pipeline and systems specifically for handling videos from the ground up, instead of reframing the problem into other domains like language or image understanding.

  • How did the partnership with NVIDIA come about for Twelve Labs?

    -Jensen Huang, the CEO of NVIDIA, seemed to have a special interest in computer vision and video understanding, which was one of the first use cases for NVIDIA chips. NVIDIA's venture team reached out to Twelve Labs, seeing a perfect match between their vision and what Twelve Labs was doing in video understanding.

  • What are some of the mission-critical use cases mentioned for Twelve Labs' products?

    -One use case mentioned is digital evidence management for police departments, where Twelve Labs' technology can help search for specific evidence in body cam footage quickly and generate police reports more efficiently, reducing time spent by up to 40%.

  • What is the CEO's perspective on the impact of AI and the importance of staying ahead of the technology curve?

    -The CEO believes that AI will present amazing opportunities not only for tech but also for humanity. He emphasizes the importance of learning more about the technology and discerning trends to build a moat, as technology advancements like foundation models can potentially impact businesses relying too heavily on them.



🤝 The Formation of Twelve Labs and Partnership with NVIDIA

This paragraph discusses the founding of Twelve Labs, a company focused on building video foundation models for developers and enterprises. It details the initial meeting between the founders and Jensen from NVIDIA, where they discussed Twelve Labs' vision and NVIDIA's interest in computer vision and video understanding. The paragraph also provides background on the founders, their experience, and the funding and partnerships they've secured, including with NVIDIA, Intel, and Samsung.


🎥 Twelve Labs' Video Foundation Model and Its Applications

This paragraph delves into the core technology behind Twelve Labs: their video foundation model. It explains how the model is designed to understand videos like humans, mapping human language to video content, enabling capabilities like semantic search, classification, and summarization. The paragraph also discusses Twelve Labs' data partnerships, their approach to licensing and using data responsibly, and the process of training the model on millions of video examples. It also mentions the launch of their search API, its adoption by developers and enterprises across various sectors like law enforcement and media entertainment, and their growth plans.


💡 Advice for Building Impactful Products and Maintaining a Competitive Edge

This paragraph offers advice for founders and companies aiming to build impactful products and maintain a competitive edge. It emphasizes the importance of being able to explain and communicate the value of one's technology to different audiences, not just experts in the field. The paragraph also cautions against relying too heavily on foundation models like GPT and advises companies to gather unique data and fine-tune smaller models to create a sustainable moat. Additionally, it stresses the importance of setting ambitious goals and maintaining determination to solve incredibly hard problems, as this will drive companies to achieve more impactful results.



💡Computer Vision

Computer vision refers to the field of artificial intelligence that enables computers to derive meaningful information from digital images, videos, and other visual inputs. In the context of the video, computer vision is described as one of the first use cases that Nvidia chips powered, indicating its importance in the early days of AI hardware development. The co-founder mentions that Nvidia has a special place in its heart for computer vision and video understanding, highlighting the company's focus and contributions to this domain.

💡Video Understanding

Video understanding is the ability of an AI system to comprehend and interpret the content of video data, including recognizing objects, actions, scenes, and extracting meaningful information. Twelve Labs, the company featured in the video, specializes in building video foundation models that can understand videos like humans. This technology enables developers to incorporate powerful semantic search, classification, and summarization capabilities into their video-centric products.

💡Foundation Model

A foundation model is a large-scale artificial intelligence model that serves as the foundation for providing intelligence to software applications. The co-founder explains that Twelve Labs' models are foundation models designed to map human language to video content, enabling emergent capabilities like search, classification, and summarization. Foundation models are versatile and can perform multiple tasks simultaneously, forming the basis for intelligent software solutions.


APIs, or Application Programming Interfaces, are software intermediaries that enable different applications to communicate and share data with each other. Twelve Labs offers its video understanding capabilities to developers and enterprises through APIs, allowing them to integrate advanced video analysis and processing functionalities into their products and services. The co-founder mentions that the company has over 20,000 developers actively using their search API.

💡Digital Evidence Management

Digital evidence management refers to the process of handling, organizing, and analyzing digital evidence, such as video footage from police body cameras or surveillance systems. The co-founder highlights that Twelve Labs' technology can significantly improve digital evidence management for law enforcement agencies by enabling faster search and retrieval of relevant video evidence, as well as automating report writing based on video content analysis.

💡Machine Learning Pipeline

A machine learning pipeline is a series of data processing and modeling steps that are followed to train and deploy machine learning models. The co-founder emphasizes that Twelve Labs' machine learning pipeline is designed with a video-first ethos, meaning that their systems are optimized from the ground up to handle the unique challenges of processing and understanding video data, such as handling large volumes of data and long video durations.

💡Multimodal AI

Multimodal AI refers to artificial intelligence systems that can process and integrate information from multiple modalities, such as text, audio, images, and videos. The co-founder mentions that Twelve Labs' participation in a video understanding competition drew the attention of investors with a thesis around multimodal AI, indicating the company's alignment with this emerging field of AI that combines and leverages different data modalities.


In the context of the video, 'moat' refers to a competitive advantage or barrier that protects a company's business from being easily replicated or disrupted by competitors or technological advancements. The co-founder advises that companies should build their own moat by leveraging unique data and fine-tuning smaller models, rather than relying solely on foundation models or APIs provided by larger companies, which could become obsolete or disrupt their business model.


Fine-tuning is a process in machine learning where a pre-trained model is further trained on a specific task or dataset to improve its performance on that particular domain. The co-founder suggests that companies should consider fine-tuning smaller models using their proprietary data to enhance their products and services, rather than solely depending on large foundation models or APIs provided by other companies.

💡Ambitious Goal

An ambitious goal refers to a bold and challenging objective that pushes the boundaries of what is currently possible. The co-founder emphasizes the importance of setting an incredibly ambitious goal for founders, as it provides the fuel and determination to tackle incredibly hard problems. Having an ambitious goal serves as a North Star, driving continuous progress and ensuring that the company's efforts remain impactful, even if the ultimate goal is never fully reached.


Nvidia's venture team reached out to Twelve Labs due to their work in computer vision and video understanding, which was a perfect match with Nvidia's interests.

Twelve Labs is an AI research and product company building video foundation models for developers and enterprises to understand videos like humans.

The founding story of Twelve Labs involved the co-founders working on the company while still in the Korean Cyber Command, meeting at a bagel shop on weekends with their laptops.

Twelve Labs took a 'video-first' ethos, not reframing the problem into language or image understanding, but innovating technologies specifically for video understanding.

Participating in the International Conference on Computer Vision competition and winning helped Twelve Labs gain exposure and attract customers and investors.

Twelve Labs' foundation model aims to map human language to video content, enabling emergent capabilities like search, classification, and summarization.

Twelve Labs soft-launched their search API in June 2023, with over 20,000 developers actively using it and crossing millions of monthly API calls.

Explaining technical products and their impact to non-technical audiences is incredibly important for widespread adoption.

Building a moat by gathering unique data to fine-tune smaller models is crucial, as relying too much on foundation models can be dangerous for a business.

Having an ambitious goal and determination to solve incredibly hard problems is crucial for founders, as it fuels their journey and leads to greater impact.



Aidan and I had chance to meet with Jensen.


We had, I think, 5 to 10 minutes to talk about Twelve Labs.


And I think Jensen always it seems like he has a special place in his heart


about computer vision and video understanding is that was one of the first


use cases that Nvidia chips powered.


And then the Nvidia's venture team reached out to us.


We were talking about the future that we're drawing, and I think the venture


team also had an idea of what Twelve labs is doing and what Nvidia wants.


Vision and video understanding was just like a perfect match.


Hi yo, my name is Jae.


I'm one of the co-founders and CEO of Twelve labs. 12 labs is an AI


research and product company based here in San Francisco and Seoul.


We're building video foundation models for developers and enterprises


building video centric products.


We basically build humongous AI models that can understand videos like humans,


and we serve it to developers via APIs that are looking into building really


powerful semantic search or classification or summarization into their products.


And we started the company about two and a half years ago with five people.


And right now we are a little over 40 across Seoul and San Francisco,


with $30 million raised in seed funding from companies like Index Ventures, Radical


Ventures, and with recent partnership with Nvidia, Intel and Samsung.


So foundation model.


It's basically this really large AI model that can do many things at once.


It's at the foundation of providing intelligence to software.


So the idea was, hey, the problem that we're solving is massive.


80% of the world's data is in video, and there's no adequate solution


technology out there for developers and enterprises to make sense of it all.


So that is the market that we're tackling.


We want to index all of that in like 80% of the world's data.


So the US Police department owns terabytes and petabytes worth


of police body cam footage, and we call it digital evidence management.


And police officers spend a lot of time looking through a specific evidence


within the content that they captured for auditing purposes or for writing reports.


So if you think about it, you know, their main job is to be out on the streets


helping the citizens, protecting the security of this nation.


And, you know, they're spending too much time searching for things.


So 12 lap search and generate APIs can help searching for digital


evidence incredibly fastly, as well as writing police report generation


is also cut down by more than 40%.


In terms of like time spent, these are some mission critical use cases


that Twelve labs products are being used.


So the founding story is Wild Man because, you know, we didn't all join


the Korean Cyber Command at the same time.


So like, okay, we decided we're gonna start this company,


but then SJ is like leaving next year and then I'm leaving the year after,


and then Aiden's like six months after.


So how do we do it? Right.


So it was genuinely very scary. So we had our ideas.


Okay SJ when you get discharged, you bring all of our laptops


and you visit us every weekend.


You take us out and then we'll go to a bagel shop and we'll work.


I clearly remember talking to Aiden after SJ had left.


I remember SJ was discharged on Thursday and he came back to the military base


Saturday of that week with our laptops and in front of our military base


there was a bagel shop called La Bagel, and that was like our office, bring all of


our laptops and we would do our research.


So we did that for like a good year before so that everyone is out.


Some people say ignorance is bliss.


And I think we were just like really naive and just really excited


about building this company.


I guess not knowing what was ahead allowed us to do what we did.


You have to understand how are we creating stronger product by leveraging AI?


Are you just showing your investors that, oh, like we've implemented GPT into our


product, or is it actually creating value?


And are you capturing value at a level that OpenAI or other companies


can't really capture?


So I think the secret sauce of 12 labs is just going head on with the problem.


I think a lot of companies might have failed because they tried reframing


the video understanding problem into language understanding or image


understanding or speech understanding.




Makes sense because that's where we've seen most improvement with large language


models with amazing speech to text models.


So it's easy to think that, oh, video is incredibly hard.


How can I reframe that problem into something that's already solved?


And sometimes it works, but for some really important task,


that approach does not work.


For Twelve Labs, we've always had video first ethos, so when we created our machine


learning pipeline systems to our models, we always had videos in mind


and our system should be able to handle petabytes worth of data as well


as this, like really long videos.


Videos are usually, you know, not ten 15 seconds long.


It's like two hours, three hours, four hours long.


So we had that video first ethos like from the get go, and we had to build


a bunch of new technologies to support it.


So not taking the shortcut and going head on with the problem and not reframing it,


and really innovating is probably the secret sauce that we have.


The important thing is, if you're building something really impactful and you


think that it's going to significantly change the industry that you're in,


there will always be someone that has very similar thesis.


It's just a matter of how do you get yourself out there?


How do you let the people know that you exist?


And for us, that was the competition.


Figuring out what was going to be impactful that we can do given our current


resources that will put us in the map, or at least, you know, let the world know


that what we're doing is relevant.


So our tactic here was, okay, we're going to talk to a bunch of customers.


And there were early believers in Twelve Labs who took our APIs and build awesome things


with us, but we needed more exposure.


Basically, we're already getting a lot of questions


about how is Google better than you?


Or you actually better than Google? Or are you actually better than Microsoft?


So as a team, we've decided to participate in ICC.


It's international conference in computer Vision.


They're putting this like awesome competition for video understanding.


So basically the competition dedicated for evaluating AI models


kind of ability to understand videos.


Back then we talked to Aiden and hey Aiden, I think we should participate


and see what we can do.


We have nothing to lose and only to gain.


The team was extremely supportive of of Aiden kind of spearheading


that effort with the team, with a limited number of team.


I think we only had like three team members back then.


All I can do to support is, you know, there was some ideas and,


and directional kind of feedback that I gave to Aiden, but we needed compute


and we needed determination to put some serious cash behind and back.


Then for Twelve Labs like $200,000 in in compute was a lot of money for us.




So and just thinking that, okay, we're going to blow through $200,000


in ten days in compute was really scary.


But the team was able to use that capital, that precious capital,


and build something incredible helped us win the competition.


And serious customers, investors that are really serious about video


and solving this video problems had their eyes on that competition, right?


And that's how we were able to get a lot of inbounds from not only customers,


but also investors that had thesis around multimodal AI. So that really took


us off fast iterations of building out our initial set of APIs so that our


customers can test it so that they can build trust with the top labs technology,


the model that Twelve Labs is building.


So the tasks that it's optimizing for is it's basically trying


to map human language into whatever that's happening in video content.




So if you can map precise human language to whatever that's happening


within video content that gives you this emergent capabilities, like being able to


search for things really well or being able to classify things or summarize.


Right. So that is what the model is doing.


And in terms of data, we work with amazing data partners in terms of like


providing us with label data or help us license the data that we need and do it


in a very copyright friendly manner.




And have the the model watch hundreds of millions of video taxpayers


and try to learn the ability to map precise human language to video content.


Top labs soft launched our search API in June of 2023.


We currently have a little over 20,000 developers


that are actively using our search API.


We've crossed a couple million monthly API calls.


The company is also really excited about enterprise customers adopting.


Thing, our technology.


So we have largest creators of the world adopting Twelve Labs as well as media,


entertainment and large sports organizations and law enforcement


organizations that are leveraging this technology growing pretty rapidly.


Then we're adding in a couple million monthly API calls


in quarter after quarter.


So hopefully with the new model releases, we hit 100 million API calls monthly soon.


If you're highly technical and you're building deeply technical product,


you're probably working on this deeply technical product to change the world.


And the world is populated by not only technical people or just the


experts that understand your technology.


There are technical folks that might be a fan of your work, but then there is


99% of the world that needs to understand.


So being patient, be able to explain what you what you do and why it's


impactful for different audiences is incredibly important.


I'm in AI space, so I find AI fascinating, and I think it's going to present


this amazing opportunity for not only like tech, but also humanity.


So I think all product, whether you're in creator economy space or even like


blockchain or some very traditional retail brick and mortar, I think it's


all going to be impacted by AI. What's really important here is being


able to learn more about the technology.


I think once you learn more about the technology and be able to kind of discern


what the trend is going to be, I think that allows you to build your own moat.


That's kind of far away from the technology advancement curve, I would say.


So, companies that are building foundation model, the idea of it is you keep making


these models stronger and stronger so that it is able to solve,


you know, more complex problems.


So let's say you're in copywriting, and what you've done is maybe building


like a UI on top of OpenAI's GPT model.


It's not that OpenAI probably wants to, like, affect your business, but it's part


of their technology improvement journey.


If the GPT models get better and better, it's probably going to get


much better at copywriting.


And if what you've built is just kind of like a wrapper around it, it's going


to get affected regardless of their their intention of affecting you or not.


Thinking about moat is incredibly important.


What is the unique data that you can gather to fine tune smaller models


to make your business better is incredibly important, and relying too much on


foundation models can be can be dangerous.


So the thing is, given the proprietary data that you have outsourced


some of the really hard, intelligent, requiring tasks to foundation models,


but building your entire business on top of OpenAI's or other companies


APIs could affect you, right?


So I think that's probably the best advice I can give.


Having an ambitious goal, and the determination to solve incredibly


hard problems is crucial for every founders, because there will come a time


where you're going to have to settle.


And if your goal was not big to begin with, then I think you will


end up in a place where what you're doing isn't very impactful.


So setting an incredibly ambitious goal then you can probably even reach


gives you the fuel to go forward.


And the North Star that you can ever reach is crucial.


And before you know it, you will always feel like you're far away from that goal.


But in doing so, you would have achieved a lot more.

Rate This

5.0 / 5 (0 votes)

Related Tags
AI StartupVideo UnderstandingFoundation ModelsComputer VisionEntrepreneurshipInnovationDeep LearningSan FranciscoSeoulTechnology