Building the World's Largest RAG for Knowledge Management @ CVS Health

Databricks

23 Jul 202419:06

Summary

TLDRIn this talk, Eric Whitman, Director of Machine Learning at CVS Health, discusses the challenges of building an enterprise-scale retrieval-augmented generation (RAG) system. He emphasizes the need for semantic search and natural language processing to improve knowledge retrieval within the company. Highlighting CVS Health’s vast, dynamic knowledge sources, Eric outlines the process of ingesting, structuring, and optimizing data for efficient search. He also addresses the technical infrastructure, such as microservices and containerization, necessary for scaling. Ultimately, the talk emphasizes balancing cost, speed, and quality while considering team structure and governance to ensure successful system deployment and sustainability.

Takeaways

😀 **RAG System Overview**: Building a scalable Retrieval-Augmented Generation (RAG) system requires a well-designed architecture and a focus on unifying and simplifying search for large organizations with complex knowledge sources.
😀 **Challenges with Large Scale**: Enterprises like CVS Health face challenges such as handling millions of documents, dynamic data sources, and access control for sensitive information, which complicate traditional RAG architectures.
😀 **Simplifying Search at CVS**: The goal is to reduce manual, time-consuming search processes by integrating semantic search and natural language processing to create a unified search experience across diverse data sources.
😀 **MVP Focus**: For an MVP, CVS focuses on three key knowledge sources: technical documentation, IT policy, and general policy documents, each of which has unique integration and connectivity needs.
😀 **Custom Data Pipelines**: CVS builds custom data pipelines for each source, normalizing, chunking, and vectorizing documents before storing them in a vector database to facilitate more efficient searches.
😀 **Metadata is Essential**: Proper metadata handling enables filtering, access control, tracking document life cycles, and managing related content across various knowledge sources, enhancing the overall system’s effectiveness.
😀 **Change Data Capture (CDC)**: To keep documents up-to-date, CVS employs CDC mechanisms to track document creation, modification, and deletion, ensuring the vector store mirrors changes in the source systems.
😀 **Scaling Challenges**: The system needs to balance cost, speed, and quality while ensuring that large data volumes don’t overwhelm the infrastructure. Microservices are used to maintain flexibility and enable faster iteration.
😀 **Performance Evaluation**: Continuous performance evaluation is critical. CVS uses manual, automated, and LLM-based tracking to monitor relevance and ensure the system improves over time without breaking under the load of new data.
😀 **Building with Flexibility**: Developing systems with a platform mindset—modular components and microservices—allows for scaling and testing new models or technologies over time without disrupting the entire infrastructure.