Apache Kafka 101: Schema Registry (2023)
Summary
TLDRIn this video, Tim Berglund from Confluent introduces the Confluent Schema Registry, a tool designed to manage the evolution of message formats in Kafka topics. As new applications and consumers emerge, and business requirements change, the Schema Registry ensures compatibility and agreement on message schemas. It operates as a standalone server, maintaining a database of schemas and providing APIs for producers and consumers to check compatibility. The tool supports JSON, Avro, and protobuf formats, facilitating schema evolution with minimal runtime failures and promoting collaboration through Interface Description Languages.
Takeaways
- 📚 The Confluent Schema Registry is a standalone server process that helps manage the evolution of message formats in Kafka topics.
- 🌐 It operates independently of Kafka brokers, appearing as a producer or consumer within the Kafka cluster.
- 🗂️ The Schema Registry maintains a database of all schemas written into topics, which is stored in an internal Kafka topic and cached for quick access.
- 🔄 It supports schema evolution, ensuring compatibility between new and existing message formats as business requirements change.
- 🛡️ The Registry enforces compatibility rules, preventing the production of messages that would violate these rules and cause runtime failures.
- 🔧 Producers and consumers interact with the Schema Registry via a REST API to check schema compatibility before producing or consuming messages.
- 🚫 If a consumer encounters a message with an incompatible schema, the Registry instructs it not to consume the message, avoiding potential errors.
- 💾 Schemas are assigned immutable IDs, allowing for caching and reducing the need for repeated REST calls, which improves performance.
- 📈 The Schema Registry currently supports three serialization formats: JSON Schema, Avro, and Protobuf, catering to different serialization needs.
- 🛠️ It provides tooling and an Interface Description Language (IDL) for developers to define and manage schema changes in a source-controllable manner.
- 🔄 The process of schema change collaboration is streamlined, often involving a pull request mechanism, ensuring all stakeholders are aware and can discuss changes.
- 📈 For non-trivial systems, using the Schema Registry is considered essential to manage schema evolution and ensure system reliability.
Q & A
What is Confluent Schema Registry?
-Confluent Schema Registry is a standalone server process that maintains a database of all schemas written into Kafka topics, ensuring compatibility and evolution of message formats.
How does the Schema Registry help with evolving message formats in Kafka?
-The Schema Registry allows producers and consumers to check compatibility of message schemas with previous versions, ensuring that changes adhere to defined compatibility rules and preventing runtime failures due to schema incompatibilities.
What is the role of the Schema Registry in Kafka's ecosystem?
-The Schema Registry acts as an application within the Kafka ecosystem, providing a REST API for producers and consumers to validate schema compatibility and maintain a database of schemas in an internal Kafka topic.
How does the Schema Registry handle schema evolution?
-It provides a mechanism for producers to submit new schemas for validation against compatibility rules and for consumers to reject messages with incompatible schemas, thus managing schema evolution and preventing data incompatibility issues.
What are the benefits of using the Schema Registry for producers?
-Producers can ensure that their messages adhere to the expected schema versions and compatibility rules, preventing the production of incompatible data and potential runtime errors.
How does the Schema Registry assist consumers in processing messages?
-Consumers can use the Schema Registry to verify that the message schema they are about to consume is compatible with the version they expect, avoiding the consumption of incompatible data.
What is the significance of caching in the Schema Registry's operation?
-Caching reduces the need for repeated REST API calls, improving performance by allowing producers and consumers to locally store and quickly access schema information after the initial validation.
What serialization formats does the Schema Registry currently support?
-The Schema Registry supports three serialization formats: JSON, Avro, and protobuf, catering to different serialization needs and preferences.
How does the Schema Registry facilitate collaboration around schema changes?
-By using an Interface Description Language (IDL) like Avro, the Schema Registry enables a centralized approach to schema definition and change management, allowing teams to collaborate through version control systems like pull requests.
What is the importance of the Schema Registry in non-trivial systems?
-In complex systems, the Schema Registry is essential for managing schema evolution, ensuring compatibility across diverse applications and teams, and preventing data serialization issues.
How does the Schema Registry help in detecting breaking changes during the development process?
-It provides tooling that allows developers to check for breaking changes at build time, before deployment, ensuring that schema changes do not introduce runtime incompatibilities.
Outlines
📚 Introduction to Confluent Schema Registry
Tim Berglund introduces the Confluent Schema Registry, a standalone server process that complements the Kafka ecosystem by maintaining a database of schemas for Kafka topics. It ensures compatibility and evolution of message formats as new consumers emerge and business requirements change. The Schema Registry operates as an application within the Kafka cluster, with its database persisted in an internal Kafka topic and cached for low latency access. It also provides a REST API for producers and consumers to check schema compatibility before message production or consumption, thus preventing runtime failures due to schema incompatibilities.
🛠️ Tooling and Collaboration for Schema Evolution
The second paragraph delves into the practical aspects of using the Confluent Schema Registry for managing schema evolution. It discusses the importance of having a standard and automated way to learn about schema changes, especially in systems where consumers may be developed by different teams or unknown entities. The Schema Registry supports collaboration around schema changes by using an Interface Description Language (IDL), such as Avro's .avsc files, which can be version-controlled and collaboratively edited through pull requests. This process not only helps in managing schema evolutions but also integrates with build-time checks to prevent breaking changes before deployment. The paragraph concludes with a strong recommendation for using Schema Registry in any non-trivial system to facilitate schema management and evolution.
Mindmap
Keywords
💡Confluent Schema Registry
💡Kafka
💡Producer
💡Consumer
💡Schema Evolution
💡REST Endpoint
💡Compatibility Rules
💡Serialization Formats
💡Interface Description Language (IDL)
💡High Availability Configuration
💡Runtime Failures
Highlights
Confluent Schema Registry is introduced to manage message format evolution in Kafka.
New consumers of existing Kafka topics may emerge from different teams or unknown sources.
Message formats must be understood by new consumers to ensure proper message consumption.
Business evolution necessitates changes in message schemas, such as adding new fields or modifying existing ones.
Schema Registry acts as a standalone server process external to Kafka brokers, maintaining a database of schemas.
The schema database is persisted in an internal Kafka topic and cached for low latency access.
Schema Registry can be configured for high availability to ensure continuous operation.
It provides an API for producers and consumers to check message schema compatibility.
Producers must call the Schema Registry REST endpoint to present the schema of new messages.
Compatibility rules are defined to allow or reject schema changes based on predefined criteria.
Schema Registry can prevent the production of incompatible messages, avoiding runtime failures.
Consumers are instructed not to consume messages with incompatible schemas.
Schema Registry does not fully automate schema evolution but significantly eases the management of schema changes.
Schemas are cached locally in producers and consumers to minimize REST API round trips.
Schema Registry currently supports JSON, Avro, and protobuf serialization formats.
Interface Description Languages (IDL) like Avro's avsc file facilitate schema definition and collaboration.
Tooling exists to convert IDL files into programming language objects, streamlining schema management.
Schema Registry promotes a standardized and automated approach to learning and managing schema changes.
The use of Schema Registry is considered essential in non-trivial systems for effective schema management.
Transcripts
- Hey, Tim Berglund with Confluent
to talk to you about a Confluent Schema Registry.
(upbeat music)
Now once applications are busily producing messages to Kafka
and consuming messages from it,
two things are gonna happen.
First, new consumers of existing topics
are going to emerge.
These are brand new applications.
They might be written by the same team
that wrote the original producer of those messages,
maybe by another team, maybe by people you don't even know,
that just depending on how your organization works.
That's a perfectly normal thing for new consumers to emerge
written by new people.
And they're going to need to understand
the format of the messages in the topic.
Second, the format of those messages
is going to evolve as the business evolves.
For example, order objects,
that's an object that represents an order,
customer places an order,
and here's an object representing that order.
They might gain new status field
or usernames might be split into first
or last from full name or the reverse.
And so on, things change.
There is no such thing as getting it all right up front,
the world changes.
And so the schema of our stuff
has to change with it.
The schema of our domain objects
is a constantly moving target.
And we have to have a way of agreeing on that schema,
the schema of those messages
in whatever topic we're thinking about at the moment.
The Confluent Schema Registry
exists to solve precisely this problem.
So, Schema Registry is a standalone server process
that runs on a machine external to the Kafka brokers.
So it looks like an application to the Kafka cluster,
it looks like a producer or a consumer.
And there's a little bit more to it than that,
but at minimum it is that.
Its job is to maintain a database of all of the schemas
that have been written into topics in the cluster
for which it is responsible.
Now that database is persisted in an internal Kafka topic,
this should come as no surprise to you,
and it's cached in the Schema Registry
for low latency access.
This is very typical, by the way,
for an element of the Kafka ecosystem
to be built out of Kafka,
you know, we needed a distributed fault tolerant data store,
well, here's Kafka presenting itself.
So we use it, we use a topic to store those schemas.
A Schema Registry can be run
in a redundant high availability configuration
if you like.
So it remains up if one instance fails.
Now, Schema Registry is also an API
that allows producers and consumers to predict
whether the message they're about to produce or consume
is compatible with previous versions
or compatible with the version that they're expecting.
When a producer is configured to use the Schema Registry,
it calls at produce time,
an API at the Schema Registry REST endpoint.
So Schema Registry is up there,
maintaining this database,
also has a REST interface.
Producer calls that REST endpoint
and presents the schema of the new message.
If it's the same as the last message produced,
then the produce may succeed.
If it's different from the last message,
but matches the compatibility rules defined for the topic,
the produce may still succeed.
If it's different in a way
that will violate the compatibility rules,
the produce will fail in a way
that the application code can detect.
There'll be a failure condition it can detect
and, you know,
dutifully of course produce that exception stacktrace
to the browser, no way don't do that,
you could responsibly handle that condition.
But you are made aware of that condition,
rather than producing data
that is gonna be incompatible down the line.
Likewise, on the consumer side,
if a consumer reads a message
that has an incompatible schema from the version
that the consumer code expects,
Schema Registry will tell it not to consume the message.
It doesn't fully automate the problem of schema evolution,
and frankly, nothing does.
That's always a challenge in any system
that serializes anything, regardless of the tooling.
But it does make a difficult problem much easier
by keeping the runtime failures
from happening when possible.
Also, if you're worried about all these rest round trips,
and that sounds really slow.
Of course, all this stuff gets cached in the producer
and the consumer when you're using Schema Registry.
So these schemas have immutable IDs,
and once I've checked once,
you know, that's gonna be cached locally,
and I don't need to keep doing those round trips.
That's usually just a warm up thing
in terms of performance.
A Schema Registry currently supports
three serialization formats,
JSON, schema, Avro and protobuf.
And depending on the format you may have available to you
an IDL, an Interface Description Language
where you can describe in a source controllable text file,
the schema of the objects in question.
And in some cases, there's also tooling
that will then take that IDL,
for example, an Avro you can write an avsc file.
That's this nice simple JSON format
where you're describing the schema of the object
and say if you're using Java,
there's a Maven and a Gradle plugin
where you can turn that into a Java object.
So then not only do you have
the ability to eliminate certain classes of runtime failures
due to schema evolution,
but you've got now a tooling pathway
that drives collaboration around schema change
to a single file.
So if you want to change what an order is,
and add a new status field to an order,
well, technically what that means is,
you change the IDL, you edit avsc file.
And the process that you now have
for collaborating around that schema change,
well, that's the same process you have
for collaborating around any schema change.
For most of us, that's a pull request, right?
You do that thing in a branch and you submit a PR
and people talk about it,
and then it gets done and everybody has that change,
and the tooling updates the object
and the Schema Registry at runtime
tells you whether that's gonna work,
there's even a way to do it at build time,
before you deploy the code to find out whether
this is gonna be a breaking change or not,
if it's not, obvious in the case of complex domain objects.
So, all kinds of very, very helpful things.
I would go so far this is a slightly opinionated statement
to say that in any non-trivial system,
using Schema Registry is non negotiable.
Again, there are going to be people writing consumers
at some point, that maybe you haven't talked to them,
you haven't had a chance to fully mind meld with them
on what's going on with the schema in that topic.
They need a standard and automated way of learning about it.
Also, no matter how good of a job you do up front
to defining schema, the world out there changes,
your schemas are gonna change.
You need a way of managing those evolutions internally,
and Confluence Schema Registry helps you with these things.
(upbeat music)
5.0 / 5 (0 votes)