Apache Kafka 101: Schema Registry (2023)

Confluent
23 Nov 202006:45

Summary

TLDRIn this video, Tim Berglund from Confluent introduces the Confluent Schema Registry, a tool designed to manage the evolution of message formats in Kafka topics. As new applications and consumers emerge, and business requirements change, the Schema Registry ensures compatibility and agreement on message schemas. It operates as a standalone server, maintaining a database of schemas and providing APIs for producers and consumers to check compatibility. The tool supports JSON, Avro, and protobuf formats, facilitating schema evolution with minimal runtime failures and promoting collaboration through Interface Description Languages.

Takeaways

  • 📚 The Confluent Schema Registry is a standalone server process that helps manage the evolution of message formats in Kafka topics.
  • 🌐 It operates independently of Kafka brokers, appearing as a producer or consumer within the Kafka cluster.
  • đŸ—‚ïž The Schema Registry maintains a database of all schemas written into topics, which is stored in an internal Kafka topic and cached for quick access.
  • 🔄 It supports schema evolution, ensuring compatibility between new and existing message formats as business requirements change.
  • đŸ›Ąïž The Registry enforces compatibility rules, preventing the production of messages that would violate these rules and cause runtime failures.
  • 🔧 Producers and consumers interact with the Schema Registry via a REST API to check schema compatibility before producing or consuming messages.
  • đŸš« If a consumer encounters a message with an incompatible schema, the Registry instructs it not to consume the message, avoiding potential errors.
  • đŸ’Ÿ Schemas are assigned immutable IDs, allowing for caching and reducing the need for repeated REST calls, which improves performance.
  • 📈 The Schema Registry currently supports three serialization formats: JSON Schema, Avro, and Protobuf, catering to different serialization needs.
  • đŸ› ïž It provides tooling and an Interface Description Language (IDL) for developers to define and manage schema changes in a source-controllable manner.
  • 🔄 The process of schema change collaboration is streamlined, often involving a pull request mechanism, ensuring all stakeholders are aware and can discuss changes.
  • 📈 For non-trivial systems, using the Schema Registry is considered essential to manage schema evolution and ensure system reliability.

Q & A

  • What is Confluent Schema Registry?

    -Confluent Schema Registry is a standalone server process that maintains a database of all schemas written into Kafka topics, ensuring compatibility and evolution of message formats.

  • How does the Schema Registry help with evolving message formats in Kafka?

    -The Schema Registry allows producers and consumers to check compatibility of message schemas with previous versions, ensuring that changes adhere to defined compatibility rules and preventing runtime failures due to schema incompatibilities.

  • What is the role of the Schema Registry in Kafka's ecosystem?

    -The Schema Registry acts as an application within the Kafka ecosystem, providing a REST API for producers and consumers to validate schema compatibility and maintain a database of schemas in an internal Kafka topic.

  • How does the Schema Registry handle schema evolution?

    -It provides a mechanism for producers to submit new schemas for validation against compatibility rules and for consumers to reject messages with incompatible schemas, thus managing schema evolution and preventing data incompatibility issues.

  • What are the benefits of using the Schema Registry for producers?

    -Producers can ensure that their messages adhere to the expected schema versions and compatibility rules, preventing the production of incompatible data and potential runtime errors.

  • How does the Schema Registry assist consumers in processing messages?

    -Consumers can use the Schema Registry to verify that the message schema they are about to consume is compatible with the version they expect, avoiding the consumption of incompatible data.

  • What is the significance of caching in the Schema Registry's operation?

    -Caching reduces the need for repeated REST API calls, improving performance by allowing producers and consumers to locally store and quickly access schema information after the initial validation.

  • What serialization formats does the Schema Registry currently support?

    -The Schema Registry supports three serialization formats: JSON, Avro, and protobuf, catering to different serialization needs and preferences.

  • How does the Schema Registry facilitate collaboration around schema changes?

    -By using an Interface Description Language (IDL) like Avro, the Schema Registry enables a centralized approach to schema definition and change management, allowing teams to collaborate through version control systems like pull requests.

  • What is the importance of the Schema Registry in non-trivial systems?

    -In complex systems, the Schema Registry is essential for managing schema evolution, ensuring compatibility across diverse applications and teams, and preventing data serialization issues.

  • How does the Schema Registry help in detecting breaking changes during the development process?

    -It provides tooling that allows developers to check for breaking changes at build time, before deployment, ensuring that schema changes do not introduce runtime incompatibilities.

Outlines

00:00

📚 Introduction to Confluent Schema Registry

Tim Berglund introduces the Confluent Schema Registry, a standalone server process that complements the Kafka ecosystem by maintaining a database of schemas for Kafka topics. It ensures compatibility and evolution of message formats as new consumers emerge and business requirements change. The Schema Registry operates as an application within the Kafka cluster, with its database persisted in an internal Kafka topic and cached for low latency access. It also provides a REST API for producers and consumers to check schema compatibility before message production or consumption, thus preventing runtime failures due to schema incompatibilities.

05:01

đŸ› ïž Tooling and Collaboration for Schema Evolution

The second paragraph delves into the practical aspects of using the Confluent Schema Registry for managing schema evolution. It discusses the importance of having a standard and automated way to learn about schema changes, especially in systems where consumers may be developed by different teams or unknown entities. The Schema Registry supports collaboration around schema changes by using an Interface Description Language (IDL), such as Avro's .avsc files, which can be version-controlled and collaboratively edited through pull requests. This process not only helps in managing schema evolutions but also integrates with build-time checks to prevent breaking changes before deployment. The paragraph concludes with a strong recommendation for using Schema Registry in any non-trivial system to facilitate schema management and evolution.

Mindmap

Keywords

💡Confluent Schema Registry

The Confluent Schema Registry is a standalone server process that is integral to the Kafka ecosystem. It maintains a database of all schemas written into topics, ensuring compatibility and evolution of message formats as the business requirements change. It is crucial for managing schema changes and ensuring that producers and consumers can understand the message formats they are working with. In the script, it is introduced as a solution to the problem of evolving message formats and the emergence of new consumers.

💡Kafka

Kafka is a distributed streaming platform that enables the processing of real-time data. It is central to the script's discussion as the platform where messages are produced and consumed. The script mentions Kafka in the context of applications busily producing and consuming messages, emphasizing the importance of a schema registry in managing the evolution of message formats within Kafka topics.

💡Producer

In the context of Kafka, a producer is an application or service that publishes messages to a Kafka topic. The script discusses how new producers may emerge, potentially written by different teams or even unknown entities, necessitating a shared understanding of message formats through the Schema Registry.

💡Consumer

A consumer in Kafka is an application or service that subscribes to topics and processes the messages produced. The script highlights that new consumers will need to understand the format of the messages in the topics they consume, which is where the Schema Registry plays a crucial role in maintaining and sharing schema information.

💡Schema Evolution

Schema evolution refers to the process of changing the structure of data as business requirements evolve. The script explains that the format of messages will change over time, and the Schema Registry is essential in managing these changes to ensure that all producers and consumers are working with compatible and up-to-date schemas.

💡REST Endpoint

A REST endpoint is a URL through which producers and consumers can interact with the Schema Registry via API calls. The script mentions that when a producer is configured to use the Schema Registry, it calls the REST endpoint to check the schema compatibility before producing a message.

💡Compatibility Rules

Compatibility rules define how schemas can evolve without breaking the system. The script explains that if a new schema is different but matches the defined compatibility rules for a topic, the produce operation may still succeed, ensuring smooth schema evolution.

💡Serialization Formats

Serialization formats are ways of converting data structures into a format that can be easily stored or transmitted. The script mentions that the Schema Registry currently supports three serialization formats: JSON, Avro, and protobuf, which are essential for defining and managing the schemas of messages in Kafka topics.

💡Interface Description Language (IDL)

An Interface Description Language is used to describe the schema of objects in a source controllable text file. The script discusses how, in the case of Avro, an IDL file (.avsc) can be used to define the schema, which can then be transformed into Java objects using Maven or Gradle plugins, facilitating schema management and collaboration.

💡High Availability Configuration

High availability configuration ensures that a system remains operational even if one of its instances fails. The script mentions that the Schema Registry can be run in a redundant, high availability configuration to maintain service continuity.

💡Runtime Failures

Runtime failures are errors that occur during the execution of a program. The script explains that the Schema Registry helps to minimize runtime failures related to schema evolution by ensuring compatibility before messages are produced or consumed.

Highlights

Confluent Schema Registry is introduced to manage message format evolution in Kafka.

New consumers of existing Kafka topics may emerge from different teams or unknown sources.

Message formats must be understood by new consumers to ensure proper message consumption.

Business evolution necessitates changes in message schemas, such as adding new fields or modifying existing ones.

Schema Registry acts as a standalone server process external to Kafka brokers, maintaining a database of schemas.

The schema database is persisted in an internal Kafka topic and cached for low latency access.

Schema Registry can be configured for high availability to ensure continuous operation.

It provides an API for producers and consumers to check message schema compatibility.

Producers must call the Schema Registry REST endpoint to present the schema of new messages.

Compatibility rules are defined to allow or reject schema changes based on predefined criteria.

Schema Registry can prevent the production of incompatible messages, avoiding runtime failures.

Consumers are instructed not to consume messages with incompatible schemas.

Schema Registry does not fully automate schema evolution but significantly eases the management of schema changes.

Schemas are cached locally in producers and consumers to minimize REST API round trips.

Schema Registry currently supports JSON, Avro, and protobuf serialization formats.

Interface Description Languages (IDL) like Avro's avsc file facilitate schema definition and collaboration.

Tooling exists to convert IDL files into programming language objects, streamlining schema management.

Schema Registry promotes a standardized and automated approach to learning and managing schema changes.

The use of Schema Registry is considered essential in non-trivial systems for effective schema management.

Transcripts

play00:00

- Hey, Tim Berglund with Confluent

play00:01

to talk to you about a Confluent Schema Registry.

play00:04

(upbeat music)

play00:09

Now once applications are busily producing messages to Kafka

play00:13

and consuming messages from it,

play00:15

two things are gonna happen.

play00:16

First, new consumers of existing topics

play00:20

are going to emerge.

play00:22

These are brand new applications.

play00:24

They might be written by the same team

play00:26

that wrote the original producer of those messages,

play00:29

maybe by another team, maybe by people you don't even know,

play00:32

that just depending on how your organization works.

play00:35

That's a perfectly normal thing for new consumers to emerge

play00:38

written by new people.

play00:40

And they're going to need to understand

play00:41

the format of the messages in the topic.

play00:44

Second, the format of those messages

play00:46

is going to evolve as the business evolves.

play00:49

For example, order objects,

play00:51

that's an object that represents an order,

play00:53

customer places an order,

play00:54

and here's an object representing that order.

play00:56

They might gain new status field

play00:58

or usernames might be split into first

play01:01

or last from full name or the reverse.

play01:05

And so on, things change.

play01:06

There is no such thing as getting it all right up front,

play01:09

the world changes.

play01:10

And so the schema of our stuff

play01:13

has to change with it.

play01:14

The schema of our domain objects

play01:16

is a constantly moving target.

play01:18

And we have to have a way of agreeing on that schema,

play01:21

the schema of those messages

play01:22

in whatever topic we're thinking about at the moment.

play01:25

The Confluent Schema Registry

play01:27

exists to solve precisely this problem.

play01:31

So, Schema Registry is a standalone server process

play01:34

that runs on a machine external to the Kafka brokers.

play01:38

So it looks like an application to the Kafka cluster,

play01:41

it looks like a producer or a consumer.

play01:43

And there's a little bit more to it than that,

play01:44

but at minimum it is that.

play01:46

Its job is to maintain a database of all of the schemas

play01:50

that have been written into topics in the cluster

play01:52

for which it is responsible.

play01:55

Now that database is persisted in an internal Kafka topic,

play02:00

this should come as no surprise to you,

play02:01

and it's cached in the Schema Registry

play02:03

for low latency access.

play02:06

This is very typical, by the way,

play02:07

for an element of the Kafka ecosystem

play02:10

to be built out of Kafka,

play02:12

you know, we needed a distributed fault tolerant data store,

play02:14

well, here's Kafka presenting itself.

play02:16

So we use it, we use a topic to store those schemas.

play02:19

A Schema Registry can be run

play02:20

in a redundant high availability configuration

play02:22

if you like.

play02:23

So it remains up if one instance fails.

play02:26

Now, Schema Registry is also an API

play02:29

that allows producers and consumers to predict

play02:31

whether the message they're about to produce or consume

play02:34

is compatible with previous versions

play02:37

or compatible with the version that they're expecting.

play02:40

When a producer is configured to use the Schema Registry,

play02:43

it calls at produce time,

play02:45

an API at the Schema Registry REST endpoint.

play02:48

So Schema Registry is up there,

play02:50

maintaining this database,

play02:51

also has a REST interface.

play02:53

Producer calls that REST endpoint

play02:56

and presents the schema of the new message.

play02:58

If it's the same as the last message produced,

play03:02

then the produce may succeed.

play03:04

If it's different from the last message,

play03:06

but matches the compatibility rules defined for the topic,

play03:10

the produce may still succeed.

play03:12

If it's different in a way

play03:13

that will violate the compatibility rules,

play03:16

the produce will fail in a way

play03:18

that the application code can detect.

play03:21

There'll be a failure condition it can detect

play03:23

and, you know,

play03:24

dutifully of course produce that exception stacktrace

play03:27

to the browser, no way don't do that,

play03:29

you could responsibly handle that condition.

play03:31

But you are made aware of that condition,

play03:33

rather than producing data

play03:34

that is gonna be incompatible down the line.

play03:37

Likewise, on the consumer side,

play03:38

if a consumer reads a message

play03:40

that has an incompatible schema from the version

play03:43

that the consumer code expects,

play03:45

Schema Registry will tell it not to consume the message.

play03:48

It doesn't fully automate the problem of schema evolution,

play03:51

and frankly, nothing does.

play03:52

That's always a challenge in any system

play03:55

that serializes anything, regardless of the tooling.

play03:58

But it does make a difficult problem much easier

play04:01

by keeping the runtime failures

play04:02

from happening when possible.

play04:04

Also, if you're worried about all these rest round trips,

play04:07

and that sounds really slow.

play04:08

Of course, all this stuff gets cached in the producer

play04:10

and the consumer when you're using Schema Registry.

play04:12

So these schemas have immutable IDs,

play04:15

and once I've checked once,

play04:16

you know, that's gonna be cached locally,

play04:17

and I don't need to keep doing those round trips.

play04:19

That's usually just a warm up thing

play04:21

in terms of performance.

play04:22

A Schema Registry currently supports

play04:24

three serialization formats,

play04:26

JSON, schema, Avro and protobuf.

play04:29

And depending on the format you may have available to you

play04:32

an IDL, an Interface Description Language

play04:34

where you can describe in a source controllable text file,

play04:38

the schema of the objects in question.

play04:41

And in some cases, there's also tooling

play04:43

that will then take that IDL,

play04:45

for example, an Avro you can write an avsc file.

play04:49

That's this nice simple JSON format

play04:50

where you're describing the schema of the object

play04:52

and say if you're using Java,

play04:54

there's a Maven and a Gradle plugin

play04:56

where you can turn that into a Java object.

play04:58

So then not only do you have

play05:00

the ability to eliminate certain classes of runtime failures

play05:03

due to schema evolution,

play05:05

but you've got now a tooling pathway

play05:08

that drives collaboration around schema change

play05:11

to a single file.

play05:12

So if you want to change what an order is,

play05:14

and add a new status field to an order,

play05:16

well, technically what that means is,

play05:18

you change the IDL, you edit avsc file.

play05:22

And the process that you now have

play05:24

for collaborating around that schema change,

play05:26

well, that's the same process you have

play05:28

for collaborating around any schema change.

play05:30

For most of us, that's a pull request, right?

play05:32

You do that thing in a branch and you submit a PR

play05:35

and people talk about it,

play05:37

and then it gets done and everybody has that change,

play05:39

and the tooling updates the object

play05:41

and the Schema Registry at runtime

play05:43

tells you whether that's gonna work,

play05:44

there's even a way to do it at build time,

play05:47

before you deploy the code to find out whether

play05:49

this is gonna be a breaking change or not,

play05:51

if it's not, obvious in the case of complex domain objects.

play05:54

So, all kinds of very, very helpful things.

play05:58

I would go so far this is a slightly opinionated statement

play06:01

to say that in any non-trivial system,

play06:03

using Schema Registry is non negotiable.

play06:07

Again, there are going to be people writing consumers

play06:10

at some point, that maybe you haven't talked to them,

play06:13

you haven't had a chance to fully mind meld with them

play06:16

on what's going on with the schema in that topic.

play06:18

They need a standard and automated way of learning about it.

play06:22

Also, no matter how good of a job you do up front

play06:24

to defining schema, the world out there changes,

play06:26

your schemas are gonna change.

play06:28

You need a way of managing those evolutions internally,

play06:30

and Confluence Schema Registry helps you with these things.

play06:34

(upbeat music)

Rate This
★
★
★
★
★

5.0 / 5 (0 votes)

Étiquettes Connexes
KafkaSchemaEvolutionCompatibilityConfluentAPIDataSerializationAvroREST
Besoin d'un résumé en anglais ?