Service discovery and heartbeats in micro-services πŸ‘πŸ“ˆ

Gaurav Sen
14 Apr 201906:44

Summary

TLDRThe video script discusses the challenges of slow data pipelines and suggests using a NoSQL database like Cassandra, which costs around a million dollars. It emphasizes the importance of efficiency, reliability, and availability in server management. The script explains the concept of sharding databases, service discovery, and health checks, including the use of heartbeats to maintain system consistency and avoid 'zombie' servers. It also touches on how load balancers and health services work together to ensure service uptime, ultimately leading to increased revenue.

Takeaways

  • πŸš€ The company faces an issue with slow data pipelines and is seeking solutions to process data faster.
  • πŸ’‘ A NoSQL database like Cassandra is suggested for its efficiency in handling large datasets, but it comes with a high cost.
  • πŸ” The concept of sharding the database into partitions is introduced as a potential solution to improve efficiency and reduce cost.
  • πŸ•’ The migration to a new system or sharding would take approximately a few weeks, implying a time investment for the transition.
  • πŸ”₯ Upon running a health check, it's revealed that only 50% of the servers are operational, indicating a significant issue with server availability.
  • βœ… The promotion of an employee and the dismissal of two others highlight the importance of identifying and addressing problems efficiently.
  • πŸ”„ The script emphasizes the importance of reliability and availability over efficiency in service management, suggesting a balance is necessary.
  • 🌐 A three-server example is provided to explain the role of health checks and load balancers in maintaining service uptime.
  • πŸ’» The potential for a 'zombie' server scenario is discussed, where a server may appear alive but is unable to process requests, and the solution of a two-way heartbeat mechanism is proposed.
  • πŸ”„ Service discovery is tied to health checks, where new services need to communicate their availability to the load balancer for proper routing.
  • πŸ“ˆ The script concludes with a call to action for further learning and engagement, inviting viewers to subscribe and comment for more information.

Q & A

  • What is the main issue discussed in the video script?

    -The main issue discussed is the slowness of the company's data pipelines and the need to process data faster.

  • What solution is initially proposed to address the data pipeline problem?

    -The initial solution proposed is to use a NoSQL database, specifically Cassandra, to improve data processing speed.

  • What is the estimated cost of using Cassandra for the company's needs?

    -The estimated cost of using Cassandra for the company's size is about 1 million dollars.

  • What alternative solution is suggested to reduce costs?

    -The alternative solution suggested is sharding the database into partitions to store data more efficiently.

  • What is the expected time frame for the database migration to partitions?

    -The expected time frame for the migration is approximately a few weeks.

  • Why is it mentioned that only 50% of the servers are currently operational?

    -It is mentioned to highlight the inefficiency and the potential for improvement by bringing up the remaining servers.

  • What is the role of the health service in the context of the script?

    -The health service's role is to ensure that all servers are operational by periodically checking their status and responding to any issues.

  • What action is taken when a server like 's1' fails to respond to the health service?

    -When 's1' fails to respond, the health service marks it as critical, and if it misses a second check, it is considered dead, prompting a restart or migration to another server.

  • How does the script differentiate between service efficiency and reliability?

    -The script emphasizes that while efficiency is important, reliability and availability are crucial for a service to be used more and generate revenue.

  • What is the concept of a 'heartbeat' in the context of service health checks?

    -A 'heartbeat' refers to a regular communication signal between the service and the health service to confirm that the service is alive and functioning properly.

  • How is service discovery related to health checks as described in the script?

    -Service discovery is closely tied to health checks as it involves updating the load balancer with the current status and location of services, which the health service then uses to check their health.

Outlines

00:00

πŸš€ Optimizing Data Pipeline Efficiency

The first paragraph addresses the issue of slow data pipelines and suggests a NoSQL database like Cassandra as a potential solution. The cost is estimated at around 1 million dollars. The speaker proposes sharding the database into partitions to improve efficiency and mentions that bringing up the remaining 50% of the servers could meet the efficiency requirements. The paragraph also highlights the importance of reliability and availability over efficiency in service operation, using a scenario with three servers (s1, s2, s3) to explain the role of health checks and load balancers in maintaining service uptime. The concept of a two-way heartbeat mechanism is introduced to ensure service consistency and avoid 'zombie' servers that continue to send stale data.

05:02

πŸ” Service Discovery and Health Checks

The second paragraph delves into the concepts of service discovery and health checks, explaining how a service like the profile service communicates its availability to the load balancer by providing IP addresses and ports. The load balancer then updates its snapshot and distributes this information to other services, allowing them to cache it for efficient communication. The health service is described as monitoring the health of services by opening connections to their HTTP ports and ensuring they are responsive. The paragraph also discusses the importance of the health service in recognizing changes in the service snapshot and maintaining system health. Links for further reading and an invitation to subscribe for notifications on related topics are provided.

Mindmap

Keywords

πŸ’‘Data Pipelines

Data pipelines refer to the mechanisms and processes through which data flows from its source to its destination. In the video, the issue with the data pipelines is that they are too slow, which is a critical problem for the efficiency of data processing. The script suggests that improving the speed of data reads is a priority.

πŸ’‘NoSQL Database

A NoSQL database is a type of non-relational database that is designed for scalability and flexibility. The script mentions the potential use of a NoSQL database like Cassandra to improve data processing speed, highlighting its relevance in handling large volumes of data efficiently.

πŸ’‘Sharding

Sharding is the process of distributing data across multiple machines or databases to improve performance and manage large datasets. The script discusses sharding as a solution to enhance the efficiency of the database by partitioning it.

πŸ’‘Migration

In the context of databases, migration refers to the process of transferring data from one system to another, often to improve performance or scalability. The script mentions a migration period of a few weeks, indicating the time required to transition to a new database system or structure.

πŸ’‘Efficiency

Efficiency in the video script pertains to the performance and speed at which data can be processed. The script emphasizes the need for making reads faster and improving overall system efficiency to meet the company's requirements.

πŸ’‘Reliability and Availability

These terms are critical in service management, referring to the dependability of a service to perform its function and the accessibility of the service when needed. The script discusses the importance of having checks and balances to ensure the service is always running, which is essential for user trust and revenue generation.

πŸ’‘Health Check

A health check in the script refers to the process of monitoring the status of servers or services to ensure they are operational. It is an essential component for maintaining service reliability, as it can detect and respond to server failures, as illustrated by the example of server s1 going down.

πŸ’‘Load Balancer

A load balancer is a system used to distribute network or application traffic across multiple servers to ensure no single server bears too much load. In the script, the load balancer is responsible for directing traffic based on the health and availability of services, as well as for service discovery.

πŸ’‘Service Discovery

Service discovery is the process by which services in a distributed system register themselves and discover other services. The script explains how a new service, such as a profile service, would inform the load balancer of its availability, allowing the load balancer to update its snapshot and direct traffic accordingly.

πŸ’‘Heartbeat

In the context of the script, a heartbeat is a signal sent from a service to a health service to indicate that it is operational. The script discusses the concept of a two-way heartbeat mechanism to ensure that services are not only alive but also capable of processing requests, thus avoiding 'zombie' services.

πŸ’‘Zombies

In the script, 'zombies' refer to services that appear to be operational because they are sending out signals (heartbeats), but in reality, they are not processing requests. The term is used metaphorically to describe a problematic state where services are not functioning as expected but are not detected as down.

Highlights

Problem identified with slow data pipelines needing faster processing.

Suggestion to use a NoSQL database for faster reads.

Estimated cost of using Cassandra for the database is about 1 million dollars.

Proposed solution to shard the database into partitions for more efficient data storage.

Migration to sharded database expected to take approximately a few weeks.

50% of servers are currently not operational, suggesting a need to bring them online.

Efficiency requirements can be easily met by utilizing the remaining 50% of servers.

Emphasis on reliability and availability over efficiency for server-side operations.

Importance of checks and balances to ensure service is always running for user trust and revenue generation.

Example given of a health service monitoring server status using a three-server model.

Health service's role in continuously checking server status and marking downed servers.

Load balancer's function in redistributing requests when a server is down.

Challenge of distinguishing between a server being alive but the application not functioning.

Introduction of a two-way heartbeat mechanism for more consistent service monitoring.

Health check problem's connection to service discovery for maintaining updated server information.

Service discovery's role in updating the load balancer with new service instances.

Load balancer's need to persist data for maintaining an accurate snapshot of service instances.

Importance of caching service information on the load balancer for efficiency.

Health service's interaction with the load balancer to ensure service health and updates.

Transcripts

play00:00

gentlemen we have a problem our data

play00:04

pipelines are too slow we need to

play00:06

process our data faster any suggestions

play00:13

now we need to make our reads faster now

play00:18

I think a no sequel database will be

play00:19

really useful here what's the cost

play00:24

typically Cassandra custom for this size

play00:26

will cost about 1 million dollars

play00:27

million dollars actually we can store

play00:31

the data more efficiently what we need

play00:33

to do is shard the database into

play00:36

partitions and what's the cost of this

play00:40

approximately a few weeks for the

play00:43

migration actually we can bring up the

play00:46

remaining 50% of our servers so I just

play00:51

ran the health check and what do you

play00:52

mean only 50% of our servers are

play00:55

actually on so we should be able to hit

play00:57

the efficiency requirements very easily

play00:58

if we started well you are promoted and

play01:02

you two are fired most of the things on

play01:12

the server side deal more with

play01:13

reliability and availability rather than

play01:15

efficiency a lot of people when they're

play01:17

starting a service they think about how

play01:19

do I make this service more efficient

play01:20

but what you really need to do is have

play01:22

enough checks and balances to make sure

play01:24

that this service is running all the

play01:25

time when that happens users tend to use

play01:28

that service more which ends up

play01:29

generating more revenue for you let's

play01:31

take an example where we have 3 servers

play01:33

s1 s2 and s3 and we also have a health

play01:35

service which is an important component

play01:36

along with the load balancer so the

play01:39

health service its main job is to make

play01:41

sure that everyone is alive and the way

play01:43

it does that is by talking to all three

play01:45

services so s1 will be sent a message of

play01:48

are your life and s1 will respond with

play01:51

yes now five seconds later the health

play01:54

service will again ask s1 are your life

play01:56

and if it responds with yes this cycle

play01:59

continues forever but what could happen

play02:01

as it always does is that s1 goes down

play02:03

because of some hardware issue or maybe

play02:05

there's a bug in the code s1 does not

play02:07

respond with yes to the requests sent by

play02:09

the health service at this point the

play02:11

health service can mark s1 as critical

play02:14

and if it misses the second Health

play02:18

Service request then you can assume that

play02:20

s1 is dead and if s1 is dead

play02:23

there's no point sending requests to

play02:25

this server we can assume that this is

play02:27

not going to be capable of taking

play02:28

requests and we need to restart that

play02:30

machine so maybe ask some service to

play02:33

restart this machine or to go on some

play02:37

other server s4 and run the service that

play02:40

was running over here on s4 yeah and

play02:43

what do you do with the load balancer

play02:44

when you actually send this information

play02:46

to the load balancer saying that s1 is

play02:48

dead and what we are going to do either

play02:51

through the load balancer or some other

play02:53

service is to run that service on a new

play02:56

machine okay this may not necessarily be

play02:58

a new machine you can run another

play03:00

instance on the same machine but for all

play03:03

our purposes this is what's happening

play03:05

what you could do is a little more

play03:07

complicated often what happens is when

play03:09

you ask a machine are you alive it

play03:11

responds with yes but sometimes the

play03:12

application is not alive so that's a

play03:15

beard problem to have and in fact it can

play03:17

be a terrible problem to have because

play03:19

you are assuming that the service is

play03:20

alive well actually it's not able to

play03:22

process any of the requests whatever be

play03:24

the case maybe there's some memory issue

play03:25

it's just not able to process any

play03:27

requests so in this case what you can do

play03:29

is the service itself should be telling

play03:32

you that I am Alive it should be telling

play03:35

the head service yes I'm alive by itself

play03:37

so it's a two-way heartbeat yeah a

play03:39

heartbeat every five seconds from this

play03:42

side and a heartbeat every five seconds

play03:44

from this side what happens here is that

play03:46

you have a more consistent model

play03:49

although there's more communication

play03:50

overhead the consistency is maintained

play03:53

in the sense that if s4 is dead the s4

play03:56

might still be sending requests to other

play03:58

services despite being dead despite not

play04:01

taking new requests because it might be

play04:02

having cron jobs running on it so when

play04:04

it does send these requests to these

play04:06

services it has stale data that's one

play04:09

problem it might be manipulating the

play04:11

internal state of these machines the way

play04:13

we can avoid this is to like basically

play04:15

avoid zombies is by having a 2-way

play04:18

heartbeat mechanism even this does not

play04:21

actually solve the problem entirely but

play04:24

this really helps you reduce the problem

play04:26

to a large extent

play04:27

because if you see that s4 has been

play04:30

marked dead by the Health Service then

play04:32

s4 will kill itself interestingly the

play04:35

health check problem is very closely

play04:36

tied to the service discovery problem in

play04:39

the sense that if there is a new service

play04:41

which is coming up having these three

play04:43

boxes s1 s2 and s3 so maybe this service

play04:46

is a profile service and it's running on

play04:48

these three boxes all it needs to do is

play04:50

tell the load balancer that hey I have

play04:53

three boxes on which I am running my

play04:55

service these are the IP addresses all

play04:58

right there's a list of IP addresses

play04:59

that you can take these are the ports on

play05:01

which I am running my HTTP clients maybe

play05:05

there's a separate port for my XMPP

play05:07

clients yeah so based on this the load

play05:13

balancer changes the snapshot that it

play05:15

has so the load balancer does need to

play05:17

persist data somewhere and this data is

play05:19

basically a snapshot of the entire

play05:21

universe of the services that you have

play05:23

when the profile service comes alive you

play05:25

need a new snapshot that's a new version

play05:27

you need to persist that in the database

play05:29

saying that s1 s2 and s3 have these

play05:32

ports and these IP addresses for which

play05:34

the profile service is running once you

play05:37

have this snapshot you can then tell

play05:39

every other service that hey this is the

play05:41

snapshot if you need to send a message

play05:42

to the profile service on s3 then this

play05:45

is the IP address and in that way what

play05:47

happens is the services themselves don't

play05:49

need to keep that information as - I

play05:50

need to send a message to the profile

play05:52

service what's the IP address or what's

play05:54

the port anytime you need information

play05:57

you come and ask the load balancer and

play05:58

even better if you can cache this

play06:00

information on the load balancer like

play06:02

basically cache the snapshot that this

play06:04

guy contains okay now how does the

play06:08

health service come into the picture

play06:09

well every time there is a change in a

play06:11

snapshot the health service can see the

play06:13

diff in the snapshot maybe and based on

play06:16

that it can open up connections with

play06:20

these three services on their HTTP ports

play06:22

always making sure that they're alive

play06:24

and making sure that the health of the

play06:25

system is good these are the main points

play06:27

of service discovery and health checks

play06:29

if you have a real interest in this

play06:32

subject and you want to get more detail

play06:34

then I have the links in the description

play06:35

below if you want notifications for

play06:37

further videos like this be sure to hit

play06:39

subscribe button

play06:40

and if you have any doubts of stations

play06:42

leave it in the comments below I'll see

play06:43

you next time

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Data EfficiencyServer ReliabilityService DiscoveryHealth ChecksLoad BalancerDatabase ShardingNoSQL DatabaseSystem MonitoringService AvailabilityTechnical Insights