Service discovery and heartbeats in micro-services 👍📈
Summary
TLDRThe video script discusses the challenges of slow data pipelines and suggests using a NoSQL database like Cassandra, which costs around a million dollars. It emphasizes the importance of efficiency, reliability, and availability in server management. The script explains the concept of sharding databases, service discovery, and health checks, including the use of heartbeats to maintain system consistency and avoid 'zombie' servers. It also touches on how load balancers and health services work together to ensure service uptime, ultimately leading to increased revenue.
Takeaways
- 🚀 The company faces an issue with slow data pipelines and is seeking solutions to process data faster.
- 💡 A NoSQL database like Cassandra is suggested for its efficiency in handling large datasets, but it comes with a high cost.
- 🔍 The concept of sharding the database into partitions is introduced as a potential solution to improve efficiency and reduce cost.
- 🕒 The migration to a new system or sharding would take approximately a few weeks, implying a time investment for the transition.
- 🔥 Upon running a health check, it's revealed that only 50% of the servers are operational, indicating a significant issue with server availability.
- ✅ The promotion of an employee and the dismissal of two others highlight the importance of identifying and addressing problems efficiently.
- 🔄 The script emphasizes the importance of reliability and availability over efficiency in service management, suggesting a balance is necessary.
- 🌐 A three-server example is provided to explain the role of health checks and load balancers in maintaining service uptime.
- 💻 The potential for a 'zombie' server scenario is discussed, where a server may appear alive but is unable to process requests, and the solution of a two-way heartbeat mechanism is proposed.
- 🔄 Service discovery is tied to health checks, where new services need to communicate their availability to the load balancer for proper routing.
- 📈 The script concludes with a call to action for further learning and engagement, inviting viewers to subscribe and comment for more information.
Q & A
What is the main issue discussed in the video script?
-The main issue discussed is the slowness of the company's data pipelines and the need to process data faster.
What solution is initially proposed to address the data pipeline problem?
-The initial solution proposed is to use a NoSQL database, specifically Cassandra, to improve data processing speed.
What is the estimated cost of using Cassandra for the company's needs?
-The estimated cost of using Cassandra for the company's size is about 1 million dollars.
What alternative solution is suggested to reduce costs?
-The alternative solution suggested is sharding the database into partitions to store data more efficiently.
What is the expected time frame for the database migration to partitions?
-The expected time frame for the migration is approximately a few weeks.
Why is it mentioned that only 50% of the servers are currently operational?
-It is mentioned to highlight the inefficiency and the potential for improvement by bringing up the remaining servers.
What is the role of the health service in the context of the script?
-The health service's role is to ensure that all servers are operational by periodically checking their status and responding to any issues.
What action is taken when a server like 's1' fails to respond to the health service?
-When 's1' fails to respond, the health service marks it as critical, and if it misses a second check, it is considered dead, prompting a restart or migration to another server.
How does the script differentiate between service efficiency and reliability?
-The script emphasizes that while efficiency is important, reliability and availability are crucial for a service to be used more and generate revenue.
What is the concept of a 'heartbeat' in the context of service health checks?
-A 'heartbeat' refers to a regular communication signal between the service and the health service to confirm that the service is alive and functioning properly.
How is service discovery related to health checks as described in the script?
-Service discovery is closely tied to health checks as it involves updating the load balancer with the current status and location of services, which the health service then uses to check their health.
Outlines
🚀 Optimizing Data Pipeline Efficiency
The first paragraph addresses the issue of slow data pipelines and suggests a NoSQL database like Cassandra as a potential solution. The cost is estimated at around 1 million dollars. The speaker proposes sharding the database into partitions to improve efficiency and mentions that bringing up the remaining 50% of the servers could meet the efficiency requirements. The paragraph also highlights the importance of reliability and availability over efficiency in service operation, using a scenario with three servers (s1, s2, s3) to explain the role of health checks and load balancers in maintaining service uptime. The concept of a two-way heartbeat mechanism is introduced to ensure service consistency and avoid 'zombie' servers that continue to send stale data.
🔍 Service Discovery and Health Checks
The second paragraph delves into the concepts of service discovery and health checks, explaining how a service like the profile service communicates its availability to the load balancer by providing IP addresses and ports. The load balancer then updates its snapshot and distributes this information to other services, allowing them to cache it for efficient communication. The health service is described as monitoring the health of services by opening connections to their HTTP ports and ensuring they are responsive. The paragraph also discusses the importance of the health service in recognizing changes in the service snapshot and maintaining system health. Links for further reading and an invitation to subscribe for notifications on related topics are provided.
Mindmap
Keywords
💡Data Pipelines
💡NoSQL Database
💡Sharding
💡Migration
💡Efficiency
💡Reliability and Availability
💡Health Check
💡Load Balancer
💡Service Discovery
💡Heartbeat
💡Zombies
Highlights
Problem identified with slow data pipelines needing faster processing.
Suggestion to use a NoSQL database for faster reads.
Estimated cost of using Cassandra for the database is about 1 million dollars.
Proposed solution to shard the database into partitions for more efficient data storage.
Migration to sharded database expected to take approximately a few weeks.
50% of servers are currently not operational, suggesting a need to bring them online.
Efficiency requirements can be easily met by utilizing the remaining 50% of servers.
Emphasis on reliability and availability over efficiency for server-side operations.
Importance of checks and balances to ensure service is always running for user trust and revenue generation.
Example given of a health service monitoring server status using a three-server model.
Health service's role in continuously checking server status and marking downed servers.
Load balancer's function in redistributing requests when a server is down.
Challenge of distinguishing between a server being alive but the application not functioning.
Introduction of a two-way heartbeat mechanism for more consistent service monitoring.
Health check problem's connection to service discovery for maintaining updated server information.
Service discovery's role in updating the load balancer with new service instances.
Load balancer's need to persist data for maintaining an accurate snapshot of service instances.
Importance of caching service information on the load balancer for efficiency.
Health service's interaction with the load balancer to ensure service health and updates.
Transcripts
gentlemen we have a problem our data
pipelines are too slow we need to
process our data faster any suggestions
now we need to make our reads faster now
I think a no sequel database will be
really useful here what's the cost
typically Cassandra custom for this size
will cost about 1 million dollars
million dollars actually we can store
the data more efficiently what we need
to do is shard the database into
partitions and what's the cost of this
approximately a few weeks for the
migration actually we can bring up the
remaining 50% of our servers so I just
ran the health check and what do you
mean only 50% of our servers are
actually on so we should be able to hit
the efficiency requirements very easily
if we started well you are promoted and
you two are fired most of the things on
the server side deal more with
reliability and availability rather than
efficiency a lot of people when they're
starting a service they think about how
do I make this service more efficient
but what you really need to do is have
enough checks and balances to make sure
that this service is running all the
time when that happens users tend to use
that service more which ends up
generating more revenue for you let's
take an example where we have 3 servers
s1 s2 and s3 and we also have a health
service which is an important component
along with the load balancer so the
health service its main job is to make
sure that everyone is alive and the way
it does that is by talking to all three
services so s1 will be sent a message of
are your life and s1 will respond with
yes now five seconds later the health
service will again ask s1 are your life
and if it responds with yes this cycle
continues forever but what could happen
as it always does is that s1 goes down
because of some hardware issue or maybe
there's a bug in the code s1 does not
respond with yes to the requests sent by
the health service at this point the
health service can mark s1 as critical
and if it misses the second Health
Service request then you can assume that
s1 is dead and if s1 is dead
there's no point sending requests to
this server we can assume that this is
not going to be capable of taking
requests and we need to restart that
machine so maybe ask some service to
restart this machine or to go on some
other server s4 and run the service that
was running over here on s4 yeah and
what do you do with the load balancer
when you actually send this information
to the load balancer saying that s1 is
dead and what we are going to do either
through the load balancer or some other
service is to run that service on a new
machine okay this may not necessarily be
a new machine you can run another
instance on the same machine but for all
our purposes this is what's happening
what you could do is a little more
complicated often what happens is when
you ask a machine are you alive it
responds with yes but sometimes the
application is not alive so that's a
beard problem to have and in fact it can
be a terrible problem to have because
you are assuming that the service is
alive well actually it's not able to
process any of the requests whatever be
the case maybe there's some memory issue
it's just not able to process any
requests so in this case what you can do
is the service itself should be telling
you that I am Alive it should be telling
the head service yes I'm alive by itself
so it's a two-way heartbeat yeah a
heartbeat every five seconds from this
side and a heartbeat every five seconds
from this side what happens here is that
you have a more consistent model
although there's more communication
overhead the consistency is maintained
in the sense that if s4 is dead the s4
might still be sending requests to other
services despite being dead despite not
taking new requests because it might be
having cron jobs running on it so when
it does send these requests to these
services it has stale data that's one
problem it might be manipulating the
internal state of these machines the way
we can avoid this is to like basically
avoid zombies is by having a 2-way
heartbeat mechanism even this does not
actually solve the problem entirely but
this really helps you reduce the problem
to a large extent
because if you see that s4 has been
marked dead by the Health Service then
s4 will kill itself interestingly the
health check problem is very closely
tied to the service discovery problem in
the sense that if there is a new service
which is coming up having these three
boxes s1 s2 and s3 so maybe this service
is a profile service and it's running on
these three boxes all it needs to do is
tell the load balancer that hey I have
three boxes on which I am running my
service these are the IP addresses all
right there's a list of IP addresses
that you can take these are the ports on
which I am running my HTTP clients maybe
there's a separate port for my XMPP
clients yeah so based on this the load
balancer changes the snapshot that it
has so the load balancer does need to
persist data somewhere and this data is
basically a snapshot of the entire
universe of the services that you have
when the profile service comes alive you
need a new snapshot that's a new version
you need to persist that in the database
saying that s1 s2 and s3 have these
ports and these IP addresses for which
the profile service is running once you
have this snapshot you can then tell
every other service that hey this is the
snapshot if you need to send a message
to the profile service on s3 then this
is the IP address and in that way what
happens is the services themselves don't
need to keep that information as - I
need to send a message to the profile
service what's the IP address or what's
the port anytime you need information
you come and ask the load balancer and
even better if you can cache this
information on the load balancer like
basically cache the snapshot that this
guy contains okay now how does the
health service come into the picture
well every time there is a change in a
snapshot the health service can see the
diff in the snapshot maybe and based on
that it can open up connections with
these three services on their HTTP ports
always making sure that they're alive
and making sure that the health of the
system is good these are the main points
of service discovery and health checks
if you have a real interest in this
subject and you want to get more detail
then I have the links in the description
below if you want notifications for
further videos like this be sure to hit
subscribe button
and if you have any doubts of stations
leave it in the comments below I'll see
you next time
5.0 / 5 (0 votes)