What Could Go Wrong with a GraphQL Query and Can OpenTelemetry Help? - Budhaditya Bhattacharya, Tyk

CNCF [Cloud Native Computing Foundation]
29 Jun 202418:11

Summary

TLDRThe speaker, a developer advocate, discusses common issues with GraphQL queries and how Open Telemetry can address them. They explore challenges like over-fetching, under-fetching, and the 'N+1' problem, demonstrating how to monitor and troubleshoot using Open Telemetry in a Node.js environment. The talk also covers the importance of semantic conventions and manual instrumentation for better observability, concluding with resources for further learning.

Takeaways

  • πŸ˜€ The speaker is starting the session early and expects a loud applause at the end, setting a positive and engaging tone for the audience.
  • 🌐 The talk focuses on potential issues with GraphQL queries and how Open Telemetry can help address them, providing a practical approach to API management.
  • πŸ” The speaker, a developer advocate at Tyk, discusses the challenges and solutions related to GraphQL and Open Telemetry, indicating the relevance of their expertise.
  • πŸ“ˆ The importance of monitoring and observability in production environments is highlighted, emphasizing the need for visibility into the health and performance of distributed systems.
  • πŸ“Š The RED method (Rate, Errors, Duration) is introduced as a strategy for gaining insights into service health, suggesting a systematic approach to monitoring.
  • πŸ›  Open Telemetry is positioned as a tool for instrumenting GraphQL services to capture distributed traces, which is crucial for troubleshooting and performance analysis.
  • πŸ”„ The speaker illustrates the use of Open Telemetry with a Node.js example, showing practical steps to integrate it with a GraphQL service for better observability.
  • πŸ“ˆ The integration of Open Telemetry with Prometheus and Jaeger is discussed, demonstrating how to visualize and analyze metrics for GraphQL services.
  • 🚦 The challenges of HTTP status codes in GraphQL are addressed, where a 200 OK response might not always indicate success, and the need for deeper inspection is emphasized.
  • πŸ” The 'N+1' problem in GraphQL is identified as a common performance issue, and the use of data loaders is suggested as a solution to prevent excessive database queries.
  • πŸ”„ The importance of granular performance profiling is stressed, moving beyond average metrics to understand the specific needs and challenges of different clients and queries.

Q & A

  • What is the main topic of the speaker's presentation?

    -The main topic of the presentation is discussing what could go wrong with GraphQL queries and how Open Telemetry can help address these issues.

  • Why might a speaker choose to start a session early?

    -The speaker starts the session early because it is the last session of the day and they expect the audience to be with them for the journey over the next 15 to 20 minutes.

  • What is the speaker's professional background?

    -The speaker, Buddha, is a Developer Advocate at Tyk, a cloud-native API management platform. They are also the chairperson of the Open API initiatives business governance board.

  • What is GraphQL and how does it differ from REST?

    -GraphQL is a query language for APIs and a runtime for fulfilling those queries with existing data. It differs from REST in that it allows clients to request specific pieces of information, addressing issues like over-fetching and under-fetching that are common with REST.

  • What is the purpose of an API Gateway in the context of GraphQL?

    -An API Gateway acts as a mediator between different security protocols, governance practices, and as an entry point for observability, especially when working with GraphQL to ensure reliability and stability in deployment.

  • What is the RED method in monitoring and what does it stand for?

    -The RED method is a monitoring strategy used to gain insight into the health and performance of distributed systems. RED stands for Rate, Errors, and Duration.

  • How can Open Telemetry be integrated into a GraphQL service?

    -Open Telemetry can be integrated by instrumenting the GraphQL service with Open Telemetry to get distributed traces. This involves using specific implementations for the technology stack, such as Node.js, and exporting spans to the Open Telemetry collector.

  • What are some common performance issues with GraphQL?

    -Some common performance issues with GraphQL include the N+1 problem, where multiple requests are made for each resource in a response, leading to inefficient data fetching and high latency.

  • What is the significance of the 'graphql.error.message' attribute added manually to the GraphQL service?

    -The 'graphql.error.message' attribute is significant as it allows for capturing specific GraphQL errors within the spans, providing more detailed information for troubleshooting issues that are not indicated by HTTP status codes.

  • How can the N+1 problem be detected and addressed using Open Telemetry?

    -The N+1 problem can be detected by monitoring the number of outgoing requests for a GraphQL query. If the number is unusually high, it may indicate an N+1 issue. Open Telemetry can help identify this by providing detailed tracing information, and it can be addressed by implementing data loaders or other optimization techniques.

  • What resources does the speaker provide for further learning about Open Telemetry and API observability?

    -The speaker provides courses on Open Telemetry API observability and API platforms, and encourages attendees to connect with them on LinkedIn for more information.

Outlines

00:00

πŸ“ Introduction to GraphQL Challenges and OpenTelemetry

The speaker introduces the session by expressing anticipation for a lively and engaged audience, emphasizing that despite being the last session, the topic is important and promises a rewarding journey. The session focuses on potential issues with GraphQL queries and how OpenTelemetry can be a solution. The speaker clarifies that they are not promoting GraphQL but addressing challenges that developers might face, especially those already using or considering GraphQL. The speaker's background and role at Tyk, a cloud-native API management platform, are shared, along with their involvement in the open API initiative. The session aims to explore both development and operational aspects of deploying a GraphQL application to production, highlighting the importance of stability and reliability.

05:02

🌐 Understanding GraphQL and Its Implementation

This paragraph delves into the specifics of GraphQL as a query language for APIs and its runtime for fulfilling queries with existing data. It emphasizes the ease of describing data and the flexibility it offers to developers to request specific data needed for their applications. The speaker contrasts GraphQL with REST, highlighting the issues of over-fetching and under-fetching that GraphQL can address. An example of a GraphQL query is given to illustrate how the request shape mirrors the response, making the data retrieval process more predictable. The paragraph also sets the context for a sample travel application that uses GraphQL, discussing the components involved in the server-side and the types of applications that can benefit from GraphQL, especially those with multiple clients or needs.

10:03

πŸ› οΈ Utilizing OpenTelemetry for GraphQL Monitoring

The speaker introduces the RED method, a monitoring strategy for gaining insights into the health and performance of distributed systems, and explains how it can be applied to monitor GraphQL in production. OpenTelemetry is presented as a tool for instrumenting the GraphQL service to capture distributed traces. The process of setting up OpenTelemetry with the travel application, including the use of the trace.js file and exporting spans to the OpenTelemetry collector, is detailed. The integration of RED metrics in Jager is discussed, including the use of Prometheus to store and display metrics, ultimately providing a dashboard for monitoring request rates, error rates, and durations for the GraphQL service.

15:05

πŸš‘ Troubleshooting GraphQL with OpenTelemetry

This paragraph discusses the troubleshooting process for GraphQL using OpenTelemetry, starting with identifying upstream errors and resolver errors. The speaker provides a scenario where an error in the Weather Service leads to a 500 HTTP status code in the GraphQL service. The use of Jaeger's dashboard to identify and address the error rate increase is explained. The paragraph also touches on the challenge of GraphQL's HTTP status code always being 200, even when errors occur, and how this can be mitigated by adding a 'graphql.error.message' attribute for better error tracking. The importance of not relying solely on the HTTP status code for assessing the health of a GraphQL service is highlighted.

πŸ” Addressing GraphQL Performance Issues with OpenTelemetry

The final paragraph addresses the performance challenges of GraphQL, specifically the 'N+1' problem, where a single query leads to multiple subsequent calls for each resource, causing performance bottlenecks. The speaker explains how OpenTelemetry can help detect this issue by analyzing the number of outgoing requests per GraphQL query. The use of data loaders as a solution to the 'N+1' problem is mentioned. The paragraph concludes with a call to action for the audience to contribute to the development of semantic conventions for GraphQL within OpenTelemetry and to consider the limitations of current instrumentation practices. The speaker wraps up the presentation with a reminder of the resources available and an invitation for further discussion.

Mindmap

Keywords

πŸ’‘GraphQL

GraphQL is a query language for APIs and a runtime for fulfilling those queries with existing data. It allows clients to request exactly the data they need, making it particularly useful for omni-channel applications. In the video, the speaker discusses the advantages and challenges of using GraphQL, including its ability to address issues like over-fetching and under-fetching that are common with REST APIs.

πŸ’‘OpenTelemetry

OpenTelemetry is an observability framework for cloud-native software, providing a set of APIs, libraries, agents, and instrumentation to help with the observation of distributed systems. The speaker mentions OpenTelemetry as a solution to monitor and troubleshoot GraphQL APIs in production, highlighting its ability to provide distributed tracing and metrics collection.

πŸ’‘RED Method

The RED Method is a monitoring strategy for gaining insight into the health and performance of distributed systems. The acronym stands for Rate, Errors, and Duration. The speaker uses the RED Method as a framework for discussing how to monitor GraphQL services, emphasizing the importance of these metrics in understanding service performance.

πŸ’‘API Gateway

An API Gateway is a server that acts as an entry point into a system of microservices, handling requests and routing them appropriately. In the context of the video, the speaker mentions a GraphQL-aware API Gateway as beneficial for observability, acting as both a mediator for security protocols and a facilitator for governance practices.

πŸ’‘Over-fetching and Under-fetching

Over-fetching refers to the scenario where an API returns more data than the client needs, potentially causing unnecessary network load. Under-fetching occurs when a client must make multiple requests to retrieve all the necessary data. The speaker discusses these concepts as problems that GraphQL can help solve, in contrast to REST APIs.

πŸ’‘N+1 Problem

The N+1 problem is a common performance issue in GraphQL where a single query results in multiple subsequent calls to fetch additional data for each item in a list. This can lead to significant performance degradation. The speaker uses this term to illustrate a typical performance challenge with GraphQL and how OpenTelemetry can help identify and address it.

πŸ’‘Data Loaders

Data Loaders are a technique used to batch and cache requests to a data store, helping to avoid the N+1 problem in GraphQL. The speaker mentions Data Loaders as a solution to the N+1 problem, indicating their importance in optimizing GraphQL API performance.

πŸ’‘Semantic Conventions

Semantic Conventions in OpenTelemetry define a set of guidelines for how to instrument code to ensure consistency in the telemetry data collected. The speaker points out that the current semantic conventions for GraphQL are limited and may not always provide the specific information needed to troubleshoot issues effectively.

πŸ’‘Tracing

Tracing in the context of distributed systems refers to the tracking of requests as they travel through various services and components. The speaker discusses the use of OpenTelemetry for tracing in GraphQL services, allowing for the visualization of the path a request takes and the identification of performance bottlenecks or errors.

πŸ’‘Prometheus

Prometheus is an open-source systems monitoring and alerting toolkit that is often used for recording real-time metrics. In the video, the speaker mentions Prometheus as the storage for the metrics generated by OpenTelemetry, which can then be visualized and monitored to gain insights into the performance and health of GraphQL services.

πŸ’‘HTTP Status Codes

HTTP Status Codes are standardized responses returned by a server to indicate the status of a request. The speaker notes that GraphQL often returns a 200 OK status code even when there are errors in the query, which can make it difficult to identify issues without looking deeper into the response body.

Highlights

The session discusses potential issues with GraphQL queries and how Open Telemetry can assist in addressing them.

The speaker, Buddha, introduces himself as a Developer Advocate at Tyk, a cloud-native API management platform.

Buddha shares his experience with the Seattle community and his involvement with the Open API Initiatives business governance board.

The importance of GraphQL in omni-channel applications and its advantages over REST for specific challenges like over-fetching and under-fetching is highlighted.

An explanation of GraphQL's query language for APIs and how it allows clients to request specific data needs is provided.

The speaker uses a travel application as an example to demonstrate the practical use of GraphQL and its components.

The benefits of a GraphQL-aware API Gateway in observability and its role as a mediator between security protocols and governance practices are discussed.

The RED method for monitoring distributed systems is introduced, focusing on rate, errors, and duration metrics.

The process of instrumenting a GraphQL service with Open Telemetry for distributed tracing is outlined.

Jaeger's integration with Open Telemetry for visualizing traces and generating RED metrics is explained.

The challenges of troubleshooting GraphQL in production, especially when HTTP status codes do not reflect the actual state of the request, are covered.

The addition of a custom 'graphql.error.message' attribute to improve error visibility in Open Telemetry traces is demonstrated.

The 'N plus 1' problem in GraphQL and its performance implications are discussed, along with using Open Telemetry to detect such issues.

The need for granular performance profiling in GraphQL due to the flexible nature of queries and the challenges it poses is highlighted.

The limitations of semantic conventions in GraphQL and the importance of manual instrumentation for better error tracking are covered.

The speaker concludes by summarizing the benefits of Open Telemetry for both developers and operational teams in enhancing GraphQL API reliability.

Buddha offers additional resources and courses on Open Telemetry, API observability, and API platforms for further learning.

Transcripts

play00:00

we're starting a couple of minutes early

play00:01

because this is after all the last

play00:03

session of the day so um I am expecting

play00:06

the loudest Applause at the end of it

play00:08

but hopefully you'll be with me on this

play00:10

journey over the next 15 to 20 minutes

play00:12

so thank you everyone for being here um

play00:15

this has been a great open tary

play00:16

Community this was my first time

play00:18

speaking in Seattle uh and meeting a lot

play00:21

of the folks that I've seen over

play00:23

LinkedIn or YouTube videos or guub

play00:25

issues in person which is great um so

play00:28

with all that being said um I'm here to

play00:30

talk a little bit about what could go

play00:32

wrong with graphel queries and can open

play00:36

Telemetry help us um I assure you it's

play00:39

not going to be a bummer off a topic at

play00:40

all even though the name might suggest

play00:42

it that way um but um if murph's law

play00:46

suggests that uh whatever can go wrong

play00:49

would potentially go wrong or whatever

play00:50

can happen will happen so let's just be

play00:52

prepared uh when that happens I'm not

play00:54

here I'm not going to be evangelizing

play00:55

graph Cur as a technology so just as a

play00:57

caveat to this uh I'm not trying to sell

play00:59

you on using graphql but if you are

play01:01

already sold onto that or considering

play01:03

doing it then uh there are a couple of

play01:05

things and challenges that you might

play01:06

encounter and we'll go through some of

play01:08

those uh and hopefully we'll see how

play01:10

open Telemetry um can solve or address

play01:13

some of those things either out of the

play01:15

box or potentially with certain manual

play01:17

instrumentations associated with it so

play01:20

with that being said I am Buddha I am a

play01:22

developer Advocate at Ty Ty for those

play01:24

who do not know is a cloud native API

play01:26

management platform powered by and uh an

play01:29

open source API Gateway we are otel

play01:32

native as well as um we are graphical

play01:34

aware as an API Gateway um and I can

play01:37

tell you all about that later on but the

play01:39

idea is we kind of know a little bit

play01:40

more about both of these different

play01:42

worlds some of the challenges some of

play01:43

the solutions associated with it um I'm

play01:46

originally from India um lived most of

play01:48

my life in Singapore currently living in

play01:50

Durham North Carolina big big fan of

play01:52

horror movies but I like my horror on

play01:54

screen and in literature not so much in

play01:56

software as you would see with craft Q

play01:58

today um

play02:00

additionally I'm the chairperson of the

play02:02

open API um open API initiatives

play02:05

business governance board in case you're

play02:07

working in there as

play02:08

well so um just to set the scene for

play02:11

today um I'm looking at two sides of

play02:14

when and how you are sort of promoting

play02:16

or or or deploying an application in

play02:19

this case a graph application to

play02:21

production uh on one side obviously is

play02:23

the development side of things where

play02:24

you're looking at the different

play02:25

capabilities your business logic of how

play02:27

you're building out your graphical

play02:28

application on the other side is a

play02:30

little bit to do with your operations

play02:32

how do you make sure that things are

play02:33

stable they are deployed reliably and

play02:36

things don't go wrong and when they do

play02:37

you at least have a way to actually

play02:39

address those things so I'm going to be

play02:40

looking at both of those different sides

play02:42

and hopefully help you get to that next

play02:44

step in that Journey so without before

play02:47

moving forward uh just a quick question

play02:49

anyone here who's actually worked with

play02:51

graphql before or is working with

play02:53

graphql at the moment considering

play02:55

working with graphql at the

play02:56

moment okay all right once again I'm not

play02:59

going to be trying to convert the others

play03:01

who did not raise their hands over here

play03:02

but hopefully address some of the

play03:03

challenges that you might be facing here

play03:06

um so for those who do not know graphql

play03:08

is um it's a query language for apis and

play03:12

um a runtime for fulfilling those

play03:15

queries with your existing data it

play03:17

usually gives you a very very easy

play03:18

hopefully straightforward way to

play03:20

describe your data and then let or

play03:23

enables your client side or the DU

play03:25

developers to actually ask for what they

play03:27

need for their application and then

play03:29

hopefully give those results in a more

play03:30

predictable way the key difference or

play03:32

the key value of actually having graph

play03:34

as a technology is usually seen uh in

play03:36

Omni Channel application especially when

play03:38

you have to cater to different needs of

play03:40

the of the consuming application this

play03:43

was kind of the main reason why it was

play03:45

for created in the first place at

play03:47

Facebook um there are a couple of other

play03:49

things that again when we talk about

play03:51

grafal we we have to talk about rest as

play03:53

well and I think again rest as an API

play03:55

technology is usually a very very robust

play03:59

uh way of dealing with apis and API

play04:01

Styles but with some of these uh with

play04:03

graphel we saw we noticed that there

play04:05

were a couple of challenges that the

play04:07

rest was kind of unable to solve um

play04:10

namely over fetching and under fetching

play04:12

in this case over fetching typically

play04:13

refers to um returning or getting back a

play04:16

lot of information more than what you

play04:18

might need and then enabling your or

play04:20

letting the heavy lifting being done in

play04:22

the front end whereas with under

play04:24

fetching you need to have multiple

play04:25

Cycles or multiple queries to be get to

play04:28

get back um the specific kind of

play04:30

information that you need in some cases

play04:32

this might be desirable but in most

play04:33

cases that adds to the overall Network

play04:35

cost or your payload costs moving on

play04:38

just to give you a little bit of an

play04:39

example of what graph qu looks like in

play04:41

practice there is a schema just kind of

play04:43

the blueprint of your overall graph API

play04:47

um then the operations usually is

play04:50

queries are the most typical

play04:51

applications uh operations that you do

play04:53

with graphql um you also have mutations

play04:56

and subscriptions but you're not going

play04:57

into those but the key thing to notice

play04:59

over here is that the shape of the

play05:01

response tends to mirror the shape of

play05:03

the request which makes things a lot

play05:05

more predictable in the way you um

play05:08

receive or get information and again you

play05:10

can request specific pieces of

play05:12

information out of the schema as opposed

play05:14

to getting back every single thing that

play05:16

has been enabled by the API producers in

play05:18

this case so with that being said um I'm

play05:21

going to set the scene again today for

play05:23

the application that we're going to be

play05:24

using as a sample in this case this is a

play05:26

travel application the server side

play05:28

includes a couple of different

play05:29

components there are a few different

play05:31

services that you see there are a couple

play05:32

of rest apis that are being called as

play05:35

part of the graphql application in some

play05:38

cases you can go directly building out

play05:39

your own schema in some cases you

play05:41

actually connect with existing data

play05:43

points as well both of these completely

play05:45

valid uh the application itself is no JS

play05:49

application even though some might say

play05:50

JavaScript is not a language but sure

play05:52

we'll we'll we'll talk about that later

play05:54

on um in the simplest use case of a

play05:57

graphql application you would you would

play05:59

have a react app or a front end a single

play06:01

front end that is making this call uh in

play06:03

which case graph C may not be the right

play06:05

way because it might be a little bit

play06:07

going over board whereas a more typical

play06:09

applications would be that you have

play06:11

multiple applications or multiple

play06:12

clients making these requests you might

play06:14

have different apps that could be

play06:16

internal facing could be partner facing

play06:18

could be an API Marketplace of its own

play06:20

that would that you would need to

play06:22

consider so with that again definitely

play06:25

one of the components that you do need

play06:26

to consider is an API Gateway typically

play06:29

um and again I'm not just saying it

play06:30

because I work in this space but a

play06:32

graphel aware mature graphel aware API

play06:35

Gateway adds a whole range of benefits

play06:37

again when you're working with

play06:38

observability specifically because it

play06:40

acts as the name suggest as a Gateway

play06:42

into your application but also as a

play06:44

mediator between the different uh

play06:47

security protocols and measures that you

play06:48

might be integrating with as well as

play06:50

different governance practices out there

play06:52

and when you're thinking about end

play06:53

tracing or endtoend observability

play06:55

ideally you need to consider every

play06:57

single component of your stack

play07:00

now how do you monitor graphql in

play07:02

production and one of the ways of doing

play07:04

that is to apply the red method for

play07:05

those who are not familiar with the red

play07:07

method the red it is a monitoring

play07:09

strategy that um that is used to gain

play07:11

insight into the health and performance

play07:14

of distributed systems um typically the

play07:17

red stands for the rate errors and

play07:19

duration and again if you're familiar

play07:21

with it that's great based on these

play07:23

metrics you can understand how good your

play07:25

service is doing and uh set up your slos

play07:29

according accordingly now to think about

play07:31

how we can start adding open Telemetry

play07:33

to the travel app that we just saw the

play07:36

first step is to instrument your graphql

play07:38

service with open Telemetry to get

play07:39

distributed traces um and again now

play07:43

there are there are different

play07:43

implementations of graphql available in

play07:45

the market in this case we've gone

play07:47

specific to the nodejs instrumentation

play07:49

in case you're looking for one that is

play07:51

specific to your application U you can

play07:53

always always go to the open Telemetry

play07:55

website and under ecosystems is search

play07:57

for um the instrument mation

play07:59

requirements there and nothing we found

play08:01

one for nodejs at least in this case uh

play08:04

moving right along so we use the trace

play08:06

JS file uh to instrument our service

play08:08

with open Telemetry and this is how we

play08:11

add the graph Cel instrumentation here

play08:13

you'll also notice that we are exporting

play08:15

the spans to the open Telemetry

play08:17

collector uh and we see the result uh

play08:20

that we have we have got endtoend

play08:21

distributor traces in jger um and we can

play08:25

see Ty API Gateway starting off the

play08:27

trace as an entry point um for the

play08:29

transaction and then reporting some

play08:31

spans then the graph Feld Services take

play08:33

over afterwards and then it goes into

play08:35

sort of the underlying rest apis the

play08:38

Upstream apis to go with it so now that

play08:41

we've got all the setup done let's move

play08:43

forward into how you're going to be

play08:45

getting the red metrics integrated

play08:47

Jagger already has some of these outof

play08:49

the boox Integrations available uh it

play08:52

uses a component in The oel Collector

play08:54

called the span metrics connector to

play08:56

generate these metrics based on the

play08:58

spans um and the span metric selector

play09:00

creates two metrics based on uh the span

play09:03

itself the calls total and um the

play09:06

latency I believe a latency count so

play09:09

those metrics are stored in Prometheus

play09:11

and Jer and will connect to Prometheus

play09:13

to display these uh

play09:15

metrics uh finally this is kind of the

play09:17

dashboard you finally have a look at

play09:18

this uh in the monitor tab you can now

play09:20

see the request rate you can see the

play09:22

error rate um and the duration for your

play09:25

graphql service so we're all all good to

play09:27

go now let's look at some of the

play09:29

possible errors that might happen we'll

play09:31

be looking at two of them one of them is

play09:33

the Upstream errors and the other one is

play09:35

a resolver error um so the first one as

play09:39

you can see here I'm trying to uh send a

play09:41

request to to get information about um

play09:44

the country Italy in this case um along

play09:47

with its weather data but um there seems

play09:50

to be something that's gone wrong as we

play09:52

can see from the message over there

play09:53

we're not getting the right response so

play09:55

now let's look at how we can start

play09:56

troubleshooting this so if you look go

play09:59

straight into the dashboard uh the Jaga

play10:01

dashboard you will already see that

play10:02

there is an increase in the error rate

play10:05

um and I can then we can as a next step

play10:08

of this we can start looking at the

play10:09

traces and I can find within the Traces

play10:12

by filtering them in Jer using the error

play10:14

tag uh you'll be able to see that a the

play10:18

the the you can actually see that you

play10:20

can um the graph service itself is

play10:23

giving a 500 HTTP status code and uh

play10:27

that is because it's a consequence of uh

play10:29

The Weather Service itself returning a

play10:31

400 ER the external Upstream Weather

play10:33

Service that we are connected to so in

play10:36

this example I can get all the

play10:37

information I need from open Telemetry

play10:40

and I can check out uh which query is

play10:43

having this issue as well if you wanted

play10:44

to go a little bit deeper a little bit

play10:45

more specific so done resolved fixed

play10:49

that issue and now we're all back and

play10:51

back and happy the final one we go into

play10:55

the U the resolver issue here um this to

play10:59

be a little bit more unique so we'll

play11:00

we'll go into that a little bit right

play11:02

now where again we we see an error that

play11:04

is that is being um shown here um but in

play11:08

this case when we look at the dashboard

play11:10

we're not really getting a hint as to

play11:12

what's really going on in this case um

play11:14

neither can we find something over here

play11:15

it looks like everything is fine so

play11:17

we're not really able to reproduce this

play11:19

at this stage now there is a reason for

play11:21

this because for those who are again

play11:23

familiar with graphql a big challenge

play11:25

with graphql is that at a status code

play11:27

level the HTTP status code typically

play11:29

Ally even when things are going wrong

play11:31

tends to give back a 200 um response and

play11:35

that has its own challenges so even if

play11:37

things are not going well graphel can be

play11:39

the perfect Optimist pretty much turning

play11:41

a blind eye to what's going on

play11:43

underneath so you don't want that you

play11:44

obviously want to get to the heart of

play11:46

the issue in this case um and as you can

play11:49

see the the object body or the the

play11:53

response body is where you can actually

play11:54

see the errors coming up and it has its

play11:57

own object where you can see some of the

play12:00

different details around it has message

play12:02

and it has location but how do we catch

play12:05

this as an error um so let's let's go

play12:08

diving a little bit into the semantic

play12:10

convention that's associated with

play12:11

graphql at this point of time and it

play12:13

looks like it's a little bit limited

play12:15

there are a couple of different options

play12:17

available here but it doesn't go into

play12:19

say a graphel error it's not giving me

play12:21

the specific um is specific conventions

play12:25

that I would need to actually get to the

play12:27

heart of this issue in this case so

play12:30

let's just add our own attribute and

play12:32

we're going to be calling it a graphql

play12:33

error. message and uh with with some

play12:36

manual

play12:37

instrumentation I've added that now into

play12:40

my code and now I can start seeing um

play12:43

the errors being recorded on my spans

play12:46

one thing you will still notice is that

play12:47

the entry point is still giving me a 200

play12:49

because that doesn't change here but we

play12:51

are getting a bit more information here

play12:53

and especially if you go into Prometheus

play12:56

in this case we can find out that their

play12:58

error rates are being reported based on

play13:00

the manual instrumentation that has

play13:01

happened so that's just something to

play13:03

keep in mind when you're working with

play13:04

graphql the HP status 200 is not always

play13:08

uh an indicator that everything is fine

play13:11

and then finally we talk about a little

play13:12

bit about performance here um one of the

play13:15

other challenge of graphql is that um

play13:19

the end point of a graphql API is

play13:21

typically again a slash graphql or

play13:24

something to that effect so the main

play13:26

changes or updates the requests that are

play13:28

going through are again at a layer below

play13:30

which is in the form of queries so you

play13:32

can still be calling that same endpoint

play13:34

but requesting different forms of data

play13:37

um and that could be very very flexible

play13:39

in the way in whichever way the the

play13:41

requester wants it to be so if they can

play13:43

essentially call it an any particular

play13:45

order that they want to that shape could

play13:47

be completely different for every single

play13:49

one of the requests that they might make

play13:51

so this obviously poses a little bit of

play13:53

a challenge um where there could be you

play13:56

know we could have multiple clients

play13:57

consuming this API each in their own own

play14:00

way um but it also poses that challenge

play14:03

of um how do you then profile that

play14:06

performance because uh what could be

play14:08

right for a specific query for a

play14:10

specific client uh may not be right for

play14:12

the others so how do you actually start

play14:15

looking or thinking about that a little

play14:16

bit more so what you need in this case

play14:19

is while that P95 value over there um is

play14:22

an indicator of some things it's really

play14:24

not enough because it's just giving you

play14:25

an average um overall error rate or

play14:28

latency rate in this case uh which again

play14:30

doesn't give you the full picture so we

play14:31

need to go a little bit more granular in

play14:33

this case and there are some typical

play14:36

performance issues that can be seen with

play14:38

graphql um we are not going to go into

play14:40

each one of them given a few minutes

play14:42

left for us um so I'm going to just talk

play14:46

about the first one but we I'm happy to

play14:47

discuss the remaining ones um directly

play14:49

with you if you wanted to have a chat

play14:51

but the most typical performance issue

play14:53

that you see here is the n plus1 issue

play14:56

again if you're not familiar with it um

play14:58

it BAS basically means that when you're

play15:00

making a request or or querying some

play15:02

data it the first response is to get

play15:05

back a set of um set of set of

play15:08

information or resources and then as a

play15:11

subsequent call you're essentially going

play15:12

into each one of those resources and

play15:15

making a query for each one of them

play15:16

hence the N plus1 problem it is fixable

play15:19

by using things like data loaders but it

play15:22

can also be quite easily overlooked and

play15:24

you don't want that because it has a lot

play15:26

of huge performance implications for

play15:29

graphical apis um so how do you solve

play15:31

that there's actually a fairly

play15:32

straightforward way to doing to to to

play15:34

actually solving this um in this example

play15:38

again you can see that you know a very

play15:39

simple query has actually gone into a

play15:41

pretty much of a cycle where you've got

play15:44

multiple uh continents and countries

play15:46

coming up as the response with open

play15:48

Telemetry uh you can nicely detect this

play15:50

n n plus1 query problem uh in this

play15:53

particular example if you look at this

play15:54

dashboard with jerger you'll see that

play15:56

within the stay diagram that one query

play15:59

has led to 27 HTTP get calls uh and

play16:02

that's a typical indicator that

play16:04

something is going wrong in which Cas in

play16:05

this case specifically n plus1 um is is

play16:08

likely the cause here so you can even

play16:11

get that number in Prometheus if you

play16:12

wanted to look at it um so getting an

play16:15

average number of outgoing requests for

play16:16

graphical query if it's that high

play16:18

typically means that something is wrong

play16:20

and in this case that is probably an N

play16:22

plus1 problem you can set alerts in test

play16:24

or in your production environment to

play16:26

actually um get to know when this is

play16:28

happening

play16:29

then finally the final steps here um so

play16:32

we've kind of understood a little bit

play16:34

more about what's going on um we spoke

play16:37

about um the open limit is still useful

play16:40

as a way to troubleshoot your graph fuel

play16:42

apis uh but there are still certain

play16:43

things that you need to consider

play16:45

including you know the semantic

play16:46

inventions are still a little bit

play16:48

limited uh it needs to be a little bit

play16:49

more specific in terms of graphql um and

play16:53

then the instrumentation providers or

play16:55

the instrumentation vendors may not

play16:57

always respect the common semantic

play17:00

convention and may have their own

play17:01

implementations and that could pose its

play17:03

own challenge um but um it's a work in

play17:07

progress where I think we've started

play17:08

contributing a little bit more in this

play17:09

area uh we opened up an issue here so

play17:11

feel free to comment and uh add your uh

play17:15

considerations your um you know

play17:17

contributions to this um hopefully this

play17:20

takes the shape of something a little

play17:21

bit more formal over the next um few

play17:23

months and years and with that um it

play17:26

looks like everyone's happy at this

play17:27

point hopefully we to address some of

play17:29

the key challenges around graphql with

play17:32

open Telemetry in production um both

play17:34

sides of the party the developers as

play17:35

well as the operational side is is a bit

play17:37

more happy having a little bit more

play17:39

reliability baked into their system so

play17:42

that's it uh the final talk hopefully

play17:44

with five minutes left for today um

play17:46

that's that's me um on the right hand

play17:49

side there is there are a couple of

play17:50

resources um I've created a couple of

play17:52

courses around open Telemetry API

play17:55

observability and API platforms feel

play17:57

free to check those out or connect with

play17:59

me on LinkedIn if you so desire so with

play18:02

that um I come to the end of the

play18:04

presentation and hopefully all

play18:05

presentations for the day so thank you

play18:07

so much appreciate it

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
GraphQLOpenTelemetryAPIsObservabilityTroubleshootingPerformancen+1 ProblemMonitoringDeveloper AdvocateCloud NativeAPI Gateway