OpenTelemetry for Mobile Apps: Challenges and Opportunities in Data Mob... Andrew Tunall & Hanson Ho
Summary
TLDRThe presentation by Andrew and Hansen from Embrace focuses on the unique challenges and opportunities of implementing open telemetry for mobile apps. With 72% of digital transactions occurring on mobile devices, the need for robust observability is clear. However, mobile apps present unique issues such as variable network connectivity, diverse device types, and user-perceived performance. The speakers discuss the limitations of current proprietary systems and the need for a more unified approach, highlighting the importance of community involvement and adaptation of open telemetry to suit the mobile environment.
Takeaways
- 📱 Mobile apps are increasingly important for digital transactions, with 72% happening on mobile devices, emphasizing the need for observability in this space.
- 🔍 Open Telemetry is being adapted for mobile apps, aiming to provide a lingua franca for observability across backend, frontend, and mobile platforms.
- 🚀 Hansen, a former mobile performance engineer at Twitter, highlights the unique challenges of mobile observability, including variable network connectivity and a vast array of device types.
- 🛠 The current state of mobile observability is often limited to proprietary systems, which can be basic and not fully integrated with backend systems.
- 📉 Mobile apps have different performance indicators compared to backend systems, where user-perceived performance is a critical component of service level objectives (SLOs).
- 🔄 The data pipeline for telemetry from mobile devices is fragile, with potential data loss at various stages, impacting the reliability of observability.
- 📊 Open Telemetry (otel) was designed with backend-centric assumptions that do not always apply to mobile, requiring adaptation for effective mobile observability.
- 🔧 There is a need for resilience in mobile telemetry tools, such as buffering data before transmission and handling incomplete data sets from the server.
- 🤖 Mobile developers may not be as familiar with concepts like tracing and context propagation, requiring more accessible APIs for effective instrumentation.
- 📈 Metrics from mobile devices need proper context and baselines for meaningful analysis, as the sheer diversity of devices and runtime environments complicates direct comparisons.
- 💡 The call to action is for more participation and questions from the mobile ecosystem community to help evolve Open Telemetry to better suit mobile app observability needs.
Q & A
What is the main topic of the presentation by Andrew and Hansen from Embrace?
-The main topic of the presentation is open telemetry for mobile apps, discussing its importance and the unique challenges faced in mobile observability.
What percentage of digital transactions in the previous year happened on a mobile device according to the script?
-72% of digital transactions in the previous year happened on a mobile device.
What are some of the unique challenges that mobile apps face in terms of observability compared to backend systems?
-Mobile apps face challenges such as dodgy network connectivity, massive cardinality of device types, different regions, OS versions, app versions, and the need to interact with backend distributed systems.
What is the issue with using proprietary systems for mobile observability and monitoring?
-Proprietary systems often provide limited observability features like crash reporting and error tracking, which may not be sufficient for serious app developers who need more comprehensive data.
Why is user-perceived performance an important aspect to consider in mobile app observability?
-User-perceived performance is important because it directly impacts the user experience and can lead to churn if users find the app consistently slow or unresponsive.
What does Hansen suggest is a fundamental problem with applying backend-centric observability tools to mobile platforms?
-Hansen suggests that the fundamental problem is that these tools make basic assumptions and ask questions that do not align with the unique conditions and requirements of mobile environments.
What is the significance of Open Telemetry (otel) in the observability ecosystem?
-Open Telemetry is significant as it serves as a lingua franca for observability, allowing different parts of a system to communicate using the same language and terms.
Why might traditional spans not be suitable for measuring certain types of performance in mobile apps?
-Traditional spans may not be suitable because they are designed for backend-centric tracing, and mobile apps require a different approach that considers user sessions, network existence, and other factors that do not fit the span model.
What are some of the assumptions made by the Open Telemetry protocol and APIs that may not hold true on mobile?
-Some assumptions include reliable telemetry recording and transmission, the ability to buffer data before sending, and the overhead of recording telemetry not significantly impacting performance.
How does the diversity of mobile developers' skill sets and experience affect the adoption of Open Telemetry?
-The diversity can affect adoption because mobile developers may not be familiar with concepts like tracing, threads, and context propagation, which are foundational to Open Telemetry.
What is the proposed solution or improvement for Open Telemetry to better accommodate mobile app observability?
-The proposed solution includes more participation from the community, asking questions, and using Open Telemetry in different ways to adapt it to the unique needs of mobile app observability.
Outlines
📱 Mobile Observability Challenges
Andrew and Hansen from Embrace introduce the concept of open telemetry for mobile apps, highlighting the unique challenges of mobile observability. They emphasize the prevalence of mobile devices in digital transactions and the limitations of current proprietary systems. Andrew sets the stage by discussing the need for better observability in mobile apps, which are often treated as distributed systems but face unique issues such as variable network connectivity and a vast array of device types. Hansen, a former mobile performance engineer at Twitter, promises to delve into the specifics of these challenges and the importance of open telemetry in addressing them.
🔍 The Fragility of Mobile Telemetry
Hansen discusses the fundamental differences between mobile and backend systems, pointing out the limitations of mobile devices such as unpredictable CPU and RAM resources, and the impact of user-perceived performance on service level objectives (SLOs). He highlights the fragility of data pipelines from mobile devices to backend systems, where data can be lost at various stages, and the importance of user experience in telemetry data. Hansen also touches on the limitations of open telemetry's (otel) application in mobile, given its design for backend-centric distributed tracing and the need for resilience in mobile telemetry tooling.
🛠 Adapting Open Telemetry for Mobile
This section delves into the specific issues with applying open telemetry to mobile apps, including the limitations of spans for measuring performance when user sessions are involved, the overhead of telemetry recording on mobile devices, and the assumptions made by open telemetry APIs that may not hold true for mobile developers. Hansen points out the need for resilience in telemetry, such as buffering data before sending it to the collector, and the importance of understanding the context of mobile app metrics, which are often presented without useful baselines or comparisons due to the vast variety of mobile environments.
🌐 The Need for Community Involvement in Mobile Observability
Hansen concludes by emphasizing the need for more participation and diverse questions in the open telemetry community to better address the unique needs of mobile apps. He calls for a broader understanding of user sessions and the importance of modeling them correctly to ask the right questions about mobile app performance. Andrew adds that entities and events are areas of ongoing work that could significantly improve mobile observability. The speakers encourage the audience to engage with them to help shape the future of open telemetry for mobile apps.
Mindmap
Keywords
💡Open Telemetry
💡Mobile Apps
💡Observability
💡Distributed Systems
💡Network Connectivity
💡Cardinality
💡Proprietary Systems
💡User Experience
💡SLOs (Service Level Objectives)
💡Instrumentation
💡Entities
Highlights
Andrew and Hansen from Embrace discuss open telemetry for mobile apps, emphasizing the unique challenges mobile apps face in observability.
72% of digital transactions in the past year occurred on mobile devices, highlighting the importance of mobile app observability.
Mobile apps differ from distributed systems, presenting unique issues such as unreliable network connectivity and high device diversity.
Traditional observability in mobile has been limited to proprietary systems, lacking comprehensive monitoring and often restricted in data access.
Open Telemetry (otel) is introduced as a solution for unified observability across different platforms, but it was originally designed with backend-centric assumptions.
The presentation calls for action, inviting colleagues interested in the mobile ecosystem to contribute to the development of open telemetry for mobile.
Hansen explains the fundamental differences in mobile runtime environments, such as unpredictable CPU and RAM allocation and the influence of the OS on app performance.
Mobile SLOs are tied to user-perceived performance, which is difficult to measure with conventional metrics.
The data pipeline from mobile devices to backend systems is fragile, with potential data loss at various stages.
Open Telemetry's assumptions about data telemetry do not always hold true on mobile, necessitating resilience in tooling.
Mobile developers often have different skill sets and are accustomed to higher-level constructs, making traditional tracing concepts less accessible.
Mobile apps have millions of instances running on different devices, making it challenging to present metrics without proper context.
Hansen advocates for more participation and questioning in the development of open telemetry for mobile to address its unique requirements.
The presentation suggests that entities and events could be key advancements in open telemetry to better represent mobile app behavior.
Andrew discusses the concept of user sessions in mobile apps, emphasizing the unpredictability of user behavior and its impact on app performance.
The need for proper modeling of user engagement and technical data in mobile apps is highlighted to answer different types of performance questions.
The presentation concludes with an open invitation for discussion during lunch, encouraging further exploration of open telemetry in mobile.
Transcripts
well we are the uh the presentation
after the break so um people will filter
in hopefully out of sheer excitement
from what they're hearing
um I am Andrew this is Hansen we are
from Embrace and we're here to talk
about uh open Telemetry for mobile apps
um full disclosure I'm mostly here for I
candy um I'm going to be doing a brief
introduction as to why we're here and
talking about this but the meat of the
presentation is from Hansen who was
formerly a um mobile performance
engineer at Twitter uh he had about 45
minutes of content that he's going to
talk about in about 11: so we'll see how
that
goes um okay so uh why are we here uh
for a minute consider that 72% of
digital transactions last year happened
on a mobile device now that wasn't all
Native mobile apps some of it was mobile
web but if you think about the most
iconic brands that you interact with on
your phone I'm gambling that most of
them have a native app that you interact
with on a pretty consistent
basis and uh you know as we think about
the observability ecosystem we've been
spending I don't know 10 years really uh
marching toward how we do observability
better for backend systems open tracing
open sensus were built for distributed
systems uh but mobile apps are not a
distributed system they are a installed
Software System running on distributed
compute resources that interact with
distributed
systems and there's a lot of unique
challenges with that environment uh you
have sometimes dodgy network
connectivity you have um a massive
cardinality of device types I think when
we were looking at our uh the the number
of unique devices for just one customer
that we saw has like 42,000 different
combinations of device models and
chipsets that potentially exist um you
have different regions different OS
versions different app versions um also
say like you're dealing with lots of
data coming from uh very disparate sets
uh interacting with your back-end
distributed
systems and uh for most of its life
cycle observability and monitoring and
the mobile ecosystem has been mostly
proprietary systems designed by vendors
um the kind of
basic observability or monitoring that
any app developer put puts in is
Firebase CR crash litics which was
something that was kind of acquired from
Twitter and formerly its own own little
company um it's free except you don't
get data for 12 hours until a customer
actually interacted with the device it's
highly sampled it's limited to like a
million events a month so if you're a
serious app developer you're not going
to use it um and as a result you've had
all of these vendors building their own
proprietary standards and trying to
convince people um that they should be
serious about observability with that
but mostly it's just crash reporting
error tracking Etc and the result is
that the Paradigm looks a little bit
like this um I'll notice that the single
line leg is uh open is metrics for uh
mobile devices and I think hansal de
dive into a little bit as to why that is
but like um for broadly speaking a lot
of mobile Engineers actually consider
this to be an acceptable picture um but
when we talk to customers that are very
serious about digital transactions
happening on mobile um they say all of
our customer impacting slos are directly
tied to my mobile device and yet I have
no good source of information that
correlates to all of the hard work I've
been doing to build reliability and
resiliency in my backend systems what am
I supposed to do and so we're going to
talk a little bit about our challenges
in in the Paradigm that we're facing
today but more than anything this is a
called a action which is if you have uh
co-workers or people who are interested
in the mobile ecosystem uh Hansen's
working on the Android Sig we have folks
working on the Swift Sig Etc please talk
to us afterward we'd love to get you
involved Al righty name is Hansen
pronounce he him I'm a Chinese looking
guy Baldhead glasses your typical type I
guess uh and I want to talk about open
Telemetry and observability in general
uh in Mobile so the Crux of the problem
that we're trying to solve in Mobile for
observability is not merely trying to
Port the tooling over to these mobile
platforms I think there are fundamental
differences in the basic assumptions
that we make and the questions that we
ask of of our observability tools and
before we can move forward we have to
acknowledge what these are recognize it
before we can do the right thing
so Andrew said mobile is unique how I
don't got three hours so I'm just going
to try to do it in three minutes first
runtime
environment this bad boy is not a
kubernetes cluster you cannot configure
how much CPU you get how much RAM you
have it is wild I can walk in to an
elevator with this
device my network connection is gone
uhoh my app is
affected not only that the CPUs on these
guys are also very limited we have
low-end mediate Tech chipsets 10 years
old running os's that are very strict in
terms of how they provision resources so
not only have limited Hardware you have
the OS saying oh yeah 50 megabytes of a
heap that's good enough for you you go
over GC uh oh my app is really slow why
well you know if you're observing you
might
know and worse yet SLO on mobile apps is
not purely based on operational
performance I will perceive a workflow
as being slow using the same device that
somebody else may not think is slow user
perceived performance is part of the
equation and it's really hard to
calculate that by simply measuring very
very specific
numbers second the pipeline to get data
Telemetry from mobile devices to the
back end is extremely fragile data could
be lost in a number of different steps
crashes could take down anything that
you're observing that you have not
written to disk even if you write to dis
just because you're at a dis it doesn't
mean you'll actually get to the back end
or if it does maybe it's delayed by
minutes hours days these are not edge
cases these are happens every time for
most
people and lastly the data the telary we
capture has to Center on user
experiences anything that we capture is
in service to that operational duration
device context everything that we do
everything that we get is so that we can
replicate the user experience in
data when we're looking at a big system
P99 might just be a condition of how the
system is running whether it's healthy
P99 for a mobile app is 1% of all all
measurements and if you have a 100
billion da 1% is a big big number and at
the end of that 1% is a user staring at
your app waiting and waiting for it to
load and it's it's not only that they do
it once it say do it again and again and
again because slow devices tend to be
slow all the time and that's where you
get things like churn because this app
is too slow I uninstall it a new set of
1% gets in and churns
so let's talk about otel otel is great
it is the linga franka of observability
it allows the back end the front end and
the and the mobile apps in between to
basically talk in the same language we
use the same words we use the same
nouns but it was designed for a backend
Centric distributed tracing world where
there are certain assumptions that just
don't apply in Mobile so I'm going to
talk about spans spans are great we love
spans don't we Applause for spans
uh but they don't work so well for all
circumstances uh if you want to measure
operational duration to pick out
outliers spans are
fantastic but what if duration is not an
indicator of performance user sessions
uh if you want to have something on
screen you want to measure how long it
you know users have interacted with it
long is actually good potentially or
just it just happens Network existence
it's just a span well maybe not a span
if you really strict about it but it
could be operations also run for a long
time on the client and not knowing the
state of those operations until you end
it kind of problematic especially if you
are at key moments of the app life cycle
if you background the app and your span
is not done I guess we can keep our
fingers crossed the OS doesn't killed
before it's done but if it does uhoh
what are you going to do also operations
need to be contextualized with a lot of
mutable state that changes all the time
and getting that data and writing them
with as attributes of a span can
potentially be very expensive waiting
for your system to come back with the
Wi-Fi status or
whatever second the protocol and the
apis of open Telemetry makes certain
assumptions that are not true on mobile
so telary being recorded and transmitted
reliably not true sometimes you don't
get it so we need to build in resilience
within the tooling to do things like
buffer your dis before we send to The
Collector which is what the open Android
project has done Cesar has done great
work around that um and then also
automic transmission of related events
and devices or D data if we only get a
partial set of information in the server
we don't know if it's a complete picture
knowing that everything is either there
or not there is extremely helpful and
there's nothing built in to really do
that also assuming that recording and
transiting tele entry doesn't actually
uh pose a um significant amount of
overhead try your Android grow device
from eight years ago run with a giga Ram
it trust me it it taking a span on the
main thread oh it impacts performance so
you have to be very careful about
what you measure how you measure it
another assumption
before before I take a drink water is
that Engineers using the API are
familiar with Concepts like tracing and
and threads and and context propagation
while mobile developers are a much more
diverse group um in terms of skill set
experience they're used to higher level
constructs that don't have the notion of
a thread well they have notion of a
thread but not the actual thread of a
thread and if you say herey trace this
you know propagate context on threads
uhoh I don't know what you're talking
about but they want to measure
performance and they look at the API and
they're like oh maybe it's not for me
and also traced operations have clear
and execution boundaries and ownership
is is you know fundamental to to break
down the distributed Trace into
spans look at a mobile code base with
200 modules different teams owning
different parts executing on different
threads instrumentation can be extremely
brittle if you don't manage it in the
right way and mobile apps unfortunately
generally managed in the right way is
not something you associate with mobile
app
architecture last point I want to make
is simply about mobile devices being
millions of different app instances
running on different different
phones metrix at least the way otel does
it really not conducive to
presenting mobile app metrics in a in a
way that is super useful uh without the
proper context to ground and Baseline
and do comparisons you're basically
munging together a bunch of different
systems with a bunch of different
runtime contexts so if I tell you p75
heat size in the jvm jvm is 60 Megs is
that good or bad if you change it to to
70 is it is it good or bad is it just
the OS is more permissive or or is there
memory leak who knows but we we we want
to know and having strict timeline
aggregations is really not suitable for
the types of operations that we're
trying to measure because they could run
super long and we can miss the window
and well I can go on but I
won't
basically it's not bad it's just
different open Telemetry is designed to
solve a spefic specific problem in
specific context and we're kind of like
ha Here Comes mobile apps we're going to
change all these assumptions expect it
to work no of course it's not GNA work
but it doesn't mean it can't work I mean
there are ongoing work in you know
various sigs uh slacks you know private
DMS uh Jason back there with a mask done
great work on the Android open Telemetry
uh codebase in order to you know bring
it to you know everybody and we need
help not only to you know write code but
to ask questions I'm familiar with
consumer mobile apps running on tablets
and and phones but what about iot what
about you know things that run in cars I
don't know those but those are mobile
and they have different questions so we
need everyone to kind of come in and say
hey my stuff is a little different
please you know how can I how can open
Telemetry help me what do I have to do
um and we're asking these questions all
the time and uh we won't stop asking
these questions because as others have
said we're never going to be done which
is great because it's
fun questions questions that's
[Applause]
it the question is if there's one change
for open simetry what would we want it
to be uh Andrew might have different
answers but for me it's just more people
getting into this and asking questions
and using it in different ways you know
we came in with some very you know basic
assumptions and we're like hey this is
fantastic spans we can do a lot with
spans oh we can I guess but maybe not
and there are other questions so without
people asking questions we were never
going to find the answers so for me
that's like more people participating in
asking
use cases you know yeah I'm not going to
make like a feature demand but I think a
lot of things are in progress uh you
know as an example we have this concept
of a user session which is really Co
kind of a collection of user behavioral
and Technical data um I mean mobile apps
are interesting in that like it's not
the computers deciding what Pathways you
exercise it's the humans and that
becomes pretty unpredictable but pretty
material in understanding state that
resulted in some sort of terminal
activity in the app whether it's
um the customer having a really good
experience or uh you know force closing
because of long running operations um
and that while it can be modeled as a
trace as kind of Hansen was uh talking
about I mean you're asking completely
separate questions of what you would
normally ask a span um in fact
engagement time if you're measuring an
operation can actually be quite ideal in
a mobile app because it's an indicator
that the customer is actually engaging
with the content as opposed to an
operation that's running a long time so
I think figuring out how to model those
things correctly so that you can ask the
different types of questions you want to
ask is pretty critical so entities is is
I think going to be a great step um
events are going to be a great step
forward events could also have in in and
start time and end time but not gonna
open that discussion um so so there is
great work happening and almost done uh
I lost my train of thought yeah don't
worry about it are there any more
question before we can also shout out
entities shout out
Jason depends what you define a session
replay like does our product have it
no yeah I mean traditional web replay
products looked at the Dom right which
is not really super accessible um or at
all in a native mobile app and we
obviously can't do a screen recording
because that would carry information
about a customer's activities and
information you wouldn't want to expose
um there are vendors playing around with
it but I mean I guess we don't have a we
don't currently think about it in a in a
way that we're modeling it within our
own sdks five gr session replay with
lots of you know new data mod you know
specific data model for parsing and in
arting is is like that hoof with really
fine detail fur we're just trying to get
a face on this guy it's just just a face
so baby steps baby
steps all right thanks all we'll be
outside during lunch feel free to talk
TI your waitresses and waiters and
[Applause]
浏览更多相关视频
Telemetry Over Events: Developer-Friendly Instrumentation at American... Ace Ellett & Kylan Johnson
Attacks on Mobile/Cell Phones | Organisational Security Policies in Mobile Computing Era | AKTU
Appium Tutorial 01 :Introduction To Mobile App Testing | Appium
Using Native OpenTelemetry Instrumentation to Make Client Libraries Better - Liudmila Molkova
How OpenTelemetry Helps Generative AI - Phillip Carter, Honeycomb
CompTIA Security+ SY0-701 Course - 4.1 Apply Common Security Techniques to Computing Resources
5.0 / 5 (0 votes)