Telemetry Over Events: Developer-Friendly Instrumentation at American... Ace Ellett & Kylan Johnson
Summary
TLDRIn this insightful talk, Kylin Johnson, the Android Lead for the mobile team at AMX, discusses the company's journey with Telemetry, emphasizing the importance of observability in mobile app development. Starting from scratch, the team focused on understanding user experience through real user monitoring and manual instrumentation. They tackled challenges like app startup, server connections, and complex user journeys, highlighting the need for developers to be experts in their features while the embedded system team supports Telemetry principles. The talk concludes with lessons learned, emphasizing the value of custom events and testing in ensuring effective observability.
Takeaways
- 😀 The speaker, Kylin Johnson, is the Android Lead for the mobile team at AMX, discussing the company's journey with telemetry and observability in mobile apps.
- 🔧 AMX started with no telemetry in their mobile app and aimed to understand user experience and app performance from the customer's perspective.
- 📈 They focused on real user monitoring and observability to gauge how the app felt to use, especially during app startup which is critical for user experience.
- 📱 The mobile team, including both Android and iOS, worked on manual instrumentation to capture telemetry data for tens of millions of users across multiple countries.
- 🚀 Kylin emphasized the importance of making observability a first-class feature in the application architecture, integrating it into the build process and development workflow.
- 🛠 The team faced challenges in abstract modeling and the cognitive load of introducing new APIs to developers already dealing with a rapidly evolving Android landscape.
- 🔄 They built a protocol for logging in, which included sending a login request, decoding JSON, scanning biometrics, and handling multi-factor authentication (MFA).
- 📊 Kylin highlighted the complexity of what seems like a simple journey from login to landing page, with many potential 'bumps' in the process that need to be monitored.
- 👥 The speaker stressed that developers should be experts in their features, with embedded specialists like Kylin being the experts in telemetry and observability.
- 🔄 The team learned to simplify the telemetry process by using events and signposts that everyone can understand, making it easier to implement and maintain.
- 📝 Kylin advocated for constantly reviewing and refining metrics to ensure they provide value and lead to actionable decisions, removing any that do not.
- 🛡 The AMX team created a clear barrier between the app's functionality and the telemetry measurement, encapsulating the 'how' of measurement separately from the 'what' of the app's decisions.
Q & A
Who is the speaker in the provided transcript?
-The speaker is Kylin Johnson, the Android Lead for the mobile team at AMX.
What is the main focus of Kylin Johnson's talk?
-The main focus is on the telemetry journey of the mobile team at AMX, specifically discussing the implementation of observability and real user monitoring in their mobile application.
Why did AMX decide to start with app startup as the critical measurement point?
-AMX decided to start with app startup because it is a critical point where the app is not in memory and the user experience can be significantly impacted by the time it takes to instantiate components and render the UI.
What is the significance of progressive rendering in the context of the talk?
-Progressive rendering is significant as it aligns with user profiles and device usage patterns, allowing for a more efficient and user-friendly experience by loading and displaying content in parts rather than waiting for everything to load at once.
How does the mobile team at AMX approach server connection issues?
-They initially assumed server connection would be the most critical and error-prone aspect to measure. However, with a coexistent server team and a tight API contract, they found that server issues were less of a concern than initially thought.
What is the importance of observability as a first-class feature in the development process?
-Observability as a first-class feature ensures that it is integrated into the architecture and the development process from the start, making it easier to maintain, understand, and improve the application based on real user data and feedback.
How does Kylin Johnson describe the complexity of the login journey in the mobile app?
-Kylin describes the login journey as very complex, involving multiple steps such as biometric scanning, multi-factor authentication, and rendering the final view, which can vary depending on the user.
What is the role of the login Observer in the Android application architecture?
-The login Observer is used to monitor and record the login process, sending the result of the login attempt to an Observer, which is part of the dependency injection setup in the Android application.
Why is it important to separate the 'what' from the 'how' in the context of telemetry?
-Separating the 'what' from the 'how' allows developers to focus on the app-level decisions without being distracted by the specifics of how the telemetry is measured, making the system more adaptable and easier to maintain.
How does the mobile team at AMX ensure the reliability of their telemetry data?
-They ensure reliability by testing everything at scale, including the observability features. They assert the occurrence of traces, the presence of child spans, and various tags and metadata to validate the telemetry data.
What is the final takeaway from Kylin Johnson's talk regarding telemetry and observability?
-The final takeaway is the importance of customizing and adapting telemetry to fit the specific needs of the application and the team, ensuring that all developers can participate in and benefit from the observability features.
Outlines
📱 Mobile Telemetry Journey at AMX
Kylin Johnson, the Android Lead for the mobile team at AMX, discusses the company's early steps in mobile telemetry, focusing on understanding app performance from a user's perspective. The team aimed to monitor real user experiences at scale, especially app startup times, which are critical in the Android world due to the instantiation of many components. They initially faced challenges with manual instrumentation and server connection errors, assuming these would be the primary bottlenecks. Kylin emphasizes the importance of observability as a first-class feature in app development, integrated into the build process and architecture to ensure ease of adoption and scalability.
🔍 Implementing Telemetry for User Journeys
This paragraph delves into the complexities of implementing telemetry for user journeys, particularly the login process. The speaker explains the cognitive load for developers in understanding and abstractly modeling telemetry principles. The login journey, which might seem straightforward, is revealed to be intricate, involving multiple steps such as biometric scanning and multi-factor authentication (MFA). The speaker advocates for a simplified approach to telemetry, using events and spans to instrument features without disrupting the development process. The goal is to make telemetry an intuitive part of development, allowing for easy tracking and measurement of critical events within the app.
🛠 Adapting and Learning from Telemetry Implementation
The final paragraph highlights the lessons learned from implementing telemetry at AMX. It emphasizes the importance of developers focusing on their features while the telemetry team provides expertise in observability. The speaker discusses the need for a clear separation between app-level decisions and measurement techniques to avoid confusion and ensure that telemetry remains an adaptable and valuable tool. The company's approach to testing telemetry at scale and asserting the presence of expected traces and spans is mentioned, showcasing a commitment to quality and precision in their observability practices. The speaker concludes by encouraging a review of metrics to ensure they provide actionable insights, advocating for the removal of any metrics that do not contribute to decision-making.
Mindmap
Keywords
💡Telemetry
💡Observability
💡Mobile Team
💡Manual Instrumentation
💡User Journey
💡App Startup
💡Progressive Rendering
💡Server Connection
💡API Contract
💡Biometric Authentication
💡Multi-factor Authentication (MFA)
💡Custom Events
Highlights
Introduction to AMX's mobile team's Telemetry journey and the importance of understanding app performance from a customer perspective.
Kylin Johnson's role as the Android Lead for the mobile team and the collaborative effort with the iOS team for manual instrumentation.
The initial challenge of having no traces or metrics in the app and the decision to focus on app startup as a critical aspect of user experience.
The complexity of app startup, including the instantiation of components and potential bottlenecks.
Assumptions about server connections being the most critical to measure and the importance of error handling in network loss scenarios.
The coexistence of the server team with the mobile team, ensuring tight API contracts and feature parity between client and server.
The concept of observability as a first-class feature in application development, emphasizing its integration into the build process.
The cognitive load for developers in understanding and implementing new APIs for Telemetry, such as span production.
The importance of modeling user journeys, such as login, with Telemetry to capture the complexity and multiple touchpoints.
The evolution from basic manual instrumentation to a more sophisticated and adaptable Telemetry system.
The use of events and spans to create a flexible and understandable Telemetry protocol that can be implemented by all developers.
The significance of separating the 'what' of app decisions from the 'how' of measurement to maintain clarity and focus.
The necessity of testing observability at scale, including asserting the presence and accuracy of traces and spans.
The lessons learned from the Telemetry journey, emphasizing the need for developers to focus on their features while experts handle Telemetry.
The importance of reviewing and refining metrics to ensure they provide value and lead to actionable decisions.
The adaptability required in dealing with multiple observability backends and the custom events created to accommodate this.
The final takeaway on the value of a well-integrated and adaptable Telemetry system in understanding and improving user experience.
Transcripts
well uh good afternoon everybody it's a
pleasure to be here and join the other
speakers uh and just share a little bit
about how we do this at AMX uh we're I
think early a little bit on the mobile
team with our Telemetry Journey but of
course as kind of has been stated here
our our backend story with open
Telemetry Telemetry is much further
along my name is uh kylin Johnson I am
the Android s Lead for the mobile team
uh my colleague Ace who's my iOS
counterpart couldn't be here but we do
uh we we the engineering effort for
manual instrumentation uh for the entire
mobile app which is uh you know tens of
millions of users multiple countries def
definitely all the permutations that
were mentioned in the prior talks were
were right there with
them I actually started being a feature
developer that's how I started at MX and
we had this opportunity or at least this
desire in some of our leadership to
actually learn more about how our app
was running its scale this is a few
years ago at this point but it started
with this basic question of how does it
feel to use a mobile app a little bit
real user monitoring um and we wanted to
use Telemetry principles use
observability just to to actually do
this and we wanted to really understand
it from the customer perspective uh for
some of our most travel Journeys it was
really about um we have short sessions
on our app in general so we wanted to
understand at scale in all the different
permutations what a user was uh
experiencing now we to do this when we
started out at Ground Zero we had zero
traces or zero metrics zero anything um
in the app when we started this so we
said okay if we want to understand how
it feels of course app startup is going
to be the critical thing your app is not
in memory now it is in our in our world
in Android World um we instantiate a lot
of a lot of uh components at that time
so this can be a bottleneck and the user
can't use your app until it's
done we we assumed that there was a lot
of large view loads where we make a
chunky server call and then we get a
resp response and then it takes however
long it takes to actually render the
Chrome and uh you know we generally at
this time we're full screen spinner so
again you're just kind of waiting for
whatever to uh to
render now lately we've been doing much
more um Progressive rendering um which
is not as common but this is more in
line with user profiles and what users
are actually using the device um they
might have one to four different
requests on one screen and uh and are
you considering the load to be when the
first one is there when the last one is
there there's a modeling question here
that of course we ran into when we
started to manually instrument
everything but beyond that we we
obviously thought that the server
connection were actually going to be the
most critical thing to measure that it
would be the most uh error prone thing
that it would of course we assume that
everybody's going into a Subway at any
given moment and loses all network
connection and so any errors that we
might have um would be a result of the
Server Connection now fortunately for us
our server team is coexistent on our
team um we actually build features with
the server and the client all at once so
if a if a feature ships on the client it
actually ships on the server too so you
keep a tight API contract there which is
good um so for all intents and purposes
this is the application right here we
would consider the server actually as
part of the mobile application because
of it's tight
coupled but of course uh the number one
Journey so what we're going to do now is
we're going to just basically go back
uh the most important the first journey
that we did and a little bit of the
Lessons Learned uh building this out
building all this Telemetry out wrong
button you get to see this wonderful
animation
again there we go log and flow it's kind
of the boring flow but it's also the one
that's underlooked it's maybe overlooked
is a better word like uh people this is
the oldest part of the app it's the
architecture that's least compliant to
the to the modern everything and this is
the most traveled one which is a little
bit ironic if you think about it so
we're going to talk about this the rest
of the time we instrumented this journey
we wanted to understand this journey
because this is what everybody has to go
through and it's largely uh the most
it's definitely the most important
one but along this whole thing we we
think that observability should be a
first class feature of any application
and everything that was said earlier
about mobile apps is completely true I
can tell you from firsthand experience
mobile architectures are polyglot at
best there there's probably three
versions of them within an application
and it's it's constantly evolving
but if and then we are actually a
feature team driven uh organization
development team for you know the four
or so people that are SES there's dozens
of of developers working on the platform
so unless you bake something into your
build process unless you bake something
into how you build the software it's
just not going to get it's going to be
harder than it should be so
observability being a first class
feature means we're testing it we're uh
it's it's baked in the architecture it's
just the way you do
things this is just an example I didn't
pull this out to throw any shade it's
just this is the example you're going to
get so if I was going to have to
describe to somebody how to make a span
how to produce a span and then introduce
our team to it this is what I would tell
them and it was interesting to me that
it was way more of a cognitive load for
them to actually understand this I mean
they're under pressures with all the
things that Google is putting out on a
daily basis Android is moving very
rapidly so yet another API to learn is
is hard because then they have to
actually do the abstract modeling um
which is something we on the S side but
is not necessarily in their
wheelhouse so we're going to build up a
protocol here this is how we're going to
model our login we're going to send a
login request that's unsurprising we're
going to decode all the Json um and then
render it of course eventually but first
we're going to scan biometric because
and you'd be surprised at how many
people don't do biometric even though
it's very readily
accessible um of course after all of
this was already done then of course we
had to implement multiactor authentic
which was uh just another bump in the
line and then you get to the logged in
View and I didn't even include half the
things that are actually bumps in the
line here so what we thought of course
is as an easy journey of just getting
from login to landing page it's actually
very very complex not to mention team
biometric did that one team MFA did that
one and then of course the landing page
it depends on which user you are which
so when you want to actually quote
unquote stop the clock which place in
the code is called well it could be n
number of them so this problem is just
it's it's much more complex than what
you might think so we're just going to
go with a basic example of what we did
um we've since built on top of this but
this is just where anybody could get
started if you have any architecture in
Android if you're using fragments
activities view models whatever you're
using you might have a view model view
model is going to have a login button
that login button just going to have a
login service and I'm I'm simplifying
this a bit obviously but then what we're
going to do is we're going to take a
login Observer if you're on Android
you're using dependency injection for
all the things and so you're going to uh
get this login Observer and then
alongside of what you were doing anyway
so we're just assuming that you were
already doing all of this we're just
going to call login received and send it
the result of what actually happened and
this is going to be one bump in the line
you're going to do this throughout the
application and you're going to send
only the critical things to The Observer
that need to be modeled
so what we get here is basically just
instrumentation via events we're just
abstracting it away to its most
primitive kind of shape and we're just
going to take the architecture is very
Hub and spoke and it's contributing
around the code base um you know it
doesn't matter if it's a network client
as was stated earlier we have I think
we're up to like 310 Gradle modules now
um some of them are cotlin some of them
are Androids some of them are just
holding Json basically there's so many
modules so build build speed um
repeatability ease of readability is is
critical here so we want a very flexible
API to be able to do what we want to do
which is time this event so but even
beyond that we want everybody
participating in this conversation we're
about to have here so it's it's very
easy for me to get stuck on numbers and
traces and everything that's that's the
result but if you boil it down like this
everybody in the organization can kind
of participate in building value here so
of course when these events are emitted
this is what we're going to build up
we're going to say we started the login
biometric o is is running login request
is then being issued which actually
stopped the biometric offs span from
running then we of course finished it
now the orange one just denotes that
it's an HTTP span that's that's its only
distinction
here but a single event here can then
stop and start spans and these optional
spans become very easy to implement then
so before if you you have many uh a lot
of state and you want to basically
instrument a feature you want to get out
of the way of the feature and if you
just have to deal with these sign posts
that everybody can understand and
everybody can um Implement well then we
can kind of proceed a little bit quicker
here so we're going to start that MFA of
course that MFA team now realizes they
can actually use metrics so then they
want to count how many times that's just
a trivial example but then we can add
events really easily we we've actually
built up what you're seeing here over
the course of many many iterations this
wasn't like one and done and we did it
it's well now we realize we can do that
oh we realize a metric is better here we
realize that um this is excellent as a
as a Spam but it's hard to um it's hard
to collect its
scale um of course we turn things into
metrics we turn spans into metric so we
realize we want to time that view
interactive and and uh we just keep keep
bumping along but this final picture is
actually what's the most uh valuable
here because we've had product owners
that are really interested in what we
see here um developers are more
interested in this part than how to get
there they really just want the span to
to turn into this and it's up to us to
uh to make it happen so these simple
techniques of turning um building into
the architecture we think really help
with adoption of just Telemetry
principles here our big lessons learned
from all of this is developers should
really just be the experts in their
features we're we're embedded s we're
supposed to be the experts in all of
this but we can't necessarily expect all
developers to do it so if you're asking
a junior developer or a senior developer
I've had both come up and and and say
they don't know any anything about spans
what they mean how do they use them what
the effect of it is but they do know
their feature and so we can work with
them to kind of explain their feature
describe it make the architecture more
observable first we just kind of took a
trivial example that anybody could do
but iOS is very event Loop driven they
are basically an actor pattern so they
have events going up and down all the
time which lends itself to being very
observable they also wouldn't do any
dependency injection they're very
against that they're very functional so
um if you break it down to these
Primitives both IOS and Android can talk
at the same terms because we're just
talking about starting and stopping
clocks we're talking starting about
we're talking about describing what the
spans are not necessarily that it is a
span
and if you aren't getting value out of a
metric I think you should constantly
review all your metrics we don't want to
turn on auto instrumentation just for
the sheer fact that not that we lose
data but we want to understand what's
there and it has to lead us to a
decision if if some Metric or alert or
whatever is not leading us to a decision
to change code or update something we
want to we want to remove it basically
only thing that should be there is what
provides value and then this is just a
basic encapsulation if you're a software
developer you kind of want to separate
the what from the how an API that we
don't control even though we love open
source at AMX we want to make a clear
barrier between um what is native to the
app and then what is just measuring the
thing and that kind of sounds uh a
little bit like it's an open source why
why don't just use open source well we
don't necessarily want to draw attention
to all these spots in our app because
they tend to collect everything you'll
have a local profiler wrapping a span
wrapping a another version of
benchmarking or something so we want
separate out what is an app level
decision versus how we're measuring it
and you can just actually test this I
kind of glaze over testing a little bit
we test everything at scale and even the
observability is tested so if a trace is
supposed to be emitted from a UI run we
actually assert that it happened we
assert all the child spans on it and we
assert uh you know just various tags and
uh metadata about it we really value
that but that is all I have thank you
very much for listening I'm happy to
take questions but I know it is
lunch these are custom completely custom
events uh I actually pulled up more of
the events uh just on my laptop there
we've had to do we we are very much that
five observability backends kind of
story so I've we've had to be adaptable
and that was the main point of this was
just to be adaptable but yes we we want
to look into the events API
Weitere ähnliche Videos ansehen
Using Native OpenTelemetry Instrumentation to Make Client Libraries Better - Liudmila Molkova
How OpenTelemetry Helps Generative AI - Phillip Carter, Honeycomb
OpenTelemetry for Mobile Apps: Challenges and Opportunities in Data Mob... Andrew Tunall & Hanson Ho
What Could Go Wrong with a GraphQL Query and Can OpenTelemetry Help? - Budhaditya Bhattacharya, Tyk
GopherCon 2020: Ted Young - The Fundamentals of OpenTelemetry
Appium Tutorial 01 :Introduction To Mobile App Testing | Appium
5.0 / 5 (0 votes)