Exfiltrating General System Performance via JavaScript

This is a somewhat technical article with an intended audience of engineers and data scientists. But if you are curious about system performance in in-field metrics, fear not!

The TL;DR;

I measured general hardware and system performance across hundreds of models, dozens of chipsets, and millions of devices in an application-agnostic manner in HTML5 environments—all without low-level interfaces or APIs!

Here are some (anonymized) example metrics:.

Overall System Performance Distribution, All Systems

Overall System Performance Histogram, by Model

X-axis is response time in milliseconds

Y axis is total samples per bucket.

Overall System Performance, by Chipset

From these we can infer lots of interesting details that may be helpful for:

Improving app performance
Identifying devices with performance issues

Seeing if targeted performance improvements for these actually worked.

Understanding the distribution and character of device populations in the wild.
Looking for data anomalies and even potential abuse in the field.

Read on for the whole story!

Now... the Deep Dive

My name is Casten Riepling and I worked in the Ecosystem Engineering team at Crunchyroll. I managed device platform specific integration and compatibility for Smart TVs, Set Top Boxes and Multichannel Video Programming Distributor (MVPD) devices, i.e. Cable Boxes. The Crunchyroll app is typical of other content service streaming applications in that it uses an HTML5+JS application stack. Like most apps on Smart TVs, the app is mostly running like a typical web application in your browser.
(note that from here out, I’ll refer to the HTML5+JS stack as just HTML5 for brevity)

Why HTML5 and JavaScript?

There are a variety of technology stacks used in Smart TV platforms around the world. These include, among others:

Native SDKs and Tools for Game Consoles - PlayStation, XBox, Wii/Switch
Android Apps - Amazon Fire, Android TV
Various Platform Specific Languages - BrightScript for Roku, Swift for Apple’s tvOS

TV or Living Room eXperience (LRX) devices have historically not wanted to be bound to a specific development ecosystem or OS. TV manufacturers like Samsung and LG have their own OS, while others may use an OS written by a third party like Roku, smart TV OS’s such as VIDAA or TitanOS. With so many OS’s, many device manufacturers opted for the open web standards approach to avoid being locked into one ecosystem. This includes a Linux-based OS with an HTML5 app stack for a majority of the applications. This also gives access to a larger talent pool of app developers. Using HTML5-based apps also provides a well understood set of operating principles for things like sandboxing, security, multi-processing, etc.

Did We Need A Bigger Boat?

The sandbox of HTML5 provides a solid foundation that help make safe, reliable, and robust applications and systems. However, the HTML5 APIs were largely designed for websites in desktop browsers. And while that does a great job for typical web apps, it does not provide many mechanisms for running in an embedded system. Embedded systems are typically designed to:

Limit hardware cost to provide the minimum support for current streaming standards + requirements
Tune the system to run at the limits of performance.
MVPD devices (from cable operators) typically have a very long shelf life, often more than 8-10+ years.

There are exceptions to the above, but by and large, assuming this model helps ensure streaming software customers will have a uniformly better experience.

Making Do

Running an app in an environment that may be very resource constrained is always a challenge. Normally, this can be managed by thoughtful design and thorough testing. But for LRX apps, these are managed and installed by an operator, sometimes across different operating systems, often on 3rd-party OSs running on devices that cannot be provided to the app developers. As such, after testing all available hardware platforms, subsequent monitoring and analytics are critical to maintaining great user experiences.

Luckily, modern web analytics and services like Datadog, Mixpanel, Mux allow for tracking app, user and video metrics, experiences, and more. You can set up alerts to look for changes in performance or errors in individual calls, user flows, video playbacks, etc. But there are many variables that may change. These include:

Firmware changes to the device
Local network performance or connectivity
New behavior in our streaming application
Changes on our backend
Changes in partner services

With so many variables, it can be a challenge to narrow things down. If we see a drop in specific metrics, we typically put on our technical stethoscopes and begin the process of Differential Diagnosis. One of the important parts of a DD is being able to quickly eliminate possibilities.

Since we only have access to JS level logging and diagnostics, we can’t see system logs or events that would normally be visible to an embedded developer. This includes tools like dmesg (Linux Kernel), top, htop, /proc/stat, or even ADB logcat. From a system perspective, we are largely flying blind. If we want some general system visibility, we’ll need to figure out some way to measure it indirectly.

Finding a Quanta

The question is: How can we gain insight into general system performance and use that to gain additional application or system insight?. In JavaScript applications, work is typically handled in a largely asynchronous cooperative model. Tasks are dispatched to a work queue by the browser/JS engine. The response time between a dispatch/queuing and the time it is serviced can give an indication of system performance. The following things might affect the latency and responsiveness of the JS task queue:

CPU: If the CPU is very busy, the entire system is impacted, and the response time may be as well
Memory Availability: If system memory is low, excessive paging will slow everything down.
App based throttling: If the JS system is throttled for CPU or memory, e.g. nice, MMU prioritization
App specific slowdowns: If the app has a bug or is inefficient, spending too much time in a specific JS handler and slowing down the entire JS system and app.

While measuring JS event loop performance this won’t give %CPU, %Total Memory, etc., it can provide a largely application and platform independent health metric. This is a sketch of how we set up the metric:

Our function, setupPerfWatcher(), is executed once at application start time. It performs a measurement every 60 seconds. When it’s time for a measurement, we save the current time, and immediately add a task to the JS queue with the setTimeout method. We don’t specify a timeout, so it will execute on the next event loop.

Then we measure the time again and obtain the response time in milliseconds. We then send this up to Datadog with a custom log event that includes some extra info to let us slice and analyze the data later on.

Note that we only measured the latency for the main JS Task Queue. Another interesting metric might be to measure the Microtask Queue latency since those tasks are only dispatched when the main Task Queue is empty. This could provide a metric with a stronger correlation for idleness. Measuring Render Queue latency could also be considered, but that is likely reflected in more typical render speed measurements.

And that is pretty much it. I had hoped that this would provide some interesting data when deployed at scale. I was not disappointed. Here are the things I noticed:

Overall System Performance Distribution, All Systems (y axis= Log Scale)

(Note that we’ve scaled this diagram on the y axis logarithmically to better show the modalities. Each marker on the y axis represents a 10x increase compared to the previous.)

This is the response distribution across all devices participating in the metric. The two initial peaks are around 1/100 the magnitude of the third peak.

Some insights:

Peaks 1 and 2 represent either some different modality, or extremely different systems. Adding additional information in the analytic event might provide more clues. For example, adding a reference to the UI/page being shown might show performance differences across pages. We can also look at the model or chipset specific breakdowns below to look for other clues.
Peak 3 is a majority of the traffic and the main mode. We can also see a somewhat visually linear taper. Since we are in a logarithm view on the y axis, this is actually closer to a logarithmic decrease. We can also see that the P90 is much further than the P75 (peeking out from behind the P50), and P99 is not even shown in this view. That indicates a fairly wide distribution.

Overall System Performance, by Model

Slicing things by device model we can see four major modalities. As these are model specific, each likely uses the same hardware.

Some insights:

Model A has distinct modes with little or no overlap. This indicates modalities, probably functional ones. Some wild guesses might include:

Modes

Group 1 - At home screen - nothing much going on, so response is good
Group 2 - During playback - performance is worse here due to more activity going on, maybe reduced memory
Group 3 - During scrolling - much more going on
Group 4 - Scrolling during video - doesn’t happen often, but is very taxing on the system

Distribution

P75 is on the third node. So a majority of the time the system is relatively performant.

Model B - The same four modes show up.

But these are blurred together. This indicates greater variation in response time for the four modes.
This is probably an indication of a system nearer the edge of performance issues. If it always takes the same amount of time to run a task, the queue may not be empty for a larger proportion of the time.
P50 mirrors Model A, but P75 is off to the right, not shown. This indicates a long tail and a higher portion of the time being not very responsive.

Model C - Slightly different than the others

We see what looks like a new node (Group 0?) preceding the others with exceptionally fast response, but rarely happens. Perhaps the JS queue is totally empty in this case and the system is near idle.
Groups 1 and 2 are similar to Model A. But with more data points.
Group 3 is almost non-existent. It could be that the modes showing up in Models A and B run so much more quickly that the points are shifted left and absorbed into Group 2.
P50 is very close to the front of Group 2. This is much better than models A and B. Furthermore P75 and P90 are both within Group 2. This means almost all the responses are processed very quickly.

This indicates a large performance improvement compared to the other models.
Not only is the average lower, but all data is grouped more tightly together.

Overall we can estimate that Model C is more performant than Model B and likewise for B and A. We can use this to look for differences in behavior in the field for different models if we can positively identify what is associated with the modes, and then check on those when application changes are made in those circumstances or areas.

We can also look for shifts in those models that happen outside of the app development cycle. Changes like this may happen due to:

Firmware updates
Invisible partner configuration changes
Device degradation over time. This could happen from:

Thermal issues
Flash memory degradation

Overall System Performance, by Chipset

Here we are looking at data across specific chipsets. (Note that the x axis is not to scale with the previous Model graphs.)

Some thoughts:

We see some familiar modes in Chipset X with P75 being close to the P50.

This looks like a happy system.

Chipset Y has the modes blurred together and P75 is off graph to the right.

This doesn’t look as happy as Chipset X

Chipset Z has a crazy bi-modality!

The performance of the first group shows evidence of the 3-4 modes.

And it is very very fast.

But how can we have such a huge split between group A and B?

Digging into this, I discovered that when running the partner API on normal web browsers, it was misreporting the chipset as Chipset Z. The data to the right is likely the real Chipset Z hardware. The group to the left with the amazing performance was likely running in the cloud or in an automation lab. It could be that the data could be further separated using an additional signal, such as the reported User Agent.

Takeaways

What I’ve demonstrated is that despite not having direct access to lower level system metrics, with the right techniques, we can gain insight through the use of thoughtful instrumentation. These insights can be further used for activities such as:

Automated configuration of performance specific UI configuration or modes

Apps often have different UI enhancements that can be selectively enabled on performant system
We can use this analytics data to apply those settings.

Monitoring specific populations for in-field changes

This could be based on models, chipsets, partner brands, etc.

Associating the data from these pseudo-low level metrics to:

Monitor the effects of app changes on overall system performance

Comparing these low level metrics for new releases before sending them out into the field

These could compliment other more direct performance tests and metrics
Having multiple ways to quantify performance is useful in cases where one of the methods has a silent problem.

These can act as a safeguard against silent performance validation failures.

Wrapping Up

I hope you’ve enjoyed this deep dive into the peculiar world of SmartTV development. This started as a “what if” when discussing tracking hardware platform specific performance in pursuit of some app independent metrics that wouldn’t be taxing on the system. This was an investigation that returned interesting and useful insights. I hope you found it interesting as well!

Search This Blog

Interesting Engineering

Tracking the Untrackable

The TL;DR;

Overall System Performance Distribution, All Systems

Overall System Performance Histogram, by Model

Overall System Performance, by Chipset

Now... the Deep Dive

Why HTML5 and JavaScript?

Did We Need A Bigger Boat?

Making Do

Finding a Quanta

Overall System Performance Distribution, All Systems (y axis= Log Scale)

Overall System Performance, by Model

Overall System Performance, by Chipset

Takeaways

Wrapping Up

Comments

Post a Comment