If you maintain an Android application, you might be relying on performance monitoring SDKs like Firebase Performance or New Relic, to name a couple. These plugins usually have a light setup process—just apply a Gradle plugin, and they provide the ability to collect statistics about every network call and database query in your app automatically.
The usual way to achieve this is to rely on a process called instrumentation, which is supported via the Android Gradle Plugin’s Transform API, or its successor, the Instrumentation API. This feature is very powerful, and potentially dangerous; in our case, a minor patch of one of these SDKs caused a production bug that left one of our core features crippled.
The visible cause of our bug, from a developer’s point of view, was that the video player saw the network requests as always being extremely fast, no matter the network quality. Therefore, it assumed the device had access to a very high bandwidth, and tried loading video segments with a very high bit rate. This did not go well for users with slower network speeds.
To understand what was going on, what went wrong, how to fix it and how to take measures so that it never happens again, we had to do some investigation.
Diving into the Android build process
Before we get to the topic of instrumentation, we first need to know a little about the Android app build process. Don’t worry, we won’t need to dive too deep into the details.
To put it simply, during the build process, your source files (Kotlin and Java) are compiled to Dalvik bytecode, which is stored in .dex
files. These files are then packaged into an APK file, which is basically just a ZIP file with all your code and resources.
Understanding bytecode instrumentation
Now, let’s say you want to take an existing application with its untouched source code, and automatically inject calls to your SDK every time a network call is made, to log whether it was successful or not. How would you achieve this?
The easiest way is to plug yourself into the build, right after the code is compiled into bytecode, and modify the bytecode to your will.
The Android Gradle Plugin (AGP) offers APIs to do this, so SDK vendors can just develop a Gradle plugin and ta-da! Once you apply it, your app is automatically instrumented.
Note that there are other ways to achieve this without the AGP. Notably, Kotlin now uses an Intermediate Representation (IR), before it gets compiled down to a target-specific format. You can write a Kotlin IR compiler plugin to transform the IR code and add your own hooks in an Android-agnostic way, although this API is still experimental at the time of writing.
Reverse-engineering a built APK
Now, this is great. But when you open an APK file, what do you get?
Let’s unzip one and look inside.
.
├── META-INF
├── assets
├── google
├── okhttp3
├── res
├── AndroidManifest.xml
├── classes.dex
├── classes2.dex
├── classes3.dex
├── classes4.dex
├── firebase-common.properties
├── firebase-crashlytics.properties
├── play-services-base.properties
├── ...
└── resources.arsc
A bunch of noise, and four interesting .dex
files. That’s where the app’s code is stored, but unfortunately, these files are not human-readable.
To turn them into low-level but understandable code, some tooling will be necessary. The easiest to use for this task is apktool
, which is free and open-source.
Let’s run apktool
on our APK, and see what happens:
There we go! In our case, we can ignore the warnings. apktool
created a new directory with a bunch of .smali
files, organized by package: one file per class, containing their Dalvik bytecode.
.
├── AndroidManifest.xml
├── res
│ ├── values
│ │ ├── strings.xml
│ │ └── ...
│ ├── layout
│ │ ├── layout_home.xml
│ │ └── ...
│ └── ...
├── smali
│ ├── com
│ ├── bedrockstreaming
│ │ ├── app
│ │ │ ├── mobile
│ │ │ │ ├── R$anim.smali
│ │ │ │ ├── R$layout.smali
│ │ │ │ ├── R$string.smali
│ │ │ │ ├── R$style.smali
│ │ │ │ └── ...
│ │ │ └── ...
│ │ └── ...
│ └── google
│ ├── android
│ │ ├── exoplayer2
│ │ │ ├── AbstractConcatenatedTimeline.smali
│ │ │ ├── AudioBecomingNoisyManager.smali
│ │ │ ├── AudioFocusManager$AudioFocusListener$$ExternalSyntheticLambda0.smali
│ │ │ ├── AudioFocusManager$AudioFocusListener.smali
│ │ │ ├── AudioFocusManager.smali
│ │ │ ├── BasePlayer.smali
│ │ │ ├── BaseRenderer.smali
│ │ │ ├── BuildConfig.smali
│ │ │ └── ...
│ │ └── ...
│ └── ...
├── smali_classes2
│ ├── com
│ │ └── bedrockstreaming
│ │ ├── app
│ │ │ ├── mobile
│ │ │ │ ├── MobileApplication.smali
│ │ │ │ └── ...
│ │ │ └── ...
│ │ └── ...
│ └── ...
└── ...
If you see files with mangled names and contents, make sure that you run apktool
on an APK with R8 obfuscation disabled, or you’ll have a hard time figuring things out.
Understanding Dalvik bytecode
Now, if you open one of these files, it will contain code that looks like the snippet below. It will look unfamiliar; that’s normal.
.method private final getContent()Lcom/bedrockstreaming/example/HomeViewModel$State$Content;
.locals 2
.line 119
iget-object v0, p0, Lcom/bedrockstreaming/example/HomeViewModel;->state:Landroidx/lifecycle/LiveData;
invoke-virtual {v0}, Landroidx/lifecycle/LiveData;->getValue()Ljava/lang/Object;
move-result-object v0
instance-of v1, v0, Lcom/bedrockstreaming/example/HomeViewModel$State$Content;
if-eqz v1, :cond_0
check-cast v0, Lcom/bedrockstreaming/example/HomeViewModel$State$Content;
goto :goto_0
:cond_0
const/4 v0, 0x0
:goto_0
return-object v0
.end method
If you’ve ever worked with assembly code before, you might notice similarities in the way the code is written. Each line begins with an instruction, which can take comma-separated parameters. To work out what these instructions and their parameters mean, you will need to refer to the Dalvik bytecode documentation provided by Google.
Let’s take an example line from the snippet and decode it together. Looking at the table in the documentation, we can see deduce this:
# We'll decode this line:
invoke-virtual {v0}, Landroidx/lifecycle/LiveData;->getValue()Ljava/lang/Object;
invoke-virtual # We're calling a virtual method
{v0}, # We're calling the method on the object referenced in register v0
Landroidx/lifecycle/LiveData; # The method we're calling is defined by androidx.lifecycle.LiveData
->getValue() # We're calling a method called getValue()
Ljava/lang/Object; # This method returns an Object
With some determination, we can figure out what the snippet does. Here, we’re defining a getContent()
method that tries to cast a LiveData
’s value to State.Content
and returns it, or null
otherwise.
Using a decompiled APK as a debugging tool
Inspecting suspicious code
Before doing anything else, we can already start looking at the generated code to identify patterns that could cause issues. Problem is… there can be a lot of code to look through.
Before going this deep in the rabbit hole, we already figured our issue was, somehow, related to instrumentation: disabling it fixed this issue; downgrading to the previous release of the SDK also fixed it. This means that if we want to get a clear look at what needs to change to go from a working APK from a broken one, we could just compare an APK instrumented by the previous SDK version with an APK instrumented by the current one!
Of course, we want to do this on the human-readable smali
files, not the raw dex
files. We can generate a full diff with the help of the diff
tool:
diff -bur normal/ instrumented/
In our case, it also proved useful to compare an APK that has been instrumented with one that hasn’t, to understand what that instrumentation is meant to achieve. Most of it was to notify the SDK of every HTTP request, along with its result.
As a simple example, the snippet below shows a class belonging to Picasso. We can see the HTTP calls it makes are being intercepted by the SDK.
--- normal/smali/com/squareup/picasso/NetworkRequestHandler.smali 2022-01-05 11:09:22.000000000 +0100
+++ instrumented/smali/com/squareup/picasso/NetworkRequestHandler.smali 2022-01-05 11:08:34.000000000 +0100
@@ -128,10 +128,26 @@
.line 103
:cond_4
+ instance-of v2, v1, Lokhttp3/Request$Builder;
+
+ if-nez v2, :cond_5
+
invoke-virtual {v1}, Lokhttp3/Request$Builder;->build()Lokhttp3/Request;
move-result-object v2
+ goto :goto_1
+
+ :cond_5
+ move-object v2, v1
+
+ check-cast v2, Lokhttp3/Request$Builder;
+
+ invoke-static {v2}, Lcom/vendor/instrumentation/okhttp3/OkHttp3Instrumentation;->build(Lokhttp3/Request$Builder;)Lokhttp3/Request;
+
+ move-result-object v2
+
+ :goto_1
return-object v2
.end method
Finding the source of the issue by iteration
We haven’t talked about apktool
’s greatest strength yet: its ability to recompile an APK from the smali
sources it has decompiled! This means we can effectively decompile an APK, make modifications to its low-level code, recompile and run it.
This proved really useful during our investigation. Since we have one directory with our APK in a bad state, and one directory with our APK in a good state, we can process by elimination to point out exactly which single class, when modified, causes our bug.
In our case, a useful workflow was to start with a suspect—let’s say we think instrumenting the OkHttp classes might have caused the bug.
- Copy the OkHttp classes from the “bad” APK, and only those, to our “good” APK.
- Recompile and run the app.
- Does the bug occur?
- If it does, then that means it is caused by the instrumentation of at least one of the OkHttp classes. We can go through this process again, this time by selecting only a subset of OkHttp’s classes, and check if the bug still occurs, etc.
- If it doesn’t, revert the OkHttp classes and try again with another suspect.
This process can be accelerated with a very simple script, to iterate faster. The recompilation step occurs incrementally, and so only takes a few seconds.
#!/bin/sh
# rebuild-and-run.sh
# Rebuild, sign and install an APK from its decompiled source.
# (c) 2022 Bedrock Streaming
# Inputs:
# DECOMPILED_APK_PATH: path to your previously decompiled APK directory
# KEYSTORE_PATH: path to your debug keystore
# KEYSTORE_PASSWORD: your debug keystore password
apktool --use-aapt2 b "$DECOMPILED_APK_PATH" \
&& apksigner sign -ks "$KEYSTORE_PATH" --ks-pass "pass:$KEYSTORE_PASSWORD" "$DECOMPILED_APK_PATH/dist/*.apk" \
&& adb install "$DECOMPILED_APK_PATH/dist/*.apk"
Here’s what it looks like in action:
In our case, we narrowed down the issue to the instrumentation of a single class: okhttp3.internal.http.CallServerInterceptor
: once it was reverted, the bug disappeared.
In fact, we narrowed it down to a very small patch with which the app runs fine:
.../okhttp3/internal/http/CallServerInterceptor.smali | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/apk/smali_classes2/okhttp3/internal/http/CallServerInterceptor.smali b/apk/smali_classes2/okhttp3/internal/http/CallServerInterceptor.smali
index c916149f..c26eab15 100644
--- a/apk/smali_classes2/okhttp3/internal/http/CallServerInterceptor.smali
+++ b/apk/smali_classes2/okhttp3/internal/http/CallServerInterceptor.smali
@@ -510,7 +510,7 @@
instance-of v8, v14, Lokhttp3/Response$Builder;
- if-nez v8, :cond_b
+ #if-nez v8, :cond_b
invoke-virtual {v14, v15}, Lokhttp3/Response$Builder;->body(Lokhttp3/ResponseBody;)Lokhttp3/Response$Builder;
@@ -574,7 +574,7 @@
instance-of v15, v8, Lokhttp3/Response$Builder;
- if-nez v15, :cond_e
+ #if-nez v15, :cond_e
invoke-virtual {v8, v14}, Lokhttp3/Response$Builder;->body(Lokhttp3/ResponseBody;)Lokhttp3/Response$Builder;
--
Basically, when the code went through this if
statement, our request got wrapped by com.vendor.instrumentation.okhttp3.OkHttp3Instrumentation
:
invoke-static {v8, v14}, Lcom/vendor/instrumentation/okhttp3/OkHttp3Instrumentation;->body(Lokhttp3/Response$Builder;Lokhttp3/ResponseBody;)Lokhttp3/Response$Builder;
And what does this method do, you ask? Let’s take a look at the decompiled source in Android Studio, so that it’s a bit easier to read:
public Builder body(ResponseBody body) {
try {
if (body != null) {
BufferedSource source = body.source();
Buffer buffer = new Buffer();
source.readAll(buffer);
return this.impl.body(ResponseBody.create(body.contentType(), buffer.size(), buffer));
}
} catch (IOException var4) {
log.error("IOException reading from source: ", var4);
} catch (IllegalStateException var5) {
log.error("IllegalStateException reading from source: ", var5);
}
return this.impl.body(body);
}
The body is being read into memory!
source.readAll(buffer);
When correlating this discovery with the source code from ExoPlayer, we could verify that, indeed, our player was expecting that the time it takes reading the response body would be the time it took to download the entire video segment. Here’s what this flow looks like in a functional app:
and can measure the time it took
to read the whole response deactivate server deactivate exo
But with this bug in the SDK, since the HTTP response has been buffered into memory by some SDK, the read was always almost-instantaneous, no matter the speed of the connection. Additionally, it messed with the overall performance since requests were no longer properly streamed by their rightful users.
after everything is downloaded nr->>exo: ResponseBody (length=2560) deactivate server deactivate exo
Using a decompiled APK as a review tool
It’s no secret to developers in any software ecosystem that library updates can be a source of problems - security vulnerabilities, bugs, incompatibilities, and so on. It’s hard to vet them properly, especially in compiled form, like libraries distributed in the Java ecosystem. Things get even harder when arbitrary Gradle plugins start rewriting our own code!
The tooling needed to decompile an APK is free, fast, and easy to automate. It’s a really helpful tool to investigate obscure bugs in places your debugger won’t let you place a breakpoint, and it’s also really useful to be able to see a human-readable diff between two binaries.
Generating a diff of the effects of a library upgrade can seem overkill and hard to do in practice, but at least in the case of bug-fix releases with hopefully few changes, it can be very helpful to have an actual report of what changed. It’s an accepted practice to review the code your team checks in; why not review the code of others, since it ends up in the exact same artifact?