Using AI to Expedite Real-World Hands-On System Test
Semi-automated system test harness for mobile testing. JotBunker as the case study: 17 scenarios in 45 minutes.
JotBunker syncs across two apps: phone and computer. End-to-end encrypted, peer-to-peer, three-way merge engine. Lots of edge cases. Tombstones, last-write-wins, category renames, divergence gates, restore-from-backup paths. Every build gets over 200 unit tests run against it automatically. That’s great for regression and sanity checking. But for real world system testing, test cases would require a lot of manual setup work. I don’t want to manually add 600 list and scratchpad items just for one edge case test.
So this week I built a harness that takes the human out of the data-entry loop. Claude Code stages both sides directly. I still tap Sync myself (more on that below). 17 ordered tests in about 45 minutes.
The trick: own the storage, write to it directly
JotBunker uses Zustand on both sides. On the computer, that’s flat JSON envelopes at %APPDATA%\Jotbunker\stores\. On Android, it’s React Native AsyncStorage, which is a SQLite database at /data/data/com.jotbunker.myapp/databases/RKStorage you can reach through adb shell run-as on a debuggable build. Same envelope shape on both sides.
That’s the whole trick. Generate state. Write it to both stores. Tap Sync. Read the debug log.
The two adb one-liners that made this fast:
# Snapshot the phone's entire AsyncStorage (binary-safe pull)
adb exec-out "run-as com.jotbunker.myapp cat databases/RKStorage" > RKStorage.bak
# Apply a SQL wipe script (binary-safe push, no /sdcard traps)
adb shell am force-stop com.jotbunker.myapp
cat wipe.sql | adb shell "run-as com.jotbunker.myapp sqlite3 databases/RKStorage"
Those two lines beat half a day of trial and error. Stdin-piped and exec-out sidestep every “permission denied on /sdcard” trap and every “Git Bash mangled my Windows path” trap I hit on the way to a working flow.
The 17 tests
| # | Scenario | Why it matters |
|---|---|---|
| 1 | Both sides empty | Smoke test |
| 2 | Computer populated, phone empty, null ancestor | Restore-from-backup path |
| 3 | Phone wiped, gate trips, user picks computer | New-phone-doesn’t-erase-computer regression |
| 4 | Edit vs delete | Tombstone wins despite later edit |
| 5 | Concurrent text edits | Last-write-wins resolution |
| 6 | True tie (same field, same timestamp) | Per-row picker dialog |
| 7 | Reorder both sides differently | Position-change path |
| 8 | Category rename on phone only | Category-detail-line fix |
| 9 | Scratchpad edit on one side | Scratchpad-detail-line fix |
| 10 | Two-cycle tombstone GC | Deletion-survives-one-sync invariant |
| 11 | Restore wipes ancestor mid-flight | The fix that motivated all of this |
| 12 | Locked Lists with sensitive text | No secret leaks into the log |
| 13 | 100 items in one slot | Performance at the documented cap |
| 14 | Mirror of test 3, user picks phone | Gate trip in the other direction |
| 15 | Routine no-op sync | Baseline sanity |
| 16 | System max (1200 items + scratchpads) | The absolute upper bound |
| 17 | Over-cap (105 items in one slot) | What happens past the documented limit |
The takeaway
I could take this even further and fake the clicks with adb shell input tap x y. For this use case, I won’t because half the bugs I’m trying to catch are timing-sensitive to the actual user gesture. I want the test to look exactly like a user, with data travelling through the system end-to-end because that’s what the product looks like in the field.
If your app’s storage is in a place you can read and write (and most are), and your sync engine emits a structured log (it should, for triage anyway), then this whole approach is just shell scripts and patience. The technique is older than I am: IBM had me writing harnesses like this in the 1990s. The new part is that an AI agent can write the fixtures, drive the workflow, and verify the log against an expected signature, all from a markdown test plan. Cost of typing “go run test 4” is now the cost of the test.
Peace.

