Using AI to Expedite Real-World Hands-On System Test

Semi-automated system test harness for mobile testing. JotBunker as the case study: 17 scenarios in 45 minutes.

JotBunker syncs across two apps: phone and computer. End-to-end encrypted, peer-to-peer, three-way merge engine. Lots of edge cases. Tombstones, last-write-wins, category renames, divergence gates, restore-from-backup paths. Every build gets over 200 unit tests run against it automatically. That’s great for regression and sanity checking. But for real world system testing, test cases would require a lot of manual setup work. I don’t want to manually add 600 list and scratchpad items just for one edge case test.

So this week I built a harness that takes the human out of the data-entry loop. Claude Code stages both sides directly. I still tap Sync myself (more on that below). 17 ordered tests in about 45 minutes.

The trick: own the storage, write to it directly

JotBunker uses Zustand on both sides. On the computer, that’s flat JSON envelopes at %APPDATA%\Jotbunker\stores\. On Android, it’s React Native AsyncStorage, which is a SQLite database at /data/data/com.jotbunker.myapp/databases/RKStorage you can reach through adb shell run-as on a debuggable build. Same envelope shape on both sides.

That’s the whole trick. Generate state. Write it to both stores. Tap Sync. Read the debug log.

The two adb one-liners that made this fast:

# Snapshot the phone's entire AsyncStorage (binary-safe pull)
adb exec-out "run-as com.jotbunker.myapp cat databases/RKStorage" > RKStorage.bak

# Apply a SQL wipe script (binary-safe push, no /sdcard traps)
adb shell am force-stop com.jotbunker.myapp
cat wipe.sql | adb shell "run-as com.jotbunker.myapp sqlite3 databases/RKStorage"

Those two lines beat half a day of trial and error. Stdin-piped and exec-out sidestep every “permission denied on /sdcard” trap and every “Git Bash mangled my Windows path” trap I hit on the way to a working flow.

The 17 tests

#	Scenario	Why it matters
1	Both sides empty	Smoke test
2	Computer populated, phone empty, null ancestor	Restore-from-backup path
3	Phone wiped, gate trips, user picks computer	New-phone-doesn’t-erase-computer regression
4	Edit vs delete	Tombstone wins despite later edit
5	Concurrent text edits	Last-write-wins resolution
6	True tie (same field, same timestamp)	Per-row picker dialog
7	Reorder both sides differently	Position-change path
8	Category rename on phone only	Category-detail-line fix
9	Scratchpad edit on one side	Scratchpad-detail-line fix
10	Two-cycle tombstone GC	Deletion-survives-one-sync invariant
11	Restore wipes ancestor mid-flight	The fix that motivated all of this
12	Locked Lists with sensitive text	No secret leaks into the log
13	100 items in one slot	Performance at the documented cap
14	Mirror of test 3, user picks phone	Gate trip in the other direction
15	Routine no-op sync	Baseline sanity
16	System max (1200 items + scratchpads)	The absolute upper bound
17	Over-cap (105 items in one slot)	What happens past the documented limit

The takeaway

I could take this even further and fake the clicks with adb shell input tap x y. For this use case, I won’t because half the bugs I’m trying to catch are timing-sensitive to the actual user gesture. I want the test to look exactly like a user, with data travelling through the system end-to-end because that’s what the product looks like in the field.

If your app’s storage is in a place you can read and write (and most are), and your sync engine emits a structured log (it should, for triage anyway), then this whole approach is just shell scripts and patience. The technique is older than I am: IBM had me writing harnesses like this in the 1990s. The new part is that an AI agent can write the fixtures, drive the workflow, and verify the log against an expected signature, all from a markdown test plan. Cost of typing “go run test 4” is now the cost of the test.

Peace.

The trick: own the storage, write to it directly

The 17 tests

The takeaway

About Homestead Hacker