PhoneClaw’s Vision Targeting system uses the Moondream AI vision model to locate UI elements on screen using natural language descriptions. This enables automation scripts to interact with elements that don’t have accessible content descriptions or stable View IDs.
Finds and clicks UI elements using natural language:
Copy
@JavascriptInterfacefun magicClicker(description: String) { mainScope.launch { try { speakText("Looking for $description on screen") // 1. Capture screenshot val screenshot = takeScreenshotForAPI() if (screenshot == null) { speakText("No screenshot available") return@launch } // 2. Encode to base64 val base64Image = bitmapToBase64(screenshot) // 3. Call Moondream API val coordinates = callMoondreamAPI(base64Image, description) if (coordinates != null) { // 4. Convert normalized coords to pixels val pixelX = (coordinates.x * 720).toFloat() + 50f val pixelY = (coordinates.y * 1600).toFloat() // 5. Simulate click withContext(Dispatchers.Main) { MyAccessibilityService.instance?.simulateClick(pixelX, pixelY) speakText("Clicked on $description") } // 6. Log for tracking trackMagicRun( "magicClicker", description, "{\"x\": ${pixelX.toInt()}, \"y\": ${pixelY.toInt()}}" ) } else { speakText("Could not find $description on screen") } } catch (e: Exception) { speakText("Error with magic click: ${e.message}") } }}
JavaScript Usage:
Copy
// Natural language element targetingmagicClicker("login button");magicClicker("profile icon in top right corner");magicClicker("red heart icon");magicClicker("first video thumbnail");// Works with icons that have no textmagicClicker("hamburger menu");magicClicker("settings gear icon");
Coordinate Offset: PhoneClaw adds a 50px X-offset to account for UI chrome. Adjust pixelX calculation if clicks are consistently off-target.
@JavascriptInterfacefun magicScraper(description: String): String { return try { runBlocking(Dispatchers.IO) { withTimeout(30000) { val screenshot = takeScreenshotForAPI() ?: return@withTimeout "Error: No screenshot" val base64Image = bitmapToBase64(screenshot) val result = callStreamingAPIWithImage(base64Image, description) trackMagicRun("magicScraper", description, result) result } } } catch (e: Exception) { "Error: ${e.message}" }}
JavaScript Usage:
Copy
// Extract specific informationconst username = magicScraper("current username displayed");const followerCount = magicScraper("number of followers");const batteryLevel = magicScraper("battery percentage in status bar");const time = magicScraper("current time shown at top");// Use scraped dataif (parseInt(followerCount) > 10000) { speakText(`Wow, ${followerCount} followers!`);}
Moondream responses are cleaned to extract concise information:
Copy
private fun cleanScrapingResponse(answer: String, originalDescription: String): String { var cleaned = answer.trim() // Remove common AI prefixes val prefixesToRemove = listOf( "The ", "I can see ", "Looking at the image, ", "In the image, ", "The screen shows " ) for (prefix in prefixesToRemove) { if (cleaned.startsWith(prefix, ignoreCase = true)) { cleaned = cleaned.substring(prefix.length) break } } // Extract specific patterns cleaned = when { originalDescription.lowercase().contains("battery") -> { val percentageRegex = Regex("(\\d+)%") percentageRegex.find(cleaned)?.value ?: cleaned } originalDescription.lowercase().contains("time") -> { val timeRegex = Regex("\\d{1,2}:\\d{2}\\s*(AM|PM|am|pm)?") timeRegex.find(cleaned)?.value ?: cleaned } else -> cleaned } return if (cleaned.isNotEmpty()) cleaned else "Not found"}
PhoneClaw logs all vision-based interactions for debugging:
Copy
private suspend fun trackMagicRun( mode: String, inputDescription: String, output: String) = withContext(Dispatchers.IO) { val database = Firebase.database var index = magicRunIndex++ val data = mapOf( "input_description" to inputDescription, "output" to output, "mode" to mode, "index" to index, "timestamp" to System.currentTimeMillis(), "device_id" to phoneDeviceId ) database.getReference("unit_tests") .child(phoneDeviceId) .child(index.toString()) .setValue(data) .await()}