type
status
date
slug
summary
tags
category
icon
password
comment
RelevancePurposeKey ProblemMethodTask-oriented UI PromptingExploration-based Memory InjectionSimulated Task Generation(LLM Generte)Augmenting Prompts with App MemoryTuning Local LLM with App-specific DataMulti-granularity Query OptimizationEvaluation and Results
AutoDroid
MobileLLM • Updated Nov 13, 2024
Relevance
- Current commercial AI assistants use developer-driven approaches with NLU modules and developer-defined functions, but face scalability challenges when supporting new tasks.
- LLMs can now utilize tools like search engines, code interpreters, and APIs.
Purpose
- an autonomous agent that can complete user-specified tasks by interacting with the smartphone
Key Problem
- GUI Representation: Convert GUI states and actions to text format to help LLMs understand and make decisions.
- Knowledge Integration: LLMs need domain-specific knowledge to navigate and complete tasks in complex smartphone apps.
- Cost Optimization: Optimize LLM query efficiency to provide a responsive task automation experience.
Method
Task-oriented UI Prompting
- a GUI parsing module to convert GUI to a simplified HTML representation:automatically scrolls and records the information
- Restricting the Action Space with Selections:requirement“- id=<id number> action=<tap/input> input text=<text or N/A> (in the event of task completion, id=-1)”
Exploration-based Memory Injection
Simulated Task Generation(LLM Generte)
- UTG——UI Transition Graph
- Simulated task——function of element
Augmenting Prompts with App Memory
- embedding model: {S}
cosine similarity between the embeddings of the simulated task S and the current task T
- Give hints about the UI elements in {S}
- PromptGenertor(T,UI,History)
Tuning Local LLM with App-specific Data
Multi-granularity Query Optimization
- Pruning Tokens by Merging Functionally Equivalent Elements.
- two UI elements leads to the same interface
- UI leaf nodes sharing the same interactive ancestor (button, checkbox, text field, etc.)
- Reducing Query Times by Shortcuts and GUI Merging.
- GUI merging is to include several GUI states into one prompt if LLMs need them all to make decisions.(scroll down)
- execute simple actions directly with the help of the app memory
Evaluation and Results
- Reproducible results
- Author:E1ainay
- URL:https://e1ainay.top/13d94b63edfa8191ace0ec88e141c3af
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts
Attacking Vision-Language Computer Agents via Pop-ups
Enabling Conversational Interaction with Mobile UI using Large Language Models
LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Task Automation
DroidBot: A Lightweight UI-Guided Test Input Generator for Android
PERSONAL LLM AGENTS: INSIGHTS AND SURVEY ABOUT THE CAPABILITY, EFFICIENCY AND SECURITY