PERSONAL LLM AGENTS: INSIGHTS AND SURVEY ABOUT THE CAPABILITY, EFFICIENCY AND SECURITY

type

status

date

slug

summary

Introduction

we focus primarily on the aspects related to “personal” parts within Personal LLM Agents, encompassing the analysis and utilization of users’ personal data, the use of personal resources, deployment on personal devices, and the provision of personalized services

Some approaches have attempted to automatically learn to support tasks through supervised learning or reinforcement learning. However, these methods also rely on a substantial amount of manual demonstrations and/or the definition of reward functions.

The main content and contributions of this paper can be summarized as follows:

We summarize the status quo of existing IPAs in both industry and academia, while analyzing their primary limitations and future trends in the LLM era.

We collect insights from senior domain experts in the area of LLM and personal agents, proposing a generic system architecture and a definition of intelligence levels for personal LLM agents.

We review the literature on three important technical aspects of personal LLM agents, including foundamental capabilities, efficiency, and security & privacy.

A Brief History of Intelligent Personal Assistants

Timeline View of the Intelligent Personal Assistants History

Speech Recognition System → Voice-based Software → Virtual Personal Assistant →LLM-based Chatbot

**Major milestones in the history of intelligent personal assistants (IPAs)**

Technical View of the Intelligent Personal Assistants History

Template-based Programming：

Given a user command, the agent first map the command to the most relevant template, then follow the predefined steps to complete the task.

Supervised Learning Methods

how to learn a representation of software GUI and how to train the interaction model

Supervised Learning Methods (1)

名称

描述

Humanoid

generate human-like test inputs based on the GUI layout information

Seq2act

Transformer network : mapping(Instructions→action tuples;UI obj→obj descriptions;action tuples↔descriptions A pair of UIs as input to capture the semantics info

PW2SS

Mobile GUI Understanding:Pixel-Words(atomic text and graphic components)→a pixel-based GUI understanding framework→screen sentence

Versatile ui transform

A multimodal Transformer: images, structures and language→5 distinct tasks.(jointly encoders and trained) an auto-regressive transformer:language input↔command or question↔ans.

UIBert

Based on the self-aligned characteristics between components of different modalities a well-designed joint image-text model

SpotLight

vision-only approach : take screenshots and a region of interest (the “focus”) as input composed of a vision encoder and a language decoder

Lexi

leverage text-based instruction manuals and user guides to curate a multimodal dataset fuse text and visual features as input to the co-attention transformer layers,

UINav

automation agents on previously seen tasks

Reinforcement Learning Methods

During the interaction, the agent gets feedback of rewards that indicate the progress of task completion, and it gradually learns how to automate the tasks by maximizing the reward payoff.To train RL-based task automation agents, a reward function that indicates the progress towards task completion is required.

Early Adoption of Foundation Models

Let LLMs use tools autonomously to accomplish complex tasks.

Some commercial products have attempted to integrate LLM with IPA.

Many issues related to efficiency, security and privacy have not been adequately addressed yet.

Personal LLM Agents:Definition & Insights

LLM-based agent deeply integrated with personal data, personal devices, and personal services.

Key Components

The combination of these key components is analogous to an operating system.

Main components of Personal LLM Agents	traditional operating systems
Foundation Model	`kernel`
Local recourse	`driver programs`
User context and memory	`program contexts and system logs`
Skills	`software application`

**Main components of Personal LLM Agents**

Intelligence Levels of Personal LLM Agents

We categorize the intelligence levels of Personal LLM Agents into five levels, denoted as L1 to L5

Level	Key Characteristics	Representative Use Cases
L1 - Simple Step Following	Agent completes tasks by following exact steps predefined by the users or the developers.	- User: “Open Messenger”; Agent opens the app named Messenger. - User: “Open the first unread email in my mailbox and read its content”; Agent follows the command step by step. - User: “Call Alice”; Agent matches a developer-defined template, finds Alice’s phone number in the address book, and calls the number.
L2 - Deterministic Task Automation	Based on the user’s description of a deterministic task, agent auto-completes the necessary steps in a predefined action space.	- User: “Check the weather in Beijing today”; Agent automatically calls the weather API with parameter “Beijing” and parses info. from the response. - User: “Make a video call to Alice”; Agent automatically opens the address book, finds Alice’s contact, and clicks on “video chat”. - User: “Tell the robot vacuum to clean the room tonight”; Agent opens the robot vacuum app, clicks ‘schedule’, and sets the time to tonight.
L3 - Strategic task Automation	Based on user-specified tasks, agents autonomously plan the execution steps using various resources and tools, and iterates the plan based on intermediate feedback until completion.	- User: “Tell Alice about my schedule for tomorrow”; Agent gathers tomorrow’s schedule information from the user’s calendar and chat history, then summarizes and sends them to Alice via Messenger. - User: “Find out which city is suitable for travel recently”; Agent lists several cities suitable for travel, checks the weather in each city, summarizes the information, and returns recommendations. - User: “Record my sleep quality tonight”; Agent checks every 10 minutes during sleep time if the user is using the phone, moving, or snoring (based on smartphone sensors and microphone), summarizes the information, and generates a report.
L4 - Memory and Context Awareness	Agent senses user context, understands user memory, and proactively provides personalized services at appropriate times.	- Agent recommends suitable financial products automatically based on User’s recent income and expenses, considering User’s personality and risk preference. - Agent estimates User’s recent anxiety level based on the conversations and behaviors, recommends movies/music to help relax and notifies user’s friends or doctors depending on the severity. - When a user falls in the bathroom, the Agent detects the event and decides whether to ask the user, notify the user’s family members, or call for help based on the user’s age and physical conditions.
L5 - Autonomous Avatar	Agent fully represents the user in completing complex affairs, can interact on behalf of user with other users or agents, ensuring safety and reliability.	- Agent automatically reads emails and messages on behalf of User, replies to questions without user intervention, and summarizes them into an abstract. - Agent attends the work discussion meeting on behalf of the user, expresses opinions based on user’s work log, listens to suggestions, and writes the minutes. - Agent records User’s daily diet and activities, privately researches or ask experts on any anomalies, and makes health improvement suggestions.

**The duties of Personal LLM Agents at different intelligence levels**

Opinions on Common Problems

Questions	Rank 1st	Rank 2nd	Rank 3rd
where to deploy the LLM	edge-cloud collaborated architecture	local deployment	cloud-only
how to customize the agents	combine the advantages of both fine-tuning and in-context learning	fine-tuning	in-context learning
what modalities to use	Text	Image	Video
which LLM ability is the most crucial for IPA products	Language understanding	academic→long context	common sense reasoning
how to interact with the agents	Voice-based	Text-based	GUI
which agent ability is needed to develop	intelligent and autonomous decision-making	Continuous improvement of user experience and interaction methods	Secure handling of personal data

what features are desired for an ideal IPA?

Efficient Data Management and Search

Autonomous Task Planning and Completion

Work and Life Assistance

Emotional Support and Social Interaction

Personalized Services and Recommendations

Digital Representative and Beyond

what are the most urgent technical challenges?

Intelligent: 1️⃣Multimodal Support 、2️⃣Context Understanding and Context-aware Actions 、3️⃣Enhancing Domain-specific Abilities of Lightweight LLM

Performance: 1️⃣Effective LLM Compression or Compact Architecture2️⃣Practical Local-Remote Collaborative Architecture

Security & Privacy: 1️⃣Data Security and Privacy Protection 2️⃣Inference Accuracy and Harmlessness

Personalization & Storage:efficient data storage solutions to manage and leverage userrelated data

Traditional OS Support:LLM-friendly interfaces and support (API) of traditional operating systems like Android

Fundamental Capabilities

Task execution 、context sensing、memorization

Task Automation Methods

1. code-based Task Automation

Slot-filling method : slot-value pairs

Program synthesis method

fine-tune LLMs to use specific APIs
utilize the chain reasoning and in-context learning ability of LLMs→show descriptions and demonstrations of the tools (e.g. APIs, other DNNs, etc.) in context

2. UI-based Task Automation

text-based GUI representation

Multimodal representation

Autonomous Agent Frameworks

Evaluation

Metrics and Benchmarks

Context Sensing

Hardware-based:acquisition through various sensors

Software-based:a form of software-base sensing(typing habits)

purposes:

Enabling Sensing Tasks:

Supplementing Contextual Information:

Triggering Context-aware Services:

Augmenting Agent Memory:

Sensing sorces:

Hardware Sensor,Software Sensor,Combination of Multiple Sensors

How to understand and utilize?

Sensor Data as Prompt.

Sensor Data Encoding + Fine-tuning.

Redirecting Sensor Data to Domain-Specific Models.

Sensing Targets:

Sensing the Environment

Scene sensing:more tangible environmental factors(locations and places)
Occasion perception:deeper environmental information(religious and cultural backgrounds…)

Sensing the User

Short-term user sensing
Long-term user sensing

Memorizing

the capability to record, manage and utilize historical data

Obtaining Memory

From Logging and Inferring

Managing and Utilizing Memory

Raw Data Management and Processing：selecting, filtering, transforming to other formats, etc

Memory-augmented LLM Inference:

Short-term memory : the form of symbolic variables during the current decision cycle.This includes perceptual inputs, active knowledge and other core information from memory or the previous.
Long-term memory : experiences from earlier decision cycles. This includes history event flows, game trajectories from previous episodes, interaction information.

Agent Self-evolution:

Learning Skills: skills as code or APIs through the strategic use of prompts.Agents could acquire novel skills by linking skills within a foundational skill set.
Finetuning LLM:

reasons:

LLMs were not specifically designed for agent-specific use cases.

Limited device makes it difficult to acquire new skills through prior knowledge and in-context learning abilities.

New knowledge and tools change the task schemas,demanding adaptation of LLMs.

Fine-tuned smaller LLMs could outperform prompted larger LLMs with reduced inference and expense in specific case.

Effiency

Three processes desire careful optimization of effiency:

Inference(bottleneck):a lot of both computation and memory resources

Customization:different context tokens or tune the LLM with domain-specific data

Memory manipulation:require access to longer contexts or external memories,handle and managemeny.

**The mapping relations between the low-level processes and high-level capabilities**

Efficient Inference

Model Compression

Quantization:reduce the model size by using fewer bits to represent the model parameters, and also reduces computations with system-level support for quantized kernels.

post-training quantization(PTQ) : quantize after train(available and flexible) quantization-aware training(QAT) : train after quantization

weight only quantization(WOQ):integer quantization(INT4 and INT8) on weights only, while preserving activations in float formats (FP16 and FP32).
INT8 quantization for both weights and activations: the activations including KV pairs are more difficult to quantize because of outliers.
low-bit floating point quantization(FP4 and FP8): higher computational performance

Pruning: removing less important connections in the network

Structured pruning: removes weights in regular patterns, such as a rectangle block Unstructured pruning: eazier to maintain model accuracy but less hardware-friendly

Knowledge Distillation(KD): using a well-performing teacher model (a large number of parameters and high precision) to guide the training of a lightweight student model(fewer parameters and lower precision)

White-box: require teacher model’s parameters Black-box: don’t require

KD is also adopted in QAT and pruning techniques to enhance the training performance.

Low-rank Factorization: Low-rank Factorization can be combined with quantization and pruning.

Inference Acceleration

the computational cost of attention increases near quadratically with the context length

KV Cache：storing (i.e., “caching”) and incrementally updating the Key-Value (KV) pairs in each token’s generation.

Context Compression:

Co-quantization of weights and activations, including KV cache.

compress the context at the prefill stage based on different importance of tokens(one-shot and cannot prune the KV cache).

a learnable mechanism to continuously determine and drop uninformative tokens.

reduce computations of less important tokens instead of directly removing them.

Kernel Optimization:

Small-batch or single-batch inference is especially important for edge scenarios.

The complexity of attention scales quadratically with the sequence length, while that of the FFN scales linearly.

efficient attention kernels including FlashAttention and Flashdecoding++
reduce the computational complexity of attention from the algorithm aspect and achieve linear complexity for self-attention in the prefill phase.
reduce dequantization overhead.

Speculative Decoding:

An effective approach in small-batch inference to improve the latency.

Speculative decoding mitigates this challenge by “guessing” several subsequent tokens through a lightweight “draft model”, and then validating the draft tokens in batches using the large “oracle model”.

Memory Reduction

KV cache and model weights are two major causes of this memory overhead.

short-context→the model compression / long-context→the KV cache

Energy Optimization

energy-consumption increase the runtime cost and carbon footprint, hurt the quality of experience (QoE) due to increased temperature and shorten battery lifespan.

Two major cause: computation and memory access

software perspective: model compression,KV cache,efficient attention kernels hardware perspective: utilize efficient processors including NPUs and TPUs/FPGA-based.

The research on energy efficiency is insufficient due to the complexity of hardware deployment and the volatility of energy measurement and analysis.

Efficient Customization

Two ways: feed the LLM with different contextual prompts and tune the LLM with domain-specific data.

Efficiency: context loading efficiency and LLM fine-tuning efficiency.

Context Loading Efficiency:

A straightforward way is to prune some redundant tokens or shorten the context length.

Another way is to reduce the bandwidth consumption during context data transmission.

different input prompts may have overlapping text segments,pre-computing,storing and reuse these segments.

Fine-tuning Efficiency:

Parameter-efficient fine-tuning (PEFT)

Efficient Optimizer Design(skip)

Training Data Curation

a small amount of high-quality data can lead to significantly reduced training cost and achieve capabilities comparable to large-scale datasets and models.

Efficient Memory Manipulation

Retrieve external memory which will be injected through prompt concatenation or intermediate layer cross-attention.

Search Efficiency

a brute-force approach results in a computational complexity of O(DN )→indexing is commonly employed to expedite query searching by reducing the number of required comparisons.

Typical Indexing Algorithms:partitioning methods include randomization→data structures

Hardware-aware Index Optimization：the utilization of disk-based indexes or the co-design of hardware and algorithms

Search Mechanism Design：

Multiple similarity criteria can be employed to evaluate vector similarity.Rule-based or estimated-cost-based methods configured offline are often employed to determine the optimal search plan.

Combine vector search with metadata filters.

Search Process Execution:

Several hardware acceleration methods,multi-threading,multi-core parallelism,GPU,distributed clusters.

Workflow Optimization

Traditional workflow is sequential, with inference/retrieval stage idle while conducting retrieving/generation.

→the potential of execuation parallelism and retrieval locality of requests

Pipelining

RaLMSpec: enable a local cache for speculative retrieval,cache prefetching, optimal speculation stride scheduler, and asynchronous verification.

PipeRAG:adjust the vector search space depending on the latency expectation of the upcoming token.

Caching

Select reason: the temporal and spatial locality of retrieved documents.use knowledge tree to organize in the GPU and host memory hierarchy.

the embedding and generative model are equivalent→Query Caching or Query-Doc Caching and save computation overhead

Security and Privacy

Three security principles including confidentiality integrity, and reliability

Confidentiality:the protection of user data privacy

Integrity: the intention not modified or influenced by malicious parties

Reliability: the agents’ internal mistakes

Confidentiality

Local Processing

lightweight models ,deployment frameworks and compression techniques.

Secure Remote Processing

Homomorphic encryption(HE):employ encryption to encode in the client→server conducts inference on the ciphertext→decryption

Challenge:certain operations in the LLMs, such as max, min, and softmax, cannot be accurately performed using HE;inference speed is slow.

Multi-Party Communicatio

Using the trusted execution environments

Data Masking

transform the original inputs into a form that is not privacy-sensitive while preserving the crucial information

hide or replace sensitive content

embedding-based data anonymization approaches(outperform HE but more risky)

Information Flow Control

There may also exist the risks of privacy leakage in the model output.

Rule-based permission control to constrain what LLMs can do and what LLMs can access.

Intergrity

output the intended content correctly, even when faced with various types of attacks

Adversarial Attacks

Attack through the specialized customization of the model’s inputs(image,text,graph…) or malicious tampering with the model.

adversarial defense, abnormal input detection, input preprocessing, output security verification, and more.

adversarial training through parameter-efficient fine-tuning.

Backdoor Attacks

through data poisoning , inserting maliciously modified samples into the model’s training data, enabling the model to learn deliberate hidden decision logic

modify the model input during the test time

modify the prompts, essentially fine-tunes the model’s parameters and thus alters its decision logic

distill benign knowledge from poisoned pre-trained encoders and transfer it to a new encoder

Prompt Injection Attacks

Bypass the preset security safeguards by using subtle or special diction in the prompts.

ensure the transparency and security of the LLM’s prompts.

distinguish the third-party from the system’s inherent promptsRe.

Reliability

Numerous critical actions include some sensitive operations→reliability.

Problems

Hallucination:incorrect answers,coherent and fluent but ultimately erroneous.

Unrecognized Operation:agent is required to execute actions,which have higher requirements for the format and executability of their outputs.

Sequential Reliability:LLMs are initially pre-trained on sequential data,while problems in the real world may not be fully addressed sequentially.

Improvement

Alignment:

the use of pre-training and finetuning,incorporating human values and intentions into their training.

Or reinforcement learning techniques.

Self-Reflection:

leverage the model’s self-reflection and check the consistency between the responses.

Or enable multiple large model agents to engage in mutual discussion and verification.

Retrieval Augmentation.

Inspection

focus on how to enhance or understand the reliability of agents based on the already generated results.

Verification:Constrained Generation.

Explanation:incorporate user opinions or human assistance.

Intermediate Feature Analysis:harness hierarchical information across different layers of the model.

Conclusion and Outlook

The emergence of large language models presents new opportunities for the development of intelligent personal assistants, offering the potential to revolutionize the way of human-computer interaction. In this paper, we focus on Personal LLM Agents, systematically discussing several key opportunities and challenges based on domain expert feedback and extensive literature review.

Currently, research on Personal LLM Agents is in the early stages. Task execution capabilities are still relatively inadequate, and the range of supported functionalities is rather narrow, leaving significant room for improvement. Moreover, ensuring the efficiency, reliability and usability of such personal agents requries to address numerous critical performance and security issues. There exists an inherent tension between the need of large-scale parameters in LLM to achieve better service quality and the constraints of resource, privacy and security in personal agents.

Going forward, except for addressing the respective challenges in each specific direction, a joint effort is needed to establish the whole software/hardware stack and ecosystem for Personal LLM Agents. Researchers and engineers also need to carefully consider the responsibility of such technology to guarantee the benign and assistive nature of Personal LLM Agents.

🎉Welcome to My Blog🎉

📖I’m currently a junior in the University of Electronic Science and Technology of China(UESTC),majoring in Software Engineering.

💡This blog is used to record my learning experiences.

🥳If you have any suggestions,I’d be very glad to communicate with you!