> 視覺介面控制代理,讓 Agent 可以操作電腦的能力 * Anthropic 的 [[Computer Use]] * https://www.anthropic.com/news/3-5-models-and-computer-use * https://docs.anthropic.com/en/docs/build-with-claude/computer-use * https://www.facebook.com/ihower/posts/10161528069893971 * paper: Large Language Model-Brained GUI Agents: A Survey https://arxiv.org/abs/2411.18279v6 (2024/11) * https://twitter.com/leeoxiang/status/1777685999143026953 (2024/4/9) * 1、Windows平台:UFO: A UI-Focused Agent for Windows OS Interaction. * 2、iOS 平台:苹果 Ferret-UI 苹果 Ferret-UI 多模态大型语言模型(MLLM),专门针对移动用户界面(UI)屏幕的理解进行了优化。Ferret-UI具备引用、定位和推理能力,能够更有效地理解和与UI屏幕进行交互。 https://arxiv.org/abs/2410.18967 * 3、Android 平台:ScreenAI: ScreenAI的核心是一种新的屏幕截图文本表示方法,可以识别UI元素的类型和位置 * awesome papers: https://github.com/francedot/acu * agent * https://x.com/LangChainAI/status/1881023825933942886 (2025/1/20) * https://github.com/Upsonic/Upsonic * UI -TARS https://github.com/bytedance/UI-TARS * https://x.com/TsingYoga/status/1881570775263859047 ## Browser 工具 * https://browser-use.com/ * https://github.com/steel-dev/steel-browser * https://www.browserbase.com/ * https://github.com/web-infra-dev/midscene * https://x.com/yadong_xie/status/1871189552192430152 (2024/12/23) * 整理 https://x.com/johnrushx/status/1883872256121774401 (2025/1/27) ## Open Computer Use * https://github.com/e2b-dev/open-computer-use ## AutoGLM https://xiao9905.github.io/AutoGLM/ ## 微軟 OmniParser https://github.com/microsoft/OmniParser https://www.jiqizhixin.com/articles/2024-10-26-4 https://www.microsoft.com/en-us/research/articles/omniparser-v2-turning-any-llm-into-a-computer-use-agent/ ## GPT-4V https://aiemploye.com/ https://osu-nlp-group.github.io/SeeAct/ https://github.com/OthersideAI/self-operating-computer 操控電腦 ## MultiOn * https://www.multion.ai/ * More: https://www.kadoa.com/blog/ai-agents-hype-vs-reality ## 視覺代理控制 * web agents: https://twitter.com/omarsar0/status/1742923330544706035 (2024/1/4) * https://twitter.com/omarsar0/status/1753889394111479852 (2024/2/4) * https://baai-agents.github.io/Cradle/ * WebVoyager * Skyvern 開源 * https://github.com/Skyvern-AI/skyvern * https://twitter.com/tuturetom/status/1787296091475780054 2024/5/6 * https://github.com/X-PLUG/MobileAgent * https://www.airtop.ai/ ## 評測 * OSWorld * https://twitter.com/dotey/status/1778605434229731667 * https://os-world.github.io/ * Mind2Web * https://osu-nlp-group.github.io/Mind2Web/ * paper: The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use (2024/11) * https://arxiv.org/abs/2411.10323 * https://x.com/omarsar0/status/1858526493661446553 * https://gui-agent.github.io/grounding-leaderboard/ * https://x.com/ChiYeung_Law/status/1875179243401019825 (2025/1/3)