https://twitter.com/ruhai11111/status/1766005647353729509 2024/3/8 https://twitter.com/aigclink/status/1789493671211221179 2024/5/12 https://twitter.com/tuturetom/status/1788222040597807336 2024/5/8 https://x.com/tuturetom/status/1789638307594621374 2024/5/12 Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper https://x.com/shao__meng/status/1788922127498027431 https://github.com/unclecode/crawl4ai 30k Python scraper based on AI https://github.com/ScrapeGraphAI/Scrapegraph-ai 18k A web scraping and browser automation library (Node.js) https://github.com/apify/crawlee 17k Crawlee https://github.com/apify/crawlee-python 5k Parsera: Lightweight Python library for scraping websites with LLMs https://github.com/raznem/parsera 1k End-to-end data extraction https://www.reworkd.ai/ shot-scraper 截圖 https://shot-scraper.datasette.io/en/stable/ https://simonwillison.net/2024/Apr/17/ai-for-data-journalism/#scraping-shot-scraper * 2024 年 6 个开源 AI 网页爬虫框架对比评测:功能解读、应用场景分析 (2024/12/11) * https://mp.weixin.qq.com/s?__biz=Mzg2OTk1NDQ4Ng==&mid=2247485505&idx=1&sn=b3c247c5fa1e75f01d2be66134f4ea2a&chksm=ce946e98f9e3e78e190a87e82478dd72689acc71a7cb8e83d35f68e21c402ecc3ac8b3f990a4&scene=178&cur_album_id=3741698036108378116#rd * Skyvern, ScrapegraphAI * Crawl4AI, Jina Reader, Firecrawl, Markdowner * ## For LLM * https://www.webpilot.ai/ * Firecrawl https://www.firecrawl.dev/ 這很多人推薦 * Jerry Liu 推薦 https://twitter.com/jerryjliu0/status/1781122933349572772 * https://www.firecrawl.dev/extract * Jina Reader https://jina.ai/reader/ ## 付費 * https://apify.com/ * https://crawlbase.com/ * https://www.zyte.com/ * https://www.bytebot.ai/ ## Jina Reader * https://jina.ai/reader * 可以打開 ReaderLM-v2 * https://jina.ai/zh-TW/models/ReaderLM-v2/ * 本機可跑 https://ollama.com/milkey/reader-lm-v2 * cc-by-nc ## WebPilot https://www.webpilot.ai/post-gpts/ ## 轉 markdown https://twitter.com/dotey/status/1755491935874118005 https://twitter.com/9hills/status/1779830990598549998 ### ReaderLM https://huggingface.co/jinaai/reader-lm-1.5b https://ollama.com/library/reader-lm https://x.com/JinaAI_/status/1879551743748706487 (2025/1/15) https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-html-to-markdown-and-json/ ## paper * Autocrawler * AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation * https://twitter.com/omarsar0/status/1782462314983071757 * https://arxiv.org/abs/2404.12753 ## HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems 評測純文字格式、Markdown 和 HTML 格式,發現 HTML 更適合做 RAG。這是因為 HTML 保留了原始文件中更豐富的語義和結構資訊,相較純文字可以提供更完整的知識脈絡​。 * https://arxiv.org/abs/2411.02959v2 * https://github.com/plageon/HtmlRAG/blob/main/toolkit/README.md * https://www.facebook.com/ihower/posts/10162035775833971