https://twitter.com/ruhai11111/status/1766005647353729509 2024/3/8
https://twitter.com/aigclink/status/1789493671211221179 2024/5/12
https://twitter.com/tuturetom/status/1788222040597807336 2024/5/8
https://x.com/tuturetom/status/1789638307594621374 2024/5/12
Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
https://x.com/shao__meng/status/1788922127498027431
https://github.com/unclecode/crawl4ai 30k
Python scraper based on AI
https://github.com/ScrapeGraphAI/Scrapegraph-ai 18k
A web scraping and browser automation library (Node.js)
https://github.com/apify/crawlee 17k
Crawlee
https://github.com/apify/crawlee-python 5k
Parsera: Lightweight Python library for scraping websites with LLMs
https://github.com/raznem/parsera 1k
End-to-end data extraction
https://www.reworkd.ai/
shot-scraper 截圖
https://shot-scraper.datasette.io/en/stable/
https://simonwillison.net/2024/Apr/17/ai-for-data-journalism/#scraping-shot-scraper
* 2024 年 6 个开源 AI 网页爬虫框架对比评测:功能解读、应用场景分析 (2024/12/11)
* https://mp.weixin.qq.com/s?__biz=Mzg2OTk1NDQ4Ng==&mid=2247485505&idx=1&sn=b3c247c5fa1e75f01d2be66134f4ea2a&chksm=ce946e98f9e3e78e190a87e82478dd72689acc71a7cb8e83d35f68e21c402ecc3ac8b3f990a4&scene=178&cur_album_id=3741698036108378116#rd
* Skyvern, ScrapegraphAI
* Crawl4AI, Jina Reader, Firecrawl, Markdowner
*
## For LLM
* https://www.webpilot.ai/
* Firecrawl https://www.firecrawl.dev/ 這很多人推薦
* Jerry Liu 推薦 https://twitter.com/jerryjliu0/status/1781122933349572772
* https://www.firecrawl.dev/extract
* Jina Reader https://jina.ai/reader/
## 付費
* https://apify.com/
* https://crawlbase.com/
* https://www.zyte.com/
* https://www.bytebot.ai/
## Jina Reader
* https://jina.ai/reader
* 可以打開 ReaderLM-v2
* https://jina.ai/zh-TW/models/ReaderLM-v2/
* 本機可跑 https://ollama.com/milkey/reader-lm-v2
* cc-by-nc
## WebPilot
https://www.webpilot.ai/post-gpts/
## 轉 markdown
https://twitter.com/dotey/status/1755491935874118005
https://twitter.com/9hills/status/1779830990598549998
### ReaderLM
https://huggingface.co/jinaai/reader-lm-1.5b
https://ollama.com/library/reader-lm
https://x.com/JinaAI_/status/1879551743748706487 (2025/1/15)
https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-html-to-markdown-and-json/
## paper
* Autocrawler
* AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation
* https://twitter.com/omarsar0/status/1782462314983071757
* https://arxiv.org/abs/2404.12753
## HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
評測純文字格式、Markdown 和 HTML 格式,發現 HTML 更適合做 RAG。這是因為 HTML 保留了原始文件中更豐富的語義和結構資訊,相較純文字可以提供更完整的知識脈絡。
* https://arxiv.org/abs/2411.02959v2
* https://github.com/plageon/HtmlRAG/blob/main/toolkit/README.md
* https://www.facebook.com/ihower/posts/10162035775833971