why hard? https://twitter.com/bindureddy/status/1744894481999278291
https://twitter.com/jerryjliu0/status/1741270281203978588 (2023/12/31) 三種方式
1. Python libraries + OCR
2. Cloud Platform providers
3. Multi-modal (screenshot with a model like GPT-4V or Gemini, extract output).
推薦 UnstructuredIO (介於上述1~2之間) 和 recursive retrieval 最好
https://levelup.gitconnected.com/a-guide-to-processing-tables-in-rag-pipelines-with-llamaindex-and-unstructuredio-3500c8f917a7
各種收集整理 https://twitter.com/giannis2two/status/1775208991905243499
1) Google Document AI
2) AWS Textract
3) Unstructured
4) LlamaParse
5) pdf2image + pytesseract
* Florian 的 PDF Parsing 文章
* https://pub.towardsai.net/advanced-rag-02-unveiling-pdf-parsing-b84ae866344e
* https://medium.com/@florian_algo/unveiling-pdf-parsing-how-to-extract-formulas-from-scientific-pdf-papers-a8f126f3511d
* https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2
* https://ai.gopubby.com/demystifying-pdf-parsing-02-pipeline-based-method-82619dbcbddf
* https://medium.com/ai-advances/demystifying-pdf-parsing-03-ocr-free-small-model-based-method-c71310988129
* https://ai.gopubby.com/demystifying-pdf-parsing-04-ocr-free-large-multimodal-model-based-method-0fdab50db048
* https://medium.com/ai-advances/demystifying-pdf-parsing-06-representative-industry-solutions-5d4a1cfe311b 比較不同框架的 PDF 作法
* https://medium.com/@florian_algo/list/document-intelligence-and-pdf-parsing-2334780a5667
* https://x.com/9hills/status/1817045481903784190 2024/7/27 方案整理
1. document-convert(开源) https://github.com/multimodal-art-projection/MAP-NEO/tree/main/Matrix/document-convert
2. Ragflow(开源) https://github.com/infiniflow/ragflow
3. gptpdf(开源) https://github.com/CosmosShadow/gptpdf
4. 百度云Textmind(闭源) https://cloud.baidu.com/product/textmind.html
5. doc2x(闭源) https://doc2x.noedgeai.com
6. 腾讯云文档解析(闭源) https://cloud.tencent.com/document/product/1759/107504
7. marker(开源) https://github.com/VikParuchuri/marker
8. PDF-Extract-Kit(开源) https://github.com/opendatalab/PDF-Extract-Kit
1. https://x.com/aigclink/status/1812506226821087494
9. zerox(开源) https://github.com/getomni-ai/zerox
10. OminiParse(开源) https://github.com/adithya-s-k/omniparse
11. MinerU(开源) https://github.com/opendatalab/MinerU
* 各种PDF提取文字工具
* https://x.com/HackerMeta/status/1868497254904143937 (2024/12/16)
* 2024 年 12 款开源文档解析框架的选型对比评测
* https://mp.weixin.qq.com/s/nL9m6IdYgPxCvkQKe0nOXQ (2024/12/10)
* MinerU, PaddleOCR
* Marker, Unstructured,
* gptpdf, Zerox
* Chunkr, pdf-extract-api, Sparrow
* https://github.com/lumina-ai-inc/chunkr
* https://github.com/CatchTheTornado/pdf-extract-api
* https://github.com/katanaml/sparrow
## LlamaParse
https://github.com/run-llama/llama_parse
比較其他PDF parser: https://twitter.com/llama_index/status/1762158562657374227
https://twitter.com/llama_index/status/1767948064659210310 (2024/3/14) 正式推出
https://cloud.llamaindex.ai/parse
https://www.youtube.com/watch?v=5Ahdg1DkPMc
- [ ] LlamaParse JSON Mode + Multimodal RAG https://github.com/run-llama/llama_parse/blob/main/examples/demo_json.ipynb
- [ ] 範例 https://twitter.com/ravithejads/status/1771761321295429640
- [ ] 範例 https://twitter.com/ravithejads/status/1771761321295429640
- table https://github.com/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb
- multi-modal https://github.com/run-llama/llama_parse/blob/main/examples/demo_json.ipynb
- table comparsion https://github.com/run-llama/llama_parse/blob/main/examples/demo_table_comparisons.ipynb
- comic https://github.com/run-llama/llama_parse/blob/main/examples/demo_parsing_instructions.ipynb
- https://github.com/run-llama/llama_parse/blob/main/examples/multimodal/multimodal_rag_slide_deck.ipynb
- Multimodal RAG https://x.com/llama_index/status/1822058106354069520 (2024/8/10)
## Unstructured
langchain 和 llamaindex 這兩家都常用 https://unstructured.io/ 來做
案例: https://unstructured.io/blog/streamlining-healthcare-compliance-with-ai
## Datalab
https://www.datalab.to/
## PyMuPDF4LLM
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/index.html
https://github.com/pymupdf/RAG
https://medium.com/@benitomartin/building-a-multimodal-llm-application-with-pymupdf4llm-59753cb44483
## pdfplumber
https://github.com/jsvine/pdfplumber
https://twitter.com/akshay_pachaar/status/1770791047947239721
https://github.com/patchy631/machine-learning/blob/main/random/extracting_text_from_pdf.ipynb
## markitdown
https://github.com/microsoft/markitdown
簡單的包裹各種 python converter 套件 而已
## Docling
* https://ds4sd.github.io/docling/
* https://github.com/DS4SD/docling
* https://simonwillison.net/2024/Nov/3/docling/
## MegaParse
https://github.com/quivrhq/megaparse
https://x.com/omarsar0/status/1863962985409982482
## Gemini 2.0 Flash
* https://www.sergey.fyi/articles/gemini-flash-2 (2025/1/15)
* https://www.philschmid.de/gemini-pdf-to-data
* https://x.com/SullyOmarr/status/1887900502496600119
* https://ai.gopubby.com/10x-cheaper-pdf-processing-ingesting-and-rag-on-millions-of-documents-with-gemini-2-0-flash-8a93dbbb3b54 (2025/2)
* 還順便做了 chunking
## Marker: converts PDF, EPUB, and MOBI to markdown.
https://github.com/VikParuchuri/marker
https://twitter.com/dotey/status/1734129116167729596
tweet 上的回覆評論是 中文好像不行,而且速度很慢
* https://www.facebook.com/kunfeng.lee.18/posts/10161390370194834
* 轉檔速度比較慢,但慢工出細活,在表格轉文字部份比 LlamaParse 好,而且這個專案的特點是,圖片會保留下來(雖然不是切割的很好)
* https://x.com/VikParuchuri/status/1892275032916713814 (2025/2/20)
* 出了新版更厲害
*有人推薦 Camelot 和 Nougat*
https://camelot-py.readthedocs.io/en/master/
https://github.com/facebookresearch/nougat
轉 HTML: pdf2htmlEX
https://github.com/pdf2htmlEX/
## gptpdf
* PyMuPDF 搭配使用 gpt-4o 解析,轉乘 markdown
* 作法解析
* https://x.com/aigclink/status/1806866296010932416
* https://x.com/dotey/status/1807293377513218048
* 裡面的預設 prompt 用簡體中文啊
https://github.com/CosmosShadow/gptpdf
## 阿里巴巴 OmniParser
https://github.com/alibabaresearch/advancedliteratemachinery
https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/OmniParser
https://twitter.com/tuturetom/status/1778810597825732920
## Zerox (typescript)
https://github.com/getomni-ai/zerox
基於 GPT-4o mini PDF OCR 的工具
* https://www.facebook.com/kunfeng.lee.18/posts/10161390370194834
* 看起來的原理是直接把每一頁的PDF當成圖檔丟給 GPT-4o mini 來轉檔,因為 4o mini 滿便宜的,所以可以這樣玩。轉出來的品質和 marker 相近,表格也轉的很不錯。轉 14 頁的 PDF ,看後台花不到台幣 2 元,可接受。
### Llamaindex
https://twitter.com/jerryjliu0/status/1711768455429722613 10/10
https://docs.llamaindex.ai/en/stable/examples/query_engine/sec_tables/tesla_10q_table.html
https://twitter.com/llama_index/status/1711768906866864403
https://twitter.com/clusteredbytes/status/1715050322903662751
效果範例 https://twitter.com/jerryjliu0/status/1710685292913668595
- [ ] RAG for Complex PDFs 錄影(1hr) https://twitter.com/llama_index/status/1733653470987825452
- [[Recursive Retriever]]
- https://www.youtube.com/watch?v=oa82yoJ6zYc
* [ ] How to Analyze Tables In Large Financial Reports Using GPT-4 (w/Jerry Liu, LlamaIndex)
* https://twitter.com/mayowaoshin/status/1723049234625184196
* https://www.youtube.com/watch?v=xT6JpDELKPg
* https://colab.research.google.com/drive/1DldMhszgSI4KKI2UziNHHM4w8Cb5OxEL#scrollTo=Ht4oSN2PvzUJ
* camelot
- [ ] Insights in building a Full-Stack Complex PDF AI Chatbot
- https://twitter.com/llama_index/status/1749248146302218742
- https://www.youtube.com/watch?v=TOeAe8KB68E
### LangChain
* https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb
* https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_structured_multi_modal_RAG_LLaMA2.ipynb
## Google Document AI
https://cloud.google.com/document-ai?hl=zh-TW
## PDFPlumb
https://github.com/jsvine/pdfplumber
## LLM Sherpa
另一家做 PDF parser
https://blog.llamaindex.ai/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125
https://github.com/nlmatics/llmsherpa#layoutpdfreader
https://blog.llamaindex.ai/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125 介紹性文章