why hard? https://twitter.com/bindureddy/status/1744894481999278291 https://twitter.com/jerryjliu0/status/1741270281203978588 (2023/12/31) 三種方式 1. Python libraries + OCR 2. Cloud Platform providers 3. Multi-modal (screenshot with a model like GPT-4V or Gemini, extract output). 推薦 UnstructuredIO (介於上述1~2之間) 和 recursive retrieval 最好 https://levelup.gitconnected.com/a-guide-to-processing-tables-in-rag-pipelines-with-llamaindex-and-unstructuredio-3500c8f917a7 各種收集整理 https://twitter.com/giannis2two/status/1775208991905243499 1) Google Document AI 2) AWS Textract 3) Unstructured 4) LlamaParse 5) pdf2image + pytesseract * Florian 的 PDF Parsing 文章 * https://pub.towardsai.net/advanced-rag-02-unveiling-pdf-parsing-b84ae866344e * https://medium.com/@florian_algo/unveiling-pdf-parsing-how-to-extract-formulas-from-scientific-pdf-papers-a8f126f3511d * https://generativeai.pub/demystifying-pdf-parsing-01-overview-130f9e4064c2 * https://ai.gopubby.com/demystifying-pdf-parsing-02-pipeline-based-method-82619dbcbddf * https://medium.com/ai-advances/demystifying-pdf-parsing-03-ocr-free-small-model-based-method-c71310988129 * https://ai.gopubby.com/demystifying-pdf-parsing-04-ocr-free-large-multimodal-model-based-method-0fdab50db048 * https://medium.com/ai-advances/demystifying-pdf-parsing-06-representative-industry-solutions-5d4a1cfe311b 比較不同框架的 PDF 作法 * https://medium.com/@florian_algo/list/document-intelligence-and-pdf-parsing-2334780a5667 * https://x.com/9hills/status/1817045481903784190 2024/7/27 方案整理 1. document-convert(开源) https://github.com/multimodal-art-projection/MAP-NEO/tree/main/Matrix/document-convert 2. Ragflow(开源) https://github.com/infiniflow/ragflow 3. gptpdf(开源) https://github.com/CosmosShadow/gptpdf 4. 百度云Textmind(闭源) https://cloud.baidu.com/product/textmind.html 5. doc2x(闭源) https://doc2x.noedgeai.com 6. 腾讯云文档解析(闭源) https://cloud.tencent.com/document/product/1759/107504 7. marker(开源) https://github.com/VikParuchuri/marker 8. PDF-Extract-Kit(开源) https://github.com/opendatalab/PDF-Extract-Kit 1. https://x.com/aigclink/status/1812506226821087494 9. zerox(开源) https://github.com/getomni-ai/zerox 10. OminiParse(开源) https://github.com/adithya-s-k/omniparse 11. MinerU(开源) https://github.com/opendatalab/MinerU * 各种PDF提取文字工具 * https://x.com/HackerMeta/status/1868497254904143937 (2024/12/16) * 2024 年 12 款开源文档解析框架的选型对比评测 * https://mp.weixin.qq.com/s/nL9m6IdYgPxCvkQKe0nOXQ (2024/12/10) * MinerU, PaddleOCR * Marker, Unstructured, * gptpdf, Zerox * Chunkr, pdf-extract-api, Sparrow * https://github.com/lumina-ai-inc/chunkr * https://github.com/CatchTheTornado/pdf-extract-api * https://github.com/katanaml/sparrow ## LlamaParse https://github.com/run-llama/llama_parse 比較其他PDF parser: https://twitter.com/llama_index/status/1762158562657374227 https://twitter.com/llama_index/status/1767948064659210310 (2024/3/14) 正式推出 https://cloud.llamaindex.ai/parse https://www.youtube.com/watch?v=5Ahdg1DkPMc - [ ] LlamaParse JSON Mode + Multimodal RAG https://github.com/run-llama/llama_parse/blob/main/examples/demo_json.ipynb - [ ] 範例 https://twitter.com/ravithejads/status/1771761321295429640 - [ ] 範例 https://twitter.com/ravithejads/status/1771761321295429640 - table https://github.com/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb - multi-modal https://github.com/run-llama/llama_parse/blob/main/examples/demo_json.ipynb - table comparsion https://github.com/run-llama/llama_parse/blob/main/examples/demo_table_comparisons.ipynb - comic https://github.com/run-llama/llama_parse/blob/main/examples/demo_parsing_instructions.ipynb - https://github.com/run-llama/llama_parse/blob/main/examples/multimodal/multimodal_rag_slide_deck.ipynb - Multimodal RAG https://x.com/llama_index/status/1822058106354069520 (2024/8/10) ## Unstructured langchain 和 llamaindex 這兩家都常用 https://unstructured.io/ 來做 案例: https://unstructured.io/blog/streamlining-healthcare-compliance-with-ai ## Datalab https://www.datalab.to/ ## PyMuPDF4LLM https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/index.html https://github.com/pymupdf/RAG https://medium.com/@benitomartin/building-a-multimodal-llm-application-with-pymupdf4llm-59753cb44483 ## pdfplumber https://github.com/jsvine/pdfplumber https://twitter.com/akshay_pachaar/status/1770791047947239721 https://github.com/patchy631/machine-learning/blob/main/random/extracting_text_from_pdf.ipynb ## markitdown https://github.com/microsoft/markitdown 簡單的包裹各種 python converter 套件 而已 ## Docling * https://ds4sd.github.io/docling/ * https://github.com/DS4SD/docling * https://simonwillison.net/2024/Nov/3/docling/ ## MegaParse https://github.com/quivrhq/megaparse https://x.com/omarsar0/status/1863962985409982482 ## Gemini 2.0 Flash * https://www.sergey.fyi/articles/gemini-flash-2 (2025/1/15) * https://www.philschmid.de/gemini-pdf-to-data * https://x.com/SullyOmarr/status/1887900502496600119 * https://ai.gopubby.com/10x-cheaper-pdf-processing-ingesting-and-rag-on-millions-of-documents-with-gemini-2-0-flash-8a93dbbb3b54 (2025/2) * 還順便做了 chunking ## Marker: converts PDF, EPUB, and MOBI to markdown. https://github.com/VikParuchuri/marker https://twitter.com/dotey/status/1734129116167729596 tweet 上的回覆評論是 中文好像不行,而且速度很慢 * https://www.facebook.com/kunfeng.lee.18/posts/10161390370194834 * 轉檔速度比較慢,但慢工出細活,在表格轉文字部份比 LlamaParse 好,而且這個專案的特點是,圖片會保留下來(雖然不是切割的很好) * https://x.com/VikParuchuri/status/1892275032916713814 (2025/2/20) * 出了新版更厲害 *有人推薦 Camelot 和 Nougat* https://camelot-py.readthedocs.io/en/master/ https://github.com/facebookresearch/nougat 轉 HTML: pdf2htmlEX https://github.com/pdf2htmlEX/ ## gptpdf * PyMuPDF 搭配使用 gpt-4o 解析,轉乘 markdown * 作法解析 * https://x.com/aigclink/status/1806866296010932416 * https://x.com/dotey/status/1807293377513218048 * 裡面的預設 prompt 用簡體中文啊 https://github.com/CosmosShadow/gptpdf ## 阿里巴巴 OmniParser https://github.com/alibabaresearch/advancedliteratemachinery https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/OmniParser https://twitter.com/tuturetom/status/1778810597825732920 ## Zerox (typescript) https://github.com/getomni-ai/zerox 基於 GPT-4o mini PDF OCR 的工具 * https://www.facebook.com/kunfeng.lee.18/posts/10161390370194834 * 看起來的原理是直接把每一頁的PDF當成圖檔丟給 GPT-4o mini 來轉檔,因為 4o mini 滿便宜的,所以可以這樣玩。轉出來的品質和 marker 相近,表格也轉的很不錯。轉 14 頁的 PDF ,看後台花不到台幣 2 元,可接受。 ### Llamaindex https://twitter.com/jerryjliu0/status/1711768455429722613 10/10 https://docs.llamaindex.ai/en/stable/examples/query_engine/sec_tables/tesla_10q_table.html https://twitter.com/llama_index/status/1711768906866864403 https://twitter.com/clusteredbytes/status/1715050322903662751 效果範例 https://twitter.com/jerryjliu0/status/1710685292913668595 - [ ] RAG for Complex PDFs 錄影(1hr) https://twitter.com/llama_index/status/1733653470987825452 - [[Recursive Retriever]] - https://www.youtube.com/watch?v=oa82yoJ6zYc * [ ] How to Analyze Tables In Large Financial Reports Using GPT-4 (w/Jerry Liu, LlamaIndex) * https://twitter.com/mayowaoshin/status/1723049234625184196 * https://www.youtube.com/watch?v=xT6JpDELKPg * https://colab.research.google.com/drive/1DldMhszgSI4KKI2UziNHHM4w8Cb5OxEL#scrollTo=Ht4oSN2PvzUJ * camelot - [ ] Insights in building a Full-Stack Complex PDF AI Chatbot - https://twitter.com/llama_index/status/1749248146302218742 - https://www.youtube.com/watch?v=TOeAe8KB68E ### LangChain * https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb * https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_structured_multi_modal_RAG_LLaMA2.ipynb ## Google Document AI https://cloud.google.com/document-ai?hl=zh-TW ## PDFPlumb https://github.com/jsvine/pdfplumber ## LLM Sherpa 另一家做 PDF parser https://blog.llamaindex.ai/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125 https://github.com/nlmatics/llmsherpa#layoutpdfreader https://blog.llamaindex.ai/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125 介紹性文章