State-Of-The-Art Prompting For AI Agents

影片: https://www.youtube.com/watch?v=DL82mGde6wo https://parahelp.com/blog/prompt-design 以下用 Claude 4 Sonnet 轉成文章形式 # 摘要版本 ## 1. 採用「經理式」超詳細提示詞架構 **核心理念**：將 LLM 視為剛入職的新員工，需要完整的工作手冊。頂級 AI 公司已經告別了簡短指令的時代，轉而採用長達 6 頁以上的詳盡提示詞文檔。ParaHelp 為 Perplexity、Replit 等頂級公司提供客戶支援的 AI Agent，其提示詞就是最佳範例。這份文檔涵蓋： - **明確的任務描述**：批准或拒絕工具呼叫 - **詳細的執行步驟**：步驟一到五的完整流程 - **重要注意事項**：避免呼叫不相關工具 - **輸出格式規範**：確保與其他 Agent 的完美整合就像為新員工準備的完整入職培訓材料，確保 AI 從一開始就理解目標與邊界。 ## 2. 精準角色定位：讓 LLM 「入戲」更深 **實踐要點**：開場就明確告訴 LLM 它的身份和專業領域。角色提示是錨定 LLM 輸出風格的關鍵。當你指定 LLM 扮演「客戶服務 Agent 的經理」或「專家提示工程師」時，它會自動調整： - **語氣風格**：從技術性轉向人性化 - **專業水準**：採用領域特定的術語和邏輯 - **思維模式**：按照該角色的思考方式處理問題這種角色化設定讓 LLM 的表現與任務需求高度統一。 ## 3. 任務分解：引導 LLM「步步為營」 **關鍵策略**：將複雜任務拆解為可管理的小步驟。面對複雜工作流程，預先定義好任務並規劃執行路徑至關重要。ParaHelp 的提示詞就展現了這種方法： - **高層計劃概述** - **詳細的計劃制定流程** - **每個步驟的具體創建方法** - **計劃的高層示例** 通過引導 LLM 逐步完成每個子任務，能顯著提升邏輯推理準確性和整體完成質量。 ## 4. 結構化標籤：告別格式混亂 **技術實現**：使用 Markdown、XML 或 JSON 格式規範輸出。結構化格式對 LLM 的重要性不可低估： - **XML 標籤**：如 `<manager_verify>accept</manager_verify>` - **Markdown 結構**：標題、要點的層次化組織 - **API 整合優勢**：便於後續處理和驗證研究顯示，許多 LLM 在 RLHF 訓練中使用了 XML 格式輸入，因此這種結構化方法能產生更好的結果。 ## 5. Metaprompting：讓 LLM 成為提示詞優化大師 **革命性技巧**：利用 LLM 自身來迭代和優化提示詞。這是當前最強大的提示工程技術之一： **基礎方法**： - 給 LLM 設定「專家提示工程師」角色 - 提供當前提示詞和問題案例 - 要求它提出改進建議 **高級應用**： - **Prompt Folding**：動態生成專門化的子提示詞 - **多模型協作**：用大模型優化，小模型執行 Trope 和 Jasberry 等公司已經證明了這種方法的威力。 ## 6. 少樣本學習：用真實案例塑造卓越行為 **實施策略**：提供高質量示例，特別是困難案例。少樣本學習能極大提升 LLM 在特定任務上的表現： **Jasberry 的成功案例**： - 專注於程式碼自動錯誤檢測 - 提供只有專家程式設計師才能處理的困難例子 - 針對 N+1 查詢等複雜問題提供具體示例這種方法類似於程式設計中的測試驅動開發，是 LLM 版本的單元測試。 ## 7. Prompt Folding：優雅管理多步驟 AI Agent **核心概念**：一個提示詞動態生成更好的自身版本。對於複雜的多階段工作流程： - **通用提示詞**：根據查詢生成專門化提示詞 - **階層管理**：分層處理複雜任務 - **動態適應**：基於上下文調整策略這種技術讓複雜 AI 系統的構建和維護更加有序高效。 ## 8. 設置「逃生出口」：讓 LLM 學會承認不知道 **關鍵原則**：明確指示 LLM 在不確定時停止並請求協助。 **Trope 的發現**：模型過於渴望幫助，即使資訊不足也會編造答案。 **解決方案**： - 明確告知：「如果資訊不足，請說『我不知道』並請求澄清」 - YC 的創新：在回應格式中包含「除錯資訊」參數 - 實際效果：產生開發者待辦清單，指出需要修復的問題 ## 9. 思維軌跡與除錯資訊：洞察 LLM 的推理過程 **技術優勢**：要求 LLM 輸出推理日誌和除錯資訊。 **實用技巧**： - **Gemini 2.5 Pro**：提供思維軌跡功能 - **長上下文優勢**：可用作 REPL 進行即時除錯 - **API 整合**：思維軌跡現已可通過 API 獲取這些「內心戲」是診斷問題、優化提示詞的寶貴資料。 ## 10. 評估體系：比提示詞更珍貴的皇冠明珠 **戰略洞察**：ParaHelp 願意開源提示詞，因為評估才是真正的核心資產。 **評估的價值**： - **理解原理**：知道為什麼提示詞有效 - **持續改進**：提供迭代優化的基準 - **競爭優勢**：需要深入了解特定領域的使用者需求 **獲取方法**： - 與領域專家深度合作 - 理解使用者的獎勵函數 - 編碼真實世界的互動模式 ## 11. 模型個性與大小協同：兼顧質量與成本 **策略實施**：認識不同模型的「個性」特徵。 **模型特性對比**： - **Claude**：更人性化，易於引導 - **Llama 4**：需要更多引導，但可精確控制 - **O3**：嚴格遵循規則，適合標準化任務 - **Gemini 2.5 Pro**：靈活處理例外情況 **成本優化策略**： - **開發階段**：使用大模型進行精細調優 - **生產環境**：將優化的提示詞適配到小模型 - **語音 Agent**：用快速模型配合精鍊的提示詞降低延遲 ## 實戰總結現代 AI 公司已經將 Prompt Engineering 視為核心工程學科。這不再是簡單的指令編寫，而是需要： 1. **系統化方法**：像 Palantir 的 Forward Deployed Engineer 一樣深入理解業務 2. **工程化流程**：建立完整的開發、測試、部署流水線 3. **持續優化**：基於評估結果不斷迭代改進只有採用這種專業化、系統化的approach，才能真正釋放 LLM 的巨大潛力，在競爭激烈的 AI 應用市場中脫穎而出。 # 更完整的整理版本 ## Metaprompting 的核心概念 Metaprompting 正在成為一個非常、非常強大的工具，現在每個人都在使用。這實際上有點像 1995 年的程式設計，當時工具還沒有完全到位，我們正處在這個新前沿。但個人而言，這也有點像學習如何管理一個人，就像是要如何實際傳達他們需要知道的事情，以便做出好的決定。 ## ParaHelp 案例深度分析 ### 公司背景與成就 ParaHelp 是一個極其成功的 AI 客戶支援公司，他們的成就令人矚目。ParaHelp 專門做 AI 客戶支援，雖然有很多公司在做這個，但 ParaHelp 做得非常、非常好。他們實際上在為 Perplexity、Replit、Bolt 以及其他一堆頂級 AI 公司提供客戶支援。當你向 Perplexity 發送客戶支援票據時，實際回應你的是他們的 AI Agent。ParaHelp 的傢伙們非常慷慨地同意向我們展示實際驅動這個 Agent 的提示，並在 YouTube 上為全世界播放。要獲得這些垂直 AI Agent 的提示相對困難，因為它們有點像這些公司 IP 的皇冠明珠。所以非常感謝 ParaHelp 的人同意基本上開源這個提示。 ### 提示結構的詳細分析 #### 長度與詳細程度這個提示真正有趣的第一點是它真的很長。它非常詳細。在這個文件中你可以看到大概有六頁長，只是滾動瀏覽它。 #### 角色設定的重要性很多最好的提示開始的重要事情是設置 LLM 角色的概念。你是客戶服務 Agent 的經理，它用要點分解了需要做的事情。 #### 任務明確化然後重要的事情是告訴任務，就是批准或拒絕工具呼叫，因為它在編排來自所有其他 Agent 的呼叫。 #### 步驟分解然後它給出了一些高層計劃，逐步分解。你看到步驟一、二、三、四、五。然後它給出一些需要記住的重要事情，它不應該有點奇怪地呼叫不同類型的工具。 #### 輸出格式規範它告訴他們如何構建輸出，因為 Agent 的很多事情是你需要它們與其他 Agent 整合。所以這幾乎像是黏合 API 呼叫。所以指定它將給出接受或拒絕的某些輸出和這種格式是重要的。 #### Markdown 結構化然後這是高層部分的排序。最好的提示做的一件事，他們用這種 markdown 類型樣式格式分解它。所以你有這裡的標題排序。然後稍後，它進入如何做計劃的更多細節。你看到這是它的子要點部分。作為計劃的一部分，實際上有三大部分是如何計劃，然後如何創建計劃中的每個步驟。然後計劃的高層例子。 #### 推理指導最好的提示的一大事情是它們概述如何推理任務。然後一大事情是給你一個例子。這就是它所做的。關於這個有趣的一件事是，它看起來更像程式設計而不是寫英語，因為它有這種 XML 標籤格式來指定計劃。 ### XML 格式的優勢我們發現這讓語言模型更容易遵循，因為很多語言模型在 IRLHF 中用 XML 類型的輸入進行後訓練，結果證明產生更好的結果。 ### 客戶化挑戰一個我幾乎期望會有的部分是，會有一個部分描述特定場景並實際給出該場景的示例輸出。那在流水線的下一階段。因為它是客戶特定的，對吧？因為每個客戶都有自己回應這些支援票據的風味。所以他們的挑戰，像很多這些 Agent 公司一樣，是如何建立通用產品，當每個客戶都想要，你知道，有點不同的工作流程和偏好。這是我看到垂直 AI Agent 公司談論很多的一個真正有趣的事情，就像，你如何有足夠的靈活性來製作特殊目的邏輯，而不會變成諮詢公司，你在為每個客戶建立新的提示？我實際上認為這種跨客戶分岔和合併提示的概念，提示的哪部分是客戶特定的與公司範圍的，這是世界才剛開始探索的真正有趣的事情。 ## 三層提示架構的深度解析 ### 系統提示層這是在系統提示中定義提示的概念。然後有開發者提示，然後有使用者提示。系統提示基本上幾乎像是定義你的公司如何運作的高層 API。以 PowerHelp 的例子來說，這非常像系統提示。沒有關於客戶的具體內容。 ### 開發者提示層然後當他們添加該 API 的特定實例並呼叫它時，他們把所有這些都塞進更多的開發者提示中，這裡沒有顯示。這添加了所有上下文，比如說，與 Perplexity 合作。處理 RAG 問題的某些方式與與 Bolt 合作非常不同，對吧？ ### 使用者提示層然後我不認為 PowerHelp 有使用者提示，因為他們的產品不是直接被終端使用者消費的，但終端使用者提示可能更像 Replit 或 Cursor，對吧？使用者需要輸入的是像「生成一個有這些按鈕、這個那個的網站」，這些都進入使用者提示。這就是正在出現的架構排序。 ### 工具機會關於避免成為諮詢公司，我認為在建立所有這些東西的工具方面有這麼多新創機會。例如，任何做過提示工程的人都知道例子和工作例子對改善輸出品質真的很重要。所以如果你以 PowerHelp 為例，他們真的想要特定於每個公司的好工作例子。所以你可以想像，隨著他們規模化，你幾乎希望那自動完成。在你的夢想模式中，你想要的是一個 Agent 本身，可以從客戶數據集中挑出最好的例子，然後軟體只是直接將其攝取到流水線中它應該屬於的地方，而不必手動去插入所有這些並自己攝取所有。 ## Prompt Folding 的創新技術 ### Trope 的發現這是 Metaprompting 的很好引子，這是我們與 AI 新創公司交談時不斷出現的一致主題之一。 Trope 是我在當前 YC 批次中正在合作的新創公司之一，他們真的幫助像 YC 公司 Ducky 這樣的人做多階段工作流程的提示和返回值的真正深入理解和除錯。他們發現的事情之一是提示摺疊。基本上一個提示可以動態生成更好的自身版本。這的一個好例子是一個通用提示，根據先前的查詢生成專門化的提示。 ### 實際應用過程所以你實際上可以進去，拿你現有的提示，實際上給它更多提示失敗的例子，它沒有完全做你想要的。你實際上可以，而不是你必須去重寫提示，你只是把它放進原始 LLM 並說，幫我讓這個提示更好。因為它非常了解自己，奇怪的是，MetaPrompt 正在成為一個非常、非常強大的工具，現在每個人都在使用。 ## Jasberry 的複雜任務處理 ### 自動錯誤檢測挑戰在你做了提示摺疊之後的下一步，如果任務非常複雜，有這個使用例子的概念。這就是 Jasberry 所做的。這是我這批正在合作的公司之一。他們基本上構建程式碼中的自動錯誤檢測，這要困難得多。 ### 專家例子的重要性他們的做法是提供一堆只有專家程式設計師才能做的非常困難的例子。比如說如果你想找到 N+1 查詢。即使對今天最好的 LLM 來說，找到這些實際上也很困難。做這些的方法是他們找到程式碼的部分。然後他們把這些添加到提示和元提示中，就像，嘿，這是 N+1 類型錯誤的例子。然後那有效果。 ### 哲學基礎我認為這種模式，有時當它太難以甚至寫散文來描述時，讓我們給你一個例子，結果效果很好，因為它幫助語言模型推理複雜任務並更好地引導它，因為你不能真正設定確切的參數。這幾乎像是程式設計中的單元測試，某種意義上的測試驅動開發，是 LLM 版本的那個。 ## 逃生出口機制的重要性 ### Trope 的發現 Trope 談論的另一件事是，你知道，模型真的非常想幫助你，如果你只是告訴它，給我這種特定格式的輸出，即使它沒有足夠的資訊，它實際上只會告訴你它認為你想聽的。這實際上是幻覺。 ### 解決方案他們發現的一件事是你實際上必須給 LLM 一個真正的逃生出口。你需要告訴它，如果你沒有足夠的資訊來說是或否或做決定，不要只是編造。停下來問我。這是一種非常不同的思考方式。 ### YC 的創新方法這實際上是我們在 YC 用 Agent 做的一些內部工作中學到的東西，Jared 想出了一個真正創新的方法來給 LLM 一個逃生出口。你想談談那個嗎？所以 Trope 方法是給語言模型逃生出口的一種方法。我們想出了一種不同的方法，就是在回應格式中，給它能力讓回應的一部分本質上是對你，開發者的抱怨，說你給了它令人困惑或規格不足的資訊，它不知道該怎麼做。 ### 實際應用然後這樣做的好處是我們只是在生產環境中用真實使用者數據運行你的 LLM。然後你可以，你可以，你知道，可以回去，你可以查看它在那個輸出參數中給你的輸出。我們內部稱之為除錯資訊。所以我們有這個除錯資訊參數，它基本上向我們報告我們需要修復的事情。它實際上最終像是一個待辦清單，你，Agent 開發者必須做的。這真是令人驚嘆的東西。 ## 簡化入門建議 ### 基本 Metaprompting 方法即使對於愛好者或對個人項目感興趣的人，開始使用 metaprompting 的一個非常簡單的方法是遵循相同的結構，給提示一個角色，讓角色像，你知道，你是一個專家提示工程師，提供真正詳細的優秀批評和改進提示的建議，給它你心中的提示，它會給你回饋一個更擴展、更好的提示。所以你可以一直運行那個循環一段時間，效果出奇地好。 ### 模型蒸餾模式我認為這是公司有時需要在其產品中更快地從元素獲得回應時的常見模式。他們用更大、更強壯的模型進行元提示，任何數百億參數以上的模型，像，我不知道，Claude 4、3.7 或你的 GPT-O3，他們進行這種元提示，然後他們有一個非常好的工作模型，然後他們用到蒸餾模型中。所以他們在，例如，在 4-O 中使用它，結果運作得相當好。具體地，有時對於語音 AI Agent 公司，因為延遲對於讓整個圖靈測試通過非常重要，因為如果 Agent 回應前有太多暫停，我認為人類可以檢測到有什麼不對。所以他們使用更快的模型，但用從更大模型精鍊的更大、更好的提示。所以這就像一個常見模式。 ### 提示管理技巧另一個，再次，也許不那麼複雜，但隨著提示變得越來越長，它變成一個大的工作文件。我發現有用的一件事是當你使用它時，如果你只是記下，你知道，Google Doc，你看到的事情，只是輸出不是你想要的方式或你可以想到改進它的方式。你可以只是以筆記形式寫下那些，然後給 Gemini Pro，就像，你的筆記加上原始提示，並要求它建議一堆編輯提示以很好地納入這些，它做得相當好。 ### 思維軌跡調試另一個技巧是在 Gemini 2.5 Pro 中，如果你查看思維軌跡，因為它在解析評估，你實際上也可以學到很多關於所有那些遺漏的東西。我們也在內部做過這個，對吧？這很關鍵，因為如果你只是通過 API 使用 Gemini，直到最近，你沒有得到思維軌跡。像思維軌跡是關鍵的除錯資訊，用來理解你的提示有什麼問題。他們剛剛把它添加到 API 中。所以你現在實際上可以把那個管道回你的開發者工具和工作流程。我認為 Gemini Pro 有如此長的上下文窗口的一個被低估的後果是你可以有效地像 REPL 一樣使用它。逐個排序。我把你的提示放在一個例子上，然後實際上實時觀看推理軌跡，弄清楚你如何能引導它到你想要的方向。 Jared 和 YC 的軟體團隊實際上建立了這個，你知道，允許我們除錯和諸如此類的各種工作台形式。但對你的觀點，有時最好只是直接使用 gemini.google.com 並拖放，你知道，實際上是 JSON 檔案。你知道，你不必在某種特殊容器中做它。就像它似乎完全是一個甚至直接在，你知道，ChatGPT 本身中都有效的東西。我會向 YC 的數據主管 Eric Baconn 致敬，他在所有這些元提示和有效使用 Gemini Pro 2.5 作為 REPL 方面幫助了我們很多。 ## Forward Deployed Engineer 模式的深度分析 ### Palantir 的創新模式 #### 核心洞察我的意思是，我認為 Palantir 在某種程度上的整個論點是，如果你看看當時叫 Facebook 的 Meta 或 Google 或任何每個人都知道的頂級軟體新創公司。Peter Thiel 和 Alex Carp 和 Stefan Cohen 和 Joe Lonsdale、Nathan Gettings，像 Palantir 的原始創辦人的關鍵認識之一是，進入財富 500 強的任何地方，進入世界上任何政府機構，包括美國。沒有人在電腦科學和技術方面達到最高水準的理解會出現在那個房間裡。 #### 巨大的市場機會所以 Palantir 很早就發現的真正、真正重大的想法是那些地方面臨的問題，實際上是數十億美元，有時是數兆美元的問題。然而，這在 AI 成為一個東西之前，我是說，人們有點在談論機器學習，但當時他們稱之為數據挖掘。世界被數據淹沒，這些巨大的人員、事物和交易數據庫，我們不知道該怎麼處理它。這就是 Palantir 過去、現在和仍然是的，你可以去找到世界上最好的技術專家，他們知道如何寫軟體來實際理解世界。你知道，你有這些拍位元組的數據，你不知道如何在乾草堆中找到針。 #### 時代背景的變化我的意思是，有趣的是大約 20、22 年後，我們有越來越多的數據，我們對正在發生的事情的理解越來越少，這只是變得更加真實。現在我們有了 LLM，實際上它正在變得更加可行，這並非偶然。 ### Forward Deployed Engineer 的具體工作 #### 實地工作要求前線部署工程師的職位具體是，你如何坐在實際調查國內恐怖主義的 FBI 探員旁邊？你如何就坐在他們實際辦公室裡的他們旁邊？看看進來的案件是什麼樣子？所有的步驟是什麼？當你實際需要去找聯邦檢察官時，他們發送的東西是什麼？ #### 原始的工作環境我的意思是，有趣的是，實際上，就像是 Word 文件和 Excel 電子表格，對吧？作為前線部署工程師你要做的是把這些人們必須做的檔案櫃和傳真機的事情轉換成真正乾淨的軟體。 #### 願景目標所以，你知道，經典觀點是實際上在三字母機構進行調查應該像在 Instagram 上拍午餐照片並發給所有朋友一樣容易。就像，那是，你知道，最有趣的部分。 ### 與傳統銷售模式的對比 #### Palantir 的獨特方法所以，你知道，我認為今天不是偶然的，通過 Palantir 系統成長的前線部署工程師現在，他們正在成為 YC 的一些最好的創辦人，實際上。是的，我的意思是，它產生了這個令人難以置信的，令人難以置信數量的新創公司創辦人，因為，是的，就像接受前線部署工程師培訓，那正是現在成為這些公司創辦人的正確培訓。 Palantir 的另一個有趣之處是其他公司會派銷售人員去和 FBI 探員坐在一起。而 Palantir 派工程師去做那件事。我認為 Palantir 可能是第一家真正制度化並將其作為流程規模化的公司，對吧？ #### 傳統方法的局限性是的。我的意思是，我認為那裡發生的事情，他們能夠非常一致地獲得這些七位數、八位數和現在九位數合約的原因是，而不是派送像髮型和牙齒的某人，你知道，讓我們去牛排館。你知道，這都像關係，你會有一次會議，他們會真的喜歡銷售人員，然後通過純粹的人格魅力，你試圖讓他們給你七位數合約。像這樣的時間尺度會是，你知道，六週，10 週，12 週，像五年。我不知道。就像，軟體永遠不會工作。 #### Palantir 方法的優勢而如果你把工程師放在那裡並給他們，你知道，Palantir Foundry，這是他們現在稱為核心數據視覺化和數據挖掘套件的東西。而不是下次會議審查 50 頁的，你知道，銷售文件或合約或規格或任何類似的東西，實際上就像，好的，我們建造了它。然後你在幾天內得到真正的即時反饋。 ### 現代新創公司的應用機會 #### 競爭優勢我的意思是，這誠實地說是新創公司創辦人最大的機會。如果新創公司創辦人能做到這一點，這就是前線部署工程師習慣做的，這就是你如何擊敗 Salesforce 或 Oracle 或，你知道，Booz Allen 或任何有大辦公室和大花哨，你有大花哨銷售人員和強有力握手的公司。 #### 成功的關鍵要素嗯，怎麼讓一個真正好的工程師用弱握手進去擊敗他們？嗯，實際上，你向他們展示他們從未見過的東西，讓他們感到被聽到。你必須對此非常有同理心。就像，你實際上必須是一個偉大的設計師和產品人員。然後，你知道，回來你就可以讓他們震驚。就像，軟體是如此強大，你知道，你看到讓你感到被看見的東西的那一秒，你就想當場購買它。 #### 創辦人的核心職責創辦人應該將自己視為自己公司的前線部署工程師嗎？絕對是的。是的。就像，你絕對不能把這外包出去。實際上，創辦人自己，他們是技術人員。他們必須是偉大的產品人員。他們必須是民族誌學者。他們必須是設計師。你希望第二次會議的人看到你根據你聽到的東西組合的演示。你希望第二次會議的人。你希望他們說，哇，我從未見過這樣的東西，拿我的錢。 ## 現代 AI 公司的成功案例 ### 垂直 AI Agent 的興起我認為這種模式之所以令人難以置信的是，這就是為什麼我們看到很多垂直 AI Agent 起飛的原因，正是因為這個，因為他們可以與這些大企業的最終買家和支持者會面。他們採取那個上下文，然後他們基本上把它塞進提示中。然後他們可以在會議中快速回來，也許就在第二天。也許 Palantir 會花更長時間，一個工程師團隊。在這裡可能只是兩個創辦人進去，然後他們會關閉六、七位數的交易，這在大企業中從未做過，這只是在這個前線部署工程師加 AI 的新模式下才可能，只是在加速。 ### GigormL 的實戰經驗這只是提醒我之前在播客上提到的一家公司，像 GigormL，他們做客戶，另一個客戶支援，特別是很多語音支援。這就是兩個極其有才華的軟體工程師的經典情況，不是天生的銷售人員，但他們強迫自己本質上成為前線部署工程師。他們關閉了與 Zepto 的巨大交易，然後還有幾家其他他們還不能宣布的公司。 #### 實地工作模式他們是否像 Palantir 模式一樣實際到現場？是的。所以他們確實這樣做，他們做了所有這些，一旦他們成交，他們就到現場，與所有客戶支援人員坐在一起，弄清楚如何持續調整並讓軟體或 LLM 工作得更好。 #### 技術創新的重要性但在那之前，甚至為了贏得交易，他們發現的是他們可以通過擁有最令人印象深刻的演示來獲勝。在他們的情況下，他們在 RAG 流水線上有所創新，這樣他們的語音回應既準確又有很低的延遲，這是一件技術上具有挑戰性的事情。 #### 時代變化的影響但我只是覺得在公共場所，在當前 LLM 興起之前的排序，你不一定能在演示階段的銷售中有足夠的差異化來擊敗現任者。所以你真的無法通過有一個稍微更好的 CRM 和更好的 UI 來擊敗 Salesforce。但現在，因為技術發展如此之快，很難讓最後 5 到 10% 正確，你實際上可以，如果你是前線部署工程師，進去，做第一次會議，調整它使其對該客戶運作良好，帶著演示回去，只是得到那個，哇，就像我們沒有看到其他人以前實現這個的體驗，並關閉巨大的交易。 ### Happy Robot 的驚人成就這與 Happy Robot 的情況完全相同，他們向世界三大物流經紀商銷售了七位數合約。他們為此構建了 AI 語音 Agent。他們是在做前線部署工程師模式的人，與這些公司的 CIO 交談並快速交付大量產品。非常非常快的周轉時間，看到這種起飛真是令人難以置信的，對吧。它從六位數交易開始，現在正在結束和七位數交易，這太瘋狂了。這只是在幾個月後。所以這就是你可以用，我的意思是，令人難以置信的，非常、非常聰明的提示工程，實際上能做到的那種事情。 ## 評估系統的戰略重要性 ### 評估是真正的皇冠明珠關於評估。我的意思是，我們已經談論評估一年了。創辦人正在發現什麼？即使我們已經說了一年或更長時間，Gary，我認為仍然是這樣的，就像評估是所有這些公司真正的皇冠明珠，數據資產。PowerHelp 願意開源提示的一個原因是他們告訴我他們實際上不認為提示是皇冠明珠，評估才是皇冠明珠。 ### 評估的核心價值因為沒有評估，你不知道為什麼提示是這樣寫的，很難改進它。是的。我認為在抽象中，你可以想到，你知道，YC 資助很多公司，特別是在垂直 AI 和 SaaS 中。除非你真的與做 X、Y、Z 知識工作的人肩並肩坐著，否則你無法獲得評估。 ### 獲得有效評估的挑戰你知道，你需要坐在拖拉機銷售區域經理旁邊並理解，嗯，你知道，這個人關心什麼，這是他們如何獲得晉升的。這是他們關心的。這是那個人的獎勵函數。然後，你知道，你正在做的是把這些坐在內布拉斯加某人旁邊的面對面互動，然後回到你的電腦，將其編碼成非常具體的評估。就像，這個特定使用者在這張發票進來後想要這個結果，我們必須決定是否要兌現這台拖拉機的保固，就舉一個例子。 ### 競爭護城河的來源這就是真正的價值，對吧？就像，你知道，每個人都真的很擔心，我們只是包裝者嗎，新創公司會發生什麼？我認為這實際上就是橡膠遇到路面的地方，如果你，你知道，如果你在特定地方，比任何人都更了解那個使用者，並讓軟體實際為那些人工作，那就是護城河。 ### 創辦人核心能力要求這就是像今天這樣的公司創辦人所需的核心能力的完美描述嗎？就像，實際上，就像你剛才說的，這就是像這樣的公司創辦人的工作，要非常擅長那件事。並且瘋狂地沉迷於區域拖拉機銷售經理工作流程的細節。是的。然後瘋狂的事情是這很難做。就像，你知道，你甚至去過內布拉斯加嗎？你知道，經典觀點是世界上最好的創辦人，他們是，你知道，真正偉大的破解工程師和技術專家，就是真正出色的。然後同時，他們必須理解世界上很少人理解的某些部分。 ### 成功創辦人的特質然後有這個小切片，你知道，是價值數十億美元新創公司的創辦人。你知道，我想到 Flexport 的 Ryan Peterson，你知道，真正、真正偉大的人，理解軟體是如何構建的。但然後同樣，我認為他是整整一年中醫療熱水浴缸的第三大進口商，就像，你知道，十年前。所以，你知道，你看到的世界越奇怪，沒有其他技術專家看到的，機會實際上越大。 ### 每個創辦人都成為前線部署工程師我認為你以前以一種非常有趣的方式提出過這個，Carrie，你有點說每個創辦人都成為前線部署工程師。這是一個追溯到 Palantir 的術語。由於你在 Palantir 早期，也許告訴我們一點關於前線部署工程師如何在 Palantir 成為一件事，創辦人現在可以從中學到什麼？ ## 不同模型的個性差異 ### 模型的獨特特徵嗯，關於每個模型似乎都有自己的個性，這一點很有趣。創辦人真正意識到的事情之一是，你實際上會為不同的事情找不同的人。其中一個廣為人知的事情是 Claude 是更快樂和更人類可引導的模型。另一個是 Llama 4，是一個需要更多引導的模型。它幾乎像與開發者交談，部分可能是在其上沒有做太多 RLHF 的產物。所以它有點粗糙，但如果你實際上擅長做很多問題，你實際上可以很好地引導它。我幾乎在做一點更多的 RLHF，但實際上更難合作。 ### 實際應用中的模型差異嗯，我們一直在內部使用 LLM 的事情之一實際上是幫助創辦人弄清楚他們應該從誰那裡拿錢。在那種情況下，有時你需要一個非常直接的評分標準，零到一百，零表示永遠不要拿他們的錢，100 表示立即拿他們的錢，就像他們實際上幫助你這麼多，你瘋了不拿他們的錢。我們一直在研究一些使用提示的評分標準。我們學到了什麼？ ### 評分標準與模型表現所以給 LLM 評分標準確實是最佳實踐，特別是如果你想要數字分數作為輸出。你想給它一個評分標準來幫助它理解，就像，我應該如何思考，80 與 90 的區別是什麼。但這些評分標準永遠不完美。總是有例外。你用 O3 與 Gemini 2.5 嘗試了它，你發現了差異。我們發現真正有趣的是，你可以給兩個不同的模型相同的評分標準。在我們的具體情況下，我們發現 O3 實際上非常僵化。就像，它真的堅持評分標準。它重度懲罰任何不符合你給它的評分標準的東西。而 Gemini 2.5 Pro 實際上相當擅長靈活性，它會應用評分標準，但它也可以有點幾乎推理為什麼某人可能像例外，或者為什麼你可能想要將某事推高或比評分標準可能建議的更積極或消極，我只是覺得這真的很有趣。 ### 模型個性的深層含義因為這就像當你訓練一個人時，你試圖，你給他們一個評分標準，就像你希望他們使用評分標準作為指南，但總是有這些邊緣情況，你需要稍微深入思考一點。我只是覺得模型本身會處理那個不同，這意味著他們有點有不同的個性，對吧？就像，O3 感覺有點更像，像士兵那樣，就像，好的，我絕對像，檢查，檢查，檢查，檢查。Gemini Pro 2.5，有點更像高主動性員工那樣，就像，哦，好的，不，我認為這是有道理的，但這可能是這種情況下的例外，這真的很有趣看到。 ### 投資者評估的實際例子是的，對於投資者來說，這很有趣。你知道，有時你有像 Benchmark 或 Thrive 這樣的投資者。就像，是的，立即拿他們的錢。他們的流程是完美的。他們從不幽靈任何人。他們回信比大多數創辦人都快。這是，你知道，非常令人印象深刻。然後這裡的一個例子可能是，有很多投資者只是被淹沒了，也許他們只是不那麼擅長管理時間。所以他們可能是真正偉大的投資者，他們的記錄證明了這一點。但他們有點慢回來。他們總是看起來被淹沒。他們意外地，可能不是故意幽靈人。所以這實際上正是 LLM 的用途。就像，你知道，其中一些的除錯資訊非常有趣看到。就像，你知道，也許它是 91 而不是像 89。我們會看到。 ## 結語：勇敢新世界 ### Metaprompting 的深層類比我想其中一件對我真正令人驚訝的事情是，你知道，我們自己正在玩它。我們花，你知道，也許 80 到 90% 的時間與所有在邊緣的創辦人一起是，你知道，一方面，我認為我們甚至用來討論這個的類比是有點像編碼。這有點實際上感覺像 1995 年的編碼。就像工具不完全在那裡。有很多未指定的東西。我們，你知道，在這個新前沿。但個人而言，這也有點像學習如何管理一個人。就像，我如何實際傳達，你知道，他們需要知道的事情，以便做出好決定？我如何確保他們知道，你知道，我將如何評估和給他們評分？ ### 持續改進的精神不僅如此，就像，有這個改善的方面，你知道，這個製造技術，在 90 年代為日本創造了真正、真正好的汽車。那個原則實際上說，絕對最擅長改善流程的人是實際在做它的人。這就是為什麼日本汽車在 90 年代變得如此之好。這對我來說就是元提示。所以我不知道，這是一個勇敢的新世界。我們有點在這個新時刻。 ### 結語所以有了這個，我們沒時間了。但迫不及待想看你們想出什麼樣的提示。我們下次見。 ## 逐字稿原文 ``` Metaprompting is turning out to be a very, very powerful tool that everyone's using now. It kind of actually feels like coding in 1995, like the tools are not all the way there. We're in this new frontier. But personally, it also kind of feels like learning how to manage a person, where it's like, how do I actually communicate the things they need to know in order to make a good decision. Welcome back to another episode of The LightCone. Today we're pulling back the curtain on what is actually happening inside the best AI startups when it comes to prompt engineering. We surveyed more than a dozen companies and got their take right from the frontier of building this stuff, the practical tips. Jared, why don't we start with an example from one of your best AI startups? I managed to get an example from a. company called Para Help. Parahelps does AI customer support. There are a bunch of companies who are doing this, but Par help is doing it really, really well. They're actually powering the customer support for perplexity and Replit and Bolt and a bunch of other like top AI companies now. So if you if you go and you like email a customer support ticket into perplexity, what's actually responding is like their AI agent. The cool thing is that the Power Help guys very graciously agreed to show us the actual prompt that is powering this agent and to put it on screen on YouTube for the entire world to see. It's like relatively hard to get these prompts for vertical AI agents because they're kind of like the crown jewels of the IP of these companies. And so very grateful to the Power Help guys for agreeing to basically like open source this prompt. Diana, can you walk us through this very detailed prompt? It's super interesting and it's very rare to get a chance to see this in action. So the interesting thing about this prompt is actually first. It's really long. It's very detailed. In this document you can see is like six pages long, just scrolling through it. The big thing that a lot of the best prompts started with is, It's this concept of setting up the role of the LLM. You're a manager of a customer service agent and it breaks it down into bullet points what it needs to do. Then the big thing is telling the task, which is to approve or reject a tool call because it's orchestrating agent calls from all these other ones. And then it gives it a bit of the high level plan. It breaks it down step by step. You see steps one, two, three, four, five. And then it gives some of the important things to keep in mind. that it should not kind of go weird into calling different kinds of tools. It tells them how to structure the output because a lot of things with agents is you need them to integrate with other agents. So it's almost like gluing the API call. So it's important to specify that it's going to give certain output of accepting or rejecting and in this format. Then this is sort of the high level section. And one thing that the best prompts do, they break it down sort of in this markdown type of style formatting. So you have sort of the heading here. And then later on, it goes into more details on how to do the planning. And you see this is like a sub-bullet part of it. And as part of the plan, there's actually three big sections is how to plan and then how to create each of the steps in the plan. And then the high-level example of the plan. One big thing about the best prompts is they outline how to reason about the task. And then a big thing is giving you a giving it an example. And this is what it does. And one thing that's interesting about this, it looks more like programming than writing English because it has this XML tag kind of format to specify sort of the plan. We found that it makes it a lot easier for LMs to follow because a lot of LMs were post-trained in IRLHF with kind of XML type of input and it turns out to produce better results. Yeah. One thing I'm surprised that isn't in here, or maybe this is just the very version that they released, what I almost expect is there to be a section where it describes a particular scenario and actually gives example output for that scenario. That's in like the next stage of the pipeline. Yeah. Oh, really? Okay. Yeah. Because it's customer specific, right? Because like every customer has their own like flavor of how to respond to these support tickets. And so their challenge, like a lot of these agent companies is like, how do you build a general purpose product when every customer, like, wants, you know, has like slightly different workflows and like preferences. It's a really interesting thing that I see the vertical AI agent companies talking about a lot, which is like, how do you have enough flexibility to make special purpose logic without turning into a consulting company where you're building like a new prompt for, for every customer? I actually think this like concept of like forking and merging prompts across customers and which part of the prompt is customer-specific versus like company-wide. It's like a really interesting thing that the world is only just beginning to explore. Yeah, that's a very good point, Jared. So this is this concept of defining the prompt in the system prompt. Then there's the developer prompt and then there's a user prompt. So what this mean is the system prompt is basically almost like defining sort of the high-level API of how your company operates. In this case, example of PowerHelp is very much a system prompt. There's nothing specific about the customer. And then as they add specific instances of, of that API and calling it, then they stuff all that in into more the developer prompt, which is not shown here. And that adds all the context of, let's say, working with perplexity. There's certain ways of how you handle rack questions as opposed to working with bold. It's very different, right? And then I don't think PowerHelp has a user prompt because their product is not consumed directly by an end user, but an end user prompt could be more like Replet or A0, right? Where users need to type is like generate me. a site that that has these buttons, this and that, that goes all in the user prompt. So that's sort of the architecture that's sort of emerging. To your point about avoiding becoming a consulting company, I think there's so many startup opportunities in building the tooling around all of this stuff. Like, for example, like anyone who's done Promptengering knows that the examples and worked examples are really important to improving the quality of the output. And so then if you take like PowerHelp as an example, they really want good worked examples that are specific to each company. And so you can imagine that as they scale, you almost want that done automatically. Like in your dream mode, what you want is just like an agent itself that can pluck out the best examples from like the customer data set and then software that just like ingests that straight into like wherever it should belong in the pipeline without you having to manually go out and plug that all and ingest it in all of yourself. That's right. Great segue into Metaprompting, which is one of the things we want to talk about because that's That's a consistent theme that keeps coming up when we talk to our AI startups. Yeah, Trope here is one of the startups I'm working with in the current YC batch, and they've really helped people like YC Company Ducky do really in-depth understanding and debugging of the prompts and the return values from a multi-stage workflow. And one of the things they figured out is prompt folding. So basically one prompt can dynamically generate better versions of itself. So a good example of that is a class. prompt that generates a specialized prompt based on the previous query. And so you can actually go in, take the existing prompt that you have and actually feed it more examples where maybe the prompt failed, it didn't quite do what you wanted. And you can actually, instead of you having to go and rewrite the prompt, you just put it into the raw LLM and say, help me make this prompt better. And because it knows itself so well, strangely, MetaPrompt. is turning out to be a very, very powerful tool that everyone's using now. And the next step after you do sort of prompt folding, if the task is very complex, there's this concept of using examples. And this is what Jasberry does. It's one of the companies I'm working with this batch. They basically build automatic bug finding in code, which is a lot harder. And the way they do it is they feed a bunch of really hard examples that only expert programmers could do. Let's say if you want to find an N plus one query. It's actually hard for today for even like the best LLMs to find those. And the way to do those is they find parts of the code. Then they add those into the prompt and meta prompt that is like, hey, this is an example of an plus one type of error. And then that works it out. And I think this pattern of sometimes when it's too hard to even kind of write a prose around it, let's just give you an example that turns out to work really well because it helps LMs to reason around complicated tasks and steer it. it better because you can't quite kind of put exact parameters. And it's almost like unit testing programming in a sense, like test driven development, is sort of the LLM version of that. Yeah. Another thing that Trope here sort of talks about is, you know, the model really wants to actually help you so much that if you just tell it, give me back output in this particular format, even if it doesn't quite have the information it needs, it'll actually just tell you what it things you want to hear. And it's literally a hallucination. So one thing they discovered is that you actually have to give the LLMs a real escape hatch. You need to tell it. If you do not have enough information to say yes or no or make a determination, don't just make it up. Stop and ask me. And that's a very different way to think about it. That's actually something we learn at some of the internal work that we've done with agents at YC, where Jared came up with a really inventive way to to give the LLM a SK patch. You want to talk about that? Yeah. So the trope here approach is one way to give the LM an escape patch. We came up with a different way, which is in the response format, to give it the ability to have part of the response be essentially a complaint to you, the developer, that like you have given it confusing or under specified information and it doesn't know what to do. And then the nice thing about that is that we just run your LLM like in production with real hoser data. And then you can, you can, you can, you know, can go back and you can look at the outputs that it has given you in that like output parameter. We call it debug info internally. So like we have this like debug info parameter where it's basically reporting to us, things that we need to fix about it. And it literally ends it being like a to do list that you, the agent developer, has to do. It's like really kind of mind blowing stuff. Yeah. I mean, just even for hobbyists or people who are interesting playing around for this for personal projects, like a very simple way to get started with metaprompting is to follow the same structure the prompt is give it a role and make the role be like, you know, you're an expert prompt engineer who gives really, like, detailed great critiques and advice on how to improve prompts and give it the prompt that you had in mind, and it will spit you back a much, a more expanded, better prompt. And so you can just keep running that loop for a while, works surprisingly well. I think that's a common pattern sometimes for companies when they need to get responses from elements in their product a lot quicker. They do the meta-prompting. with a bigger, beefier model, any of the, I don't know, hundreds of billions of parameters plus, models like, I guess, Cloud 4, 3.7, or your GPD-03, and they do this meta-prompting, and then they have a very good working one that then they use into the distilled model. So they use it on, for example, in a 4-0 and it ends up working pretty well. Specifically, sometimes for voice AI agents, companies, because latency, is very important to get this whole touring test to pass, because if you have too much pause before the agent responds, I think humans can detect something is off. So they use a faster model, but with a bigger, better prompt that was refined from the bigger models. So it's like a common pattern as well. Another, again, less sophisticated maybe, but as the prompt gets longer and longer, it becomes a large working dock. One thing I found useful is as you're using it, if you just just note down, you know, Google Doc, things that you're seeing, just the outputs not being how you want or ways that you can think to improve it. You can just write those in note form and then give Gemini Pro, like, your notes plus the original prompt and ask it to suggest a bunch of edits to the prompt to incorporate these in well, and it does that quite well. The other trick is in Gemini 2.5 Pro, if you look at the thinking traces, as is parsing through evaluation, you could actually learn a lot about all those misses as well. We've done that internal as well, right? As this is critical, because if you're just using Gemini via the API, until recently, you did not get the thinking traces. And like the thinking traces are like the critical debugging information to like understand like what's wrong with your prompt. They just added it to the API. So you can now actually like pipe that back into your developer tools and workflows. Yeah, I think it's an underrated consequences of Gemini Pro having such long context windows is you can effectively use it like a repel. Go sort of like one by one. I put your prompt on like one example, then literally watch the reasoning trace in real time to figure out like how you can steer it in the direction you want. Jared, and the software team at YC has actually built this, you know, various forms of work benches that allow us to like do debug and things like that. But to your point, like sometimes it's better just to use Gemini.google.com directly and then drag and drop, you know, literally Jason files. And, you know, you don't have to do it in some sort of special container. Like it seems to be totally something that works even directly in, you know, chat GPT itself. Yeah, this is all stuff. I would give a shout out to YC's head of data, Eric Bacon, who's helped us all a lot, a lot of this metapropting and using Gemini Pro 2.5 as effectively a repel. about evals. I mean, we've talked about evils for going on a year now. What are some of the things that founders are discovering? Even though we've been saying this for a year or more now, Gary, I think it's still the case that, like, evals are the true crown jewel, like data asset for all of these companies. Like one reason that PowerHelp was willing to open source the prompt is they told me that they actually don't consider the prompts to be the crown jewels, like the evals are the crown jewels. Because without the e-vowels, you don't know why the prompt was written the way that it was and it's very hard to improve it. Yeah. And I think in abstraction, you can think about, you know, YC funds a lot of companies, especially in vertically AI and SaaS. And then you can't get the evals unless you were sitting literally side by side with people who are doing X, Y, Z knowledge work. You know, you need to sit next next to the tractor sales regional manager and understand, well, you know, this person. person cares about, this is how they get promoted. This is what they care about. This is that person's reward function. And then, you know, what you're doing is taking these in-person interactions sitting next to someone in Nebraska and then going back to your computer and codifying it into very specific e-vals. Like, this particular user wants this outcome after the, you know, after this invoice comes in, we have to decide whether we're going to honor the, you know, the warranty on this tractor, like just to take one of, one example. That's the true value, right? Like, you know, everyone's really worried about, um, are we just rappers and, you know, what is going to happen to startups? And I think this is literally where the rubber meets the road, where, um, if you, you know, if you are out there in particular places, understanding that user better than anyone else and having the software actually work for those people, that's the moat. Is that just like such a perfect depiction of like, what is the core competence? required of founders today. Like literally, like the thing that you just said, like, that's your job as a founder of a company like this is to be really good at that thing. And like maniacally obsessed with like the details of the regional tractor sales managers workflow. Yeah. And then the wild thing is it's very hard to do. Like, you know, have you even been to Nebraska? You know, the classic view is that the best founders in the world, they're, you know, sort of really great cracked engineers and technologists and just really brilliant. And then. And at the same time, they have to understand some part of the world that very few people understand. And then there's this little sliver that is, you know, the founder of a multi-billion dollar startup. You know, I think of Ryan Peterson from Flexport, you know, really, really great person who understands how software is built. But then also, I think he was the third biggest importer of medical hot tubs for an entire year, like, you know, a decade ago. So, you know, the weirder that. is the more of the world that you've seen that nobody else who's a technologist has seen, the greater the opportunity, actually. I think you've put this in a really interesting way before, Carrie, where you're sort of saying that every founder's become a forward deployed engineer. That's like a term that traces back to Palantir. And since you were early at Palantir, maybe tell us a little bit about how did forward deployed engineer become a thing at Palantir and what can founders learn from it now? I mean, I think the whole thesis of Palantir at some level was that if you look at meta back then it was called Facebook or Google or any of the top software startups that everyone sort of knew back then. One of the key recognitions that Peter Thiel and Alex Carp and Stefan Cohen and Joe Lonsdale, Nathan Gettings, like the original founders of Palantir had, was that go into anywhere in the Fortune 500, go into any government agency in the world, including the United States. And nobody who understands computer science and technology at the level that, you know, at the highest possible level would ever even be in that room. And so Palantir's sort of really, really big idea that they discovered very early was that the problems that those places face, they're actually multi-billion dollars, sometimes trillion-dollar problems. And yet, this was well before AI became a thing. I mean, people were sort of talking about machine learning, but, you know, back then they called it data mining. You know, the world is awash in data, these giant databases of people and things and transactions and we have no idea what to do with it. That's what Palantir was, is, and still is, that you can go and find the world's best technologists who know how to write software to actually make sense of the world. You know, you have these petabytes of data and you don't know how do you find the needle in the haystack. And, you know, the wild thing is going on something like 20, 22 years later, it's only become more true that we have more and more data and we have less and less of an understanding of what's going on. And it's no mistake that actually, now that we have LLMs, like we actually, it is becoming much more tractable. And then the forward deployed engineer title was specifically, how do you sit next to literally the FBI agent who's investigating domestic terrorism? How do you sit right next to them in their actual office? and see what does the case coming in look like? What are all the steps? When you actually need to go to the federal prosecutor, what are the things that they're sending? Is it, I mean, what's funny is like literally, it's like word documents and Excel spreadsheets, right? And what you do as a forward deployed engineer is take these sort of, you know, file cabinet and fax machine things that people have to do and then convert it into really clean software. So, you know, the classic view is that it should be, as easy to actually do an investigation at a three-letter agency as going and taking a photo of your lunch on Instagram and posting it to all your friends. Like, that's, you know, kind of the funniest part of it. And so, you know, I think it's no mistake today that forward deployed engineers who came up through that system at Palantir now, they're turning out to be some of the best founders at YC, actually. Yeah, I mean, it produced this incredible, an incredible number of startup founders because, yeah, like the training to be a forward deployed engineer, that's exactly the right training to be a founder of these companies now. The other interesting thing about Palantir is like other companies would send like a salesperson to go and sit with the FBI agent. And like Palantir sent engineers to go and do that. I think Palantir is probably the first company to really like institutionalize that and scale that as a process, right? Yeah. I mean, I think what happened there, the reason why they were able to get these sort of seven and eight and now nine figure contracts very consistently is that instead of sending someone who's like hair and teeth and they're in there and, you know, let's go to the, let's go to the steakhouse. You know, it's all like relationship and you'd have one meeting, they would really like the salesperson, and then through sheer force of personality, you try to get them to give you a seven-figure contract. And like the timescales on this would be, you know, six weeks, 10 weeks, 12 weeks, like five years. I don't know. It's like, and the software would never work. Whereas if you put an engineer in there and you give them, you know, Palantir Foundry, which is what they now call sort of their core datavis and data mining suite. instead of the next meeting being reviewing 50 pages of, you know, sort of sales documentation or a contract or a spec or anything like that, it's literally like, okay, we built it. And then you're getting like real live feedback within days. And I mean, that's honestly the biggest opportunity for startup founders. If startup founders can do that and that's what forward deployed engineers are sort of used to doing, that's how you could beat a sales force or an Oracle or, you know, a A booze Allen or literally any company out there that has a big office and a big fancy, you know, you have big fancy salespeople with big strong handshakes. And it's like, how does a really good engineer with a weak handshake go in there and beat them? Well, it's actually, you show them something that they've never seen before and like make them feel super heard. You have to be super empathetic about it. Like, you actually have to be a great designer and product person. And then, you know, come back and you can just blow them away. Like, the software is so. powerful that, you know, the second you see something that, you know, makes you feel seen, you want to buy it on the spot. Is a good way of thinking about it that founders should think about themselves as being the four deployed engineers of their own company? Absolutely. Yeah. Like, you definitely can't farm this out. Like literally, the founders themselves, they're technical. They have to be the great product people. They have to be the ethnographer. They have to be the designer. You want the person on the second meeting to see the demo you put together based on the stuff you heard. And you want the person on the second meeting. And you want them to say, wow, I've never seen anything like that and take my money. I think the incredible thing about this model is this is why we're seeing a lot of the vertical AI agents take off is precisely this, because they can have these meetings with the end buyer and champion at these big enterprises. They take that context and then they stuff it basically in the prompt. And then they can quickly come back in a meeting like just the next day. Maybe with Palantir would have taken a bit longer. a team of engineers. Here it could be just the two founders go in and then they would close the six, seven figure deals which were seen and with large enterprises, which has never been done before. And it's just possible with this new model of forward-deploy engineer plus AI is just on accelerating. It just reminds me a company I mentioned before on the podcast like GigormL who do customer, another customer support and especially a lot of voice support. And it's just classic Aids of two extremely talented software engineers, not natural sales, people, but they force themselves to be essentially forward-deployed engineers. And they close a huge deal with Zepto and then a couple of other companies they can't announce yet. Do they physically go on site like the Palant tier model? Yes. So they do, so they did all of that where once they close the deal, they go on site and they sit there with all the customer support people and figuring out how to keep tuning and getting the software or the LLM to work even better. But before that, even to win the deal. or what they found is that they can win by just having the most impressive demo. And in their case, they've innovated a bit on the RAG pipeline so that they can have their voice responses be both accurate and very low latency, sort of like a technically challenging thing to do. But I just feel like in the public, pre sort of the current LLM rise, you couldn't necessarily differentiate enough in the demo phase of sales to beat out incumbent. So you can really beat Salesforce by having a slightly better CRM with a better UI. But now, because the technology evolves so fast and so hard to get this last five to 10 percent correct, you can actually, if you're a forward-deployed engineer, go in, do the first meeting, tweak it so that it works really well for that customer, go back with the demo and just get that, oh, wow, like we've not seen anyone else pull this off before experience and close huge deals. And that was the exact same case with Happy Robot, who has sold seven-figure contracts to the top three largest logistics. brokers in the world. They're built AI voice agents for that. They are the ones doing the forward-deploy engineer model and talking to like the CIOs of these companies and quickly shipping a lot of product. Like very very quick turnaround and it's been incredible to see that take off right now. And it started from six-figure deals, now doing closing and seven-figure deals, which is crazy. This is just a couple months after. So that's the kind of stuff that you can do with, I mean, unbelievably, very, very smart prompt engineering, actually. Well, one of the things that's kind of interesting about each model is that they each seem to have their own personality. And one of the things the founders are really realizing is that you're going to go to different people for different things, actually. One of the things that's known a lot is Claude is sort of the more happy and more human steerable model. And the other one is Lama, four, is one that needs a lot more steering. It's almost like talking to a lot. a developer and part of it could be an artifact of not having done as much RL8, RLHF on top of it. So it's a bit more rough to work with, but you could actually steer it very well if you actually are good at actually doing a lot of problem. I'm almost doing a bit more RLHF, but it's a bit harder to work with, actually. Well, one of the things we've been using LLMs for internally is actually helping founders figure out who they should take money from. And so in that case, sometimes you need a very straightforward rubric, a zero to a hundred, zero being never ever take their money, and 100 being take their money right away, like they actually help you so much that you'd be crazy not to take their money. We've been working on some scoring rubrics around that using prompts. What are some of the things we've learned? So it's certainly best practice to give LLM's rubrics, especially if you want to get a numerical score as the output. You want to give it a rubric to help it understand, like, how should I think through and what's like 80 versus a 90. But these rubrics are never perfect. There's often always exceptions. And you tried it with O3 versus Gemini 2.5 and you found discrepancies. This is what we found really interesting is that you can give the same rubric to two different models. And in our specific case, what we found is that 03 was very rigid, actually. Like, it really sticks to the rubric. It's heavily penalizes for anything that doesn't fit like the rubric that you've given it. Whereas Gemini 2.5 Pro was actually quite good at being flexible. in that it would apply the rubric, but it could also sort of almost reason through why someone might be like an exception or why you might want to push something up more positively or negatively than the rubric might suggest, which I just thought was really interesting. Because it's just like when you're training a person, you're trying to, you give them a rubric, like you want them to use a rubric as a guide, but there are always these sort of edge cases where you need to sort of think a little bit more deeply. And I just thought it was interesting that the models themselves will have. handle that differently, which means they sort of have different personalities, right? Like, O3 felt a little bit more, like the soldier sort of, like, okay, I'm definitely like, check, check, check, check. And Gemini Pro 2.5, a little bit more like a high agency sort of employee was like, oh, okay, no, I think this makes sense, but this might be an exception in this case, which was really interesting to see. Yeah, it's funny to see that for investors. You know, sometimes you have investors like a benchmark or a thrive. It's like, yeah, take their money right away. Their process is immaculate. They never ghost anyone. They answer their emails fast. than most founders. It's, you know, very impressive. And then one example here might be, there are plenty of investors who are just overwhelmed and maybe they're just not that good at managing their time. And so they might be really great investors and their track record bears that out. But they're sort of slow to get back. They seem overwhelmed all the time. They accidentally, probably not intentionally ghost people. And so this is legitimately exactly what an LLM is for. Like the debug info on some of these are very interesting to see. Like, you know, maybe it's a 91 instead of like an 89. We'll see. I guess one of the things that's been really surprising to me as, you know, we ourselves are playing with it. And we spend, you know, maybe 80 to 90 percent of our time with founders who are all the way out on the edge is, you know, on the one hand, the analogies, I think even we use to discuss this is it's kind of like coding. It kind of actually feels like coding in, you know, 1995. Like the tools are not all the way there. There's a lot of stuff that's unspecified. we're, you know, in this new frontier. But personally, it also kind of feels like learning how to manage a person. Where it's like, how do I actually communicate, you know, the things they need to know in order to make a good decision? And how do I make sure that they know, you know, how I'm going to evaluate and score them? And not only that, like, there's this aspect of Kaizen, you know, this manufacturing technique that created really, really good cars for Japan in the 90s. And that principle actually says that the people who are the absolute best at improving the process are the people actually doing it. And it's literally why Japanese cars got so good in the 90s. And that's metaprompting to me. So I don't know, it's a brave new world. We're sort of in this new moment. So with that, we're out of time. But can't wait to see what kind of prompts you guys come up with. And we'll see you next time. I'm going to do. ```