{"id":13039,"date":"2025-07-21T14:09:38","date_gmt":"2025-07-21T06:09:38","guid":{"rendered":"https:\/\/ihower.tw\/blog\/?p=13039"},"modified":"2025-09-21T16:30:47","modified_gmt":"2025-09-21T08:30:47","slug":"aie-ai-evals","status":"publish","type":"post","link":"https:\/\/ihower.tw\/blog\/13039-aie-ai-evals","title":{"rendered":"\u611b\u597d AI Engineer \u96fb\u5b50\u5831 \ud83d\ude80 \u4ec0\u9ebc\u662f AI Evals \u932f\u8aa4\u5206\u6790 #30"},"content":{"rendered":"\n<figure class=\"wp-block-image is-resized\"><img decoding=\"async\" src=\"https:\/\/listmonk.aihao.tw\/uploads\/123v3s.png\" alt=\"\" style=\"width:549px;height:auto\"\/><\/figure>\n\n\n\n<p>Hello! \u5404\u4f4d AI \u958b\u767c\u8005\u5927\u5bb6\u597d \ud83d\udc4b<\/p>\n\n\n\n<p>\u6211\u662f ihower\uff0c\u4e0d\u77e5\u4e0d\u89ba\u9019\u662f\u7b2c 30 \u671f\u5566\uff0c\u611f\u8b1d\u4f60\u4e00\u8def\u4ee5\u4f86\u7684\u8a02\u95b1\u8207\u652f\u6301 \ud83d\ude4f<\/p>\n\n\n\n<p>\u5982\u679c\u559c\u6b61\u8f15\u9b06\u4ea4\u6d41\u548c\u5206\u4eab\u6700\u65b0\u6d88\u606f\uff0c\u6b61\u8fce\u52a0\u5165 <a href=\"https:\/\/t.me\/+ewllzijfb7YyNTZl\">Telegram \u8a0e\u8ad6\u7fa4<\/a>\uff01<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83e\udded <a href=\"https:\/\/hamel.dev\/blog\/posts\/field-guide\/index.html\">A Field Guide to Rapidly Improving AI Products<\/a><\/h3>\n\n\n\n<p>Hamel Husain \u5206\u4eab\u4e86\u771f\u6b63\u6210\u529f\u7684 AI \u5718\u968a\u7684\u516d\u500b\u8a55\u4f30\u8fed\u4ee3\u7b56\u7565:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\u932f\u8aa4\u5206\u6790\u624d\u662f\u738b\u9053\uff0c\u5225\u6c88\u8ff7\u6f02\u4eae\u7684 dashboard \u901a\u7528\u6307\u6a19<\/li>\n\n\n\n<li>\u6700\u91cd\u8981\u7684\u6295\u8cc7\uff1a\u5ba2\u88fd\u5316\u7684\u6578\u64da\u6aa2\u8996\u4ecb\u9762<\/li>\n\n\n\n<li>\u8b93\u9818\u57df\u5c08\u5bb6\u76f4\u63a5\u5beb Prompt<\/li>\n\n\n\n<li>\u7528\u5408\u6210\u6578\u64da\u8d77\u6b65<\/li>\n\n\n\n<li>\u4fdd\u6301\u8a55\u4f30\u7cfb\u7d71\u7684\u53ef\u4fe1\u5ea6\uff0c\u7528\u4e8c\u5143\u5224\u65b7\u53d6\u4ee3\u6a21\u7cca\u5206\u6578<\/li>\n\n\n\n<li>\u8def\u7dda\u5716\u8981\u6578\u5be6\u9a57\uff0c\u4e0d\u662f\u6578\u529f\u80fd<\/li>\n<\/ol>\n\n\n\n<p>\u5148\u5efa\u8a55\u4f30\u57fa\u790e\u8a2d\u65bd\uff0c\u518d\u8003\u616e\u5177\u9ad4\u529f\u80fd\u3002\u807d\u8d77\u4f86\u5f88\u6162\uff0c\u5be6\u969b\u4e0a\u662f\u6700\u5feb\u7684\u8def\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\u2753 <a href=\"https:\/\/hamel.dev\/blog\/posts\/evals-faq\/index.html\">AI Evals \u8ab2\u7a0b\u7684 FAQ<\/a><\/h3>\n\n\n\n<p>Hamel Husain \u548c Shreya Shankar \u6574\u7406\u4e86\u4ed6\u5011 <a href=\"https:\/\/maven.com\/parlance-labs\/evals\">AI Evals \u8ab2\u7a0b<\/a> \u7684 FAQ\uff0c\u6536\u96c6\u4e86\u6559 700+ \u5de5\u7a0b\u5e2b\u548c PM \u5f8c\u6700\u5e38\u88ab\u554f\u7684\u554f\u984c\u3002\u5305\u62ec:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\u932f\u8aa4\u5206\u6790 (Error Analysis) \u662f\u738b\u9053<\/li>\n\n\n\n<li>\u81ea\u5efa\u8a55\u4f30\u4ecb\u9762\u6bd4\u73fe\u6210\u5de5\u5177\u597d<\/li>\n\n\n\n<li>\u4e8c\u5143\u8a55\u4f30 &gt; \u674e\u514b\u7279\u91cf\u8868(1-5\u5206)<\/li>\n\n\n\n<li>RAG \u6c92\u6b7b\uff0c\u53ea\u662f\u8981\u7528\u5c0d\u65b9\u6cd5<\/li>\n\n\n\n<li>\u5225\u7528\u73fe\u6210\u7684\u901a\u7528\u6307\u6a19\uff0c\u9019\u4e9b\u6307\u6a19\u5c0d\u5927\u90e8\u5206 AI \u61c9\u7528\u90fd\u6c92\u7528<\/li>\n\n\n\n<li><\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udd75\ufe0f\u200d\u2640\ufe0f <a href=\"https:\/\/ihower.tw\/blog\/12960-ai-evals-and-error-analysis\">\u4ec0\u9ebc\u662f\u932f\u8aa4\u5206\u6790 Error analysis ?<\/a><\/h3>\n\n\n\n<p>\u4e0a\u5169\u7bc7\u90fd\u91cd\u9ede\u63d0\u5230\u932f\u8aa4\u5206\u6790\uff0c\u6211\u6574\u7406\u4e86\u4e00\u7bc7\u6587\u7ae0\u4f86\u8b1b\u4ec0\u9ebc\u662f AI \u61c9\u7528\u8a55\u4f30\u7684\u932f\u8aa4\u5206\u6790\u3002\u6587\u9577\u8acb\u76f4\u63a5\u770b\u6211 blog \u6587\u7ae0\u3002<\/p>\n\n\n\n<p>\u91dd\u5c0d\u6c92\u6709\u6a19\u6e96\u7b54\u6848\u7684\u554f\u7b54\u8a55\u4f30(\u5c0d\u6bd4\u6709\u6a19\u6e96\u7b54\u6848\u7684\u662f\u6307\u55ae\u9078\u3001\u591a\u9078\u7b49\u6709\u56fa\u5b9a\u7b54\u6848)\uff0c\u9019\u88cf\u4e0d\u540c\u65bc\u5e38\u898b\u7684 G-Eval \u8a55\u4f30\u65b9\u5f0f\u63a1\u7528\u6b63\u9762\u8868\u5217\uff0c\u6839\u64da\u4f60\u7684 Criteria \u505a\u8a55\u4f30\u91cf\u6e2c\u6253\u5206(\u4f8b\u59821~5\u5206\u6709\u591a\u7b26\u5408)\u3002<br>\u9019\u88cf\u6559\u7684\u65b9\u6cd5\u662f\u5148\u505a\u932f\u8aa4\u5206\u6790\uff0c\u62ff\u5230\u5177\u9ad4\u7684\u8ca0\u9762\u8868\u5217\u5f8c\uff0c\u5f8c\u7e8c\u518d\u91dd\u5c0d &#8220;\u6bcf\u4e00\u7a2e&#8221; \u5931\u6557\u6a21\u5f0f\u90fd\u4f86\u505a\u8a55\u4f30\u91cf\u6e2c\u548c\u6539\u9032\u3002<\/p>\n\n\n\n<!--more-->\n\n\n\n<h3 class=\"wp-block-heading\">\ud83e\udde0 <a href=\"https:\/\/www.dbreunig.com\/2025\/06\/22\/how-contexts-fail-and-how-to-fix-them.html\">How Long Contexts Fail and How to Fix Your Context<\/a><\/h3>\n\n\n\n<p>\u96d6\u7136\u6a21\u578b\u652f\u63f4\u8d8a\u4f86\u8d8a\u9577\u7684 Long Context\uff0c\u4f46\u66f4\u9577\u7684\u4e0a\u4e0b\u6587\u4e0d\u7b49\u65bc\u66f4\u597d\u7684\u56de\u61c9\uff0c\u4f5c\u8005\u7e3d\u7d50\u4e86\u56db\u7a2e\u5931\u6557\u6a21\u5f0f:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u4e0a\u4e0b\u6587\u4e2d\u6bd2: \u5e7b\u89ba\u932f\u8aa4\u9032\u5165\u4e0a\u4e0b\u6587\u5f8c\u88ab\u53cd\u8986\u5f15\u7528\u3002\u4f8b\u5982 Gemini \u73a9\u5bf6\u53ef\u5922\u6642\u7522\u751f\u932f\u8aa4\u904a\u6232\u72c0\u614b\uff0c\u5c0e\u81f4 AI \u57f7\u8457\u65bc\u4e0d\u53ef\u80fd\u7684\u76ee\u6a19<\/li>\n\n\n\n<li>\u4e0a\u4e0b\u6587\u5206\u5fc3: \u4e0a\u4e0b\u6587\u592a\u9577\u6642\uff0c\u6a21\u578b\u904e\u5ea6\u4f9d\u8cf4\u6b77\u53f2\u8a18\u9304\u800c\u5ffd\u7565\u8a13\u7df4\u77e5\u8b58\u3002\u7814\u7a76\u767c\u73fe\u8d85\u904e 100k tokens \u5f8c\uff0cAgent \u958b\u59cb\u91cd\u8907\u904e\u53bb\u884c\u70ba\u800c\u975e\u7522\u751f\u65b0\u7b56\u7565<\/li>\n\n\n\n<li>\u4e0a\u4e0b\u6587\u6df7\u4e82: \u7121\u95dc\u8cc7\u8a0a\u5f71\u97ff\u6a21\u578b\u7522\u751f\u4f4e\u54c1\u8cea\u56de\u61c9\u3002Berkeley \u529f\u80fd\u547c\u53eb\u8a55\u6e2c\u986f\u793a\uff0c\u6240\u6709\u6a21\u578b\u5728\u63d0\u4f9b\u591a\u500b\u5de5\u5177\u6642\u8868\u73fe\u90fd\u6703\u4e0b\u964d<\/li>\n\n\n\n<li>\u4e0a\u4e0b\u6587\u885d\u7a81: \u65b0\u8cc7\u8a0a\u8207\u65e2\u6709\u8cc7\u8a0a\u7522\u751f\u77db\u76fe\u3002\u5fae\u8edf\u7814\u7a76\u986f\u793a\uff0c\u5c07\u63d0\u793a\u5206\u6bb5\u8f38\u5165\u6bd4\u4e00\u6b21\u6027\u8f38\u5165\u5e73\u5747\u4e0b\u964d 39% \u6548\u80fd<\/li>\n<\/ul>\n\n\n\n<p>\u4f5c\u8005\u4e5f\u5206\u4eab\u4e86<a href=\"https:\/\/www.dbreunig.com\/2025\/06\/26\/how-to-fix-your-context.html\">How to Fix \u89e3\u6c7a\u65b9\u6848<\/a>\uff0c\u5305\u62ec RAG \u9078\u64c7\u6027\u52a0\u8f09\u3001\u5de5\u5177\u52d5\u614b\u9078\u64c7\u3001\u4e0a\u4e0b\u6587\u9694\u96e2\u3001\u4fee\u526a\u7121\u95dc\u8cc7\u8a0a\u7b49\u6280\u5de7\u3002<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>LangGraph \u6709\u51fa\u4e86\u4e00\u500b\u7bc4\u4f8b\u5be6\u4f5c: <a href=\"https:\/\/github.com\/langchain-ai\/how_to_fix_your_context\">github.com\/langchain-ai\/how_to_fix_your_context<\/a><\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">\u2699\ufe0f <a href=\"https:\/\/manus.im\/blog\/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus\">Manus \u7684 AI Agent \u4e0a\u4e0b\u6587\u5de5\u7a0b\u7d93\u9a57<\/a><\/h3>\n\n\n\n<p>\u6700\u8fd1\u642c\u53bb\u65b0\u52a0\u5761\u7684 Manus \u5718\u968a\u5206\u4eab\u4e86\u4ed6\u5011\u7684 Agent \u958b\u767c\u5be6\u6230\u7d93\u9a57\uff0c\u6709\u4e0d\u5c11\u7368\u5230\u7684\u89c0\u9ede\uff0c\u975e\u5e38\u503c\u5f97\u4e00\u8b80\uff0c\u5305\u62ec:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>KV-Cache (Prompt\u5feb\u53d6) \u547d\u4e2d\u7387\u662f Production \u74b0\u5883\u7684 AI Agent \u6700\u91cd\u8981\u7684\u6307\u6a19\uff0c\u8981\u76e1\u53ef\u80fd\u78ba\u4fdd context \u7684 prefix \u90e8\u5206\u4e0d\u8b8a\uff0c\u9019\u6703\u76f4\u63a5\u5f71\u97ff\u6210\u672c\u5dee\u4e8610\u500d<\/li>\n\n\n\n<li>\u5de5\u5177\u9078\u64c7\u592a\u591a\u6703\u8b93 Agent \u8b8a\u7b28\uff0c\u53ef\u4ee5\u8a2d\u8a08\u72c0\u614b\u6a5f\u4f86\u52d5\u614b\u6c7a\u5b9a\u7576\u4e0b\u53ef\u7528\u54ea\u4e9b\u5de5\u5177\u3002\u4f46\u662f\u8981\u907f\u514d\u52d5\u614b\u6539\u8b8a\u5de5\u5177\u5217\u8868 schema\uff0c\u56e0\u70ba\u9019\u6703\u7834\u58de Prompt \u5feb\u53d6\uff0c\u800c\u662f\u7528\u906e\u7f69 logits \u8b93\u6a21\u578b\u4e0d\u8981\u9078\u7279\u5b9a\u7684\u5de5\u5177<\/li>\n\n\n\n<li>\u4f7f\u7528\u6a94\u6848\u7cfb\u7d71\u7576\u6210\u7121\u9650\u7684\u5916\u90e8\u8a18\u61b6\u9ad4\uff0c\u6e1b\u5c11 context \u4f54\u7528\uff0c\u907f\u514d\u622a\u65b7\u6216\u58d3\u7e2e\u3002\u4f8b\u5982\u4f7f\u7528\u7db2\u9801\u6293\u53d6\u5de5\u5177\u6642\uff0c\u4e0d\u8981\u5c07\u7d50\u679c\u5168\u6587\u585e\u56de context\uff0c\u800c\u662f\u5beb\u5165\u6a94\u6848\uff0c\u7136\u5f8c\u8b93 Agent \u81ea\u5df1\u7528\u5de5\u5177\u518d\u53bb\u8b80\u6a94\u6848\uff0ccontext \u4e2d\u53ea\u9700\u9810\u8a2d\u4fdd\u7559 URL \u5373\u53ef\u3002<\/li>\n<\/ol>\n\n\n\n<p>\u7de8\u6309: \u9019\u6bb5\u5beb\u4e0d\u662f\u5f88\u6e05\u695a\u5b83\u662f\u600e\u9ebc\u505a\u7684\uff0c\u4ee5\u4e0b\u662f\u6211\u8166\u88dc\u53ef\u80fd\u7684\u8b80\u53d6\u65b9\u5f0f: \u4f8b\u5982\u914d\u7f6e grep \u5de5\u5177\uff0c\u8b93 Agent \u67e5\u627e\u8b80\u53d6\u90e8\u5206\u6bb5\u843d\uff0c\u6216\u662f\u66f4\u9032\u968e\u7528 sub-agent \u5de5\u5177\u53bb\u8b80\u6a94\u6848\uff0c\u7136\u5f8c sub-agent \u53ea\u64f7\u53d6\u56de\u50b3\u76f8\u95dc\u7684\u6bb5\u843d\u7d66 lead agent \u653e\u5230 context \u4e4b\u4e2d\uff0c\u5982\u6b64\u53ef\u4ee5\u5927\u5e45\u964d\u4f4e\u4e3b context \u7684\u4f54\u7528<\/p>\n\n\n\n<ol start=\"4\" class=\"wp-block-list\">\n<li>\u7528 todo.md \u4f86\u64cd\u63a7\u6ce8\u610f\u529b\uff0c\u4e0d\u65b7\u91cd\u5beb\u5f85\u8fa6\u6e05\u55ae\u653e\u5230 context \u6700\u5f8c<\/li>\n\n\n\n<li>context \u4e2d\u4fdd\u7559\u932f\u8aa4\u7684\u8def\u5f91\uff0c\u907f\u514d\u91cd\u8907\u72af\u932f<\/li>\n\n\n\n<li>few-shot \u8981\u6709\u591a\u6a23\u6027\uff0c\u907f\u514d\u904e\u5ea6\u4e00\u81f4\u7684\u7bc4\u4f8b<\/li>\n<\/ol>\n\n\n\n<p>\u5982\u6b64\u91cd\u8996\u5feb\u53d6\u547d\u4e2d\u7387\uff0c\u662f\u6211\u4e4b\u524d\u6c92\u60f3\u904e\u7684\uff0c\u4f46\u60f3\u60f3\u4e5f\u975e\u5e38\u6709\u9053\u7406\uff0cAgent \u7684 tokens \u6210\u672c\u63a7\u5236\u592a\u91cd\u8981\u4e86\uff0c\u56e0\u70ba\u6574\u500b\u5c0d\u8a71\u4e32\u6703\u5f88\u9577\u554a!<\/p>\n\n\n\n<p>\u66f4\u591a\u8a0e\u8ad6\u5728\u6211 <a href=\"https:\/\/www.facebook.com\/ihower\/posts\/10162696657613971\">Facebook \u8cbc\u6587<\/a>\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83c\udfc5 <a href=\"https:\/\/huggingface.co\/spaces\/galileo-ai\/agent-leaderboard\">Galileo Agent Leaderboard v2<\/a><\/h3>\n\n\n\n<p>Galileo Labs \u767c\u4f48\u4e86 Agent Leaderboard v2 \u8a55\u6e2c\uff0c\u4e0d\u53ea\u6e2c\u8a66 Agent \u80fd\u4e0d\u80fd\u547c\u53eb\u5c0d\u7684\u5de5\u5177\uff0c\u800c\u662f\u6a21\u64ec\u771f\u5be6\u4f01\u696d\u5834\u666f\uff0c\u8a55\u4f30 AI \u5728\u591a\u8f2a\u5c0d\u8a71\u4e2d\u8655\u7406\u8907\u96dc\u6c7a\u7b56\u548c\u5b8c\u6210\u5be6\u969b\u4efb\u52d9\u7684\u80fd\u529b\u3002<br>\u8a55\u6e2c\u6db5\u84cb\u4e94\u5927\u95dc\u9375\u7522\u696d\uff1a\u9280\u884c\u3001\u91ab\u7642\u3001\u6295\u8cc7\u3001\u96fb\u4fe1\u548c\u4fdd\u96aa\u3002\u6bcf\u500b\u6e2c\u8a66\u5834\u666f\u90fd\u7cbe\u5fc3\u8a2d\u8a08\uff0c\u5305\u542b 5-8 \u500b\u76f8\u4e92\u95dc\u806f\u7684\u7528\u6236\u76ee\u6a19\uff0c\u771f\u5be6\u53cd\u6620\u4f01\u696d\u74b0\u5883\u4e2d\u7684\u4efb\u52d9\u8907\u96dc\u5ea6\u3002<\/p>\n\n\n\n<p>\u4ed6\u5011\u6e2c\u91cf\u5169\u500b\u6838\u5fc3\u6307\u6a19:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Action Completion (AC): \u884c\u52d5\u5b8c\u6210\u5ea6\uff0cAgent \u662f\u5426\u771f\u6b63\u5b8c\u6210\u4e86\u7528\u6236\u7684\u6bcf\u500b\u76ee\u6a19\uff1f<\/li>\n\n\n\n<li>Tool Selection Quality (TSQ): \u5de5\u5177\u9078\u64c7\u54c1\u8cea\uff0cAgent \u662f\u5426\u9078\u64c7\u4e86\u6b63\u78ba\u7684\u5de5\u5177\u4e26\u6b63\u78ba\u4f7f\u7528\uff1f<\/li>\n<\/ul>\n\n\n\n<p>\u6700\u65b0\u7d50\u679c\u986f\u793a\uff0cGPT-4.1 \u4ee5\u5e73\u5747 62% \u7684 Action Completion \u5206\u6578\u4f4d\u5c45\u699c\u9996\uff0c\u6bcf\u6b21\u5c0d\u8a71\u6210\u672c\u70ba 0.068 \u7f8e\u5143\u3002<\/p>\n\n\n\n<p>\u4ee4\u4eba\u610f\u5916\u7684\u662f\uff0c\u63a8\u7406\u6a21\u578b\u5728\u6b64\u8a55\u6e2c\u4e2d\u8868\u73fe\u4e0d\u4f73\u3002\u63a8\u6e2c\u539f\u56e0\u5728\u65bc\u4f01\u696d\u5834\u666f\u66f4\u91cd\u8996\u57f7\u884c\u6548\u7387\u800c\u975e\u6df1\u5ea6\u5206\u6790\uff0c\u63a8\u7406\u6a21\u578b\u53ef\u80fd\u56e0\u904e\u5ea6\u601d\u8003\u53cd\u800c\u964d\u4f4e\u4e86\u4efb\u52d9\u5b8c\u6210\u6548\u7387\u3002<br>\u9019\u63d0\u9192\u6211\u5011\u9078\u64c7 AI \u6a21\u578b\u61c9\u512a\u5148\u8003\u91cf\u5be6\u969b\u61c9\u7528\u9700\u6c42\uff0c\u96d6\u7136\u63a8\u7406\u6a21\u578b\u5728\u5b78\u8853\u6e2c\u8a66\u4e2d\u8868\u73fe\u5353\u8d8a\uff0c\u4f46\u9762\u5c0d\u65e5\u5e38\u5546\u696d\u4efb\u52d9\u6642\uff0c\u7c21\u55ae\u76f4\u63a5\u7684\u57f7\u884c\u529b\u53ef\u80fd\u66f4\u5177\u5be6\u6548\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udcca <a href=\"https:\/\/x.com\/xeophon_\/status\/1917175899948020203\">LLM Benchmark \u5feb\u901f\u89e3\u91cb<\/a><\/h3>\n\n\n\n<p>Xeophon \u4ecb\u7d39\u4e86\u5404\u7a2e AI \u8a55\u6e2c\u57fa\u6e96\uff0c\u5f88\u591a\u4eba\u770b\u5230\u6a21\u578b\u6392\u884c\u699c\u4e0a\u7684\u5206\u6578\uff0c\u4f46\u5176\u5be6\u4e0d\u592a\u6e05\u695a\u9019\u4e9b\u6578\u5b57\u4ee3\u8868\u4ec0\u9ebc\u610f\u7fa9\u3002<\/p>\n\n\n\n<p>\u62ff GPQA \u4f86\u8aaa\uff0c\u9019\u662f\u76ee\u524d\u6700\u71b1\u9580\u7684\u8a55\u6e2c\u4e4b\u4e00\uff0c\u8a2d\u8a08\u5f97\u5f88\u7cbe\u826f\uff0c\u4f46\u91cd\u9ede\u662f\u5b83\u53ea\u6e2c\u300c\u751f\u7269\u3001\u7269\u7406\u3001\u5316\u5b78\u300d\u4e09\u500b\u9818\u57df\u3002\u5982\u679c\u4f60\u4ee5\u70ba GPQA \u9ad8\u5206\u5c31\u4ee3\u8868\u6a21\u578b\u4ec0\u9ebc\u90fd\u61c2\uff0c\u90a3\u5c31\u8aa4\u6703\u5927\u4e86\u3002<\/p>\n\n\n\n<p>\u985e\u4f3c\u7684\u9084\u6709 LiveCodeBench\uff0c\u6703\u5b9a\u671f\u66f4\u65b0\u984c\u76ee\u4fdd\u6301\u65b0\u9bae\u5ea6\uff0c\u4f46\u73fe\u5728\u7684 LLM \u5c0d\u7c21\u55ae\u7684\u7a0b\u5f0f\u984c\u76ee\u5df2\u7d93\u592a\u8f15\u9b06\u4e86\uff0c\u57fa\u672c\u4e0a easy\/medium \u7684 LeetCode \u984c\u76ee\u5df2\u7d93\u96e3\u4e0d\u5012\u5b83\u5011\u3002<\/p>\n\n\n\n<p>\u770b AI benchmark \u8981\u5c0f\u5fc3\uff0c\u4e0d\u662f\u5206\u6578\u9ad8\u5c31\u4ee3\u8868\u5168\u80fd\u3002\u6bcf\u500b\u8a55\u6e2c\u90fd\u6709\u5b83\u7684\u8a2d\u8a08\u76ee\u7684\u548c\u4fb7\u9650\u6027\uff0c\u4e86\u89e3\u80cc\u5f8c\u7684\u7d30\u7bc0\u624d\u80fd\u6b63\u78ba\u89e3\u8b80\u6a21\u578b\u7684\u771f\u5be6\u80fd\u529b\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udd25 <a href=\"https:\/\/axk51013.medium.com\/llm-%E5%B0%88%E6%AC%84-temperature-%E4%B8%80%E5%AE%9A%E8%A6%81%E8%A8%AD-0-%E5%97%8E-52106a444424\">Temperature \u4e00\u5b9a\u8981\u8a2d 0 \u55ce\uff1f<\/a><\/h3>\n\n\n\n<p>\u4f7f\u7528 LLM \u6642\u7684\u4e00\u500b\u95dc\u9375\u8ff7\u601d\u662f Temperature \u5230\u5e95\u8a72\u4e0d\u8a72\u8a2d\u70ba 0 ?<br>\u5f88\u591a\u4eba\u6703\u8aaa\u300c\u7a69\u5b9a\u6700\u91cd\u8981\uff0c\u76f4\u63a5\u628a temperature \u8a2d\u6210 0 \u5427\uff01\u300d\uff0c\u7406\u7531\u4e0d\u5916\u4e4e\u662f: \u964d\u4f4e\u5e7b\u89ba\u3001\u4fdd\u8b49\u53ef\u91cd\u73fe\u3001\u6548\u679c\u6700\u597d\u3002\u4f46\u4f5c\u8005\u8a8d\u70ba\u9019\u53ef\u80fd\u662f LLM \u4f7f\u7528\u4e0a\u6700\u5927\u7684\u8aa4\u89e3\u4e4b\u4e00\u3002<\/p>\n\n\n\n<p>\u4f5c\u8005 Oscar \u5f9e LLM decoding \u6a5f\u5236\u8b1b\u8d77\uff0c\u7834\u89e3\u4e86\u4e09\u5927\u8ff7\u601d:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u8ff7\u601d1: Temperature=0 \u80fd\u6709\u6548\u964d\u4f4e\u5e7b\u89ba\u3002\u5be6\u969b\u4e0a temperature&gt;1 \u6642\u78ba\u5be6\u6703\u589e\u52a0\u5e7b\u89ba\uff0c\u4f46\u5728 temperature&lt;1 \u7684\u7bc4\u570d\u5167\uff0c\u4e26\u4e0d\u4fdd\u8b49\u8d8a\u5c0f\u8d8a\u597d\u3002\u95dc\u9375\u5728\u65bc: Greedy decoding \u5bb9\u6613\u9677\u5165\u91cd\u8907\u5faa\u74b0\uff0c\u6703\u504f\u5411\u8f38\u51fa\u8a13\u7df4\u8cc7\u6599\u4e2d\u6700\u5e38\u898b\u7684\u7b54\u6848\uff0c\u4f46\u7db2\u8def\u4e0a\u6700\u6b63\u78ba\u7684\u8cc7\u8a0a\u5f80\u5f80\u4e0d\u662f\u6700\u5e38\u51fa\u73fe\u7684\uff0c\u5728 RAG \u5834\u666f\u4e2d\uff0c\u53ef\u80fd\u6703\u8b93\u6a21\u578b\u904e\u5ea6\u4f9d\u8cf4\u5167\u90e8\u77e5\u8b58\u800c\u5ffd\u7565 context<\/li>\n\n\n\n<li>\u8ff7\u601d2: Temperature=0 \u624d\u80fd\u6bcf\u6b21\u5f97\u5230\u76f8\u540c\u7d50\u679c\u3002\u9019\u500b\u908f\u8f2f\u6709\u6839\u672c\u6027\u932f\u8aa4\u3002\u73fe\u4ee3\u96fb\u8166\u7684\u300c\u96a8\u6a5f\u300d\u5176\u5be6\u662f\u507d\u96a8\u6a5f\u6578\uff0c\u53ea\u8981\u63a7\u5236 random seed \u5c31\u80fd\u91cd\u73fe\u3002\u6240\u4ee5 OpenAI \u63d0\u4f9b\u7684\u662f seed \u53c3\u6578\uff0c\u800c\u4e0d\u662f\u8981\u6c42\u4f60\u8a2d temperature=0\u3002<\/li>\n\n\n\n<li>\u8ff7\u601d3: Temperature=0 \u6548\u679c\u6700\u597d\u3002\u5927\u91cf\u7814\u7a76\u986f\u793a greedy decoding \u5bb9\u6613\u51fa\u73fe \u56b4\u91cd\u7684\u81ea\u6211\u91cd\u8907\u554f\u984c\u3001\u5c40\u90e8\u6700\u512a\u4f46\u975e\u5168\u57df\u6700\u512a\u3001\u8207\u4eba\u985e\u8a9e\u8a00\u7d44\u7e54\u65b9\u5f0f\u5dee\u7570\u904e\u5927\u3002\u4eba\u985e\u5728\u7d44\u7e54\u8a9e\u8a00\u6642\uff0c\u7d93\u5e38\u6703\u4f7f\u7528\u5f9e LLM \u89d2\u5ea6\u770b\u4f86\u662f\u300c\u4f4e\u6a5f\u7387\u300d\u7684\u8a5e\u5f59\uff0c\u56e0\u70ba\u6211\u5011\u6703\u6df1\u601d\u719f\u616e\u9078\u64c7\u66f4\u7cbe\u78ba\u7684\u8868\u9054\u3002<\/li>\n<\/ul>\n\n\n\n<p>\u6700\u5f8c\u5efa\u8b70\u628a temperature \u7576\u4f5c hyperparameter \u4f86\u8abf\u6574\uff0c\u57fa\u65bc\u5177\u9ad4\u4efb\u52d9\u505a\u5be6\u9a57\uff0c\u5927\u591a\u6578\u6a21\u578b\u5b98\u65b9\u63a8\u85a6\u7684\u8a2d\u5b9a\u662f temperature=0.6, top_p=0.9\u3002\u53ef\u4ee5\u5617\u8a66\u65b0\u8208\u7684 Min-p sampling \u65b9\u6cd5\u3002<br>\u5efa\u8b70\u6709\u8a55\u4f30\u7684\u6d41\u7a0b\uff0c\u900f\u904e\u5be6\u9a57\u627e\u51fa\u6700\u4f73\u8a2d\u5b9a\uff0c\u800c\u4e0d\u662f\u7121\u8166\u8a2d\u5b9a temperature=0\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udee0\ufe0f <a href=\"https:\/\/axk51013.medium.com\/human-agent-computer-interaction-design-2-%E5%A6%82%E4%BD%95%E6%AD%A3%E7%A2%BA%E8%A8%AD%E8%A8%88-tools-%E4%BE%86%E6%8F%90%E5%8D%87-agent-performance-6cdb848fafd0\">\u5982\u4f55\u6b63\u78ba\u8a2d\u8a08 tools \u4f86\u63d0\u5347 Agent performance<\/a><\/h3>\n\n\n\n<p>\u548c\u4e0a\u7bc7\u540c\u6a23\u4f86\u81ea Oscar \u7684\u6587\u7ae0\uff0c\u9019\u7bc7\u8b1b\u8ff0\u4e86\u8a2d\u8a08 edit tool \u7684\u4f86\u9f8d\u53bb\u8108\u6545\u4e8b\uff0c\u975e\u5e38\u7cbe\u5f69\u3002<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u7b2c\u4e00\u4ee3 Whole editing: \u8b93 LLM \u76f4\u63a5\u91cd\u65b0\u751f\u6210\u6574\u4efd code<\/li>\n\n\n\n<li>\u7b2c\u4e8c\u4ee3 diff &amp; patch: \u4f7f\u7528 Linux \u539f\u751f\u7684 diff \u548c patch \u6307\u4ee4<\/li>\n\n\n\n<li>\u7b2c\u4e09\u4ee3 improved tools: \u5404\u5927\u5ee0\u91dd\u5c0d diff \u7684\u554f\u984c\u505a\u4e86\u5404\u7a2e\u6539\u9032<\/li>\n<\/ul>\n\n\n\n<p>\u5149\u6709\u66f4\u5f37\u7684 LLM \u662f\u4e0d\u5920\u7684\uff0c\u8981\u642d\u914d\u597d\u7684 tool \u8a2d\u8a08\u624d\u80fd\u767c\u63ee\u6700\u5927\u6548\u679c\u3002\u9019\u4e9b\u6539\u9032\u5728 SWE-Bench Verified \u4e0a\u80fd\u5e36\u4f86 6-8% \u7684\u63d0\u5347\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83c\udf93 <a href=\"https:\/\/axk51013.medium.com\/llm-%E6%96%B0%E6%89%8B%E5%85%A5%E9%96%80-2025-%E5%B9%B4%E5%A6%82%E4%BD%95%E8%87%AA%E5%AD%B8-llm-a0de380d78eb\">2025 \u5e74\u5982\u4f55\u81ea\u5b78 LLM<\/a><\/h3>\n\n\n\n<p>\u9023\u7e8c\u63a8\u85a6 Oscar \u7684\u6587\u7ae0\uff0c\u4ed6\u5206\u4eab\u4e86\u4ed6\u5efa\u8b70\u7684 LLM \u81ea\u5b78\u6e05\u55ae\u3002Oscar \u8a8d\u70ba\u5982\u679c\u771f\u7684\u628a\u9019\u4e9b\u5167\u5bb9\u5b78\u900f\uff0c\u7d55\u5c0d\u6bd4\u5e02\u9762\u4e0a\u904e\u534a\u5e74\u85aa100~200\u842c\u7684 junior LLM engineer \u9084\u8981\u61c2 LLM\uff0c\u9762\u8a66\u7d55\u5c0d\u6c92\u554f\u984c\u3002<br>\u73fe\u5728\u79d1\u6280\u696d\u51fa\u73fe\u4e00\u500b\u5947\u7279\u73fe\u8c61\uff0c2021-2022\u5e74\u5165\u8077\u7684\u8001\u54e1\u5de5\u73fe\u5728\u9023\u9762\u8a66\u90fd\u904e\u4e0d\u4e86\uff0c\u800c2023\u5e74\u5165\u8077\u7684\u540c\u4e8b\u666e\u904d\u4e5f\u6bd4\u4e0d\u4e0a\u73fe\u5728\u6bd4\u8f03\u5f37\u7684\u65b0\u4eba\uff0c\u6240\u4ee5\u5fc5\u9808\u6301\u7e8c\u5b78\u7fd2\u624d\u4e0d\u6703\u88ab\u6280\u8853\u6d6a\u6f6e\u6dd8\u6c70\u3002<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>\u5e0c\u671b\u4f60\u559c\u6b61\u9019\u96c6\u5167\u5bb9\uff01\u5982\u679c\u4f60\u60f3\u66f4\u6709\u7cfb\u7d71\u5730\u638c\u63e1 Context Engineering \u6280\u8853\uff0c\u6b61\u8fce\u5831\u540d\u6211\u7684<a href=\"https:\/\/aihao.tw\/llm\">\u5927\u8a9e\u8a00\u6a21\u578b LLM \u61c9\u7528\u958b\u767c\u5de5\u4f5c\u574a<\/a> \u8ab2\u7a0b\u3002\u4e5f\u6b61\u8fce\u628a\u9019\u9580\u8ab2\u63a8\u85a6\u7d66\u5c0d LLM \u61c9\u7528\u958b\u767c\u6709\u8208\u8da3\u7684\u670b\u53cb\uff01<\/p>\n\n\n\n<p>\u2013 ihower<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hello! \u5404\u4f4d AI \u958b\u767c\u8005\u5927\u5bb6\u597d \ud83d\udc4b \u6211\u662f ihower\uff0c\u4e0d\u77e5\u4e0d\u89ba\u9019\u662f\u7b2c 30 \u671f\u5566\uff0c\u611f\u8b1d\u4f60\u4e00\u8def\u4ee5\u4f86\u7684\u8a02 &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/ihower.tw\/blog\/13039-aie-ai-evals\" class=\"more-link\">\u95b1\u8b80\u5168\u6587<span class=\"screen-reader-text\">\u3008\u611b\u597d AI Engineer \u96fb\u5b50\u5831 \ud83d\ude80 \u4ec0\u9ebc\u662f AI Evals \u932f\u8aa4\u5206\u6790 #30\u3009<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[91],"tags":[],"class_list":["post-13039","post","type-post","status-publish","format-standard","hentry","category-aie","entry"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p1q6tG-3oj","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/posts\/13039","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/comments?post=13039"}],"version-history":[{"count":4,"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/posts\/13039\/revisions"}],"predecessor-version":[{"id":13268,"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/posts\/13039\/revisions\/13268"}],"wp:attachment":[{"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/media?parent=13039"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/categories?post=13039"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/tags?post=13039"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}