notehermes-agent实现了一个完整的 “经验提取 → 知识存储 → 智能检索 → 上下文注入 → 执行验证 → 自动改进” 闭环。是内置闭环自学习机制的项目。不是只做 task summary而是在做一个 persistent memory skill induction retrieval user modeling 的闭环。更多是工程优化Skills 系统让 AI Agent 像人类专家一样积累经验——把成功的做法写成 SOP在使用中持续修订并且可以分享给其他人。后台审查记忆审查、skill审查等如异步fork出一个后台审查Agent实例判断历史对话中有啥内容能沉淀为有价值的skill/memory。代码hermes-agent/run_agent.pyAgentic RL奖励函数比如代码能执行通过/检索正确就给1分不能就0参考OpenClaw-RL的CombinedRLVR OPD方法其中RLVR的reward是三信号加权正确70%效率15%工具使用15%文章目录note一、建Harness—六大组件二、hermes agent1、后台审查Agent2、Agentic RLReference一、建Harness—六大组件【关于Harness】如何构建Harness——六大组件全解析,https://mp.weixin.qq.com/s/HwqEaXSGkcYgUNrzB2okuA六大组件1、文件系统工作台作用不仅是存文件更是 Agent 的“外部大脑”。用于存储中间结果、实现多 Agent 协作通过文件共享状态、并与 Git 集成实现版本控制和回滚。2、Bash 沙箱手脚作用实现“写→跑→修”的自我验证循环。沙箱提供资源隔离如 Docker防止 Agent 执行危险操作如 rm -rf是 Agent 从“顾问”变为“工程师”的关键。3、记忆AGENTS.md - 外挂大脑作用一种“不改权重加知识”的巧妙方案。Agent 将项目规范、架构决策写入 Markdown 文件下次启动时自动注入上下文。这比微调Fine-tuning成本更低且人类可读可编辑。4、Web Search MCP作用Web Search​ 解决实时性问题如查最新文档MCP (Model Context Protocol)​ 是 Anthropic 推出的“AI 世界的 USB 接口”让 Agent 能即插即用地连接数据库、Jira 等内部工具从“搜索”升级为“连接”。5、上下文工程注意力管理作用对抗 Context Rot上下文腐烂。通过压缩Summarization、卸载将大段输出存文件只留摘要、分层管理等策略防止重要信息被淹没保持模型“头脑清醒”。6、编排 Hooks调度与质检作用编排负责将大任务拆解分发给不同 Agent如简单任务用小模型复杂任务用大模型Hooks​ 是质量门禁通过确定性规则如 Lint 检查、格式校验拦截模型可能产生的错误输出确保质量底线二、hermes agent1、后台审查Agent后台审查Agent每当主 Agent 完成对用户的回复后对于用户而言交互似乎就此结束。但在后台Hermes 通过_spawn_background_review会在后台异步启动一个审查 Agent。这是一个异步处理机制系统会立即 Fork 出一个新的轻量级 Agent 实例专门负责对刚刚结束的对话进行深度复盘。这个后台 Agent 不会干扰前台的用户体验而是从三个维度对此次交互进行全方位审查的Prompt记忆审查_MEMORY_REVIEW_PROMPT这段对话有什么值得记住的经验判断这段对话中是否蕴含值得长期保留的关键经验或事实提炼初长期记忆存入 Agent 的记忆库技能审查_SKILL_REVIEW_PROMPT这个任务模式是否值得变成Skill分析当前的任务解决路径是否具有通用性是否值得被抽象并固化为一个可复用的Skill综合审查_COMBINED_REVIEW_PROMPT有什么可以改进的反思整个执行过程中是否存在优化空间或潜在的错误模式。具体参考源码中的prompt截止20260429关注THINK CLASS-FIRST. What general pattern of task did the user just complete生成skill考虑任务类别不存具体任务memory存用户画像如偏好等。# ------------------------------------------------------------------# Background memory/skill review# ------------------------------------------------------------------_MEMORY_REVIEW_PROMPT(Review the conversation above and consider saving to memory if appropriate.\n\nFocus on:\n1. Has the user revealed things about themselves — their persona, desires, preferences, or personal details worth remembering?\n2. Has the user expressed expectations about how you should behave, their work style, or ways they want you to operate?\n\nIf something stands out, save it using the memory tool. If nothing is worth saving, just say Nothing to save. and stop.)_SKILL_REVIEW_PROMPT(Review the conversation above and consider whether a skill should be saved or updated.\n\nWork in this order — do not skip steps:\n\n1. SURVEY the existing skill landscape first. Call skills_list to see what you have. If anything looks potentially relevant, skill_view it before deciding. You are looking for the CLASS of task that just happened, not the exact task. Example: a successful Tauri build is in the class \desktop app build troubleshooting\, not \fix my specific Tauri error today\.\n\n2. THINK CLASS-FIRST. What general pattern of task did the user just complete? What conditions will trigger this pattern again? Describe the class in one sentence before looking at what to save.\n\n3. PREFER GENERALIZING AN EXISTING SKILL over creating a new one. If a skill already covers the class — even partially — update it (skill_manage patch) with the new insight. Broaden its \when to use\ trigger if needed.\n\n4. ONLY CREATE A NEW SKILL when no existing skill reasonably covers the class. When you create one, name and scope it at the class level (\react-i18n-setup\, not \add-i18n-to-my-dashboard-app\). The trigger section must describe the class of situations, not this one session.\n\n5. If you notice two existing skills that overlap, note it in your response so a future review can consolidate them. Do not consolidate now unless the overlap is obvious and low-risk.\n\nOnly act when something is genuinely worth saving. If nothing stands out, just say Nothing to save. and stop.)_COMBINED_REVIEW_PROMPT(Review the conversation above and consider two things:\n\n**Memory**: Has the user revealed things about themselves — their persona, desires, preferences, or personal details? Has the user expressed expectations about how you should behave, their work style, or ways they want you to operate? If so, save using the memory tool.\n\n**Skills**: Was a non-trivial approach used to complete a task that required trial and error, changing course due to experiential findings, or a different method or outcome than the user expected? If so, work in this order:\n a. SURVEY existing skills first (skills_list, then skill_view on candidates).\n b. Identify the CLASS of task, not the specific task (\desktop app build troubleshooting\, not \fix my Tauri error\).\n c. PREFER UPDATING/GENERALIZING an existing skill that covers the class.\n d. ONLY CREATE A NEW SKILL if no existing one covers the class. Scope at the class level, not this one session.\n e. If you notice overlapping skills during the survey, note it so a future review can consolidate them.\n\nOnly act if theres something genuinely worth saving. If nothing stands out, just say Nothing to save. and stop.)2、Agentic RL奖励函数1比如代码能执行通过/检索正确就给1分不能就02参考OpenClaw-RL的CombinedRLVR OPD其中RLVR的reward是三信号加权正确70%效率15%工具使用15%代码hermes-agent/environments/论文OpenClaw-RL: Train Any Agent Simply by Talkinghttps://arxiv.org/pdf/2603.10165OpenClaw-RL核心把这些“下一状态信号”比如用户继续追问/纠正/满意、工具返回结果/报错等统一收集起来作为在线训练数据。把 Agent 每次交互后的用户反馈、工具结果、环境变化都转成在线 RL 信号让 Agent 在真实使用中持续变强。OpenClaw-RL combined advantageA t combined w binary r final w opd ( log ⁡ π teacher ( a t ∣ s enhanced ) − log ⁡ π θ ( a t ∣ s t ) ) A_t^{\text {combined }}w_{\text {binary }} r_{\text {final }}w_{\text {opd }}\left(\log \pi_{\text {teacher }}\left(a_t \mid s_{\text {enhanced }}\right)-\log \pi_\theta\left(a_t \mid s_t\right)\right)Atcombined​wbinary​rfinal​wopd​(logπteacher​(at​∣senhanced​)−logπθ​(at​∣st​))将二值奖励 和 教师-学生分布差异 加权成一个新的优势估计用教师模型在 hint-enhanced prompt 上跑一次 forward得到 logprobs信用分配问题缓解方法Advantage 来源粒度RLVR / Binary RLPRM 给的 (r final r_{\text{final}}rfinal​)response-level / sequence-levelOPDteacher-student logprob gaptoken-levelCombined两者加权和mixed上面说的OpenClaw-RL的RLVR reward参考asyncdefcompute_reward(self,item:dict,result:AgentResult,ctx:ToolContext,)-float: Multi-signal reward: - correctness (0.7): Did the tests pass? - efficiency (0.15): Fewer turns better - tool_usage (0.15): Did the agent actually write run code? cfgself.config# ---- Signal 1: Test correctness ----# Check if test_solution.py exists and passes in the agents sandboxcorrectness0.0try:test_resultctx.terminal(python test_solution.py 21,timeout30)outputtest_result.get(output,)exit_codetest_result.get(exit_code,1)ifexit_code0andpassedinoutput.lower():correctness1.0elifexit_code0:correctness0.8# Ran without error but no explicit passedelifassertinoutput.lower()anderrorinoutput.lower():correctness0.2# Partial — code runs but assertions failelse:correctness0.1# Code errors out entirelyexceptExceptionase:logger.debug(Test execution failed in reward: %s,e)correctness0.0# ---- Signal 2: Efficiency ----max_turnscfg.max_agent_turns turns_usedresult.turns_usedifturns_used3:efficiency1.0elifturns_usedmax_turns//2:efficiency0.8elifturns_usedmax_turns*3//4:efficiency0.5else:efficiency0.2# ---- Signal 3: Tool usage ----tools_usedset()formsginresult.messages:ifmsg.get(role)assistantandmsg.get(tool_calls):fortcinmsg[tool_calls]:fntc.get(function,{})ifisinstance(tc,dict)else{}namefn.get(name,)ifname:tools_used.add(name)# Good: used both terminal and file toolsifterminalintools_usedand(write_fileintools_usedorpatchintools_used):tool_usage1.0elifterminalintools_used:tool_usage0.6eliftools_used:tool_usage0.3else:tool_usage0.0# ---- Combine ----reward(cfg.correctness_weight*correctnesscfg.efficiency_weight*efficiencycfg.tool_usage_weight*tool_usage)rewardmin(1.0,max(0.0,reward))# Track metricsself._reward_buffer.append(reward)self._correctness_buffer.append(correctness)self._efficiency_buffer.append(efficiency)self._tool_usage_buffer.append(tool_usage)logger.debug(Reward: correctness%.2f, efficiency%.2f, tool_usage%.2f → %.3f,correctness,efficiency,tool_usage,reward,)returnrewardReference[1] 一文搞懂Hermes新顶流Agent如何从经验中自我进化[2] https://github.com/NousResearch/hermes-agent[3] 深度解析 Hermes Agent 如何实现“自进化”及其 Prompt / Context / Harness 的设计实践[4] https://hermes-agent.nousresearch.com/docs/user-guide/features/rl-training