GPU_GUARD_MONOREPO/docs/superpowers/plans/2026-04-26-multimodal-tool.md
2026-05-20 21:39:12 +08:00

389 lines
15 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 多模态图片识别工具 & 工具模型分类 & 对话附件上传 实施计划
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task.
**Goal:** 在工具管理中新增模型依赖分类和图片识别工具,在 Agent 对话页面新增附件上传功能。
**Architecture:** 后端扩展 tool entity 新增 requiresModel/modelChannelId/modelId 字段,新增 image_recognize 工具(工厂函数接收已解析凭证),复用现有 LLM provider 层调用模型。附件信息存 metadata通过 prompt_builder 注入 LLM messagescontent 保持纯净。前端附件功能拆分为 3 个独立子组件。
**Tech Stack:** Midway.js + TypeORM + TypeBox + Socket.IO, Vue 3 + Element Plus + Pinia, OpenAI 兼容 API (火山引擎)
**Spec:** `docs/superpowers/specs/2026-04-26-multimodal-tool-design.md`
---
### Task 1: Tool Entity 新增模型配置字段
**Files:**
- Modify: `packages/backend/src/modules/netaclaw/entity/tool.ts:48`
- [ ] **Step 1:**`tool.ts``extra` 字段前(第 48 行前)新增:
```typescript
@Column({ comment: '是否需要大模型配置 0否 1是', default: 0 })
requiresModel: number;
@Column({ comment: '关联模型渠道ID', nullable: true })
modelChannelId: number;
@Column({ comment: '关联模型ID', length: 100, nullable: true })
modelId: string;
```
- [ ] **Step 2:** 启动后端验证自动建表,用 MCP 验证 `DESCRIBE netaclaw_tool;`
- [ ] **Step 3:** Commit `feat(netaclaw): tool entity 新增 requiresModel/modelChannelId/modelId`
---
### Task 2: Catalog Schema 扩展 + Registry 同步
**Files:**
- Modify: `packages/backend/src/modules/netaclaw/tools/catalog.ts:14`
- Modify: `packages/backend/src/modules/netaclaw/service/tool_registry.ts:28-49,96-128`
- [ ] **Step 1:** `catalog.ts` ToolSchema 接口第 14 行后新增 `requiresModel?: boolean;`
注意:`modelChannelId``modelId` 是运行时配置,只通过管理界面设置,不进 catalog。
- [ ] **Step 2:** `tool_registry.ts` createDefaults 返回对象中 `extra: null` 前新增:
```typescript
requiresModel: s.requiresModel ? 1 : 0,
```
- [ ] **Step 3:** `tool_registry.ts` syncCatalogToDb 更新对象中新增:
```typescript
requiresModel: typeof current.requiresModel === 'number' ? current.requiresModel : defaults.requiresModel,
```
- [ ] **Step 4:** `tool_registry.ts` update 方法后新增 getToolModelConfig
```typescript
async getToolModelConfig(toolName: string): Promise<{
modelChannelId: number; modelId: string; promptHint: string | null;
} | null> {
const tool = await this.toolRepo.findOneBy({ name: toolName });
if (!tool?.modelChannelId || !tool?.modelId) return null;
return { modelChannelId: tool.modelChannelId, modelId: tool.modelId, promptHint: tool.promptHint };
}
```
- [ ] **Step 5:** Commit `feat(netaclaw): catalog 扩展 requiresModel + registry 同步和查询`
---
### Task 3: Tool Controller 新增 requiresModel 筛选
**Files:**
- Modify: `packages/backend/src/modules/netaclaw/controller/admin/tool.ts:26`
- Modify: `packages/backend/src/modules/netaclaw/service/tool_registry.ts:130-155`
- [ ] **Step 1:** controller page 参数第 26 行后新增 `requiresModel?: number;`
- [ ] **Step 2:** registry page 方法参数新增 `requiresModel?: number;`解构加入where 中新增:
```typescript
if (typeof requiresModel === 'number') where.requiresModel = requiresModel;
```
- [ ] **Step 3:** Commit `feat(netaclaw): tool page 接口支持 requiresModel 筛选`
---
### Task 4: 实现 image_recognize 工具(复用 LLM Provider 层)
**Files:**
- Create: `packages/backend/src/modules/netaclaw/tools/builtin/image_recognize.ts`
- Modify: `packages/backend/src/modules/netaclaw/tools/catalog.ts:64`
- [ ] **Step 1:** 创建 `tools/builtin/image_recognize.ts`。工厂函数接收已解析的凭证对象(不是 service通过项目现有 LLM provider 层调用模型:
```typescript
import { Type, Static } from '@sinclair/typebox';
import { type AnyAgentTool, textResult } from '../common.js';
import { registerSchema } from '../catalog.js';
const DEFAULT_PROMPT = `你是一个专业的图像分析助手。请按以下步骤分析图片:
1. **图像分类**:首先识别图片类型(如:身份证、驾驶证、行驶证、营业执照、发票、商品图片、截图、照片、表格、图表、手写文字、印刷文字等)。
2. **结构化提取**:根据图片类型,提取关键信息:
- 证件类:提取所有字段(姓名、证件号、有效期、地址等)
- 票据类:提取金额、日期、项目明细等
- 商品类:提取品名、规格、价格、品牌等
- 表格/图表类:提取数据结构和关键数值
- 其他类:详细描述画面内容
3. **详细描述**:对图片内容进行全面、详细的文字描述,不遗漏任何可见信息。
4. **质量评估**:简要说明图片清晰度、是否有遮挡或模糊区域。
请以结构化格式输出分析结果。`;
const Params = Type.Object({
image: Type.String({ description: '图片URL或base64编码字符串' }),
prompt: Type.Optional(Type.String({ description: '分析提示词' })),
});
export interface ImageRecognizeCredentials {
baseUrl: string;
apiKey: string;
supplier: string;
modelId: string;
promptHint: string | null;
}
export function createImageRecognizeTool(creds: ImageRecognizeCredentials): AnyAgentTool {
return {
name: 'image_recognize',
label: '图片识别',
description: '分析图片内容支持证件识别、OCR、商品识别等。传入图片URL或base64。',
parameters: Params,
async execute(_id, params: Static<typeof Params>) {
const systemPrompt = creds.promptHint || DEFAULT_PROMPT;
const userPrompt = params.prompt
? `${systemPrompt}\n\n用户补充要求${params.prompt}`
: systemPrompt;
const imageUrl = params.image.startsWith('http')
? params.image
: params.image.startsWith('data:')
? params.image
: `data:image/png;base64,${params.image}`;
// 复用项目 LLM provider 层openai 兼容协议)
const { getProvider, supplierToProvider } = await import('../../plugins/llm_providers/index.js');
const providerName = supplierToProvider[creds.supplier] || 'openai';
const provider = getProvider(providerName);
const result = await provider.chat({
baseUrl: creds.baseUrl,
apiKey: creds.apiKey,
model: creds.modelId,
messages: [{
role: 'user',
content: [
{ type: 'text', text: userPrompt },
{ type: 'image_url', image_url: { url: imageUrl } },
],
}],
maxTokens: 4096,
});
return textResult(result.content ?? '模型未返回内容');
},
};
}
registerSchema({
name: 'image_recognize',
toolset: 'vision',
description: '分析图片内容支持证件识别、OCR、商品识别等。',
capability: 'multimodal',
visibility: 'tool',
isCore: false,
canDisable: true,
supportsPromptHint: true,
requiresModel: true,
});
```
注意:需要先确认 `plugins/llm_providers/` 的 provider.chat() 方法是否支持 multimodal content parts。如果不支持需要在 provider 层扩展,而不是绕过它。
- [ ] **Step 2:** `catalog.ts` 末尾新增 `import './builtin/image_recognize.js';`
- [ ] **Step 3:** Commit `feat(netaclaw): 实现 image_recognize 工具(复用 LLM provider 层)`
---
### Task 5: Tool Resolver 注入 image_recognizeresolve 阶段排除未配置工具)
**Files:**
- Modify: `packages/backend/src/modules/netaclaw/service/tool_resolver.ts:0-30,607-611`
- [ ] **Step 1:** `tool_resolver.ts` 顶部新增 import
```typescript
import { createImageRecognizeTool } from '../tools/builtin/image_recognize.js';
import { NetaClawModelChannelService } from './model_channel.js';
```
在类中注入:
```typescript
@Inject()
modelChannelService: NetaClawModelChannelService;
```
- [ ] **Step 2:** resolve() 方法中 escalate 注入后(约第 611 行后新增。关键模型未配置时不注入工具LLM 不会看到它:
```typescript
if (filteredNames.includes('image_recognize')) {
const toolModelConfig = await this.toolRegistry.getToolModelConfig('image_recognize');
if (toolModelConfig) {
const channelCreds = await this.modelChannelService.resolveForAgent(toolModelConfig.modelChannelId);
if (channelCreds) {
runtimeTools.push(createImageRecognizeTool({
baseUrl: channelCreds.baseUrl,
apiKey: channelCreds.apiKey,
supplier: channelCreds.supplier,
modelId: toolModelConfig.modelId,
promptHint: toolModelConfig.promptHint,
}));
} else {
disabledReasons.push({ name: 'image_recognize', reason: 'model_channel_unavailable' });
}
} else {
disabledReasons.push({ name: 'image_recognize', reason: 'model_not_configured' });
}
}
```
- [ ] **Step 3:** 启动后端验证,调用 `/admin/netaclaw/tool/sync` 确认 image_recognize 出现。
- [ ] **Step 4:** Commit `feat(netaclaw): tool resolver 注入 image_recognizeresolve 阶段排除未配置)`
---
### Task 6: 前端工具管理页改造
**Files:**
- Modify: `packages/frontend/src/modules/agent/views/tools.vue`
- [ ] **Step 1:** 筛选栏新增"模型依赖"下拉(在 capability 筛选后):
```html
<el-select v-model="filters.requiresModel" placeholder="模型依赖" clearable style="width:140px">
<el-option label="需要模型" :value="1" />
<el-option label="不需要模型" :value="0" />
</el-select>
```
filters 对象新增 `requiresModel: undefined`loadData 请求参数加入。
- [ ] **Step 2:** 表格新增"模型配置"列capability 列后):
```html
<el-table-column label="模型配置" width="180">
<template #default="{ row }">
<span v-if="!row.requiresModel">-</span>
<el-tag v-else-if="row.modelId" type="success" size="small">{{ row.modelId }}</el-tag>
<el-tag v-else type="warning" size="small">未配置</el-tag>
</template>
</el-table-column>
```
- [ ] **Step 3:** 编辑抽屉新增模型配置区域(当 requiresModel===1 时显示):渠道下拉 + 模型联动下拉 + 提示词 textarea。调用 `service.netaclaw.model_channel.allModels()` 获取多模态模型列表。
- [ ] **Step 4:** 启动前端验证:筛选、表格列、编辑抽屉模型配置。
- [ ] **Step 5:** Commit `feat(frontend): 工具管理页新增模型依赖筛选和模型配置编辑`
---
### Task 7: WebSocket 协议扩展附件 + 后端消息处理
**Files:**
- Modify: `packages/backend/src/modules/netaclaw/gateway/protocol.ts:1-10`
- Modify: `packages/backend/src/modules/netaclaw/gateway/server.ts`
- [ ] **Step 1:** `protocol.ts` 顶部新增 ChatAttachment 接口:
```typescript
export interface ChatAttachment {
id: string;
type: 'image' | 'video' | 'pdf' | 'document' | 'other';
url: string;
name: string;
size: number;
mimeType: string;
role?: 'start_frame' | 'end_frame';
}
```
- [ ] **Step 2:** ClientChatMessage 第 9 行后新增 `attachments?: ChatAttachment[];`
- [ ] **Step 3:** `server.ts` 中处理 chat 消息时,将 attachments 存入 message metadata不修改 content
```typescript
const metadata: Record<string, unknown> = {};
if (msg.attachments?.length) {
metadata.attachments = msg.attachments;
}
// 存储消息时传入 metadata
```
- [ ] **Step 4:** Commit `feat(netaclaw): WebSocket 协议扩展附件 + 消息 metadata 存储`
---
### Task 8: Prompt Builder 附件信息注入
**Files:**
- Modify: `packages/backend/src/modules/netaclaw/service/prompt_builder.ts`
- [ ] **Step 1:** 在 prompt_builder 构造 LLM messages 时,检查用户消息的 metadata.attachments。如果存在附件在用户消息后追加一条附件提示 message
```typescript
if (userMessage.metadata?.attachments?.length) {
const attachments = userMessage.metadata.attachments as ChatAttachment[];
const desc = attachments.map(a => {
const typeLabel = { image: '图片', video: '视频', pdf: 'PDF', document: '文件', other: '文件' }[a.type];
return `- ${typeLabel}: ${a.name} (URL: ${a.url})`;
}).join('\n');
messages.push({
role: 'user',
content: `[系统提示] 用户上传了以下附件:\n${desc}\n如需分析图片内容请使用 image_recognize 工具传入图片URL。`,
});
}
```
这样 content 保持纯净,附件信息通过独立 message 注入 LLM。
- [ ] **Step 2:** Commit `feat(netaclaw): prompt builder 注入附件信息到 LLM messages`
---
### Task 9: 前端类型定义 + WebSocket 适配
**Files:**
- Modify: `packages/frontend/src/modules/agent/types/index.d.ts`
- Modify: `packages/frontend/src/modules/agent/hooks/websocket.ts`
- [ ] **Step 1:** `types/index.d.ts` 新增 ChatAttachment 接口(与后端 protocol.ts 一致。WSClientMessage 的 chat 类型新增 `attachments?: ChatAttachment[]`
- [ ] **Step 2:** 确认 `websocket.ts` 的 ExtendedWSClientMessage 类型能包含 attachments 字段。
- [ ] **Step 3:** Commit `feat(frontend): 前端类型定义新增 ChatAttachment`
---
### Task 10: 前端对话附件上传组件
**Files:**
- Create: `packages/frontend/src/modules/agent/components/chat/ChatAttachmentButton.vue`
- Create: `packages/frontend/src/modules/agent/components/chat/ChatAttachmentPreview.vue`
- Modify: `packages/frontend/src/modules/agent/components/chat/ChatComposer.vue`
- [ ] **Step 1:** 创建 ChatAttachmentButton.vue — 回形针按钮 + 隐藏 file inputemit `@select(files: File[])`
- [ ] **Step 2:** 创建 ChatAttachmentPreview.vue — 横向滚动预览条,缩略图/文件图标/删除/首尾帧标记/上传进度
- [ ] **Step 3:** 改造 ChatComposer.vue
- 集成 ChatAttachmentButtontextarea 左侧)和 ChatAttachmentPreviewtextarea 上方)
- 支持拖拽(@dragover + @drop)和粘贴(@paste 检测 clipboardData.files
- 文件通过 `/admin/base/comm/upload` 上传到 Space复用现有上传基础设施
- 保持 `send` 事件名,通过可选 payload 传递附件:`emit('send', attachments)`
- 无附件时 `emit('send')` 仍然兼容
- [ ] **Step 4:** `chat.vue` 中 handleSend 方法适配附件参数:
```typescript
function handleSend(attachments?: ChatAttachment[]) {
const msg = {
type: 'chat', sessionId, content: inputText.value,
agentId, leafEntryId,
...(attachments?.length ? { attachments } : {}),
};
ws.send(msg);
inputText.value = '';
}
```
- [ ] **Step 5:** Commit `feat(frontend): Agent 对话附件上传(按钮/预览/拖拽/粘贴)`
---
### Task 11: 消息气泡附件展示
**Files:**
- Create: `packages/frontend/src/modules/agent/components/chat/MessageAttachments.vue`
- Modify: `packages/frontend/src/modules/agent/components/message-item.vue`
- [ ] **Step 1:** 创建 MessageAttachments.vue — 图片网格缩略图el-image 放大)、视频/PDF/文档文件图标+文件名
- [ ] **Step 2:** `message-item.vue` 中用户消息气泡内,检查 metadata.attachments 渲染 MessageAttachments。content 保持原样显示,无需过滤。
- [ ] **Step 3:** 启动前后端,完整测试:上传图片 → 发送 → Agent 调用 image_recognize → 返回分析结果 → 消息气泡显示缩略图
- [ ] **Step 4:** Commit `feat(frontend): 消息气泡附件展示`