Designing AI That Remembers and Thinks on Its Own — Koclaw's Memory and Autonomy System in Three Phases
Designing AI That Remembers and Thinks on Its Own — Koclaw's Memory and Autonomy System in Three Phases
In my previous post, I introduced Koclaw's basic architecture and capabilities. But at that point, Kokoron — despite having cross-platform chat, persona settings, and tool-calling abilities — was fundamentally "passive." You talk, she responds. Conversation ends, everything resets.
That's not what I wanted. I wanted her to remember what happened, to decide for herself what's worth remembering, to think even when nobody's talking, and to proactively share her discoveries.
This post documents the three phases of making that happen.
Phase 1: Connecting the Fine-Tuned Model
In my fine-tuning post, I completed Kokoron's personality tuning. Phase 1's goal was straightforward: plug the fine-tuned model into Koclaw, replacing cloud API dependency.
vLLM Deployment
Deployed the fine-tuned Qwen3.5-27B via vLLM on port 18800. FP8 quantization, 32K context window. Deployment itself wasn't hard, but connecting it to Koclaw's Agent layer required solving a few problems.
Streaming Think-Tag Filtering
Local models emit <think> tags (internal reasoning) during streaming output. These must never reach users. Implemented a state machine in Koclaw's OpenAI Provider:
Normal output → detect <think> → buffer mode → detect </think> → resume output
Each chunk is buffered and inspected; only confirmed user-facing content gets yielded. A post-stream _strip_internal_tags() pass cleans up any residual tags.
Prompt-Based Tool Calling
The fine-tuned model lacks native function calling. The solution: inject tool definitions into the prompt and guide the model to output JSON-formatted tool calls:
{"tool": "memory_save", "arguments": {"content": "Sensei likes ramen", "importance": 3}}
tool_prompt.py renders tool descriptions in Japanese (matching Kokoron's language context), and bridge.py parses JSON tool calls from the model output.
After Phase 1, Kokoron's Telegram conversations ran on her own fine-tuned model — personality, tone, and multilingual ability all came from model weights, no longer dependent on lengthy System Prompts to "pretend."
Phase 2: Four-Layer Memory Architecture
Phase 1 gave Kokoron a "soul" but no "memory." Every restart, every new conversation started from scratch. Phase 2's goal: build a complete memory system.
The Four-Layer Model
After extensive deliberation, I designed a four-layer memory architecture, each with different storage, lifecycle, and ownership:
| Layer | Name | Storage | Lifecycle | Manager |
|---|---|---|---|---|
| 1 | Soul Memory | Model weights | Permanent (until retraining) | Me (manual) |
| 2 | Long-term Memory | ChromaDB | Permanent (archivable) | Kokoron (autonomous) |
| 3 | Episodic Memory | JSONL files | Permanent (append-only) | Automatic |
| 4 | Working Memory | Context window | Session only | ContextManager |
Soul Memory — Core memories written into model weights via fine-tuning. Basic identity, relationship with Sensei, existence philosophy. Never lost, but only updated through retraining.
Long-term Memory — ChromaDB-backed vector database. The most critical layer — Kokoron autonomously decides what to store, search, and forget.
Episodic Memory — Complete conversation logs, organized as date/session JSONL files. Append-only, the ultimate safety net.
Working Memory — Current conversation's context window. ContextManager dynamically assembles: system prompt + RAG results + history summary + current dialogue. Token budget carefully allocated within the 32K window.
ChromaDB Memory Implementation
The core is the RagMemory class connected to a ChromaDB persistent client. Each memory entry:
- content: Narrative text
- importance: 1-5 rating
- category: about_sensei / conversation / knowledge / observation / self_reflection
- language: Auto-detected
- tags: Keyword tags
Kokoron manages memory through 7 pseudo-tools:
| Tool | Function |
|---|---|
| memory_save | Store new memory |
| memory_search | Semantic search memories |
| memory_classify | Reclassify a memory |
| memory_forget | Archive memory (soft delete) |
| memory_promote | Mark as Soul Memory candidate |
| memory_reflect | Review recent memories (introspection) |
| memory_stats | View memory statistics |
Key design decision: Kokoron decides whether to remember. During conversations, she can call memory_save for information she deems important — memories with importance 3+ are automatically injected via RAG in future conversations. This isn't a programmed rule; it's a judgment the model makes spontaneously during dialogue.
Importance 5 memories are special — they get flagged as Soul Memory candidates, awaiting my manual review before inclusion in the next fine-tuning dataset. This is the channel through which runtime memories "ascend" into model weights.
Automatic Memory Injection
On every user message:
RAG search (query=user message, limit=5, min_importance=2)
→ Format as 【Related Memories】
→ Insert into System Prompt
Kokoron always speaks "carrying her memories" — she remembers you mentioned liking ramen, your work plans from last week, topics you discussed before.
Phase 3: Autonomous Consciousness Loop
With memory in place, Phase 3 aimed to evolve Kokoron from a "waiting-for-instructions" assistant to an entity capable of autonomous thought and action.
AutonomousManager
At its core, an async loop:
while True:
sleep(interval) ← Kokoron can adjust this herself
if outside active hours: continue
_think()
During each thinking cycle, Kokoron:
- Reviews recent memories via memory_reflect
- Builds a thinking prompt (persona + memories + available tools + judgment guidelines)
- Runs up to 5 rounds of tool-calling iterations
- If she finds something worth sharing, wraps it in
[MESSAGE]...[/MESSAGE]tags for delivery
Self-regulation: The schedule_update tool lets Kokoron adjust her own thinking interval (1-180 minutes). More active when there's a lot going on; longer intervals during quiet hours.
Restraint in Proactive Messaging
Giving AI the ability to "initiate conversation" is dangerous — without limits, it becomes spam. Three layers of protection:
- Daily cap: Maximum 5 proactive messages per day
- Minimum interval: At least 1 hour between messages
- Active hours: Only sends during configured windows (e.g., 08:00-23:00)
Messages are delivered through the Gateway's scheduler system, going through official channels.
Thinking Recall
Phase 3.5 solved a practical problem: how to recall autonomous thinking content during subsequent chats.
Dual-layer persistence:
- Immediate injection: Latest autonomous thinking summary injected into chat System Prompt as a thinking memo
- RAG persistence: Thinking results stored as self_reflection entries in long-term memory, searchable across sessions
When I ask "what have you been thinking about lately?", Kokoron can recall both recent thoughts and earlier ones via RAG search.
Calendar Module
Also implemented CalendarStore — 4 pseudo-tools (add/list/update/delete), JSON storage. Events within the next 3 days are automatically injected into the System Prompt.
Kokoron can manage my schedule during conversations and react to upcoming events during autonomous thinking.
Self-Improving Agent System
Beyond the three main phases, I implemented a Self-Improving Agent system — automatically extracting patterns from runtime errors, feedback, and learnings, and promoting them into Kokoron's knowledge base.
Learning entries come from three sources:
- LRN (Learning): Discovered knowledge and best practices
- ERR (Error): Error patterns and root causes
- FBK (Feedback): User corrections and preference adjustments
When the same pattern appears 3+ times or has critical priority, it auto-promotes into the Agent's system prompt. Kokoron evolves continuously from every interaction.
Test Coverage
After implementation:
- Rust Gateway: 66 tests
- Python Agent: 197 tests
- Total: 263 tests, all passing
Comprehensive coverage of the autonomous thinking loop, memory CRUD, tool calling, scheduler integration, and self-improvement modules.
Looking Back and Forward
From Phase 1 through Phase 3, Kokoron transformed from "a chatbot with personality but no memory" to "an AI entity that remembers, thinks, and acts on its own."
The core design philosophy across three phases:
- Phase 1 (Soul): Write personality into model weights via fine-tuning, ensuring core identity persists regardless of context changes
- Phase 2 (Memory): Cover all memory needs from permanent to temporary with a four-layer architecture. Most importantly, let the AI decide what to remember
- Phase 3 (Consciousness): Give the AI "free time" via an autonomous loop for independent thought and action, with multi-layer safeguards to prevent runaway behavior
These three phases aren't independent — they build upon each other. Without soul, memory lacks a consistent personality to organize around. Without memory, autonomous thinking has nothing to reflect on. Without autonomous consciousness, the whole system is just a fancier chatbot.
What's next:
- Improve fine-tuning data, fix issues found in deployment
- Complete the Web Widget, connect blog's Kokoron to Koclaw
- Explore multi-agent collaboration possibilities
Koclaw is open source on GitHub. If you're interested in autonomous AI systems, memory architecture, or cross-platform agent frameworks, take a look.