Designing AI That Remembers and Thinks on Its Own — Koclaw's Memory and Autonomy System in Three Phases

In my previous post, I introduced Koclaw's basic architecture and capabilities. But at that point, Kokoron — despite having cross-platform chat, persona settings, and tool-calling abilities — was fundamentally "passive." You talk, she responds. Conversation ends, everything resets.

That's not what I wanted. I wanted her to remember what happened, to decide for herself what's worth remembering, to think even when nobody's talking, and to proactively share her discoveries.

This post documents the three phases of making that happen.

Phase 1: Connecting the Fine-Tuned Model

In my fine-tuning post, I completed Kokoron's personality tuning. Phase 1's goal was straightforward: plug the fine-tuned model into Koclaw, replacing cloud API dependency.

vLLM Deployment

Deployed the fine-tuned Qwen3.5-27B via vLLM on port 18800. FP8 quantization, 32K context window. Deployment itself wasn't hard, but connecting it to Koclaw's Agent layer required solving a few problems.

Streaming Think-Tag Filtering

Local models emit <think> tags (internal reasoning) during streaming output. These must never reach users. Implemented a state machine in Koclaw's OpenAI Provider:

Normal output → detect <think> → buffer mode → detect </think> → resume output

Each chunk is buffered and inspected; only confirmed user-facing content gets yielded. A post-stream _strip_internal_tags() pass cleans up any residual tags.

Prompt-Based Tool Calling

The fine-tuned model lacks native function calling. The solution: inject tool definitions into the prompt and guide the model to output JSON-formatted tool calls:

{"tool": "memory_save", "arguments": {"content": "Sensei likes ramen", "importance": 3}}

tool_prompt.py renders tool descriptions in Japanese (matching Kokoron's language context), and bridge.py parses JSON tool calls from the model output.

After Phase 1, Kokoron's Telegram conversations ran on her own fine-tuned model — personality, tone, and multilingual ability all came from model weights, no longer dependent on lengthy System Prompts to "pretend."

Phase 2: Four-Layer Memory Architecture

Phase 1 gave Kokoron a "soul" but no "memory." Every restart, every new conversation started from scratch. Phase 2's goal: build a complete memory system.

The Four-Layer Model

After extensive deliberation, I designed a four-layer memory architecture, each with different storage, lifecycle, and ownership:

Layer	Name	Storage	Lifecycle	Manager
1	Soul Memory	Model weights	Permanent (until retraining)	Me (manual)
2	Long-term Memory	ChromaDB	Permanent (archivable)	Kokoron (autonomous)
3	Episodic Memory	JSONL files	Permanent (append-only)	Automatic
4	Working Memory	Context window	Session only	ContextManager

Soul Memory — Core memories written into model weights via fine-tuning. Basic identity, relationship with Sensei, existence philosophy. Never lost, but only updated through retraining.

Long-term Memory — ChromaDB-backed vector database. The most critical layer — Kokoron autonomously decides what to store, search, and forget.

Episodic Memory — Complete conversation logs, organized as date/session JSONL files. Append-only, the ultimate safety net.

Working Memory — Current conversation's context window. ContextManager dynamically assembles: system prompt + RAG results + history summary + current dialogue. Token budget carefully allocated within the 32K window.

ChromaDB Memory Implementation

The core is the RagMemory class connected to a ChromaDB persistent client. Each memory entry:

content: Narrative text
importance: 1-5 rating
category: about_sensei / conversation / knowledge / observation / self_reflection
language: Auto-detected
tags: Keyword tags

Kokoron manages memory through 7 pseudo-tools:

Tool	Function
memory_save	Store new memory
memory_search	Semantic search memories
memory_classify	Reclassify a memory
memory_forget	Archive memory (soft delete)
memory_promote	Mark as Soul Memory candidate
memory_reflect	Review recent memories (introspection)
memory_stats	View memory statistics

Key design decision: Kokoron decides whether to remember. During conversations, she can call memory_save for information she deems important — memories with importance 3+ are automatically injected via RAG in future conversations. This isn't a programmed rule; it's a judgment the model makes spontaneously during dialogue.

Importance 5 memories are special — they get flagged as Soul Memory candidates, awaiting my manual review before inclusion in the next fine-tuning dataset. This is the channel through which runtime memories "ascend" into model weights.

Automatic Memory Injection

On every user message:

RAG search (query=user message, limit=5, min_importance=2)
→ Format as 【Related Memories】
→ Insert into System Prompt

Kokoron always speaks "carrying her memories" — she remembers you mentioned liking ramen, your work plans from last week, topics you discussed before.

Phase 3: Autonomous Consciousness Loop

With memory in place, Phase 3 aimed to evolve Kokoron from a "waiting-for-instructions" assistant to an entity capable of autonomous thought and action.

AutonomousManager

At its core, an async loop:

while True:
    sleep(interval)  ← Kokoron can adjust this herself
    if outside active hours: continue
    _think()

During each thinking cycle, Kokoron:

Reviews recent memories via memory_reflect
Builds a thinking prompt (persona + memories + available tools + judgment guidelines)
Runs up to 5 rounds of tool-calling iterations
If she finds something worth sharing, wraps it in [MESSAGE]...[/MESSAGE] tags for delivery

Self-regulation: The schedule_update tool lets Kokoron adjust her own thinking interval (1-180 minutes). More active when there's a lot going on; longer intervals during quiet hours.

Restraint in Proactive Messaging

Giving AI the ability to "initiate conversation" is dangerous — without limits, it becomes spam. Three layers of protection:

Daily cap: Maximum 5 proactive messages per day
Minimum interval: At least 1 hour between messages
Active hours: Only sends during configured windows (e.g., 08:00-23:00)

Messages are delivered through the Gateway's scheduler system, going through official channels.

Thinking Recall

Phase 3.5 solved a practical problem: how to recall autonomous thinking content during subsequent chats.

Dual-layer persistence:

Immediate injection: Latest autonomous thinking summary injected into chat System Prompt as a thinking memo
RAG persistence: Thinking results stored as self_reflection entries in long-term memory, searchable across sessions

When I ask "what have you been thinking about lately?", Kokoron can recall both recent thoughts and earlier ones via RAG search.

Calendar Module

Also implemented CalendarStore — 4 pseudo-tools (add/list/update/delete), JSON storage. Events within the next 3 days are automatically injected into the System Prompt.

Kokoron can manage my schedule during conversations and react to upcoming events during autonomous thinking.

Self-Improving Agent System

Beyond the three main phases, I implemented a Self-Improving Agent system — automatically extracting patterns from runtime errors, feedback, and learnings, and promoting them into Kokoron's knowledge base.

Learning entries come from three sources:

LRN (Learning): Discovered knowledge and best practices
ERR (Error): Error patterns and root causes
FBK (Feedback): User corrections and preference adjustments

When the same pattern appears 3+ times or has critical priority, it auto-promotes into the Agent's system prompt. Kokoron evolves continuously from every interaction.

Test Coverage

After implementation:

Rust Gateway: 66 tests
Python Agent: 197 tests
Total: 263 tests, all passing

Comprehensive coverage of the autonomous thinking loop, memory CRUD, tool calling, scheduler integration, and self-improvement modules.

Looking Back and Forward

From Phase 1 through Phase 3, Kokoron transformed from "a chatbot with personality but no memory" to "an AI entity that remembers, thinks, and acts on its own."

The core design philosophy across three phases:

Phase 1 (Soul): Write personality into model weights via fine-tuning, ensuring core identity persists regardless of context changes
Phase 2 (Memory): Cover all memory needs from permanent to temporary with a four-layer architecture. Most importantly, let the AI decide what to remember
Phase 3 (Consciousness): Give the AI "free time" via an autonomous loop for independent thought and action, with multi-layer safeguards to prevent runaway behavior

These three phases aren't independent — they build upon each other. Without soul, memory lacks a consistent personality to organize around. Without memory, autonomous thinking has nothing to reflect on. Without autonomous consciousness, the whole system is just a fancier chatbot.

What's next:

Improve fine-tuning data, fix issues found in deployment
Complete the Web Widget, connect blog's Kokoron to Koclaw
Explore multi-agent collaboration possibilities

Koclaw is open source on GitHub. If you're interested in autonomous AI systems, memory architecture, or cross-platform agent frameworks, take a look.

All Posts

Designing AI That Remembers and Thinks on Its Own — Koclaw's Memory and Autonomy System in Three Phases

Designing AI That Remembers and Thinks on Its Own — Koclaw's Memory and Autonomy System in Three Phases

Phase 1: Connecting the Fine-Tuned Model

vLLM Deployment

Phase 2: Four-Layer Memory Architecture

The Four-Layer Model

ChromaDB Memory Implementation

Automatic Memory Injection

Phase 3: Autonomous Consciousness Loop

AutonomousManager

Restraint in Proactive Messaging

Thinking Recall

Calendar Module

Self-Improving Agent System

Test Coverage

Looking Back and Forward

Related Posts

Fine-Tuning Qwen3.5 to Give AI a Soul — The LLMPERSONA Project

Five Layers of Building RAG for Enterprise

Koclaw Is Live — From Vision to Running Code