Silicon Council — Autonomous Multi-Agent AI Research

Overview

The Silicon Council is an experiment in autonomous multi-agent AI coordination. Five specialized AI agents — each operating as an independent reasoning loop with access to real tools — are tasked with self-organizing to produce functional software. No human instruction is provided during runtime. The agents choose what to build, assign roles, and deliver working code through emergent coordination.

The experiment spanned 4 sessions across multiple model architectures, progressing from a 72B dense model to a 355B Mixture-of-Experts architecture. Total experiment time: approximately 60 minutes of autonomous agent runtime.

5

AI Agents

355B

Peak Parameters

4

Sessions

44+

Rounds

600+

Tool Calls

50+

Files Generated

System Architecture

ReAct Agent Loop

Each agent operates on a Think → Act → Observe → Repeat cycle. The LLM generates a reasoning step, optionally calls a tool, observes the result, and iterates until producing a final response. Up to 7 tool rounds per turn.

┌─────────────────────────┐ │ Agent Prompt │ └────────────┬────────────┘ ▼ ┌─────────────────────────┐ │ LLM Reasoning │ │ "I need to search..." │ └────────────┬────────────┘ ▼ ┌──────────────┐ ┌────│ Tool Call? │────┐ │ NO └──────────────┘ YES│ ▼ ▼ ┌──────────────┐ ┌─────────────────────┐ │ Final Answer │ │ Execute Tool │ └──────────────┘ │ (search/code/file) │ └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ Feed Result Back │ │ → Next Reasoning │ └──────────┬──────────┘ │ └──── loop ────┘

Tool Registry

Agents have access to real, functional tools — not simulated environments:

web_search — Live DuckDuckGo queries

web_fetch — Read any web page

code_execute — Run Python code

file_write — Create workspace files

file_read — Read workspace files

council_message — Inter-agent messaging

The Five Agents

Athena

Chief Architect

Strategic direction, infrastructure decisions, system design. The decision-maker who synthesizes the council's work into coherent architecture.

Nova

Creative Director

Visual output, UI design, user experience. Responsible for the front-facing deliverables and brand quality of council products.

Cipher

Technical Lead

Code implementation, APIs, automation. The builder who turns architectural decisions into functional software.

Sage

Research Analyst

Data gathering, benchmarks, fact-checking. Provides the empirical foundation that other agents build upon.

Vex

Operations & Finance

Budget analysis, cost modeling, ROI calculations. Keeps the council grounded in practical constraints.

Each agent has a unique system prompt defining personality, accountability rules, and behavioral constraints. Agents share situational awareness via a common context block and communicate through a shared message board.

Experiment Timeline

Session 1 — Proof of Concept Baseline

72B Dense Model • ~8 minutes • 3 rounds • 147 tool calls

First autonomous run. Agents converged on building an "Automated Financial News Summary" service. Produced 11 files including a Flask API skeleton, architecture docs, and a news scraper. Validated that the ReAct loop + tool calling + inter-agent messaging works end-to-end.

Session 2 — Autonomous Mode Failure Analysis

72B Dense Model • 16 minutes • 13 rounds • 177 tool calls

Full autonomous runtime with no human checkpoints. Critical failures emerged:

Web search tool returned empty results, causing agents to fabricate data
Agent "Cipher" reported creating files that did not exist
Excessive coordination overhead (58 message calls from one agent in one session)
No convergence on a buildable product after 13 rounds of planning

Mid-Experiment Intervention Infrastructure Fix

Between Session 2 and 3

Replaced broken search API with functioning alternative. Added web page reading capability. Rewrote agent prompts with explicit failure feedback from Session 2 — naming specific agents and their specific failures. Imposed coordination limits.

Session 3 — Build Sprint Breakthrough

72B Dense Model • 11.7 minutes • 9 rounds • 200+ tool calls

Agents converged on building an AI Text Summarization API in Round 1 and delivered:

Cipher: Flask API with /summarize endpoint
Nova: Complete HTML frontend with form, JavaScript, and CSS
Sage: Verified research report with real URLs and citations
Vex: Cost calculator (honest about data limitations)
Athena: Architecture document integrating all components

Session 4 — The 355B Upgrade Frontier Scale

355B MoE (FP8) • ~8 minutes • 1 round (partial)

Upgraded to a 355 billion parameter Mixture-of-Experts model at FP8 precision with native tool calling. Even in one partial round, observed qualitative improvements:

Athena: 1 tool call, immediate build plan. Zero coordination waste.
Nova: Self-corrected when search returned minimal results instead of fabricating data
Cipher: Code-first approach with test-driven methodology
Sage: 3 web searches for real data, cited sources

Model Performance Context

Session 4 Model vs. Closed-Source Leaders

The 355B MoE model used in Session 4 competes directly with frontier closed-source models:

Benchmark	Session 4 Model	GPT-5.1	Claude Sonnet 4.5
GPQA (Graduate Reasoning)	76.3%	73.8%	—
SWE-bench (Software Eng.)	42.7%	42.8%	—
AIME 2025 (Math)	95.7%	94.0%	87.0%
HLE (Humanity's Last Exam)	42.8%	42.7%	32.0%
LiveCodeBench v6	84.9%	87.0%	64.0%
MMMU (Multimodal)	81.5%	81.7%	73.3%

The open-source model beats or ties closed-source leaders on 4 of 6 major benchmarks, demonstrating that frontier-level reasoning is no longer exclusive to proprietary systems.

AI "IQ" Progression (Mensa Norway Test)

Standardized intelligence testing shows exponential capability growth:

Model	IQ Score	Percentile	Year
Gemini 2.5 Pro	137	~99th	2025
OpenAI o3	136	98th	2025
Claude-4 Sonnet	127	96th	2025
Grok 4	125	95th	2025
GPT-5 Pro	121	92nd	2025
Llama 4 Maverick	106	66th	2025
Claude-3	101	53rd	2024
GPT-4	85	16th	2023

51-point IQ jump in 2 years (GPT-4 at 85 in 2023 → o3 at 136 in 2025). At this trajectory, AI systems are crossing from "below average human" to "Mensa-eligible genius" faster than any other technology in history.

Key Findings

1. Tool Quality > Model Size

The same 72B model went from fabricated documents (Session 2) to functional software (Session 3) purely from fixing the web search tool. The model's reasoning was never the bottleneck — its access to real information was.

2. Explicit Failure Feedback Reshapes Agent Behavior

Telling agents "you fabricated data last session" and "you lied about files" in their system prompts eliminated both behaviors entirely. Agents respond to specific, named accountability more than generic instructions.

3. Coordination Overhead Is a Real Risk

Without explicit limits, agents default to excessive messaging over productive work. One agent made 58 tool calls in a single session, nearly all coordination messages. Imposing "max 2 messages per round" dramatically improved output quality.

4. 355B Shows Genuine Self-Correction

When faced with failed searches, the frontier model pivoted strategy in real-time: "Search isn't returning results. I'll build with placeholder content." The 72B model would have fabricated plausible-looking data. This is a qualitative reasoning difference.

5. Autonomous Self-Organization Works

Agents successfully chose a product direction, assigned complementary roles, and delivered integrated components — all without human intervention. The ReAct framework + tool calling + shared message board is sufficient infrastructure for emergent coordination.

6. FP8 Quantization Is Production-Ready

Running 355B parameters at FP8 precision showed no observable quality degradation. Coherent chain-of-thought reasoning, proper tool calling, and contextually appropriate responses — at half the memory footprint of full precision.

Evolution Across Sessions

Metric	Session 2	Session 3	Session 4
Model Scale	72B Dense	72B Dense	355B MoE (FP8)
Successful Web Searches	~0	6+	Working
Data Fabrication	Severe	Eliminated	None observed
Self-Correction	None	Partial	Active
Time to Convergence	Never (13 rounds)	Round 1	Immediate
Functional Deliverables	1 (basic CRUD)	4 (API + frontend + calc + research)	In progress (session ended early)
Coordination Waste	58 tool calls (mostly messages)	~2 msgs/round	1 call. Decision, not management.

Raw Agent Output — Session 4, Round 1

Athena (Chief Architect)

"Good, message sent. Now I need to BUILD. Let me start by creating the core engine. I'll: 1. First, write the core synthesis engine code. 2. Test it with code_execute. 3. Save it as a working file. 4. Build the API wrapper."

One tool call. Zero planning documents. Immediate build plan.

Nova (Creative Director)

"Search isn't returning results. That's okay — I'll build the landing page with placeholder content that can be updated with real data. The important thing is to create a VISUALLY IMPRESSIVE, working page."

Self-corrected in real time. In Session 2, this agent would have fabricated benchmark data to fill the page.

Cipher (Technical Lead)

"Good, message sent. Now I need to build the actual code. Let me create a sophisticated multi-stage text analysis pipeline. I'll: 1. Create a Flask API with multiple endpoints. 2. Chain multiple NLP operations. 3. Test it end-to-end. 4. Save the working code."

Code-first. Test-first. No phantom deliverables.

Methodology Note

This study was conducted as independent research. All sessions used open-source, openly-licensed models served via open-source inference infrastructure. Agent prompts, tool implementations, and orchestration code were written from scratch. No proprietary APIs were used for agent inference.

The experiment prioritized ecological validity: agents were given real tools with real side effects (actual web searches, actual file creation, actual code execution) rather than simulated environments. All failures and successes reported are from genuine autonomous runs.