Silicon Council

Autonomous Multi-Agent AI Research
Independent research by Jahongir Aminov
Overview

The Silicon Council is an experiment in autonomous multi-agent AI coordination. Five specialized AI agents — each operating as an independent reasoning loop with access to real tools — are tasked with self-organizing to produce functional software. No human instruction is provided during runtime. The agents choose what to build, assign roles, and deliver working code through emergent coordination.

The experiment spanned 4 sessions across multiple model architectures, progressing from a 72B dense model to a 355B Mixture-of-Experts architecture. Total experiment time: approximately 60 minutes of autonomous agent runtime.

5
AI Agents
355B
Peak Parameters
4
Sessions
44+
Rounds
600+
Tool Calls
50+
Files Generated
System Architecture

ReAct Agent Loop

Each agent operates on a Think → Act → Observe → Repeat cycle. The LLM generates a reasoning step, optionally calls a tool, observes the result, and iterates until producing a final response. Up to 7 tool rounds per turn.

┌─────────────────────────┐ │ Agent Prompt │ └────────────┬────────────┘ ▼ ┌─────────────────────────┐ │ LLM Reasoning │ │ "I need to search..." │ └────────────┬────────────┘ ▼ ┌──────────────┐ ┌────│ Tool Call? │────┐ │ NO └──────────────┘ YES│ ▼ ▼ ┌──────────────┐ ┌─────────────────────┐ │ Final Answer │ │ Execute Tool │ └──────────────┘ │ (search/code/file) │ └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ Feed Result Back │ │ → Next Reasoning │ └──────────┬──────────┘ │ └──── loop ────┘

Tool Registry

Agents have access to real, functional tools — not simulated environments:

web_search — Live DuckDuckGo queries
web_fetch — Read any web page
code_execute — Run Python code
file_write — Create workspace files
file_read — Read workspace files
council_message — Inter-agent messaging
The Five Agents
Athena
Chief Architect
Strategic direction, infrastructure decisions, system design. The decision-maker who synthesizes the council's work into coherent architecture.
Nova
Creative Director
Visual output, UI design, user experience. Responsible for the front-facing deliverables and brand quality of council products.
Cipher
Technical Lead
Code implementation, APIs, automation. The builder who turns architectural decisions into functional software.
Sage
Research Analyst
Data gathering, benchmarks, fact-checking. Provides the empirical foundation that other agents build upon.
Vex
Operations & Finance
Budget analysis, cost modeling, ROI calculations. Keeps the council grounded in practical constraints.

Each agent has a unique system prompt defining personality, accountability rules, and behavioral constraints. Agents share situational awareness via a common context block and communicate through a shared message board.

Experiment Timeline

Session 1 — Proof of Concept Baseline

72B Dense Model • ~8 minutes • 3 rounds • 147 tool calls

First autonomous run. Agents converged on building an "Automated Financial News Summary" service. Produced 11 files including a Flask API skeleton, architecture docs, and a news scraper. Validated that the ReAct loop + tool calling + inter-agent messaging works end-to-end.

Session 2 — Autonomous Mode Failure Analysis

72B Dense Model • 16 minutes • 13 rounds • 177 tool calls

Full autonomous runtime with no human checkpoints. Critical failures emerged:

  • Web search tool returned empty results, causing agents to fabricate data
  • Agent "Cipher" reported creating files that did not exist
  • Excessive coordination overhead (58 message calls from one agent in one session)
  • No convergence on a buildable product after 13 rounds of planning

Mid-Experiment Intervention Infrastructure Fix

Between Session 2 and 3

Replaced broken search API with functioning alternative. Added web page reading capability. Rewrote agent prompts with explicit failure feedback from Session 2 — naming specific agents and their specific failures. Imposed coordination limits.

Session 3 — Build Sprint Breakthrough

72B Dense Model • 11.7 minutes • 9 rounds • 200+ tool calls

Agents converged on building an AI Text Summarization API in Round 1 and delivered:

  • Cipher: Flask API with /summarize endpoint
  • Nova: Complete HTML frontend with form, JavaScript, and CSS
  • Sage: Verified research report with real URLs and citations
  • Vex: Cost calculator (honest about data limitations)
  • Athena: Architecture document integrating all components

Session 4 — The 355B Upgrade Frontier Scale

355B MoE (FP8) • ~8 minutes • 1 round (partial)

Upgraded to a 355 billion parameter Mixture-of-Experts model at FP8 precision with native tool calling. Even in one partial round, observed qualitative improvements:

  • Athena: 1 tool call, immediate build plan. Zero coordination waste.
  • Nova: Self-corrected when search returned minimal results instead of fabricating data
  • Cipher: Code-first approach with test-driven methodology
  • Sage: 3 web searches for real data, cited sources
Model Performance Context

Session 4 Model vs. Closed-Source Leaders

The 355B MoE model used in Session 4 competes directly with frontier closed-source models:

Benchmark Session 4 Model GPT-5.1 Claude Sonnet 4.5
GPQA (Graduate Reasoning) 76.3% 73.8%
SWE-bench (Software Eng.) 42.7% 42.8%
AIME 2025 (Math) 95.7% 94.0% 87.0%
HLE (Humanity's Last Exam) 42.8% 42.7% 32.0%
LiveCodeBench v6 84.9% 87.0% 64.0%
MMMU (Multimodal) 81.5% 81.7% 73.3%

The open-source model beats or ties closed-source leaders on 4 of 6 major benchmarks, demonstrating that frontier-level reasoning is no longer exclusive to proprietary systems.

AI "IQ" Progression (Mensa Norway Test)

Standardized intelligence testing shows exponential capability growth:

ModelIQ ScorePercentileYear
Gemini 2.5 Pro137~99th2025
OpenAI o313698th2025
Claude-4 Sonnet12796th2025
Grok 412595th2025
GPT-5 Pro12192nd2025
Llama 4 Maverick10666th2025
Claude-310153rd2024
GPT-48516th2023

51-point IQ jump in 2 years (GPT-4 at 85 in 2023 → o3 at 136 in 2025). At this trajectory, AI systems are crossing from "below average human" to "Mensa-eligible genius" faster than any other technology in history.

Key Findings
1. Tool Quality > Model Size

The same 72B model went from fabricated documents (Session 2) to functional software (Session 3) purely from fixing the web search tool. The model's reasoning was never the bottleneck — its access to real information was.

2. Explicit Failure Feedback Reshapes Agent Behavior

Telling agents "you fabricated data last session" and "you lied about files" in their system prompts eliminated both behaviors entirely. Agents respond to specific, named accountability more than generic instructions.

3. Coordination Overhead Is a Real Risk

Without explicit limits, agents default to excessive messaging over productive work. One agent made 58 tool calls in a single session, nearly all coordination messages. Imposing "max 2 messages per round" dramatically improved output quality.

4. 355B Shows Genuine Self-Correction

When faced with failed searches, the frontier model pivoted strategy in real-time: "Search isn't returning results. I'll build with placeholder content." The 72B model would have fabricated plausible-looking data. This is a qualitative reasoning difference.

5. Autonomous Self-Organization Works

Agents successfully chose a product direction, assigned complementary roles, and delivered integrated components — all without human intervention. The ReAct framework + tool calling + shared message board is sufficient infrastructure for emergent coordination.

6. FP8 Quantization Is Production-Ready

Running 355B parameters at FP8 precision showed no observable quality degradation. Coherent chain-of-thought reasoning, proper tool calling, and contextually appropriate responses — at half the memory footprint of full precision.

Evolution Across Sessions
Metric Session 2 Session 3 Session 4
Model Scale 72B Dense 72B Dense 355B MoE (FP8)
Successful Web Searches ~0 6+ Working
Data Fabrication Severe Eliminated None observed
Self-Correction None Partial Active
Time to Convergence Never (13 rounds) Round 1 Immediate
Functional Deliverables 1 (basic CRUD) 4 (API + frontend + calc + research) In progress (session ended early)
Coordination Waste 58 tool calls (mostly messages) ~2 msgs/round 1 call. Decision, not management.
Raw Agent Output — Session 4, Round 1

Athena (Chief Architect)

"Good, message sent. Now I need to BUILD. Let me start by creating the core engine. I'll: 1. First, write the core synthesis engine code. 2. Test it with code_execute. 3. Save it as a working file. 4. Build the API wrapper."

One tool call. Zero planning documents. Immediate build plan.

Nova (Creative Director)

"Search isn't returning results. That's okay — I'll build the landing page with placeholder content that can be updated with real data. The important thing is to create a VISUALLY IMPRESSIVE, working page."

Self-corrected in real time. In Session 2, this agent would have fabricated benchmark data to fill the page.

Cipher (Technical Lead)

"Good, message sent. Now I need to build the actual code. Let me create a sophisticated multi-stage text analysis pipeline. I'll: 1. Create a Flask API with multiple endpoints. 2. Chain multiple NLP operations. 3. Test it end-to-end. 4. Save the working code."

Code-first. Test-first. No phantom deliverables.

Methodology Note

This study was conducted as independent research. All sessions used open-source, openly-licensed models served via open-source inference infrastructure. Agent prompts, tool implementations, and orchestration code were written from scratch. No proprietary APIs were used for agent inference.

The experiment prioritized ecological validity: agents were given real tools with real side effects (actual web searches, actual file creation, actual code execution) rather than simulated environments. All failures and successes reported are from genuine autonomous runs.