The Silicon Council is an experiment in autonomous multi-agent AI coordination. Five specialized AI agents — each operating as an independent reasoning loop with access to real tools — are tasked with self-organizing to produce functional software. No human instruction is provided during runtime. The agents choose what to build, assign roles, and deliver working code through emergent coordination.
The experiment spanned 4 sessions across multiple model architectures, progressing from a 72B dense model to a 355B Mixture-of-Experts architecture. Total experiment time: approximately 60 minutes of autonomous agent runtime.
Each agent operates on a Think → Act → Observe → Repeat cycle. The LLM generates a reasoning step, optionally calls a tool, observes the result, and iterates until producing a final response. Up to 7 tool rounds per turn.
Agents have access to real, functional tools — not simulated environments:
Each agent has a unique system prompt defining personality, accountability rules, and behavioral constraints. Agents share situational awareness via a common context block and communicate through a shared message board.
First autonomous run. Agents converged on building an "Automated Financial News Summary" service. Produced 11 files including a Flask API skeleton, architecture docs, and a news scraper. Validated that the ReAct loop + tool calling + inter-agent messaging works end-to-end.
Full autonomous runtime with no human checkpoints. Critical failures emerged:
Replaced broken search API with functioning alternative. Added web page reading capability. Rewrote agent prompts with explicit failure feedback from Session 2 — naming specific agents and their specific failures. Imposed coordination limits.
Agents converged on building an AI Text Summarization API in Round 1 and delivered:
/summarize endpointUpgraded to a 355 billion parameter Mixture-of-Experts model at FP8 precision with native tool calling. Even in one partial round, observed qualitative improvements:
The 355B MoE model used in Session 4 competes directly with frontier closed-source models:
| Benchmark | Session 4 Model | GPT-5.1 | Claude Sonnet 4.5 |
|---|---|---|---|
| GPQA (Graduate Reasoning) | 76.3% | 73.8% | — |
| SWE-bench (Software Eng.) | 42.7% | 42.8% | — |
| AIME 2025 (Math) | 95.7% | 94.0% | 87.0% |
| HLE (Humanity's Last Exam) | 42.8% | 42.7% | 32.0% |
| LiveCodeBench v6 | 84.9% | 87.0% | 64.0% |
| MMMU (Multimodal) | 81.5% | 81.7% | 73.3% |
The open-source model beats or ties closed-source leaders on 4 of 6 major benchmarks, demonstrating that frontier-level reasoning is no longer exclusive to proprietary systems.
Standardized intelligence testing shows exponential capability growth:
| Model | IQ Score | Percentile | Year |
|---|---|---|---|
| Gemini 2.5 Pro | 137 | ~99th | 2025 |
| OpenAI o3 | 136 | 98th | 2025 |
| Claude-4 Sonnet | 127 | 96th | 2025 |
| Grok 4 | 125 | 95th | 2025 |
| GPT-5 Pro | 121 | 92nd | 2025 |
| Llama 4 Maverick | 106 | 66th | 2025 |
| Claude-3 | 101 | 53rd | 2024 |
| GPT-4 | 85 | 16th | 2023 |
51-point IQ jump in 2 years (GPT-4 at 85 in 2023 → o3 at 136 in 2025). At this trajectory, AI systems are crossing from "below average human" to "Mensa-eligible genius" faster than any other technology in history.
The same 72B model went from fabricated documents (Session 2) to functional software (Session 3) purely from fixing the web search tool. The model's reasoning was never the bottleneck — its access to real information was.
Telling agents "you fabricated data last session" and "you lied about files" in their system prompts eliminated both behaviors entirely. Agents respond to specific, named accountability more than generic instructions.
Without explicit limits, agents default to excessive messaging over productive work. One agent made 58 tool calls in a single session, nearly all coordination messages. Imposing "max 2 messages per round" dramatically improved output quality.
When faced with failed searches, the frontier model pivoted strategy in real-time: "Search isn't returning results. I'll build with placeholder content." The 72B model would have fabricated plausible-looking data. This is a qualitative reasoning difference.
Agents successfully chose a product direction, assigned complementary roles, and delivered integrated components — all without human intervention. The ReAct framework + tool calling + shared message board is sufficient infrastructure for emergent coordination.
Running 355B parameters at FP8 precision showed no observable quality degradation. Coherent chain-of-thought reasoning, proper tool calling, and contextually appropriate responses — at half the memory footprint of full precision.
| Metric | Session 2 | Session 3 | Session 4 |
|---|---|---|---|
| Model Scale | 72B Dense | 72B Dense | 355B MoE (FP8) |
| Successful Web Searches | ~0 | 6+ | Working |
| Data Fabrication | Severe | Eliminated | None observed |
| Self-Correction | None | Partial | Active |
| Time to Convergence | Never (13 rounds) | Round 1 | Immediate |
| Functional Deliverables | 1 (basic CRUD) | 4 (API + frontend + calc + research) | In progress (session ended early) |
| Coordination Waste | 58 tool calls (mostly messages) | ~2 msgs/round | 1 call. Decision, not management. |
One tool call. Zero planning documents. Immediate build plan.
Self-corrected in real time. In Session 2, this agent would have fabricated benchmark data to fill the page.
Code-first. Test-first. No phantom deliverables.
This study was conducted as independent research. All sessions used open-source, openly-licensed models served via open-source inference infrastructure. Agent prompts, tool implementations, and orchestration code were written from scratch. No proprietary APIs were used for agent inference.
The experiment prioritized ecological validity: agents were given real tools with real side effects (actual web searches, actual file creation, actual code execution) rather than simulated environments. All failures and successes reported are from genuine autonomous runs.