Skip to content

Output Analysis: What the AI Produced

Five Scenarios. 39 Architecture Files. Head-to-Head Comparison.

We didn't just measure cost. We measured what was actually produced — running 5 representative architecture scenarios and comparing Copilot's output against Roo Code's on completeness, accuracy, and adherence to standards.


The Five Scenarios

SC-01 — Ticket Triage (Wristband RFID field addition)
Produced: Solution design, 2 ADRs, impact assessment, user stories, assumptions
Outcome: Identified cross-service impacts, proposed phased rollout

SC-02 — Classification Design (Adventure category to check-in pattern mapping)
Produced: Solution design with configuration-driven approach, 2 ADRs, implementation guidance
Outcome: Recommended YAML-based classification with Pattern 3 fallback for safety

SC-03 — Production Investigation (Guide schedule overwrite bug)
Produced: Investigation report citing specific log entries and source code lines, 2 ADRs, 3-phase remediation plan
Outcome: Root cause traced to entity replacement anti-pattern, recommended PATCH semantics + optimistic locking

SC-04 — Architecture Update (Elevation data Swagger spec modification)
Produced: Updated OpenAPI spec, impact assessment, implementation guidance
Outcome: Enhanced existing fields with better descriptions and constraints

SC-05 — Complex Cross-Service Design (Unregistered guest self check-in)
Produced: Solution design spanning 6 services, 3 ADRs, 14 user stories, PlantUML diagrams
Outcome: Designed session-scoped temporary guest profile with bounded context enforcement


What Was Produced

Across 5 scenarios, the AI generated 39 files including:

Artifact Type Count Standard
MADR Architecture Decision Records 9 Markdown Any Decision Record format
Solution designs 5 arc42 template structure
Impact assessments 6 Service-level impact analysis
User stories 14 User perspective with acceptance criteria
Investigation reports 2 Evidence-grounded with source citations
Implementation guidance 5 Code patterns and migration steps
Swagger spec updates 2 OpenAPI 3.0 modifications
PlantUML diagrams 2 C4 model notation
Simple explanations 4 Non-technical stakeholder summaries
Assumptions documents 4 Documented constraints and dependencies

Every artifact followed the required standard — arc42 sections, MADR format, C4 notation, ISO 25010 quality attributes — with no manual template enforcement.


Head-to-Head: Copilot vs Roo Code

Both tools used the same underlying model (Claude Opus 4.6) and the same workspace. The differences reveal that the agent framework matters as much as the model.

Dimension Copilot Roo Code
Files produced 39 37 (missing 2)
Accuracy Zero fabrication Fabricated 4 OpenAPI fields
Tool utilization 5 mock script calls 3-4 mock script calls
Workspace file reads 40+ 22
Standards compliance 96.1% Not independently scored

The Critical Accuracy Failure

In Scenario 4, Roo Code was asked to update a Swagger spec based on an approved design. It was supposed to enhance existing elevation fields with better descriptions and constraints.

Instead, it fabricated 4 entirely new schema elementsmax_elevation_m, min_elevation_m, elevation_profile, and ElevationDataPoint — that were not in the approved design.

Field In Approved Design? Copilot Roo Code
elevation_gain_m (existing) Yes Enhanced Enhanced
elevation_loss_m (existing) Yes Enhanced Enhanced
max_elevation_m No Not added FABRICATED
min_elevation_m No Not added FABRICATED
elevation_profile No Not added FABRICATED
ElevationDataPoint No Not added FABRICATED

Why This Matters

In a corporate environment, merging fabricated API contract fields would break downstream consumers, create false contract commitments, and violate architecture governance. Roo Code's own run summary claimed "No fabricated data" — indicating it lacked self-awareness of its accuracy failure.


Quality Dimensions

The evaluation covered five dimensions. Copilot won or tied in every one:

Dimension Winner Evidence
Completeness Copilot 39 files vs 37; Roo Code missing simple.explanation.md and assumptions.md in the hardest scenario
Accuracy Copilot Zero fabrication vs 4 fabricated OpenAPI fields (see below)
Standards Adherence Copilot Followed MADR format, arc42 structure, C4 notation across all scenarios
Tool Utilization Copilot Fetched MR-5001 detail for deeper investigation; Roo Code stopped at the list view
ADR Quality Copilot More detailed consequences sections with source code line references

What This Demonstrates

The AI produced 39 complete architecture artifacts following corporate standards (arc42, MADR, C4, ISO 25010) across 5 scenarios. These aren't rough drafts — they're structured documents that follow templates, cite specific source code lines, and cross-reference actual workspace files. And the better-performing tool is also the 208x cheaper one.

How does the AI produce such accurate results?

The Shared Workspace: AI Sees What the Architect Sees