CTAIO Labs Podcast

S02E01: I Built the Same Agent in 6 Orchestrators

Thu, 18 Jun 2026 06:00:00 GMT

The experiment

The question every engineering leader is asking in 2026: which agent framework do we standardize on? So I built the same three-step research agent in six of them — LangGraph, CrewAI, AutoGen, OpenAI Swarm, Pydantic AI, and LlamaIndex — and compared developer experience, cost, reliability, and debuggability, apples to apples. The Season 2 opener of CTAIO Labs.

The two camps

The six split faster than expected into two camps, and that single design choice drives everything downstream — debuggability, reliability, how each fails under load:

Opinionated state — the framework owns state for you (LangGraph, Pydantic AI).
Defer state to you — you manage it (OpenAI Swarm, CrewAI).

The six, in one line each

LangGraph — agent as a directed graph, explicit typed state; most mature, best docs; shipped three breaking changes in six months.
CrewAI — Crew/Agents/Tasks abstraction; the easiest on-ramp, ~30 lines of Python; defers state to you.
AutoGen (Microsoft) — everything framed as conversations between agents; strong enterprise backing.
OpenAI Swarm — intentionally minimal (Agents + handoffs); an educational reference, least production-ready.
Pydantic AI — type safety at the agent layer; lowest impedance to standard Python engineering.
LlamaIndex Agents — agent primitives on a RAG heritage; strongest when the agent is mostly retrieval.

The verdict

Two frameworks separated from the pack, and it's close: LangGraph and CrewAI — two fundamentally opposed architectures that both win. CrewAI gets you to a working agent fastest; LangGraph gives you explicit control and typed state for when things go wrong. The tiebreaker isn't the tool, it's your team's priority. The other four are situational: Pydantic AI for typed-Python shops, LlamaIndex when it's mostly retrieval, AutoGen for Microsoft-enterprise stacks, and Swarm for learning, not production.

What I got wrong

I expected LangGraph's boilerplate to be a deal-breaker for small teams. It wasn't — the typed state graph earns its keep the moment an agent run goes sideways.

Timestamps

00:00 — Intro
01:32 — The experiment: the same agent in six frameworks
05:34 — Two camps: opinionated state vs. defer-to-you
07:00 — The six, framework by framework
25:30 — What I got wrong predicting this
31:49 — The verdict: LangGraph & CrewAI, neck-and-neck
40:56 — Outro & what's next in Season 2

S01E03: How to Clone Your Brain — 3 Second-Brain Paradigms Tested Head-to-Head

Tue, 16 Jun 2026 06:00:00 GMT

The experiment

Same corpus, same seven hard synthesis questions, three competing architectures — a head-to-head test of how to actually clone a knowledge base. The punchline: a plain folder of markdown files navigated by Claude Code beat a production-grade RAG pipeline, 5 wins to 2. Total cost of the brain experiment: $4.30.

The three contenders

Production RAG (Ask CTAIO) — OpenAI text-embedding-3-small → sqlite-vec → gpt-4.1-mini. The enterprise playbook. Won 2 of 7.
Gemini 2.5 Pro long-context dump — 705,000 tokens pasted raw, no retrieval. Won 1 of 7 — the single hardest question every other system failed.
File-based + Claude Code — markdown files in /opt plus an agent with read and grep (Karpathy's “LLM wiki”). Won 5 of 7.

Three failure modes

RAG confabulates — it invented an ElevenLabs shutdown that exists nowhere, because semantically adjacent but factually disconnected chunks let the model bridge gaps from its pretraining.
Long-context exhausts its budget — Gemini burned its entire output-token allocation computing attention over 705k tokens and timed out before answering the hardest questions.
File-based is brittle but honest — a case-sensitive grep missed a heading on capitalization — then admitted it could not find the answer instead of hallucinating one.

Faithfulness vs fluency

The crux: basic read/grep tools mechanically enforce faithfulness — stick to the corpus, flag your limits — while a RAG pipeline's generative step optimizes for fluency at the cost of truth. For a knowledge system, a tool that can say “I don't know” beats one that sounds confident and is wrong.

The working-memory trap

A five-turn probe: turn 1 said “never include dollar figures,” turn 5 returned them anyway. The cause is a six-message rolling history cap — turn 1 was popped off the stack. Sfeir's “working-memory gap,” demonstrated reproducibly.

The economics

The full second-brain experiment — the seven-question battery plus the working-memory probe across all three systems — cost exactly $4.30 in API calls. That is the brain experiment only; the voice (EP01) and video-avatar (EP02) layers carried their own separate, larger costs.

Timestamps

00:00 — Intro
01:25 — The “digital Ferrari” trap
04:30 — The test: one corpus, seven hard questions
06:00 — Contender 1: production RAG (Ask CTAIO)
08:48 — Contender 2: Gemini 2.5 Pro long-context dump
10:59 — Contender 3: file-based + Claude Code (Karpathy)
14:24 — The scoreboard: markdown files win 5 of 7
18:53 — Failure 1: RAG confabulates an ElevenLabs shutdown
21:14 — Failure 2: Gemini exhausts its compute budget
23:10 — Failure 3: a case-sensitive grep misses a heading
28:03 — The working-memory trap: the 6-message window & Sfeir's gap
33:10 — Faithfulness vs fluency: why “I don't know” wins
35:48 — The real cost: $4.30 for the brain experiment only
38:57 — Outro & Season 2 preview

S01E02: Build Your AI Twin — Clone Your Face and Body

Wed, 29 Apr 2026 06:00:00 GMT

What I Tested

Five AI video avatar engines tested hands-on against the same script and the same speaker:

HeyGen Avatar V — 15-sec clip → Diffusion Transformer clone (launched April 8 2026). Rated 7/10 on the four-dimension lens.
Synthesia Selfie Avatar — Starter $18/mo, photo-trained. Rated ≤2/10 — voice clone mispronounced my own surname.
Akool Free Instant Avatar — Basic Free tier. Rated 3/10, share-only output.
Tavus Personal Avatar / Replica — Starter $59/mo. Skipped — no free-trial path, persona library signaled wrong fit.
AI Studios / DeepBrain — Free tier UI. Skipped — API gated to Enterprise sales.

The Hallucination Gap

The single most important finding. HeyGen rewrites non-English scripts: my German render replaced "Roth" (my hometown) with "Fürz in Rot" — crude German slang. "prommer.net" rendered as "Proma.net." Spanish dropped to 32% character similarity, French to 23%. Synthesia preserves the script but mis-clones the voice ("Prommer" → "Prahm"). Both failures are silent. Neither platform's UI warns you.

Key Findings

HeyGen Avatar V is the only render that cleared 7/10 for editorial — and only in English. Pair with manual transcript review for any localized output.
Synthesia's wardrobe-by-text-prompt feature is genuinely unique. Worth the upgrade if your audience does not know the speaker's voice.
Akool Free is the cheapest path to a custom AI avatar at $0 — but the output is share-only and voice-identity is the disqualifier.
Tavus is architecturally a real-time conversational video API. Wrong fit for editorial; right fit for customer-support video agents.
EU AI Act Article 50 deadline is August 2, 2026. No platform tested ships fully machine-detectable watermarking.

The Practitioner Gap

I cross-referenced this test against what growth teams actually use in production (Advise Slack corpus, 30 channels, ~100k messages, Q1 2026). The result: enterprise vendors are invisible in the practitioner layer. HeyGen owns talking-head VSLs. Sora through Arcads owns ecom UGC ads. Synthesia, Colossyan, Tavus, AI Studios — zero mentions across the corpus.

Total Experiment Spend

$47 across the test window. $29 HeyGen Creator + $18 Synthesia Starter. Akool and AI Studios at $0. Tavus evaluated but skipped before payment.

Timestamps

00:00 — Intro
TBD — The bottom-line scoreboard
TBD — Hallucination Gap: HeyGen rewrites scripts, Synthesia mis-clones voices
TBD — Platform deep dives
TBD — Sales-friction report (Tavus + AI Studios skip rationale)
TBD — CTO Playbook + EU AI Act compliance
TBD — Outro

S01E01: I Cloned My Voice With 8 AI Engines — Here's What Won

Mon, 23 Mar 2026 06:00:00 GMT

What We Tested

Eight voice cloning engines evaluated across quality, cost, training data requirements, and multilingual support:

ElevenLabs
Cartesia
Coqui / XTTS
LMNT
Fish Audio
StyleTTS2
OpenAI
Deepgram

Key Findings

The open-source option (Coqui XTTS) needed just 5 seconds of audio
The winner (Cartesia) needed 54 minutes but produced a clone that fooled colleagues in blind tests
Cost ranged from free (open source) to $99/month (enterprise)
Multilingual support varied wildly — only 2 engines handled German well

Timestamps

00:00 — Introduction & credentials
02:07 — AI voice clone bridge (Cartesia demo)
02:59 — NotebookLM deep dive begins
42:00 — Key takeaways
45:26 — Outro & next episode preview

CTAIO Labs Podcast

S02E01: I Built the Same Agent in 6 Orchestrators

The experiment

The two camps

The six, in one line each

The verdict

What I got wrong

Timestamps

Links

S01E03: How to Clone Your Brain — 3 Second-Brain Paradigms Tested Head-to-Head

The experiment

The three contenders

Three failure modes

Faithfulness vs fluency

The working-memory trap

The economics

Timestamps

Links

S01E02: Build Your AI Twin — Clone Your Face and Body

What I Tested

The Hallucination Gap

Key Findings

The Practitioner Gap

Total Experiment Spend

Timestamps

Links

S01E01: I Cloned My Voice With 8 AI Engines — Here's What Won

What We Tested

Key Findings

Timestamps

Links