Heroes of AI and Magic

5 updates

Aliaksei Ramanau Final

May 11, 09:49 PM

# darkmagic-analytics-agent Ask Fairmarkit data a question in plain English. Get a chart, a number, and the math behind it. We built an analytics agent that turns natural-language procurement questions into validated Cube-backed answers — with charts, CSV export, friendly errors, a "show me how" trace panel, and a dark theme that should make everyone Linuxoid happy. #What's under the hood A LangGraph plan-and-execute pipeline (~20 specialized nodes) sits on top of the existing Cube semantic layer. Every number on screen is the output of a Cube query that was generated by the agent, validated against Cube metadata, and executed deterministically - no free-form SQL, no hallucinated metrics[eventually in prod ;) ]. A second cheap-LLM "mapper critique" node catches roughly 30% of wrong-metric mistakes before they hit Cube. The verifier flags time-series anomalies with a z-score gate, so the presenter can call them out instead of glossing over them. # Resources * Git repo: https://gitlab.fmdev.io/analytics/darkmagic-analytics-agent * Docs: https://gitlab.fmdev.io/analytics/darkmagic-analytics-agent/-/tree/master/docs?ref_type=heads * Pitch presentation with more details: https://docs.google.com/presentation/d/1qwRq-kqsV3RyeafJNboilSmTDEngjZ-hg4OOcIPE_ho/edit?slide=id.g3e0f0a7a62f_0_72#slide=id.g3e0f0a7a62f_0_72 * More screenshots: https://drive.google.com/drive/folders/17NGUFwXQSpkjleUxiw-4QJM4Y_RuHHDV # Team & thanks Engineered by Alex Ramanau + a squad of AI agents(claudecode, opencode) with direct help from Yaugeniy Shalkevich and Marceli Adamzyk, and sharp feedback from Viktor and Andrew. Special kudos to Yaugeniy Shalkevich for explaining in simple words complex domain-specific problems.

Aliaksei Ramanau

May 11, 11:14 AM

# Weekend — Stabilize the trunk, branch out for intelligence We went into the weekend with a working agent and a clear feedback: **don't add breadth — add trust and depth.** Two days later we have a sturdier trunk *and* two intelligence prototypes built in parallel on a separate VM via agentic engineering. ## Stabilization (master) **Web-search / procurement branch — finalized.** HITL flow hardened against the edge cases that bit us on Day 3; full e2e green. Multiple small fixes applied directly from what e2e tests surfaced — exactly the loop the CTO asked for. **Per-turn chat checkpoint.** New migration `0004_chat_pending_checkpoint` + 232-line test suite. The agent now distinguishes "user repeated the same question" from "user asked something new," and resumes from the right LangGraph state. Removes a whole class of stale-context bugs. **FE that fails politely.** When Cube goes away, the user gets a clear status panel and an actionable error - not a JavaScript stack trace. Small change, big "this is a real product" signal. **Project renamed** to `darkmagic-analytics-agent`. Module paths, Prometheus, Docker, lockfiles — all green after the rename. ## Two intelligence prototypes (parallel branches, VM-hosted) To not block the master trunk, the two intelligence prototypes were built on a separate VMs, using **opencode + OpenAI GPT-5.5** and Claude Code + Opus 4.7. Each prototype followed its own pre-written spec (Round 13) and landed in its own feature branch so they can be reviewed, merged, or shelved independently. ### Branch 1 - Verification & self-correction [`feature/13-anomaly-zscore-prototype+14-mapper-critique-prototype`](https://github.com) - **Z-score outlier check** ([spec 13](spec/13-anomaly-zscore-prototype.md)) - verifier flags time-series points that deviate >3σ from the local window, so the presenter can call them out instead of glossing over them. 133-line test suite. - **Mapper self-critique node** ([spec 14](spec/14-mapper-critique-prototype.md)) - cheap-LLM plausibility gate between `cube_mapper` and `execute`. Catches wrong-metric / missing-filter mistakes and triggers a single bounded retry. 131-line node, 283-line test suite. Projected to auto-correct ~30% of mapping errors before they hit Cube. ### Branch 2 - Retrieval & best-practices grounding [`feature/spec-09-plus-10-bestpractices-and-followups`](https://github.com) - **Few-shot retrieval + anomaly flagging** ([spec 9](spec/09-fewshot-and-anomaly.md)) The mapper now retrieves similar past Q→Cube-query pairs from a small curated index before generating a new query. - **Procurement best-practices RAG corpus.** 19 hand-curated markdown documents across `savings/`, `cycle-time/`, `engagement/`, `governance/` - the kind of institutional knowledge that turns a generic SQL bot into a procurement-aware advisor. - **Engagement response-rate suggester**, refined after domain-expert feedback from Marceli - thresholds now match how FM customers actually read the metric. - **Comprehensible QA scenarios** documenting the expected best-practices behavior, so the eval set can prove the RAG is actually moving the answers in the right direction. ## By the numbers 14 commits across 4 branches · 267 files · +5.4k / −0.8k lines · 2 new prototypes shipped behind feature branches · 19-doc procurement RAG corpus · 5 new test files (~1k lines of new tests). ## Today's we're on the final stretch (Day 5) 1. **UX-flow polish + last-mile bug fixes** - every demo claim shows green on the real product. 2. **Sharpen the eval against Micron Technology staging data** - anonymized data, optimize the **North Star metric: % of natural-language questions answered correctly.** 3. **Pitch + demo.** get out of the comfort zone and prepare comprehensible pitch + demo. ## Meanwhile, in a parallel universe Two production streams are moving alongside it: - **AWS production deployment of `ai-analytics` + `platform-analytics`** - configs finalized, deployment pipeline being driven through to prod. - **Analytics 2.0 → Azure uat/staging** - Cube configuration verified, release being prepped for uat/staging rollout. Same brain, different terminals. The "fast software engineering" claim isn't theoretical - it's how today is actually being scheduled. --- **Fun fact.** In parallel with shipping features, the same engineers also managed to grill spicy chicken wings on the bank of the Vistula. Turns out spec-driven development frees up enough cycles for proper smoke testing — the literal kind. 🔥🍗

Viktar Kushch

May 11, 02:46 PM

Aliaksei Ramanau

May 9, 03:46 PM

# Day 3 — From "looks impressive" to "we can prove it works" We started Day 3 with a working vertical slice. We end it having absorbed honest and valuable feedback from Viktor/Andrew, sharpened the demo posture, and shipped a chunk of real product feel — without splitting focus. ## The pivot — CTO / VP of Engineering feedback A mid-hackathon sync delivered the signal we needed: *"anyone can vibe-code a chat in a week."* Two paths to maturity — polished UX **or** depth of reasoning. Pick one. Lean in. We picked **both**. Chat need to be useful and be prototype for new UX flows - data exporting and become just a useful everyday tool. ## What we shipped **Chat lifecycle that feels like a real product.** List, delete, **undo**, and **pinned/favorite chats** with a partial-index migration. UNDO is the single most requested missing piece from FM's current AI Kit chat — it now exists. **HITL that doesn't look like a wall of text.** Ambiguity questions are now structured choices (3 + "Other"). Fixed the broken metrics-ambiguity flow that snuck in late on Day 2. +289 lines of clarification-loop tests pin the behavior. **Smarter metrics discovery.** *"What can I ask?"* now returns a curated, role-aware catalogue — not a 200-item dump. 216-line node + 290-line test suite. **Cross-tenant requests handled explicitly.** When a buyer at Tenant A asks about Tenant B's data, the agent refuses by name and emits an audit event — no more silent empty results. New permission cases in the eval set. **Spec depth — six micro-specs to push intelligence.** Round 13 of the spec process produced concrete, ready-to-implement prototypes: mapper self-critique (E001.C2), few-shot retrieval, anomaly detection, presenter heuristics, suggested follow-ups, best-practices RAG. **Isolation-hardening spec.** `spec/15-isolation-hardening.md` — 521 lines closing three near-miss gaps surfaced by an internal security review (multi-cube tenant filter, silent mapping fallback, LLM-supplied tenant filter rejection). ## This is *not* vibe-coding What we're really doing: re-implementing FM's existing AI Kit chat (from `ai-analytics`) with an improved user flow and advanced agentic engineering — using **spec-driven agentic development** as the discipline. That discipline is recognizably *software engineering*, not "talk to an LLM until something works": - **16 committed specs** in `docs/spec/` — each hand-offable to any engineer or coding agent. - **60+ logged decisions** in `context/005-decisions-log.md. traceable across 13 rounds of design conversation. - **Audit chain:** every commit references a spec; every spec references a round; every round references the original brief. - Reasoning happens **before** code, in writing, reviewable by humans. Disagreement is resolved at the spec layer where it's cheap — not at merge time where it's expensive. - The "agentic" part is that drafts of specs and code are produced by an LLM. The senior-engineer judgment — what to spec, what to cut, what to verify — stays human. **It's just software engineering in 2026. Done fast.** ## By the numbers 12 commits · 72 files · +5.2k / −0.2k lines · 1 new domain spec · 1 isolation spec · 6 ready-to-implement micro-spec prototypes. ## Weekend focus 1. **End-to-end flow polish & testing** — every demo claim has a green path on staging, top to bottom. 2. **Golden dataset, sharpened against real customer data** — consolidate to 15–20 branch-diverse cases, all running on staging Cube. The eval-pass number drives the demo's North Star slide. 3. Implementing a couple of prototypes to increase intelligence of the AI Agent. 4. **Stability verification + bug fixes** — repeat the eval run, fix what breaks, chase the number until it holds steady across runs.

Viktar Kushch

May 9, 04:13 PM

Andrey Timonin

May 10, 04:34 PM

Aliaksei Ramanau

May 8, 08:14 AM

# Day 2 — From skeleton to a real conversation We started Day 2 with bones. We end it with a procurement agent that **talks back, remembers, and refuses to lie about numbers.** ## Highlights **Real domain, not toy data.** Vendored the production Cube schema (18.8k lines, synced offline so the agent can run with no live Cube) and shipped two new cubes: `dim_vendors` and `fact_unified_responses` (354 lines of facts and measures). The agent now answers against the same model our analytics team ships to our real customer - Micron technology. **Multi-turn intake that actually intakes.** Conversation-aware ambiguity check with inline metadata lookup — the agent asks the *right* follow-up instead of shrugging. **"Last quarter," "this fiscal year," "Q3" — all just work.** New `timeframe.py` translates natural-language date phrases into Cube time dimensions. **Golden dataset is live.** 614-line PM domain notes + 48 new core-golden cases. We can now measure the agent getting smarter — or dumber — instead of guessing. **End-to-end coverage.** 608 lines of Playwright + a 236-line responses-cube test suite. No more vibes-driven shipping. **The UI grew up.** Pretty markdown rendering in chat, a real dark theme with a switcher, CSV export from any answer. Pinned/favorite chats half-landed (migration + repo done, FE polish in flight). ## By the numbers 13 commits · 112 files · +22.8k / −4.5k lines · 4 new test suites, all green ## Where we are The simple path is no longer a demo trick — it's a product loop: **ask → clarify → query the real warehouse → render → export.** Day 3 unlocks the complex path + the human-in-the-loop escape hatch, and we start hammering it with the eval harness. ------------------------------------------------ P.S. * Uploading of pictures still doesn't work to me, but here is the link to my hacking env: https://drive.google.com/file/d/1DKdzSuacLaspeYM2n_dT4oIBx6WnTdeo/view?usp=drive_link * A humble screenshot from the magic Analytics chat: https://drive.google.com/file/d/1Lq63QuTF9uQdqUtrVpWBXbKjEArU0oMH/view?usp=sharing (I'll provide later, chat is broken due to disconnection from VPN. To post this update I need to be painfully disconnected from Staging Cube :] ) * 0.66l of coffee, 1.5l of tea, 1 glass of Cyprus wine consumed - hope I won't disqualified for doping :)

Andrey Timonin

May 8, 01:05 PM

Aliaksandr Layuk

May 8, 03:07 PM

Aliaksei Ramanau

May 6, 08:13 PM

Quick update from *Heroes of AI and Magic*: What's done: * prepare working place for hacking in the new homeoffice * load extra product context form company expert (kudos to Yauheniy Shalkevich) * setup repo and init project context * prepared HLD and verified by human and one more LLM * setup team of AI agents and started implementation P.S. Fun facts: * consumed 2.33 cups of coffee and 0.76l of beer * spilled 0.66 cups to the desk (fortunately, no serios damage) * consumed Claude Code limits in 1.5 hour, requested Premiums seat capability (thanks to Andrew Timonin for immediate addressing the problem) Stay tuned :)

Aliaksandr Layuk

May 8, 03:05 PM

Denis Atyasov

May 10, 08:54 PM