Automating the Analyst: How We Built a Synthetic Research Pipeline to Benchmark the AI Market

Contributing experts

Last year, HTEC Group published a comprehensive report on the state of AI code generators. It was a success, but it came with a heavy operational tax. Producing a report of that depth required a team of senior engineers and analysts spending weeks manually scouring documentation, testing tools, synthesizing feature matrices, and drafting executive summaries. 

When the time came to update the report for this quarter, we faced a choice: allocate that massive human effort again, or build a system that could do it for us. 

The goal was to build a tool that could autonomously research the software market, verify facts, and produce a publication-ready HTML report—all within a single day, with the output quality scaling with the amount of money invested. 

By leveraging OpenAI’s infrastructure alongside Anthropic and Google models, we built a “Tool Researcher” that doesn’t just produce a simple report one might get from ChatGPT Deep Research; it performs rigorous, multi-stage investigation. The goal was not merely to “ask AI” to write a report—generative models are notorious for fluency without substance—but to engineer a system capable of verifiable, multi-step research, critical synthesis, and document composition. 

The result is a Python-based orchestration engine that produces a publication-ready, 30+ page HTML report with live visualizations and deep citations. It runs in under a day and costs approximately $150 in compute. What follows is an overview of the architecture that makes this possible, moving beyond the hype of “generative AI” into the mechanics of synthetic research. It is a look under the hood at how we engineered a synthetic analyst that ensures reliability, grounding, and depth. 

The Architecture of “Extended Thinking” 

Instead of a single prompt, our engine treats every specific feature comparison as a complex workflow. When the system investigates a feature—say, “Does Tool X support Customer Managed Encryption Keys (CMEK)?”—it spins up parallel execution threads. We can configure the system to use OpenAI, Anthropic, and Google Gemini simultaneously. 

  1. Drafting: Each thread uses web search tools to find live, 2024-2025 documentation. They draft an initial answer. 
  1. Critique Loop: Before the answer is finalized, the model is forced to critique its own draft. It looks for missing citations, vague claims, or outdated info. It then revises the draft. 
  1. Consensus Merge: The engine takes the independent outputs from different models and merges them. If one model found a hidden pricing detail and another found a specific API limitation, the merge step synthesizes a superset of the truth. 

This means every cell in our final comparison table isn’t just a guess; it is the survivor of a multi-agent debate. 

We had also experimented with using an evolutionary algorithm approach, but ultimately decided against this, as, while it was giving us better results, the cost was too great to justify it. The simple multi-generation multi-threaded map-reduce approach was sufficient to elicit palpable improvements in the result quality. 

In standard LLM interactions, most users prompt a model and accept the first draft (even if a reasoning model was involved). For high-stakes research, this is insufficient. Our engine implements a dialectic process inspired by the scientific method. 

The Data Waterfall: From Micro-Facts to Macro-Strategy 

A major challenge in automated reporting is maintaining context. A detail found in the technical documentation of a specific tool needs to influence the high-level executive summary at the end of the report. 

We solved this with a strict data propagation hierarchy using structured XML tags. 

  1. Phase 1: Feature Extraction. The system iterates through the configuration (e.g., “Compliance,” “IDE Support”). It generates thousands of HTML fragments—individual table rows and “conclusion” blocks that cite specific URLs. 
  1. Phase 2: Signal Aggregation. As the agents research, they are instructed to emit `<data>` tags containing “strategic signals.” For example, while researching encryption, an agent might flag that “Tool A only supports EU residency in its Pro tier.” 
  1. Phase 3: Synthesis. These data blocks flow downstream. When the system generates the “Executive Overview” or “Security Findings” sections later, it is fed a compressed stream of these aggregated facts. 

This allows the final conclusion to reference specific details found hours earlier in the process, creating a report that feels cohesive rather than fragmented. 

On a less technical level, the architecture follows a linear, accumulative data flow, mimicking how a human team would assemble a white paper: 

  1. Expansion & Taxonomy: The system first scans the market. If configured, it expands the software list and generates new analysis categories based on current industry trends, ensuring the report doesn’t suffer from “frozen scope.” 
  1. Feature-Level Research: This is the heavy lifting. The system executes hundreds of concurrent research tasks. It doesn’t just output a “Yes/No”; it generates a detailed HTML rationale for every data point, linking to official documentation or release notes. If evidence is missing, the system is instructed to return an explicit `N/A` rather than hallucinating a capability. 
  1. Domain & Technical Sweeps: Beyond the matrix, the system performs “sweeps.” It uses the OpenAI Code Interpreter (the “Python tool”) to perform weighted scoring analysis for specific domains (e.g., FinTech or Healthcare), calculating scores mathematically rather than relying on the model’s intuition. 
  1. Synthesis: Once the structured data is gathered, the system moves to narrative generation. It writes executive summaries, generates headings, and draws conclusions. Crucially, these narrative sections are grounded in the data collected in the previous steps. The model is not allowed to invent facts here; it must cite the structured data it has already generated. 

Solving the “Hallucination” Problem: Grounding and Verification 

The system utilizes a custom XML-like tagging structure within the model outputs. We demand specific tags (e.g., `<table_row_html>`, `<conclusion_html>`, `<data>`). If a model fails to provide the exact structure or omits a required citation, the orchestrator rejects the output and triggers a structured retry loop, feeding the error back to the model for correction. 

Furthermore, we implemented a `<data>` tag propagation system. When the engine discovers a critical insight during the feature research phase (e.g., “Tool X deprecated this feature in Jan 2025”), it wraps that fact in a `<data>` block. These blocks are aggregated and passed forward to the executive summary and conclusion phases. This ensures that the high-level narrative is always synchronized with the low-level technical findings. 

For C-level executives, accuracy is non-negotiable. An AI tool that invents features is worse than useless; it is a liability. We engineered three layers of defense against hallucinations: 

  1. Mandatory Citation: The prompts strictly enforce that every substantive claim must be backed by an inline HTML anchor link to a 2024-2025 source. If the agent cannot find a link, it is instructed to mark the feature as `N/A`. We prefer a gap in the data over a lie. 
  1. Schema Enforcement: The system uses strict parsing. If an agent returns a table row with empty cells or missing rationales, the orchestrator rejects the output and triggers a retry loop with specific error instructions. 
  1. The “Evidence Dossier”: In the final HTML report, every checkmark in the comparison table is clickable. Clicking it opens a modal containing the “Conclusion HTML”—a mini-essay written by the agent justifying why it gave that score, complete with the links it found. This makes the report fully auditable.

The Refinement Agent: An Editor-in-Chief 

The most novel part of our pipeline is the final phase. Once the raw report is assembled, it is often visually inconsistent or verbose. Enter the Refinement Agent. 

This is not a writer; it is a software engineer. We load the entire HTML document into the context of a high-reasoning model. This agent essentially “looks” at the rendered report and issues “syscalls”—commands like `scroll`, `update`, `move`, or `delete`. 

It operates in a loop: 

  1. Investigation: It scans the document for formatting glitches, broken layouts, or conflicting JavaScript. 
  1. Fixing: It issues precise patch instructions to the underlying Python orchestrator to rewrite specific DOM elements. 

This agentic loop ensures the final output isn’t just a wall of text, but a polished, interactive HTML product with working charts (using Apache ECharts), sticky headers, and responsive layouts. 

Instead of trying to get the generation perfect in one shot, we built a specialized agent that acts as a human editor. This agent views the final document through a “scroll window” (managing token context limits). It iterates through the document, session by session, fixing layout issues, resolving JavaScript variable collisions, consolidating CSS, and sharpening the prose. It operates like a developer debugging code, issuing patches to the document object model (DOM) until the report meets a rigorous quality standard. 

In addition to this, we added a human-in-the-loop component, where the same refinement agent is used by a human supervisor: the human operator issues the topic to investigate and fix, and the agent enters the two-step loop of refinement. Once the error is fixed, we come back to the user for review and keep doing this. 

Interestingly, we found that if the previous steps have been properly configured, there is rarely any need for human intervention and have actually completely omitted both the automated and human-in-the-loop versions of the refinement step, as the full report had already started looking polished enough.

Beyond Software Generators

While we built this to benchmark code generators, the architecture is content-agnostic. The entire process is driven by a `config.json` file. 

By simply changing the list of software (e.g., to “Salesforce vs. HubSpot”) and the feature definitions (e.g., “Lead Scoring,” “GDPR Compliance”), the exact same pipeline will produce a deep-dive market analysis for CRM platforms. We have effectively built a “researcher-in-a-box” that scales horizontally across domains—from FinTech compliance tools to cloud infrastructure providers. 

The system is not even hardcoded to analyze code generators. It is a generalized research engine defined entirely by a configuration contract (`config.json`) and a set of natural language prompts stored as part of the system’s prompt library. These serve as the DNA of the output. They define the taxonomy of the research—the software to be compared, the categories of analysis (e.g., “Security,” “Latency,” “Context Awareness”), and the specific features within those categories. 

Crucially, the configuration dictates the data types for every feature—whether a metric is a boolean, a numerical count, a qualitative 1-5 rating or an entirely different metric defined by the user in plain English. This forces the underlying models to adhere to a strict schema, transforming unstructured web data into a structured matrix that allows for objective comparison. If we change the configuration to analyze “Cloud ERP Systems” or “Biotech Simulation Tools,” the engine adapts without a single line of code change. 

In fact, we could employ the same approach to compare items entirely outside the domain of software and hardware, though, for our purposes, this was not tested.

Development Approach and Lessons Learned 

In fact, our whole system was built by a single Forward Deployed Engineer, with the help of GitHub Copilot and OpenAI Codex in VS Code. 

We took a spec-driven development approach, maintaining a set of specification prompts from the start of the project. We maintained two major levels of documentation for our development process: 

  1. Project-specific prompt library: we maintained a text file repository of all the major instructions, system specification, architecture outlines, bug fixes 
  1. In-project documentation: we instructed all agents (both Copilot and Codex) to maintain two files—`README.md`as user-facing documentation and `AGENTS.md` as copilot-facing documentation with current and ongoing implementation details—updating this after every change and using it as a single source of truth for the project’s specification. The project-specific prompt library was left in the source repository as backup, but these two files served to clearly describe the engine and its implementation state to any copilot running on the code base. 

With only some interventions from the engineer, and with proper educated guidance of the models and clear specification, we were able to develop and test the whole system and generate two reports within a couple developer-weeks. 

Although we could write at length about the many lessons learned from this project, a couple important points stand out: 

  1. The development process itself incurred more cost than the actual report generation—testing different models, auditing the Extended Thinking Agent, modifying the algorithm hyperparameters produced costs well beyond 200$, but still under 1000$ (accounting only for infrastructure and excluding developer time). 
  1. Error recovery should be done in the early stages of development—API issues, network issues and service availability problems (sometimes even guardrail false positives) can cause the application to fail in a later stage, incurring retry costs. 
  1. State saving is imperative—in long-running multi-stage pipelines such as this one, it is crucial to save the in-memory application state after every step, as any rerun will raise both testing and development costs. Algorithm state persistence (current in-memory document fragments, stages, accumulated data) must be implemented from the very beginning, instead of later in the project life cycle, to reduce costs. 
  1. Small-scale testing is not always viable—smallest models, such as `gpt-5-nano`often fail to follow instructions and produce incorrectly formatted outputs (despite being encouraged with specification and examples in the developer prompt), causing failures which would not otherwise occur. 

We found that the small and tiny models may only be used on trivial tasks, while for long-running agentic tasks, they ought not be used even for testing (with the caveat that, for example, `gpt-5-mini` on `high` reasoning might be on-par with `gpt-5` on `low` and acceptable for testing).

The Result and Conclusion 

Developing AI-using software using AI certainly presents novel challenges and it is likely that the developer profile will significantly change in the coming years, as the skillset necessary to achieve reliable results goes well beyond software engineering and software architecture. We find that verbal fluency, clarity in specification, planning and communication skills all need to converge, which is a relatively rare case in software engineering. 

Developing a complex automation system requires understanding the inner workings of the models used both for development (as copilots) and as part of the application (the agents’ core), the details of the job being automated and how it fits within the broader context of the client. 

For this reason, I believe that we will need to move beyond traditional software development into what I call “engineering intelligent systems”. I have written extensively about this in my Engineering Intelligence, Minds and Cognition treatise, outlining how we need to take a more holistic approach—taking into consideration not just model intelligence, but also company intelligence, infrastructural intelligence and individual contributor—developer, project manager, designer, proofreader, etc.—intelligence. 

Download this Report

Most popular articles