Innovation Strategy

Drive growth through user-centric innovation by conceptualizing, developing, and optimizing digital solutions.

Develop holistic, omnichannel, customer experiences that optimize touchpoints, boost satisfaction, and enhance loyalty.

Conduct in-depth user research to reveal market opportunities and incorporate user preferences and behavioral insights to guide digital solution development.

Emerging TechnologyExploration

Thoughtfully explore the application of emerging technologies to enable a new generation of intelligent digital solutions.

Innovation Strategy

Digital Product & Platform Strategy

Drive growth through user-centric innovation by conceptualizing, developing, and optimizing digital solutions.

Customer Experience Strategy

Develop holistic, omnichannel, customer experiences that optimize touchpoints, boost satisfaction, and enhance loyalty.

User Research
& Analysis

Conduct in-depth user research to reveal market opportunities and incorporate user preferences and behavioral insights to guide digital solution development.

Emerging Technology Exploration

Thoughtfully explore the application of emerging technologies to enable a new generation of intelligent digital solutions.

Experience Design

Digital Product Design

Design immersive, user-centric digital products that drive growth by leveraging our experience design and product strategy capabilities.

Technology Platform Design

Optimize the performance of your digital platform interface and architecture by ensuring it adapts and scales with advanced platform design.

User Experience Design

Elevate user adoption, retention, and loyalty by making every touchpoint users have with your digital product or platform frictionless.

Technical Strategy & Architecture

Technology Engineering & 
Enablement

Engineer efficient, scalable digital solutions through a well-defined technology strategy enabled by thoughtful technical architecture.

Technology Due
Diligence

Improve operational efficiency and mitigate technical risk by objectively analyzing and assessing your technology assets.

Enterprise Modernization

Accelerate your integration of modern technologies to streamline operations, increase business agility, and reduce technical debt.

Data, Analytics & AI

Utilize advanced techniques that transform data into actionable intelligence to effectively compete and outperform in your domain.

Emerging Technology Applications

Innovate ahead of your market using emerging technologies to develop solutions that optimize your operations and elevate your customer experience.

Hardware & Embedded Solutions

Forward-thinking software and hardware engineering to reimagine your digital solutions and build the right products faster.

Product & Platform Engineering

Digital Product Development

Bring your product vision to life from concept to launch with user-centered experience design and world-class digital engineering.

Product Due Diligence

Identify potential gaps in your product development lifecycle to establish a solid foundation for scalable, value-driven digital products and growth.

Digital Product
Evolution

Prioritize continuous digital product improvement with comprehensive maintenance, performance optimization, and feature enhancements.

Embedding Emerging
Technologies

Embed emerging technologies into your digital products to boost performance, enhance user experiences, and unlock new functionality.

Centers of Excellence

Center of Excellence teams at HTEC stitch together recognized expertise across the firm to accelerate innovation, research, and efficiency in digital solution design, development, and engineering.

Engineering & Delivery

E&D drives engineering performance and efficiency for clients at any stage of their digital journey deploying the right expertise at the right time in the right context.

Tech Excellence Office

TEO provides expertise in technology excellence to build innovative solutions and support internal teams and clients with cutting-edge technologies.

Product & Design

P&D empowers HTEC teams and clients with best practices, strategies, and insights for product design and development.

Life at HTEC

Benefits

Craft customer-centric solutions and drive business success by leveraging our experience, strategy, and design services.

Global Locations

Explore HTEC’s global presence, from our headquarters to consultancy and development centers. Discover the diverse local flavors of each location.

Teams

Learn about our diverse, global teams and how our structure supports excellence in engineering, delivery, and business operations.

Culture

Dive into HTEC’s culture, where innovation, collaboration, and growth drive everything we do. Explore our values and what makes HTEC a great place to work.

htec.ai

Let’s partner up

Automating the Analyst: How We Built a Synthetic Research Pipeline to Benchmark the AI Market

2026/01/21

Emerging Technology Exploration

Contributing experts

Igor Sevo

Last year, HTEC Group published a comprehensive report on the state of AI code generators. It was a success, but it came with a heavy operational tax. Producing a report of that depth required a team of senior engineers and analysts spending weeks manually scouring documentation, testing tools, synthesizing feature matrices, and drafting executive summaries.

When the time came to update the report for this quarter, we faced a choice: allocate that massive human effort again, or build a system that could do it for us.

The goal was to build a tool that could autonomously research the software market, verify facts, and produce a publication-ready HTML report—all within a single day, with the output quality scaling with the amount of money invested.

By leveraging OpenAI’s infrastructure alongside Anthropic and Google models, we built a “Tool Researcher” that doesn’t just produce a simple report one might get from ChatGPT Deep Research; it performs rigorous, multi-stage investigation. The goal was not merely to “ask AI” to write a report—generative models are notorious for fluency without substance—but to engineer a system capable of verifiable, multi-step research, critical synthesis, and document composition.

The result is a Python-based orchestration engine that produces a publication-ready, 30+ page HTML report with live visualizations and deep citations. It runs in under a day and costs approximately $150 in compute. What follows is an overview of the architecture that makes this possible, moving beyond the hype of “generative AI” into the mechanics of synthetic research. It is a look under the hood at how we engineered a synthetic analyst that ensures reliability, grounding, and depth.

The Architecture of “Extended Thinking”

The fundamental flaw in most LLM-based research is the “one-shot” problem. If you ask a model to “compare GitHub Copilot and Amazon Q,” it will give you a generic, often hallucinated answer based on its training data. There are deep research agents available on the market which do indeed run a more meticulous research task, including gathering references and search grounding, but these systems limit the number of tokens (i.e., compute) spent on the task. Effectively, in such systems, it is the vendor—OpenAI, Google, Anthropic, xAI—who decides when the research must end and when there has been “enough thinking”. What we wanted to do was enable agents to think indefinitely—feed in a larger budget and get a better result (albeit, nonlinearly). We called this module the “Extended Thinking Engine”.

Instead of a single prompt, our engine treats every specific feature comparison as a complex workflow. When the system investigates a feature—say, “Does Tool X support Customer Managed Encryption Keys (CMEK)?”—it spins up parallel execution threads. We can configure the system to use OpenAI, Anthropic, and Google Gemini simultaneously.

Drafting: Each thread uses web search tools to find live, 2024-2025 documentation. They draft an initial answer.

Critique Loop: Before the answer is finalized, the model is forced to critique its own draft. It looks for missing citations, vague claims, or outdated info. It then revises the draft.

Consensus Merge: The engine takes the independent outputs from different models and merges them. If one model found a hidden pricing detail and another found a specific API limitation, the merge step synthesizes a superset of the truth.

This means every cell in our final comparison table isn’t just a guess; it is the survivor of a multi-agent debate.

We had also experimented with using an evolutionary algorithm approach, but ultimately decided against this, as, while it was giving us better results, the cost was too great to justify it. The simple multi-generation multi-threaded map-reduce approach was sufficient to elicit palpable improvements in the result quality.

In standard LLM interactions, most users prompt a model and accept the first draft (even if a reasoning model was involved). For high-stakes research, this is insufficient. Our engine implements a dialectic process inspired by the scientific method.

The Data Waterfall: From Micro-Facts to Macro-Strategy

A major challenge in automated reporting is maintaining context. A detail found in the technical documentation of a specific tool needs to influence the high-level executive summary at the end of the report.

We solved this with a strict data propagation hierarchy using structured XML tags.

Phase 1: Feature Extraction. The system iterates through the configuration (e.g., “Compliance,” “IDE Support”). It generates thousands of HTML fragments—individual table rows and “conclusion” blocks that cite specific URLs.

Phase 2: Signal Aggregation. As the agents research, they are instructed to emit `<data>` tags containing “strategic signals.” For example, while researching encryption, an agent might flag that “Tool A only supports EU residency in its Pro tier.”

Phase 3: Synthesis. These data blocks flow downstream. When the system generates the “Executive Overview” or “Security Findings” sections later, it is fed a compressed stream of these aggregated facts.

This allows the final conclusion to reference specific details found hours earlier in the process, creating a report that feels cohesive rather than fragmented.

On a less technical level, the architecture follows a linear, accumulative data flow, mimicking how a human team would assemble a white paper:

Expansion & Taxonomy: The system first scans the market. If configured, it expands the software list and generates new analysis categories based on current industry trends, ensuring the report doesn’t suffer from “frozen scope.”

Feature-Level Research: This is the heavy lifting. The system executes hundreds of concurrent research tasks. It doesn’t just output a “Yes/No”; it generates a detailed HTML rationale for every data point, linking to official documentation or release notes. If evidence is missing, the system is instructed to return an explicit `N/A` rather than hallucinating a capability.

Domain & Technical Sweeps: Beyond the matrix, the system performs “sweeps.” It uses the OpenAI Code Interpreter (the “Python tool”) to perform weighted scoring analysis for specific domains (e.g., FinTech or Healthcare), calculating scores mathematically rather than relying on the model’s intuition.

Synthesis: Once the structured data is gathered, the system moves to narrative generation. It writes executive summaries, generates headings, and draws conclusions. Crucially, these narrative sections are grounded in the data collected in the previous steps. The model is not allowed to invent facts here; it must cite the structured data it has already generated.

Solving the “Hallucination” Problem: Grounding and Verification

The primary objection to AI-generated reports is reliability. We addressed this through Strict Schema Enforcement and Data Propagation.

The system utilizes a custom XML-like tagging structure within the model outputs. We demand specific tags (e.g., `<table_row_html>`, `<conclusion_html>`, `<data>`). If a model fails to provide the exact structure or omits a required citation, the orchestrator rejects the output and triggers a structured retry loop, feeding the error back to the model for correction.

Furthermore, we implemented a `<data>` tag propagation system. When the engine discovers a critical insight during the feature research phase (e.g., “Tool X deprecated this feature in Jan 2025”), it wraps that fact in a `<data>` block. These blocks are aggregated and passed forward to the executive summary and conclusion phases. This ensures that the high-level narrative is always synchronized with the low-level technical findings.

For C-level executives, accuracy is non-negotiable. An AI tool that invents features is worse than useless; it is a liability. We engineered three layers of defense against hallucinations:

Mandatory Citation: The prompts strictly enforce that every substantive claim must be backed by an inline HTML anchor link to a 2024-2025 source. If the agent cannot find a link, it is instructed to mark the feature as `N/A`. We prefer a gap in the data over a lie.

Schema Enforcement: The system uses strict parsing. If an agent returns a table row with empty cells or missing rationales, the orchestrator rejects the output and triggers a retry loop with specific error instructions.

The “Evidence Dossier”: In the final HTML report, every checkmark in the comparison table is clickable. Clicking it opens a modal containing the “Conclusion HTML”—a mini-essay written by the agent justifying why it gave that score, complete with the links it found. This makes the report fully auditable.

The Refinement Agent: An Editor-in-Chief

The most novel part of our pipeline is the final phase. Once the raw report is assembled, it is often visually inconsistent or verbose. Enter the Refinement Agent.

This is not a writer; it is a software engineer. We load the entire HTML document into the context of a high-reasoning model. This agent essentially “looks” at the rendered report and issues “syscalls”—commands like `scroll`, `update`, `move`, or `delete`.

It operates in a loop:

Investigation: It scans the document for formatting glitches, broken layouts, or conflicting JavaScript.

Fixing: It issues precise patch instructions to the underlying Python orchestrator to rewrite specific DOM elements.

This agentic loop ensures the final output isn’t just a wall of text, but a polished, interactive HTML product with working charts (using Apache ECharts), sticky headers, and responsive layouts.

Instead of trying to get the generation perfect in one shot, we built a specialized agent that acts as a human editor. This agent views the final document through a “scroll window” (managing token context limits). It iterates through the document, session by session, fixing layout issues, resolving JavaScript variable collisions, consolidating CSS, and sharpening the prose. It operates like a developer debugging code, issuing patches to the document object model (DOM) until the report meets a rigorous quality standard.

In addition to this, we added a human-in-the-loop component, where the same refinement agent is used by a human supervisor: the human operator issues the topic to investigate and fix, and the agent enters the two-step loop of refinement. Once the error is fixed, we come back to the user for review and keep doing this.

Interestingly, we found that if the previous steps have been properly configured, there is rarely any need for human intervention and have actually completely omitted both the automated and human-in-the-loop versions of the refinement step, as the full report had already started looking polished enough.

Beyond Software Generators

While we built this to benchmark code generators, the architecture is content-agnostic. The entire process is driven by a `config.json` file.

By simply changing the list of software (e.g., to “Salesforce vs. HubSpot”) and the feature definitions (e.g., “Lead Scoring,” “GDPR Compliance”), the exact same pipeline will produce a deep-dive market analysis for CRM platforms. We have effectively built a “researcher-in-a-box” that scales horizontally across domains—from FinTech compliance tools to cloud infrastructure providers.

The system is not even hardcoded to analyze code generators. It is a generalized research engine defined entirely by a configuration contract (`config.json`) and a set of natural language prompts stored as part of the system’s prompt library. These serve as the DNA of the output. They define the taxonomy of the research—the software to be compared, the categories of analysis (e.g., “Security,” “Latency,” “Context Awareness”), and the specific features within those categories.

Crucially, the configuration dictates the data types for every feature—whether a metric is a boolean, a numerical count, a qualitative 1-5 rating or an entirely different metric defined by the user in plain English. This forces the underlying models to adhere to a strict schema, transforming unstructured web data into a structured matrix that allows for objective comparison. If we change the configuration to analyze “Cloud ERP Systems” or “Biotech Simulation Tools,” the engine adapts without a single line of code change.

In fact, we could employ the same approach to compare items entirely outside the domain of software and hardware, though, for our purposes, this was not tested.

Development Approach and Lessons Learned

Beyond just producing the technical report which caught so many eyes last year, we set out to test another hypothesis—that the very tools we were investigating, the code generators, would be capable of helping create a serious product from scratch.

In fact, our whole system was built by a single Forward Deployed Engineer, with the help of GitHub Copilot and OpenAI Codex in VS Code.

We took a spec-driven development approach, maintaining a set of specification prompts from the start of the project. We maintained two major levels of documentation for our development process:

Project-specific prompt library: we maintained a text file repository of all the major instructions, system specification, architecture outlines, bug fixes

In-project documentation: we instructed all agents (both Copilot and Codex) to maintain two files—`README.md`as user-facing documentation and `AGENTS.md` as copilot-facing documentation with current and ongoing implementation details—updating this after every change and using it as a single source of truth for the project’s specification. The project-specific prompt library was left in the source repository as backup, but these two files served to clearly describe the engine and its implementation state to any copilot running on the code base.

With only some interventions from the engineer, and with proper educated guidance of the models and clear specification, we were able to develop and test the whole system and generate two reports within a couple developer-weeks.

Although we could write at length about the many lessons learned from this project, a couple important points stand out:

The development process itself incurred more cost than the actual report generation—testing different models, auditing the Extended Thinking Agent, modifying the algorithm hyperparameters produced costs well beyond 200$, but still under 1000$ (accounting only for infrastructure and excluding developer time).

Error recovery should be done in the early stages of development—API issues, network issues and service availability problems (sometimes even guardrail false positives) can cause the application to fail in a later stage, incurring retry costs.

State saving is imperative—in long-running multi-stage pipelines such as this one, it is crucial to save the in-memory application state after every step, as any rerun will raise both testing and development costs. Algorithm state persistence (current in-memory document fragments, stages, accumulated data) must be implemented from the very beginning, instead of later in the project life cycle, to reduce costs.

Small-scale testing is not always viable—smallest models, such as `gpt-5-nano`often fail to follow instructions and produce incorrectly formatted outputs (despite being encouraged with specification and examples in the developer prompt), causing failures which would not otherwise occur.

We found that the small and tiny models may only be used on trivial tasks, while for long-running agentic tasks, they ought not be used even for testing (with the caveat that, for example, `gpt-5-mini` on `high` reasoning might be on-par with `gpt-5` on `low` and acceptable for testing).

The Result and Conclusion

Developing AI-using software using AI certainly presents novel challenges and it is likely that the developer profile will significantly change in the coming years, as the skillset necessary to achieve reliable results goes well beyond software engineering and software architecture. We find that verbal fluency, clarity in specification, planning and communication skills all need to converge, which is a relatively rare case in software engineering.

Developing a complex automation system requires understanding the inner workings of the models used both for development (as copilots) and as part of the application (the agents’ core), the details of the job being automated and how it fits within the broader context of the client.

For this reason, I believe that we will need to move beyond traditional software development into what I call “engineering intelligent systems”. I have written extensively about this in my Engineering Intelligence, Minds and Cognition treatise, outlining how we need to take a more holistic approach—taking into consideration not just model intelligence, but also company intelligence, infrastructural intelligence and individual contributor—developer, project manager, designer, proofreader, etc.—intelligence.

In the end, we moved from a multi-week manual labor to a pipeline that successfully runs in completely unsupervised mode. It handles the drudgery of reading documentation and checking release notes, while producing a report that is arguably on par with the previous human-generated reports, simply due to the system meeting the threshold of intelligence necessary to produce such a report.

In the age of AI, the competitive advantage isn’t just having data; it’s the speed at which you can synthesize that data into a decision. This tool is our engine for that speed.