The AI tooling landscape in early 2026 is overwhelming. Every week brings a new framework, a new model, a new vector database claiming to be the fastest, and a new orchestration layer promising to simplify everything. For engineering leaders trying to build production AI systems, the noise-to-signal ratio has never been worse. We know because we have been navigating it ourselves -- shipping AI-powered products and agent systems for clients across fintech, energy, and government while the ground shifts underneath us every quarter.

Production AI stack architecture layers — Our production AI stack: from model selection to monitoring

This article is our attempt to cut through that noise. We are sharing the specific tools, models, and architectural decisions that make up our production AI stack -- what we use, why we chose it, and what we deliberately chose not to use. This is not a theoretical framework or a vendor comparison chart. It is the stack running in production today, serving real users, handling real edge cases, and costing real money. We are sharing it because transparency builds trust, and because we wish more teams would do the same. When vendors publish benchmarks, they optimize for the demo. When practitioners share their stacks, they optimize for honesty.

The Model Layer: Choosing Your LLMs

We run a multi-model strategy -- not because it is trendy, but because no single model is the right choice for every task in a production system. Getting the model strategy wrong means either overspending on capabilities you do not need or under-delivering on quality that users expect.

Claude: Our Primary Reasoning Model

Anthropic's Claude is our primary model for complex reasoning, long-context analysis, and nuanced instruction following. We use it for agent orchestration decisions, document analysis across 50-100 page contracts, code generation and review, and any task where following detailed system prompts precisely matters more than raw speed. Claude's extended thinking capability is particularly valuable for agent systems -- when an agent needs to plan a multi-step workflow, the quality difference versus other models is measurable. We also rely on Claude's structured output reliability. Production systems cannot tolerate malformed JSON, and for regulated industries this reliability is a hard requirement.

GPT-4o and Open-Source Models

OpenAI's GPT-4o serves as our fallback and our choice for multimodal tasks involving image analysis and complex function calling patterns. We maintain it not as a hedge against vendor lock-in -- though that is a benefit -- but because certain tasks genuinely perform better on it. Pretending one model wins everywhere is ideology, not engineering.

For cost-sensitive, high-volume tasks, we deploy open-source models -- primarily Meta's Llama 3 and Mistral. These handle classification, entity extraction, and simple summarization where frontier model costs are unjustifiable. A classification task running 50,000 times per day costs roughly $100 per month on Llama 3 versus $3,000 on Claude. The quality difference for binary classification is negligible; the cost difference is not. We self-host using vLLM for inference serving, giving us control over latency, availability, and data residency.

Orchestration: Wiring Agents Together

The orchestration layer is what turns individual model calls into coherent agent workflows. It manages state, routes decisions, handles tool calls, and recovers from failures. Getting this layer right is the difference between a demo that impresses and a system that works at 3 AM on a Saturday.

LangGraph for Agent Workflows

We use LangGraph as our primary orchestration layer. It models agent workflows as directed graphs where nodes represent actions and edges represent conditional transitions. The key advantage is checkpointing -- LangGraph persists the full agent state at every node, enabling replay of failed executions from the exact point of failure, human-in-the-loop approval at any step, and complete audit trails for compliance.

Custom Orchestration for Critical Paths

For production-critical paths -- payment processing agents, security-sensitive workflows -- we use custom TypeScript state machines rather than a framework. Frameworks add abstraction layers, and abstraction layers add failure modes. When a workflow processes financial transactions, every line of orchestration code should be explicit, testable, and free from third-party dependency updates that could change behavior. It is more code than LangGraph, but the trade-off is worth it where reliability trumps development speed.

Vector Databases and Embeddings

We use three vector databases for three deployment contexts. Pinecone is our default for cloud-native deployments -- managed, scalable, with namespace-based tenant isolation. pgvector is our choice when the application already runs on PostgreSQL, keeping vectors alongside relational data and eliminating a separate database to operate. Weaviate is deployed for on-premise clients -- government agencies and financial institutions with strict data residency -- running containerized within their infrastructure with native hybrid search.

For embeddings, OpenAI's text-embedding-3-large is our default for English applications. For multilingual work -- a significant portion of our projects across Latin America and Europe -- Cohere's embed-multilingual-v3.0 outperforms alternatives on cross-language retrieval. For on-premise deployments, we use open-source models like BGE-large and E5-mistral, running on the same GPU instances as our LLMs to keep the entire pipeline self-contained.

RAG Pipeline: From Documents to Answers

The quality of a RAG system depends far more on the retrieval pipeline than on the generation model -- a frontier model given bad context produces bad answers just as confidently as it produces good ones.

Document-Type-Aware Chunking

We abandoned universal chunking strategies early. Fixed-size chunks with overlap work for articles and reports but butcher contracts, specifications, and financial statements. Our approach: legal contracts are chunked by clause, technical docs by section, financial reports by table and narrative separately. We also maintain parent-child relationships between chunks so the system can pull surrounding context when a fragment is retrieved -- eliminating the most common RAG failure mode of returning a relevant fragment that lacks the context to interpret it correctly.

Reranking and Hybrid Search

Reranking is consistently the single largest quality improvement in our RAG pipelines. We use Cohere Rerank for managed deployments and cross-encoder models for on-premise setups. Adding reranking to a baseline vector search pipeline improves answer accuracy by 15-25% across document types. It adds 100-200ms of latency, but the quality improvement makes it non-negotiable.

We pair this with hybrid search -- combining vector similarity with BM25 keyword matching -- for every production system. Pure vector search misses exact-match queries for contract numbers, product SKUs, and regulation identifiers. The implementation adds complexity, but missing an obviously relevant document because it is not semantically close to the query is too damaging to accept.

Evaluation and Monitoring

An AI system without evaluation is a liability. Every production system gets a custom evaluation suite: extraction accuracy, completeness, and hallucination rate for document processing; task completion rate, relevance, and tone adherence for conversational agents. Off-the-shelf tools give you generic metrics. Custom frameworks give you the metrics that actually correlate with user satisfaction.

For subjective quality dimensions, we use an LLM-as-judge pattern -- Claude as the judge model, scoring outputs on a 1-5 scale with mandatory reasoning for each score. It is not a replacement for human evaluation but a scalable filter that catches regressions and flags borderline cases. For critical applications in legal and financial domains, domain experts review a statistical sample of outputs weekly, providing the ground truth for calibrating automated evaluation.

For observability, we run LangSmith capturing every LLM call, tool invocation, and agent decision as a trace. Custom Grafana dashboards track what leadership cares about: cost per day by model and client, latency percentiles, task completion rates, and quality scores. Every dollar of AI inference is attributed to a specific client, project, and use case -- not as financial hygiene, but to drive optimization decisions.

Infrastructure and the API Gateway Pattern

Agent services are containerized with Docker on Kubernetes, auto-scaling based on queue depth rather than CPU utilization -- because AI workloads are I/O-bound waiting for model APIs, not CPU-bound. For self-hosted inference, vLLM runs on GPU instances with dynamic batching, scaling to zero during off-hours. A single API gateway handles authentication, rate limiting, retry logic, model routing, and cost tracking for every outbound LLM request. This gateway is the chokepoint through which all AI spending flows, making it the natural place to enforce budgets and collect telemetry.

What We Chose Not to Use (and Why)

The tools you reject reveal as much about your engineering philosophy as the tools you adopt. Here are the notable technologies we evaluated and deliberately passed on.

CrewAI -- Too opinionated about agent interaction patterns for production. LangGraph provides the same multi-agent capabilities with explicit control over every transition.
Chroma -- Solid for prototyping but operational maturity for production workloads (connection pooling, HA, backups) did not meet our standards. We revisit periodically.
Haystack -- Clean pipeline abstraction but significantly smaller ecosystem and community support than LangChain/LangGraph.
Fine-tuned reasoning models -- Results consistently worse than well-prompted frontier models, with high maintenance burden. We do fine-tune embedding models for domain-specific retrieval, where the ROI is clear.
AutoGen -- Insufficient production hardening. Conversation-based agent interaction made debugging difficult, and no persistent checkpointing was a dealbreaker for enterprise workflows.

How Our Stack Evolved Over 18 Months

In mid-2024, we ran GPT-4 as our sole model, raw LangChain chains for orchestration, Chroma for vector storage, and basic logging. The system worked for demos but crumbled under production load. By late 2024, Claude 3.5 Sonnet had replaced GPT-4 as our primary model, Pinecone and pgvector had replaced Chroma, and LangGraph had replaced LangChain chains -- immediately improving debugging and testing.

Through 2025, we added Cohere Rerank (our biggest single quality improvement), built custom evaluation frameworks, deployed LangSmith, and introduced the API gateway pattern. Into 2026, the focus shifted to maturity: custom orchestration for critical paths, human evaluation loops, self-hosted models for data-sensitive clients, and LLM-as-judge pipelines. The stack is more complex than 18 months ago, but every component earns its place by solving a problem we actually had -- not a problem we imagined.

Principles Behind Our Choices

Use the right model for each task, not the best model for every task. Model cascading is not just cost optimization -- it is an architectural principle.
Own your critical path. Frameworks are excellent for non-critical workflows, but production-critical code should be explicit and free from upstream dependency changes.
Measure before you optimize. Every optimization we have made was driven by measured quality gaps in production, not theoretical concerns.
Operational simplicity compounds. A slightly less capable tool your team can operate confidently beats a superior tool requiring specialized expertise.
Transparency is a feature. When clients ask what models we use and how we evaluate quality, we answer with specifics.

Our stack will continue to evolve -- the AI tooling landscape moves too fast for any architecture to be permanent. But the principles are stable: measure everything, own your critical paths, use the right tool for each job, and be honest about what works. If you are building production AI systems and want to compare notes, or if you need a team that has already navigated these choices, we would welcome the conversation. Explore our AI and machine learning services at /services/ai-development.

Fernando Boiero

CTO & Co-Founder

Over 20 years in the tech industry. Founder and director of Blockchain Lab, university professor, and certified PMP. Expert and thought leader in cybersecurity, blockchain, and artificial intelligence.

Let's build something great

AI, blockchain & custom software — tailored for your business.

Get in touch

Ready to leverage AI & Machine Learning?

From predictive models to MLOps — we make AI work for you.

Get in touch Explore our services

November 4, 2025·10 min

The Real Cost of Running AI Agents in Production

A COO's breakdown of AI agent production costs -- tokens, infrastructure, monitoring, and hidden expenses -- with budgeting frameworks and ROI calculations.

August 19, 2025·10 min

How We Integrate AI Into Products Without It Being a Gimmick

A practical framework for evaluating where AI adds genuine product value. Covers the AI feature trap, data quality, UX patterns, measurement, and real examples.

The AI Stack We Use in Production: Models & Pipelines