We were hired to audit the security of an open-source AI agents framework. The client wanted assurance that their agent infrastructure -- handling sensitive data across multiple LLM providers -- met the standards required for enterprise deployment. What started as a routine security engagement became the catalyst for building something entirely new. The framework was well-intentioned, popular on GitHub, and actively maintained. But the deeper we dug, the clearer it became: the problems were not bugs to be fixed. They were architectural decisions baked into the foundation.
What We Found: The Limits of Python-Based Agent Frameworks
The framework we audited followed a pattern common across Python-based agent ecosystems. Agents executed tools by spawning subprocesses or calling functions directly within the same Python process. There was no sandboxing -- a tool with a bug or a malicious prompt injection could access the filesystem, make network requests, or read environment variables containing API keys. In one test, we demonstrated that a carefully crafted prompt could cause an agent to exfiltrate credentials stored in memory by a different agent running in the same process.
The runtime overhead compounded the security issues. Python's Global Interpreter Lock meant that multi-agent orchestration was fundamentally single-threaded. Cold start times ranged from 2 to 5 seconds depending on the dependency tree. Memory consumption for a single agent with a modest tool set sat around 200-400MB -- manageable for a demo, prohibitive when running dozens of agents in production. And every agent restart reloaded the entire dependency chain because Python has no concept of ahead-of-time compilation for this workload.
The compliance implications were the final concern. Dynamic typing meant that data flowing through the agent pipeline had no compile-time guarantees about shape or content. Personally identifiable information could pass through logging middleware without detection because there was no type-level enforcement. Achieving GDPR compliance required wrapping every data access point in runtime checks -- checks that were easy to forget, impossible to enforce systematically, and invisible to static analysis. For an ISO 27001 certified organization like ours, recommending this framework for production use was not an option.
The Decision: Building From Scratch in Rust
Our first instinct was to contribute patches upstream. We drafted a proposal for a sandbox layer, memory isolation between agents, and a typed message-passing system. But the more we scoped the changes, the clearer it became that we were not proposing improvements -- we were proposing a rewrite. The framework's plugin architecture assumed direct memory access between components. The tool execution model was built around Python's import system. Adding isolation after the fact would break every existing extension and defeat the purpose of the ecosystem.
We chose Rust for reasons that went beyond trend-following. Our team had spent two years writing smart contracts for the Stellar blockchain using Soroban -- Rust-based smart contracts that compile to WASM and execute in a sandboxed environment. We understood Rust's ownership model, its async ecosystem, and critically, its ability to compile to WebAssembly. Every property we needed for a secure agent framework -- memory safety without garbage collection, zero-cost abstractions, deterministic resource cleanup, and a WASM compilation target -- Rust provided natively.
The decision was not taken lightly. Rust has a steeper learning curve than Python, the ecosystem for AI tooling is younger, and development velocity in the early weeks was slower. But we were building infrastructure that would run in production for years, handling sensitive data across regulated industries. The upfront investment in correctness would pay dividends every day the system ran without a security incident.
Agentor Architecture: 13 Crates, One Mission
Agentor is structured as a Rust workspace with 13 crates, each with a single responsibility and well-defined boundaries. This is not a monolith split into folders -- each crate compiles independently, has its own test suite, and exposes a public API through Rust's module system. Dependencies between crates are explicit and enforced by the compiler. If a crate does not need filesystem access, it simply does not depend on std::fs, and no amount of runtime cleverness can change that.
- agentor-core -- Agent lifecycle management, message routing, and the trait definitions that every other crate implements. This is the backbone: it defines what an agent is, how agents communicate, and how the system orchestrates multi-agent workflows.
- agentor-runtime -- The async Tokio-based runtime that manages agent execution, resource limits, and graceful shutdown. Handles backpressure, task prioritization, and ensures no single agent can starve others of CPU or memory.
- agentor-providers -- LLM provider abstraction layer supporting OpenAI, Anthropic, and local models through a unified trait. Switching providers requires changing a configuration value, not rewriting code.
- agentor-tools -- Tool definition and execution framework with compile-time type checking. Every tool declares its inputs, outputs, and required permissions as Rust types.
- agentor-mcp -- Full implementation of the Model Context Protocol, enabling Agentor agents to interoperate with any MCP-compatible system.
- agentor-sandbox -- The WASM isolation layer. Every tool execution happens inside a sandboxed WASM instance with explicit capability grants.
- agentor-cli -- Command-line interface for creating, running, and managing agent projects. Scaffolds new projects, runs local development servers, and manages deployments.
- agentor-codegen -- Code generation pipeline with AST-aware transformations, multi-file coordination, and integrated compilation verification.
- agentor-config -- Configuration management with environment-aware defaults, secret handling, and validation.
- agentor-telemetry -- Structured logging, distributed tracing, and metrics collection with OpenTelemetry compatibility.
- agentor-auth -- Authentication and authorization for agent-to-agent and agent-to-service communication.
- agentor-storage -- Persistent state management with pluggable backends (SQLite, PostgreSQL, in-memory).
- agentor-testing -- Test utilities, mock providers, and snapshot testing for agent behaviors.
WASM Sandbox: Security by Design
The sandbox layer is the architectural decision that differentiates Agentor from every Python-based alternative. When an agent invokes a tool -- whether it is reading a file, making an HTTP request, or executing generated code -- that tool runs inside an isolated WASM instance. The instance has no access to the host filesystem, no network capabilities, and no shared memory with other tools or the host process. Access to any resource must be explicitly granted through a capability system defined at deployment time.
Each WASM instance runs with configurable memory limits and execution timeouts. If a tool attempts to allocate more memory than permitted, the instance is terminated -- not the agent, not the runtime, just that single tool invocation. If a tool exceeds its execution timeout, it is killed cleanly with full diagnostic output. This is fundamentally different from Python's approach of spawning subprocesses and hoping they behave. There is no hoping in Agentor -- the constraints are enforced by the WASM runtime at the instruction level.
The practical impact is significant. In the framework we audited, a prompt injection attack that caused a tool to execute arbitrary code could compromise the entire system. In Agentor, the same attack is contained to a sandboxed WASM instance with no capabilities -- it can compute, but it cannot communicate, persist, or access anything outside its sandbox. The blast radius of any security incident is reduced from 'total system compromise' to 'one tool invocation returned an error.'
Optimized for Code Generation
This is the core of why Agentor exists and what makes it different from frameworks that treat code generation as just another tool. Agentor was built from the ground up to be the runtime for AI-driven code generation -- not text generation that happens to produce code, but a structured pipeline that understands code as a first-class artifact with syntax, semantics, and dependencies.
The pipeline starts with spec-driven development. Before an agent generates a single line of code, it produces a structured specification: the files to be created or modified, the interfaces they must implement, the dependencies they require, and the tests that will validate correctness. This specification is not a suggestion -- it is a contract that the generation pipeline enforces. If the generated code does not match the spec, the pipeline rejects it before it ever reaches a filesystem.
Multi-file output coordination is where most code generation tools fall apart. Generating a single function is straightforward. Generating a module with five files that import from each other, implement shared interfaces, and must compile together is an order of magnitude harder. Agentor's codegen crate maintains a dependency graph of all files in a generation batch, ensures imports resolve correctly across files, and validates that the complete output compiles as a unit before writing anything to disk.
AST-aware transformations replace the naive text replacement that most frameworks use. When Agentor modifies existing code, it parses the source into an abstract syntax tree, applies transformations at the semantic level, and regenerates the source. This means adding a method to a class does not break formatting, inserting an import does not duplicate an existing one, and renaming a function updates all call sites. The difference between text replacement and AST transformation is the difference between a find-and-replace tool and a compiler -- one of them understands the code.
The integrated testing pipeline closes the loop. Every generation cycle follows the same sequence: generate, compile, test, iterate. If the generated code does not compile, Agentor captures the compiler errors, feeds them back to the LLM with the relevant context, and requests a corrected version -- automatically, without human intervention. If the code compiles but tests fail, the test output follows the same feedback loop. This generate-compile-test-iterate cycle typically converges in 2-3 iterations, and each iteration benefits from Rust's zero-copy architecture that minimizes the overhead of passing large codebases between pipeline stages.
MCP Protocol: Universal AI Interoperability
The Model Context Protocol is becoming the standard for AI system interoperability, and Agentor implements it as a first-class citizen through the agentor-mcp crate. Any Agentor agent can expose its capabilities as MCP tools, consume tools from external MCP servers, and participate in multi-system workflows without custom integration code. This means an Agentor code generation agent can use tools provided by an IDE plugin, a CI/CD system, or a cloud deployment service -- all through the same protocol, with the same security guarantees enforced by the sandbox layer.
MCP support also means that organizations can adopt Agentor incrementally. Existing AI workflows built on other frameworks can call Agentor agents through MCP without replacing their current infrastructure. As teams gain confidence in the security and performance characteristics, they can migrate additional workloads at their own pace. There is no big-bang migration required -- Agentor was designed to coexist.
Compliance Without Compromise
As an ISO 27001 certified company, we built Agentor with compliance as a design constraint, not an afterthought. Every architectural decision was evaluated against the requirements of the regulatory frameworks our clients operate within.
- GDPR -- Data flow through the agent pipeline is tracked at the type level. PII fields are marked with Rust's type system, and any attempt to log, serialize, or transmit PII without explicit redaction is caught at compile time. The sandbox ensures that tools cannot access data they have not been explicitly granted permission to see.
- ISO 27001 -- Access controls, audit logging, and incident containment are not optional modules -- they are built into the runtime. Every agent action is logged with cryptographic integrity, every tool invocation is recorded with its capability grants, and security events trigger automatic containment through the sandbox layer.
- DPGA (Digital Public Goods Alliance) -- Agentor is designed to meet open-source standards for digital public goods, with transparent governance, accessible documentation, and platform-independent deployment through WASM.
The contrast with the framework we audited is stark. Achieving GDPR compliance in that Python-based system required adding runtime middleware at every data boundary, with no compiler enforcement and no guarantee of completeness. In Agentor, compliance is structural -- if the code compiles, the data handling rules are enforced.
Performance: Rust vs Python
We are careful about benchmarks because they can be misleading. These numbers come from our internal test suite running the same agent workflow -- a code generation task that reads a specification, generates three files, compiles them, runs tests, and iterates once -- on the same hardware.
- Cold start time -- Agentor: 38ms. The Python framework: 3.2 seconds. This matters when you are running agents on-demand in serverless environments where every cold start is a user waiting.
- Memory footprint -- Agentor agent with full tool set: 42MB. Equivalent Python agent: 380MB. A 9x reduction that translates directly into infrastructure cost savings when running at scale.
- Concurrent agents -- Agentor on a 4-core machine: 200+ agents running simultaneously with linear throughput scaling. The Python framework: 12-15 agents before the GIL becomes the bottleneck.
- Test suite -- 483 tests, zero unsafe blocks, zero uses of unwrap() in production code. Every error path is handled explicitly.
- WASM sandbox overhead -- Adding sandbox isolation to a tool invocation costs approximately 1.2ms per invocation. The security guarantee of complete isolation for less than two milliseconds of latency is a trade-off we accept without hesitation.
The most meaningful number is not any single benchmark -- it is the total wall-clock time for a complete code generation cycle. Agentor completes the generate-compile-test-iterate loop in roughly 40% of the time the Python framework takes, primarily because the overhead between pipeline stages is measured in microseconds rather than hundreds of milliseconds. When you are running thousands of these cycles per day, that difference compounds into hours of saved time and significantly lower LLM API costs due to fewer wasted tokens.
What's Next
Agentor is under active development, and our roadmap reflects the priorities we hear from early adopters and our own internal usage.
- Language server integration -- Deep integration with VS Code and other IDEs through the Language Server Protocol, enabling Agentor code generation agents to operate as intelligent coding assistants with full project context.
- Distributed agent orchestration -- Multi-node agent execution with automatic work distribution, fault tolerance, and state replication for enterprise-scale deployments.
- Visual pipeline editor -- A browser-based tool for designing agent workflows visually, with drag-and-drop composition of tools, providers, and orchestration patterns.
- Extended language support -- AST-aware transformations currently support Rust, TypeScript, and Python. We are adding Go, Java, and C# to cover the most common enterprise languages.
- Open-source release -- We are preparing Agentor for public release under a permissive open-source license. The core crates, documentation, and example projects will be available on GitHub.
What started as a security audit has become the foundation of how we build AI-powered systems at Xcapit. Agentor is not just a framework we sell to clients -- it is the framework we use internally for our own AI agent development, our code generation workflows, and our compliance-critical deployments. Every improvement we make is battle-tested in our own production environment before it reaches anyone else.
The lesson from this journey is one we keep relearning in software engineering: when the foundation is wrong, no amount of patching will fix it. Sometimes the responsible decision is to start over with the right constraints, the right language, and the right architecture. Agentor is our answer to the question we could not solve with patches -- how do you build AI agents that are secure enough, fast enough, and reliable enough for the systems that matter most?
Fernando Boiero
CTO & Co-Founder
Over 20 years in the tech industry. Founder and director of Blockchain Lab, university professor, and certified PMP. Expert and thought leader in cybersecurity, blockchain, and artificial intelligence.
Let's build something great
AI, blockchain & custom software — tailored for your business.
Get in touchReady to leverage AI & Machine Learning?
From predictive models to MLOps — we make AI work for you.
Related Articles
OpenClaw Security Anatomy: What AiSec's 35 Agents Found in the World's Most Popular AI Agent
We ran AiSec — our open-source AI security framework with 35 specialized agents — against OpenClaw, the most popular AI agent on GitHub (191K stars). In 4 minutes and 12 seconds, it found 63 vulnerabilities mapped to 8 security frameworks. Here is the full technical breakdown.
LLM Security: Defending Against Prompt Injection Attacks
A technical deep dive into prompt injection, indirect injection, jailbreaking, and data exfiltration attacks on large language models — with practical, layered defense strategies for teams building production AI systems.