Back to Projects

Agentas Multi-Agent Gateway

Production-grade multi-agent orchestration platform built on AWS Bedrock

Sep 2024 – Present Grid Dynamics
Agentas Architecture Diagram showing request flow through RAI guardrails to AWS Bedrock AgentCore

Overview

Agentas is an enterprise-grade multi-agent orchestration platform that serves as a gateway between users and AWS Bedrock's AgentCore service. As a core member of the Agentas team, I architected and implemented a FastAPI-based service that intelligently routes user requests through Responsible AI (RAI) guardrails while managing complex multi-agent workflows.

The Challenge

Organizations needed a scalable way to orchestrate multiple AI agents while ensuring:

  • Secure, authenticated access to external tools and services
  • Responsible AI guardrails to prevent harmful or biased outputs
  • Session state management across distributed agent workflows
  • Robust error handling and timeout management for production reliability
  • Comprehensive observability for debugging complex agent interactions

My Solution

I designed and built the core gateway service with the following architecture:

Gateway Service Architecture

  • FastAPI Backend: High-performance async API service handling concurrent user requests with automatic request validation and OpenAPI documentation
  • RAI Guardrails: Implemented pre- and post-processing filters to screen requests and responses for compliance with ethical AI guidelines
  • AWS Bedrock Integration: Seamless connection to AWS Bedrock's AgentCore service for LLM-powered agent orchestration
  • Redis Session Management: Distributed session store enabling stateful conversations across multiple agent invocations with automatic expiration

Gateway Authentication Layer

One of my key contributions was building the authentication layer that enables secure tool invocations:

  • JWT-based authentication for external tool access
  • Service-to-service authentication using AWS IAM roles
  • OAuth 2.0 integration for third-party API access
  • Rate limiting and quota management per client

Tool Manager Integration

I implemented the end-to-end flow connecting the tools manager to AWS AgentCore:

  • Dynamic tool registry allowing runtime tool registration and discovery
  • Tool execution framework with timeout handling (configurable per tool, default 30s)
  • Error boundary implementation preventing tool failures from crashing agents
  • Structured logging capturing tool invocations, parameters, and results
  • Health check endpoints monitoring tool availability and response times

Technical Stack

Python 3.11 FastAPI AWS Bedrock Redis Pydantic boto3 Docker CloudWatch

Key Features

Multi-Agent Orchestration

Coordinate multiple specialized agents with different capabilities, enabling complex workflows that leverage the strengths of each agent type.

Responsible AI Guardrails

Pre-flight and post-flight content filtering ensuring all interactions comply with ethical AI guidelines and organizational policies.

Session Persistence

Redis-backed session storage enabling stateful conversations with automatic cleanup and configurable TTL policies.

Comprehensive Logging

Structured logging with CloudWatch integration providing full visibility into request flow, tool invocations, and agent decisions.

Impact & Results

  • Successfully deployed to production serving 1000+ requests per day
  • Achieved 99.5% uptime with robust error handling and retry mechanisms
  • Reduced average response time to under 800ms for simple agent queries
  • Enabled secure integration with 15+ external tools and services
  • Improved debugging efficiency by 60% through comprehensive observability

Lessons Learned

This project taught me valuable lessons about building production-grade AI systems:

  • Timeout Management is Critical: Different tools have vastly different response times. Implementing per-tool timeout configurations prevented slow tools from blocking the entire workflow.
  • Observability from Day One: Adding structured logging and tracing early made debugging complex multi-agent interactions significantly easier.
  • Graceful Degradation: Not every tool invocation will succeed. Building fallback strategies and clear error messages improved overall system reliability.

Interested in Learning More?

I'm happy to discuss this project in more detail or explore how similar architectures could benefit your organization.