Agentic AI Frameworks

agentic

frameworks

Published

May 3, 2026

From Chat to Code

If you have used ChatGPT, Claude, or Gemini through their web interface, you already know what a large language model (LLM) can do. It answers questions, summarizes text, writes code, and brainstorms ideas. But in that mode, you are the one driving. You type a prompt, read the response, decide what to do next, and type another prompt.

Agentic AI takes this further. Instead of you driving the conversation one message at a time, you write a program that talks to the LLM on your behalf. The program can send prompts, read responses, make decisions, call tools, and loop, all without you pressing “send” each time.

What Is an LLM API?

Every major model provider (OpenAI, Anthropic, Google, and others) exposes their models through an API (Application Programming Interface). This is just a URL you can call from code. You send a message, you get a response back as structured data.

A minimal example in Python:

from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY env variable
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2 + 2?"}]
)
print(response.choices[0].message.content)

That is the foundation. Everything in agentic AI builds on top of calling an LLM from code and doing something with the response.

Why Go Beyond a Single Call?

A single LLM call is powerful but limited. It cannot look up current information because it only knows its training data. It cannot take actions in the real world like sending emails, updating databases, or controlling devices. It cannot break a complex task into steps and execute them one by one. And it cannot correct itself based on feedback.

Agentic AI solves these limitations by putting the LLM inside a loop where it can call tools, get results, and decide what to do next.

What Are Tools?

A “tool” in the agentic context is any function the LLM can call. Examples include searching the web, reading a file, querying a database, sending an email, executing code, or calling another API.

The LLM does not actually run the tool itself. Your code runs the tool and feeds the result back to the LLM. The LLM just decides which tool to call and what arguments to pass.

Think of it like a person sitting at a desk with a phone. They can reason and talk, but they cannot physically check the warehouse inventory. However, they can pick up the phone and ask someone in the warehouse to check for them. The LLM is the person at the desk. The phone call is the tool. Your code is the warehouse worker who actually does the lookup and reports back.

How Tool Calling Actually Works

Tool calling sounds magical at first. You tell an LLM “you can query my database” and somehow it reaches back into your system and runs a SQL query? That sounds crazy. But the reality is more mundane than it appears.

sequenceDiagram
    participant Code as Your Code
    participant LLM as LLM (Cloud)
    participant Tool as Tool (Local)
    Code->>LLM: Prompt + tool descriptions
    LLM->>Code: "Call tool X with args Y" (JSON)
    Code->>Tool: Execute tool X
    Tool->>Code: Result
    Code->>LLM: Here is the result
    LLM->>Code: Final response

The tool calling flow. The LLM never executes anything directly. It requests a tool call, your code executes it, and the result goes back to the LLM.

Here is what actually happens. You send a prompt to the LLM and include a description of the tools it can use. You tell it to respond in JSON if it wants to invoke a tool. The LLM reads your prompt, decides it needs to use a tool, and responds with a structured JSON message saying “I want to call this tool with these arguments.” Your code receives that JSON, runs the actual tool locally, gets the result, and sends the result back to the LLM in a second call. The LLM then uses that result to formulate its final answer.

At the end of the day, you write something like an if statement in your code. If the LLM wants to do X, then do X, then call the LLM again with the results. That is all there is to it. The tool calling APIs from OpenAI, Anthropic, and others package this pattern into a clean interface, but underneath it is JSON and if statements.

Here is a concrete example that walks through every step in the diagram above. The LLM has two tools available and must choose the right one.

Step 1. Your code sends this prompt to GPT-5.3 (including descriptions of both tools).

You are a warehouse assistant. You have two tools available:

use tool: check_inventory [product] to check how many units are in stock
use tool: check_price [product] to look up the current price

If you need information, respond ONLY with the tool call. Do not make up answers.

User question: How much does a standing desk cost?

Step 2. The LLM chooses the price tool (not inventory) and responds:

use tool: check_price [standing desk]

Step 3. Your code parses that response, recognizes it as a tool request for check_price, and executes the actual database query locally. The query returns $349.99.

Step 4. Your code sends the result back to the LLM:

Tool result: $349.99

Step 5. The LLM uses that information to produce a final answer:

A standing desk costs $349.99.

Notice that the LLM had two tools available and autonomously chose check_price over check_inventory because the question was about cost, not stock levels. That is the autonomy in action. But the actual execution was entirely in your code.

Resources and RAG

Resources are a way to improve the effectiveness of an LLM by providing it with extra context. The idea is simple. You grab relevant data and include it in the prompt so the LLM can refer to it when answering.

For example, if you are building a warehouse assistant, you could include the full product catalog (names, prices, descriptions) in the prompt. When a user asks about a product, the LLM can refer to that data directly. That extra context is a “resource” you are providing to the LLM.

The simplest approach is to just include everything in the prompt. But for large datasets, that becomes impractical. This is where Retrieval Augmented Generation (RAG) comes in. RAG is about figuring out which pieces of context are most relevant to the current question and including only those. You might use vector search, keyword matching, or even another LLM to select the right context. The goal is the same. Give the LLM the information it needs without overwhelming it with everything you have.

What Is an Agentic Framework?

An agentic AI framework is a toolkit that helps developers build applications where large language models (LLMs) are not just answering prompts, but planning, calling tools, remembering context, and collaborating with other agents to complete multi-step tasks.

This note focuses specifically on frameworks and protocols for building agents. It does not cover products that use agentic AI under the hood, such as AI-powered IDEs (Kiro, Cursor), self-hosted AI assistants (OpenClaw/MyClaw), or low-code automation platforms (n8n). Those are consumers of agentic patterns rather than building blocks for creating them.

At a high level, every framework tends to provide the same building blocks:

Agents: an LLM plus a role, instructions, and the tools it is allowed to use.
Tools: typed functions the agent can call (search, code execution, APIs, databases).
Memory: short-term conversation state and longer-term vector or key-value stores.
Planning / control flow: how the agent decides what to do next (ReAct, plan-and-execute, graphs, state machines).
Orchestration: how multiple agents coordinate (supervisor, hierarchical, or peer-to-peer).

Framework Landscape

LangChain / LangGraph

LangChain is the most widely adopted ecosystem for building LLM applications. It provides abstractions for prompts, chains, tools, retrievers, and agents.

LangGraph sits on top of LangChain and models agent workflows as a stateful graph. Nodes are functions or agents, edges are transitions, and a central state object carries data between them. This makes it a strong fit for production workflows that need branching, loops, human-in-the-loop steps, and persistence.

CrewAI

CrewAI focuses on crews of role-based agents that collaborate on tasks. You define each agent with a role, goal, and backstory, then assign tasks and let the crew execute sequentially or in a hierarchical process. It prioritizes readability and is popular for quickly prototyping multi-agent setups.

AutoGen

AutoGen (from Microsoft) centers on conversational multi-agent patterns. Agents talk to each other using structured messages, and a user proxy agent can execute code on the user’s behalf. It shines for tasks like code generation and review loops where two or more agents iterate on each other’s output.

OpenAI Agents SDK

The OpenAI Agents SDK is a lightweight Python library for building agents with tool use, handoffs between agents, and guardrails. It is close to the OpenAI API surface, which makes it simple to adopt if you are already using GPT models and want minimal abstraction overhead.

LlamaIndex

LlamaIndex started as a data framework for retrieval-augmented generation (RAG) and has grown agent capabilities on top. Its strength is connecting LLMs to your data: document loaders, indexes, query engines, and agents that can reason over those indexes.

Semantic Kernel

Semantic Kernel (also from Microsoft) targets enterprise integration, with first-class support for C#, Python, and Java. It uses plugins (collections of functions) and a planner that composes them to fulfill a goal. It fits well when an agent needs to live inside an existing .NET or enterprise codebase.

Pydantic AI

Pydantic AI leans on Pydantic models for typed inputs, outputs, and tool signatures. It is small, explicit, and easy to test, which makes it appealing when you want type safety and predictable behavior rather than a large ecosystem.

Quick Comparison

Framework	Language(s)	Strength
LangChain / LangGraph	Python, JS/TS	Large ecosystem, graph-based orchestration
CrewAI	Python	Role-based multi-agent crews
AutoGen	Python, .NET	Conversational multi-agent patterns
OpenAI Agents SDK	Python	Lightweight, close to OpenAI API
LlamaIndex	Python, TS	Data and RAG-centric agents
Semantic Kernel	C#, Python, Java	Enterprise and .NET integration
Pydantic AI	Python	Typed, minimal, testable

The Complexity Spectrum

These frameworks exist at different levels of complexity, each with pros and cons. It helps to think of them as a hierarchy.

No Framework (+ MCP)

The simplest approach is to use no agentic framework at all. Just connect to LLMs directly using their APIs and orchestrate among them yourself. Anthropic makes a compelling case for this in their Building Effective Agents blog post: the APIs are relatively simple and straightforward, and the benefit is you get to see exactly what is going on under the hood. You control the prompts in detail, and you avoid buying into any ecosystem.

Alongside “no framework” sits MCP, the Model Context Protocol. Created by Anthropic in 2024, it is not a framework but an open-source protocol. The idea is that it allows models to be connected to sources of data and tools in a standardized, agreed-upon way, so you do not need glue code. As long as you conform to the protocol, you can stitch together models and their providers in a simple, elegant way. Because it is really about having a protocol rather than a framework, it belongs in the same tier as having no framework at all.

Lightweight Frameworks

One level up in complexity come OpenAI Agents SDK and CrewAI. Both are relatively lightweight and stay out of your way.

OpenAI Agents SDK is super lightweight, simple, clean, and flexible. It is relatively new but already very capable. CrewAI has been around longer and is also easy to use. One difference is that CrewAI has a low-code angle: you can put together agents to work on a problem through mostly configuration via YAML files, which makes it a bit heavier than OpenAI Agents SDK but still in the lightweight category.

With both of these, even though you are using a framework, you still feel like you are just interacting with LLMs.

Heavyweight Frameworks

At the top level of complexity are LangGraph (from the people that brought you LangChain) and AutoGen (from Microsoft). Both are relatively heavyweight compared to the others, and both have a steeper learning curve. AutoGen is really a couple of different things under one umbrella.

LangGraph in particular is quite complex. The core idea is that you are building a computational graph out of your agents and their tools. That is very powerful and means you can build quite sophisticated things, but it comes at a cost. You are signing up for a lot of terminology, concepts, and abstractions that you need to buy into. The ecosystem takes over your project in a big way. It becomes less of an “agentic AI project” and more of a “LangGraph project.” AutoGen has the same dynamic: once you adopt it, the framework’s patterns dominate how you structure your code.

This is how LangGraph and AutoGen are fundamentally different from the lighter options. With OpenAI Agents SDK or CrewAI, you still feel like you are just talking to LLMs. With the heavyweight pair, you are very much living inside that ecosystem.

Choosing Your Level

There are many more frameworks beyond the ones listed here, but these represent the most popular and give a good cross-section of the spectrum. Which one you pick depends on a few things: the use case (different platforms fit different business objectives), personal preference, and how much you want to lean on existing abstractions versus staying close to the metal. The heavyweight frameworks give you power and structure for complex production systems. The lightweight ones (and no framework at all) give you speed, flexibility, and full visibility into what is happening. Both ends have their place.

How to Choose

A few heuristics that tend to hold up:

Prototyping a multi-agent idea quickly: CrewAI or AutoGen.
Production workflows with branching and persistence: LangGraph.
Heavy focus on retrieval over your own data: LlamaIndex.
Enterprise / .NET environment: Semantic Kernel.
Minimal abstraction on top of OpenAI models: OpenAI Agents SDK or Pydantic AI.

The frameworks are not mutually exclusive. It is common, for example, to use LlamaIndex for retrieval inside a LangGraph workflow.

Open Questions

Things to dig into in future notes:

Patterns for evaluating agents (trajectories, task success, cost).
When a plain tool-calling loop beats a “framework”.
Handling long-running agents, checkpoints, and recovery.
Cost and latency tradeoffs across frameworks on the same task.