How to Design Real AI Agents (Agents)?
Transitioning from "Persona" to Tool-Using Decision Mechanisms
A step-by-step senior-level AI Engineer guide from Level 0 to Level 7.
🎯 Introduction: Why "You Are the Best AI Engineer in the World" Is Not Enough
90% of online tutorials tell you to write this to create an agent:
If you are deploying an application to production, prompts written with only persona will turn your system into a hallucination bomb.
The real power of agents does not come from the roles you assign them; it comes from Tool Calling, Grounding, and Explainability.
💡 What Will You Learn in This Post?
In this post we will take a simple problem — "measuring text similarity" — and show you step by step how to evolve it from a novice approach to an autonomous level designed by a senior AI Engineer. At each level we will compare prompts, architecture, and cost.
🎬 Learn with Video: Murat Karakaya Akademi
You can also watch this training on the Murat Karakaya Akademi YouTube channel. Follow the same journey from Level 0 to Level 7 with step-by-step live demos, code explanations, and architectural analysis.
📌 Level 0 — Novice Approach: Black Box & Hallucination
Scenario: We need to find how similar Text 1 and Text 2 are on a 0–100 scale. (For example, in a RAG system, we measure how faithfully a generated answer stays true to the reference text).
"You are a text analyst. Compare the given Text 1 and Text 2 and decide how similar they are. Return the similarity as a numeric value between 0–100. 0: They do not match at all, 100: They are completely identical word for word."
Why Is This Bad?
- Black Box: The model only returns "75". Why 75? Not 74 or 80? We don't know.
- Subjectivity: The same texts might get 60 one day and 85 the next.
- Hallucination Risk: LLMs cannot perform mathematical measurements; they predict words. "75" is a fabricated (hallucinated) value.
📌 Level 1 — Structured Output & Chain of Thought
The first thing we want is Explainability. We tell the model: "Don't just give me the result, show me your thinking process."
- Chain of Thought: We did not ask for the score first. We asked it to fill the reasoning field first. The model justifies its own inference while writing the explanation.
- Debugging: If the score is 40, we can look at the logs and say "The model missed this detail, that's why it gave a low score."
- Integration: The JSON output can be parsed by application code.
📌 Level 2 — Rubric and Objectivity (Divide & Conquer)
We are not allowing the agent to interpret the abstract concept of "similarity" on its own. We give it rules (rubric). We break the problem into parts.
Rubric:
- 1. Main Idea (0-20): Do both texts convey the same core message?
- 2. Tone and Style (0-20): Do the texts have the same formality and emotion?
- 3. Entities (0-20): Do the names, dates, and numbers in the texts match?
- 4. Missing Information (0-20): Does Text 2 omit an important detail from Text 1?
- 5. Fluency (0-20): How structurally coherent is Text 2?
- Grounding: We removed the model from abstract evaluation on a scale of 100.
- Objectivity: Scores will be much more consistent even if you run them at different times (variance decreases).
- Comparability: We can evaluate different prompts against the same rubric.
📌 Level 3 — Example-Based Rubric Calibration (One-Shot / Few-Shot)
In Level 2 we gave the rubric; however, this was still zero-shot prompting. The model read the criteria but never saw from examples what "When do I give 20 points?", "What is the boundary case for 10?", "When is 0 appropriate?"
At this level we provide small, representative-examples for each criterion.
Scoring Calibration Examples:
- Main Idea — 20 points: "Data cleaning is critical for model success" and "Model quality heavily depends on clean data" convey the same core message.
- Main Idea — 10 points: Both texts discuss data quality but one focuses on security risks while the other focuses on model performance.
- Main Idea — 0 points: One discusses data cleaning, the other discusses a sports match result.
- Tone and Style — 20 points: Both texts are written in an academic and formal tone.
- Tone and Style — 10 points: One is formal, the other more conversational, but the meaning is preserved.
- Consistency: The model uses the same score ranges more reliably.
- Teachability: The rubric is now supported by behavioral examples, not just an abstract list.
- Cost: Few-shot examples increase input tokens — so examples must be short and clear.
Expected JSON Schema:
📌 Level 4 — Tool Calling and Workflow
In Level 3 we calibrated the rubric with examples; however, the evaluation still relied solely on LLM interpretation. Now we add deterministic metrics from external systems as evidence.
Metrics Used:
- ROUGE-L F1: Measures word-sequence overlap.
- Lightweight Similarity Score: A combination of Token cosine + Token Jaccard + Character 3-gram cosine + Sequence ratio.
- If ROUGE-L F1 is low, this indicates low word-sequence overlap; it is not alone evidence of low semantic similarity.
- If the Lightweight Similarity Score is higher than ROUGE-L, the texts may convey a similar message with different words.
- The Lightweight Similarity Score is not a real semantic embedding; it should be used as a decision-support signal, not as the sole decision-maker.
🌍 Why Is This Used in the Real World?
- Reliability: ROUGE and lightweight similarity scores are deterministic — they give the same scores to the same text pair every time.
- Traceability: Since the LLM's opinion is grounded in external evidence, evaluation becomes more auditable.
- Cost Control: Lightweight metrics are fast and do not require heavy model dependencies.
🔬 Experiment Hygiene Note
Level 4 uses the same rubric text as Level 3. This is a deliberate decision: the difference between the two levels is not a rubric change, but only external metric context.
Additional Required JSON Fields in Level 4:
metric_interpretation: Explain how you interpreted the ROUGE-L F1 and Lightweight Similarity Score values.calibration_note: Explain how the rubric calibration examples affected your scoring.
Expected JSON Schema (Level 4):
📌 Level 5 — ReAct and Ollama Tool Calling with Real Agent Loop
We built a strong workflow in Level 4, but was our system a real agent? Not exactly. Because we calculated the metrics with Python. In real agent behavior, the model determines what evidence it needs, the software layer executes the tool, and the result returns to the model.
ReAct Loop:
- Reasoning: The model determines what external evidence it needs to evaluate text similarity.
- Action: Instead of writing plain text, the model produces Ollama's native
tool_callsfield. - Observation: Python executes the relevant function and returns the result to the model as
role="tool". - Final Answer: The model uses the tool results to produce the rubric-based JSON evaluation.
Ollama Native Tool Calling Flow:
Tool Definitions (Python):
Tool Calling Loop (Pseudo-Python):
- ReAct is not just writing
Thought / Action / Observation; it is connecting thought to real tool execution. - An agent is the joint design of prompt, tool calling, execution loop, grounding, and error control layers.
- The model determines tool needs, Python executes the tool, and the final evaluation is supported by external evidence.
📊 Level 5 Token Cost:
"Since Level 5 is a multi-turn agent loop, the input token is not just the length of the first user prompt. With each client.chat call, the system message, user message, previous assistant messages, and tool results are re-injected into context. Therefore, the input token total in Level 5 is not a unique token count, but a cumulative processed token / cost indicator."
🛡️ In the Real World: Production Guardrails
In real production, these protections are added: schema validation, maximum step limit, tool allowlist, retry, and tracing/logging.
Level 5 Tool Calling JSON Schema:
📌 Level 6 — Rubric-Based Sub-Agents and Python Aggregator
In Level 5, a single agent both called tools and interpreted the entire rubric on its own. At this level we try a different architecture: instead of one large prompt, we give each rubric criterion to a separate sub-agent.
Architecture:
- 1 generic sub-agent function is written.
- This function is called 5 times with 5 different rubric configurations.
- Each sub-agent evaluates only its own criterion.
- Python aggregator validates, sorts, and calculates the total score.
- The aggregator makes no LLM calls — it is deterministic.
The Pedagogical Message of This Level:
Limitations:
- 5 sub-agents = 5 LLM calls = more expensive than Level 5.
- Not necessary in every case; it makes sense when the rubric grows or when audibility is critical.
Level 6 — Sub-Agent JSON Output (single criterion):
Python Aggregator Total Output (all criteria):
Level 6 Python Aggregator Function Example:
Level 6 — Token Cost:
- 5 sub-agents = 5 independent LLM calls
- Each call includes system prompt + user prompt + metric context
- Total input token = 5 × (system + user prompt length)
- Aggregator token cost = 0 (Python code runs)
Level 6 — Sub-Agent JSON Schema (inside build_sub_agent_system_prompt):
📌 Level 7 — Orchestrator Agents and When They Are Not Needed
After Level 6 the natural question is: "Wouldn't it be better if an orchestrator agent managed these sub-agents?"
Orchestrator Agent is a powerful architecture in the real world. An orchestrator can break down tasks, decide which sub-agent to run, select appropriate tools, initiate retries on missing or contradictory results, and convert results from different agents into a final decision.
However, in this example we deliberately do not need it, because:
- The 5 rubric criteria are predetermined.
- Every criterion must run.
- Each sub-agent evaluates only its own criterion.
- Missing criterion check, sorting, and total score can be reliably done with Python.
🏗️ When Does an Orchestrator Agent Make Sense?
- When which sub-agents to run changes from task to task.
- When tool selection, data source selection, or workflow branching is needed.
- When there are contradictions between sub-agent responses and an interpretive reconciliation is needed.
- When there are dynamic steps such as quality control, retry, missing information completion, or human approval.
Level 7's Message: Orchestrator + sub-agent architectures exist and are important; however, they are not necessary in every problem. In this example, the Python aggregator is the correct, simple, and instructive choice.
📊 Comparison of All Levels
| Level | Approach | Added Layer | Gain | Limitation / Lesson |
|---|---|---|---|---|
| Lvl 0 | Persona / Black Box | Simple system prompt | Fast start | Inconsistent, unexplainable, hallucination-prone |
| Lvl 1 | JSON + Explanation | Structured output | Answer becomes parseable | Still subjective |
| Lvl 2 | Rubric | Criteria-based evaluation | More objective score | No external evidence, still LLM opinion |
| Lvl 3 | One-Shot / Few-Shot Calibration | Criteria-based examples | More consistent scores | Input token cost increases |
| Lvl 4 | Workflow + Tools | ROUGE + lightweight similarity metrics | Grounded, evidence-based | Developer selects the tools |
| Lvl 5 | ReAct + Tool Calling | Automatic tool calling + execution loop | Real agent behavior | High cost, loop management needed |
| Lvl 6 | Sub-Agents + Python Aggregator | Generic sub-agent + deterministic aggregation | Task decomposition, responsibility separation | 5× LLM calls, more expensive |
| Lvl 7 | Orchestrator Agent Decision | Architectural decision-making | Understanding of advanced architectures | Orchestrator not needed everywhere |
🎓 Final Message: Prompt Engineering Becomes Systems Engineering
Simply giving an agent a powerful-sounding prompt and expecting it to work correctly is not enough.
A Real AI Agent is not just a model that produces answers; it is a software system that jointly manages thinking patterns, tool usage, reasoning steps, and data-driven evidence.
As we progressed from Level 0 to Level 7, we actually built the same idea layer by layer. First we structured the output, then we broke evaluation into rubric parts, then we calibrated with examples. Then we added grounding with metrics and connected the ReAct idea to real tool execution. Finally, we saw that more advanced architectures like orchestrator agents exist, but in this example, not adding an extra LLM orchestrator was the more correct engineering decision.
Our value as AI Engineers emerges here: instead of expecting miracles from the model, understand the model's strengths and weaknesses and build the right architecture around them. Good agent design is not just about writing prompts; it is about proving with data, supporting with tools, making reasoning visible, and making outputs measurable.
📝 Test Texts Used in the Training
These two texts were specifically selected: low word overlap (low ROUGE), yet they convey a semantically similar message.
"In the process of training artificial intelligence models, the use of high-quality datasets is of critical importance. If the dataset contains incorrect, biased, or incomplete information, the results produced by the model will inevitably be flawed and unreliable. Therefore, data cleaning is a more prioritized step than the complexity of the model architecture."
"The success of machine learning algorithms heavily depends on the quality of the information they are fed. Algorithms fed with dirty, biased, or incomplete data will naturally produce incorrect and untrustworthy outputs. Therefore, filtering and organizing data is a much more essential process than building the system's infrastructure."
🏷️ Tags & Hashtags
#ArtificialIntelligence #AI #MachineLearning #DeepLearning #LLM #LargeLanguageModel #PromptEngineering #AgentDesign #AI Agents #Ollama #ToolCalling #ReAct #SubAgent #Orchestrator #StructuredOutput #Rubric #FewShot #Grounding #Explainability #RAG #ROUGE #TokenCost #MuratKarakayaAkademi #AIEngineering #TechEducation #YouTubeEducation #Blogger
