ETL Automation

February 1, 2025 · 8 min read

AI Full Stack Engineer

Demo

Check out the demo before I dive into the blog!

Introduction

In this blog, I explore how automation simplifies problem-solving by testing AI’s ability to break tasks into subproblems. This use case worked with a single agent, but more complex problems require multi-agent orchestration—covered in the next blog.

AI’s Business Impact: Aligning Technology with Strategy

Data ingestion is often a bottleneck for businesses dealing with multiple service providers. The challenge isn't just handling large volumes of data—it’s managing provider-specific schemas, adapting to schema changes, and maintaining consistency across updates. Manually handling these variations is tedious, error-prone, and difficult to scale.

AI-powered automation can help mitigate these challenges by:

Assisting with schema adaptation when providers update their formats.
Reducing operational overhead, allowing teams to focus on higher-value tasks.
Improving efficiency and minimizing downtime caused by schema mismatches.
Supporting more reliable decision-making with structured, up-to-date data ingestion.

However, AI is not a silver bullet. Risks like hallucinated transformations or misinterpretations of schema changes require robust safeguards. Detailed logging and iterative improvements help detect issues, refine workflows, and adapt the system over time.

For businesses looking to scale, balancing automation with oversight is key. Tracking queries, intermediate steps, and outputs can help to detect errors and patterns. This can help improve the system over time.

The Importance of Orchestration

LLMs generate text but cannot execute code. To incorporate execution, they must delegate tasks to external tools and refine responses based on the results. The diagram below illustrates this orchestration. Can this be achieved using prebuilt solutions, or does it require a custom implementation?

Lets explore the challenges, technical details, and key insights.

Navigating Trade-offs in AI System Design

Prebuilt Agent

Langchain provides a prebuilt CSV Agent. At first glance, it seems to solve the problem of orchestration for me, however, on closer inspection, there are some concerns.

Pros

Feature	Description
Ease of Use	Works out of the box with minimal setup
Streaming Events	Supports real-time streaming with intermediate thoughts and actions
Observability	Built-in event streaming makes reasoning process transparent
Orchestration	Manages tool selection, execution, and self-reflection in a loop until a final response is generated
Mental Load	Frees up cognitive effort to focus on high-level problem solving
Quick Deployment	Requires minimal configuration, enabling rapid prototyping

Example Streamed Event

AgentAction(
  tool='python_repl_ast',
  tool_input="df[(df['Event Details'].str.contains('09-08-2018')) & (df['Frequency'] == 'Recurring')]",
  log='Thought: I need to check if there are any monthly bookings on 9 August 2018 in the dataframe. I will filter the dataframe for entries with the date "09-08-2018" and a frequency of "Recurring". Then, I will extract the relevant details for any such bookings.\n\nAction: python_repl_ast\nAction Input: df[(df[\'Event Details\'].str.contains(\'09-08-2018\')) & (df[\'Frequency\'] == \'Recurring\')] '
  observation='''
    Customer Invoice No.,Event Details,Link Event,Customer Name,Provider Name,Frequency,Hourly Rate,Duration,Amount Gross,Amendments,Gross Difference,Unnamed: 11
    1b1c9f58-bd91-4289-929f-948e718e2473,04-08-2018 09:30,link,Alice Ng,SparkleClean Services,Cancellation,22.5,2.5,0.0,Yes,-66.25,
    ...
    '''
)

As seen above, prebuilt agent will:

handle the orchestration
streamed events that can be transmitted to the frontend using a preferred server-client communication method (e.g., WebSockets, Server-Sent Events)
- this enables real-time rendering of the agent's thought process, enhancing transparency and user experience.

Cons

Feature	Description
Limited Flexibility	Predefined agents exist (e.g., Pandas Dataframe Agent), but lacks support for problems that requires specialized agents
Restricted Orchestration Between Agents	Difficult to implement complex multi-agent workflows when the problem demands specialized coordination

For this blog, the Pandas DataFrame Agent is well-suited for the ETL use case, making LangChain’s prebuilt solution a practical choice. The problem scope does not require multi-agent coordination, so the limitations in orchestration are not a concern here. However, I will still explore an alternative that offers greater flexibility in designing custom agent interactions: LangGraph.

Alternative: Langgraph

Langgraph presents a compelling alternative when the use case demands greater flexibility in orchestrating multiple agents. Unlike Langchain’s predefined agents, Langgraph allows for customizable workflows, enabling multi-agent collaboration, conditional logic, and more complex decision-making.

Pros

Feature	Description
Flexible Orchestration	Allows for complex agent workflows, including multi-agent interactions, conditional branching, and asynchronous execution.
Fine-Grained Control	Developers can define precise execution logic, ensuring that agents interact optimally based on the use case’s needs.
Extensibility	Provides a modular foundation to scale and integrate with other AI components as requirements evolve.

Cons

Feature	Description
Increased Complexity	Greater customization comes with a steeper learning curve and higher development effort compared to Langchain’s prebuilt agents.
Manual Implementation Required	Orchestration has to be manually implemented as opposed to the prebuilt one in Langchain.

While Langchain’s predefined agents are sufficient for this ETL use case, exploring Langgraph lays the groundwork for the next blog, where the problem scope justifies the need for a more advanced, multi-agent orchestration approach. Given this, I ultimately chose to implement Langgraph for this blog.

Scaling AI Systems: Architectural and Operational Insights

Scaling Considerations for a Small User Base

The initial goal is to support tens of users effectively while ensuring a balance between performance and cost-efficiency. Right out of the box, the OpenAI API and FastAPI as the backend handle concurrency efficiently, reducing the need for additional scaling efforts in the early stages.

Exploring Groq for Fast Agentic Workflows

Groq presents an alternative cloud provider for LLMs, particularly excelling in agentic workflows requiring rapid responses. Key observations:

It delivers significantly faster inference times, enhancing user experience.
From my experience, it tends to be more talkative but less effective at solving tasks compared to GPT-4O-Mini.
The cost is approximately six times higher than mainstream alternatives.
Given the above trade-offs, I did not invest heavily in prompt engineering, as the core focus of this blog is not on achieving low latency.

Choosing the right LLM cloud provider with my current goal in mind is essential—treating this as a stepping stone rather than an endless pursuit of the "best" solution. There will always be "better" approaches for different use cases, but without a clear stopping point, I risk constantly iterating without making real progress.

Failure Analysis: Debugging and Mitigating AI Pitfalls

The Interplay Between Prompts and Models

Given the scaling experiment above, I noticed that prompt-model tagging will be necessary should I continue on the path of different models in production. As each model is trained differently, prompts are model-specific and should be tagged to each model for best results.

The Importance of a Hallucination Detection Module

In hindsight, integrating a hallucination detection module is crucial, even if it does not entirely prevent incorrect outputs. It provides an additional layer of reasoning and debugging, helping to:

Detect inconsistencies in model-generated responses before Python pandas code is executed
Reduces the risk of propagating erroneous outputs.

However, implementing such a module increases system complexity. Given this, I hope to cover better orchestration strategies in a separate blog.

Security Risks and Best Practices in AI Systems

Using a prebuilt pandas DataFrame agent introduces security risks by exposing system Python execution to anyone accessing the system. In a production environment, this can have serious implications:

Containerized Deployment on AWS
- The production environment runs in AWS-managed containers.
- Containers are ephemeral—if the environment is corrupted, they can be easily destroyed and redeployed.
- They contain nothing of value other than the API key (discussed in the next section).
- My personal desktop remains unaffected by these risks.

API Key Exposure and Mitigation Strategies

Risk of OpenAI API Key Exposure
- The API key must be accessible on the production server to interact with GPT-4O-Mini.
- While restricting API key exposure at the prompt level (e.g., preventing it from being output) may seem like a solution, it can be bypassed via jailbreak techniques.
- To mitigate this, I have enforced security at the code level, ensuring the API key is never exposed in logs, responses, or accessible environments.
  - Additionally, I have implemented usage limits at the API provider level to cap monthly expenses, serving as a fail-safe in case of unexpected leaks or abuse.
Potential and Unknown Attack Vectors
- Despite implemented safeguards, I am aware of specific ways my API key could still be leaked.
- Additionally, unknown vulnerabilities may exist, which need continuous assessment and mitigation.

Long-Term Security Considerations

If this system is scaled further, securing the API key and the overall infrastructure must be a top priority. A more comprehensive approach—including environment isolation, access controls, and strict request validation—should be taken to minimize security risks.

Lessons Learned from AI Deployments

Agentic workflows are just tools, not solutions—they require a clear use case or domain-specific challenge to be effective. Not every problem warrants an agentic approach; the complexity must justify the need for automation. Scalability and efficiency should be considered from the start, ensuring workflows remain adaptable through iterative improvements.

Demo​

Introduction​

AI’s Business Impact: Aligning Technology with Strategy​

The Importance of Orchestration​

Navigating Trade-offs in AI System Design​

Prebuilt Agent​

Pros​

Cons​

Alternative: Langgraph​

Pros​

Cons​

Scaling AI Systems: Architectural and Operational Insights​

Scaling Considerations for a Small User Base​

Exploring Groq for Fast Agentic Workflows​

Failure Analysis: Debugging and Mitigating AI Pitfalls​

The Interplay Between Prompts and Models​

The Importance of a Hallucination Detection Module​

Security Risks and Best Practices in AI Systems​

API Key Exposure and Mitigation Strategies​

Long-Term Security Considerations​

Lessons Learned from AI Deployments​

Demo

Introduction

AI’s Business Impact: Aligning Technology with Strategy

The Importance of Orchestration

Navigating Trade-offs in AI System Design

Prebuilt Agent

Pros

Cons

Alternative: Langgraph

Pros

Cons

Scaling AI Systems: Architectural and Operational Insights

Scaling Considerations for a Small User Base

Exploring Groq for Fast Agentic Workflows

Failure Analysis: Debugging and Mitigating AI Pitfalls

The Interplay Between Prompts and Models

The Importance of a Hallucination Detection Module

Security Risks and Best Practices in AI Systems

API Key Exposure and Mitigation Strategies

Long-Term Security Considerations

Lessons Learned from AI Deployments