Building an Automated Research Workflow

Junyang Deng
Tutorial
March 24, 2026

Introduction

This tutorial documents the technical innovations and hard-won solutions behind an automated research workflow. This is not a theoretical “how to organize agents” guide—it’s practical documentation of real problems encountered in building a 9-stage automated research pipeline that takes you from a research idea to a completed paper.

Why Build an Automated Research Workflow?

Academic research involves repetitive, time-consuming tasks: data cleaning, literature searches, statistical analysis, figure generation, and writing. The vision is simple: describe your research goal once, let AI handle the rest. A single prompt triggers the entire pipeline with zero human intervention from data to final PDF.

But building something that actually works—reliably, repeatedly—requires solving some hard technical problems.

Building Your Own Research Workflow

Here is a step-by-step guide to building your own research workflow, based on real experience iterating with Claude Code.

Step 1: Define Your End Goal

Start by clearly identifying what the pipeline should produce. Write your requirements, expectations, and scope into a file called .claude/plan.prompt.md. This file becomes the single source of truth for the entire workflow—it should capture what you want, not how to get there yet.

Step 2: Generate a Plan with Claude Code

Use the /plan function in Claude Code with your plan.prompt.md as input. Claude will generate a set of steps to achieve your goal. Review the plan carefully and make sure it aligns with your expectations—push back and revise until the high-level structure feels right before moving forward.

Step 3: Expand Each Step into Detailed Guidance

Once the plan is solid, ask Claude to generate more detailed instructions for each step. Pay special attention to how the steps interact with each other: every stage should have clear inputs, outputs, and handoffs so the pipeline flows end-to-end without gaps.

Step 4: Run, Observe, and Record Errors

Run the generated pipeline multiple times. Ask Claude to remember the errors and difficulties it encounters across runs. This iterative execution is where most of the real learning happens—you’ll discover edge cases, broken assumptions, and failure modes that weren’t obvious from the plan alone.

Step 5: Revise and Debug Individual Skills

After identifying recurring issues from your test runs, go into each skill individually to revise and debug. This is the fine-tuning phase where you fix specific failure points, improve prompts, tighten validation, and harden the pipeline until it runs reliably. In this step, I use /skill-creator frequently for validation and evaluation of the skills.

Technical Challenges & Solutions

When I designed this pipeline, I encountered several challenges. Here are the major ones and how I addressed them.

Research Question Formulation

Problem: LLMs tend to generate research questions that sound plausible but aren’t actually answerable with your data. A beautiful question is useless if the data doesn’t support it.

Solution: Use a data-driven approach with the PICO framework:

Population: What subjects are in your data?
Intervention: What variables can you manipulate?
Comparison: What groups can you compare?
Outcome: What metrics are actually measurable?

Before generating questions, have the LLM inspect your data schema and verify feasibility. A feasibility validator script checks that required variables exist, have sufficient sample size, and aren’t missing too many values.

LLM Hallucination

Problem: LLMs hallucinate when checking “did this work?” and can’t reliably verify file existence. They’ll say “file created successfully” when nothing was written.

Solution: Use scripts instead of prompts to ensure all steps are executed before leaving. Python-based file system validation plus pre-emptive feasibility checks. The _validate_outputs() function checks file existence and size directly via the OS, raising ValueError if expected outputs are missing. complete_stage() calls this validation before marking a stage complete.

def _validate_outputs(expected_outputs: dict) -> None:
    """Validate that expected output files exist and have content."""
    for name, path in expected_outputs.items():
        if not os.path.exists(path):
            raise ValueError(f"Missing required output: {name} at {path}")
        if os.path.getsize(path) == 0:
            raise ValueError(f"Empty output file: {name} at {path}")

Token Overflow: Context Bundle + Pruning

Problem: 9 stages × large JSON files = token overflow. Each stage needs all previous context, but passing full file contents exceeds context limits.

Failed Approaches:

Passing full file contents → token overflow
Truncating files → loss of critical information
Asking LLM to summarize → inconsistent, unreliable

Solution: Two-part system. Context bundles capture semantic decisions (why) rather than raw outputs (what). Each stage adds a compressed layer with:

key_decisions - What was decided and why
forward_references - Pointers to preserved files
stage_summary - Stage-specific output summary

Selective pruning rules specify:

can_prune - Files deletable after each stage
must_preserve - Files required for downstream stages
summary_in_context - What summaries remain in context

Pruning modes: safe (after checkpoint stages), aggressive (after every eligible stage), off (for debugging). Result: ~80% token reduction while maintaining full resumability.

Feedback Loop State Management

Problem: When analysis fails, you need to re-run stages 3-5. How do you preserve state across iterations without losing progress or repeating expensive operations?

Solution: cycle_state.json tracks feedback loop iterations with:

current_cycle - Current iteration number
max_cycles - Maximum allowed iterations
failed_candidates - Variables that failed analysis
failure_reasons - Why each candidate failed

The reset_stage_progress() function deletes progress.json to enable re-entry. Fast-track mode skips web searches (they didn’t change), runs primary model + Table 1 only, and applies score penalties to failed candidates. Stages 3-5 files are never pruned during active feedback cycles.

Key Takeaways

When you recognize repetition in your daily work, consider wrapping it into a skill or a multi-skill workflow. The initial investment pays off quickly.
It takes a lot of time to make a workflow that’s solid and constantly provides high-quality results. The tutorial above skips months of iteration—the real work is in the debugging, testing, and refining.
File system as state store - Immediate writes for crash recovery. Never trust LLM memory for critical state.
Python validation, not LLM - Reliable checking without hallucination. If it matters, verify it with code.
Semantic context + selective pruning - You can reduce tokens by ~80% while maintaining resumability if you design your context bundles thoughtfully.

Building an Automated Research Workflow

Introduction

Why Build an Automated Research Workflow?

Building Your Own Research Workflow

Step 1: Define Your End Goal

Step 2: Generate a Plan with Claude Code

Step 3: Expand Each Step into Detailed Guidance

Step 4: Run, Observe, and Record Errors

Step 5: Revise and Debug Individual Skills

Technical Challenges & Solutions

Research Question Formulation

LLM Hallucination

Token Overflow: Context Bundle + Pruning

Feedback Loop State Management

Key Takeaways

Further Reading

Tags :

Share :

Related Posts

Prognostic Marker Discovery Pipeline

PDAC Spatial Transcriptomics Analysis

Research Workflow Automation System