Building an Automated Research Workflow

Introduction

This tutorial documents the technical innovations and hard-won solutions behind an automated research workflow. This is not a theoretical “how to organize agents” guide—it’s practical documentation of real problems encountered in building a 9-stage automated research pipeline that takes you from a research idea to a completed paper.

Why Build an Automated Research Workflow?

Academic research involves repetitive, time-consuming tasks: data cleaning, literature searches, statistical analysis, figure generation, and writing. The vision is simple: describe your research goal once, let AI handle the rest. A single prompt triggers the entire pipeline with zero human intervention from data to final PDF.

But building something that actually works—reliably, repeatedly—requires solving some hard technical problems.

Building Your Own Research Workflow

Here is a step-by-step guide to building your own research workflow, based on real experience iterating with Claude Code.

Step 1: Define Your End Goal

Start by clearly identifying what the pipeline should produce. Write your requirements, expectations, and scope into a file called .claude/plan.prompt.md. This file becomes the single source of truth for the entire workflow—it should capture what you want, not how to get there yet.

Step 2: Generate a Plan with Claude Code

Use the /plan function in Claude Code with your plan.prompt.md as input. Claude will generate a set of steps to achieve your goal. Review the plan carefully and make sure it aligns with your expectations—push back and revise until the high-level structure feels right before moving forward.

Step 3: Expand Each Step into Detailed Guidance

Once the plan is solid, ask Claude to generate more detailed instructions for each step. Pay special attention to how the steps interact with each other: every stage should have clear inputs, outputs, and handoffs so the pipeline flows end-to-end without gaps.

Step 4: Run, Observe, and Record Errors

Run the generated pipeline multiple times. Ask Claude to remember the errors and difficulties it encounters across runs. This iterative execution is where most of the real learning happens—you’ll discover edge cases, broken assumptions, and failure modes that weren’t obvious from the plan alone.

Step 5: Revise and Debug Individual Skills

After identifying recurring issues from your test runs, go into each skill individually to revise and debug. This is the fine-tuning phase where you fix specific failure points, improve prompts, tighten validation, and harden the pipeline until it runs reliably. In this step, I use /skill-creator frequently for validation and evaluation of the skills.

Technical Challenges & Solutions

When I designed this pipeline, I encountered several challenges. Here are the major ones and how I addressed them.

Research Question Formulation

Problem: LLMs tend to generate research questions that sound plausible but aren’t actually answerable with your data. A beautiful question is useless if the data doesn’t support it.

Solution: Use a data-driven approach with the PICO framework:

  • Population: What subjects are in your data?
  • Intervention: What variables can you manipulate?
  • Comparison: What groups can you compare?
  • Outcome: What metrics are actually measurable?

Before generating questions, have the LLM inspect your data schema and verify feasibility. A feasibility validator script checks that required variables exist, have sufficient sample size, and aren’t missing too many values.

LLM Hallucination

Problem: LLMs hallucinate when checking “did this work?” and can’t reliably verify file existence. They’ll say “file created successfully” when nothing was written.

Solution: Use scripts instead of prompts to ensure all steps are executed before leaving. Python-based file system validation plus pre-emptive feasibility checks. The _validate_outputs() function checks file existence and size directly via the OS, raising ValueError if expected outputs are missing. complete_stage() calls this validation before marking a stage complete.

def _validate_outputs(expected_outputs: dict) -> None:
    """Validate that expected output files exist and have content."""
    for name, path in expected_outputs.items():
        if not os.path.exists(path):
            raise ValueError(f"Missing required output: {name} at {path}")
        if os.path.getsize(path) == 0:
            raise ValueError(f"Empty output file: {name} at {path}")

Token Overflow: Context Bundle + Pruning

Problem: 9 stages × large JSON files = token overflow. Each stage needs all previous context, but passing full file contents exceeds context limits.

Failed Approaches:

  • Passing full file contents → token overflow
  • Truncating files → loss of critical information
  • Asking LLM to summarize → inconsistent, unreliable

Solution: Two-part system. Context bundles capture semantic decisions (why) rather than raw outputs (what). Each stage adds a compressed layer with:

  • key_decisions - What was decided and why
  • forward_references - Pointers to preserved files
  • stage_summary - Stage-specific output summary

Selective pruning rules specify:

  • can_prune - Files deletable after each stage
  • must_preserve - Files required for downstream stages
  • summary_in_context - What summaries remain in context

Pruning modes: safe (after checkpoint stages), aggressive (after every eligible stage), off (for debugging). Result: ~80% token reduction while maintaining full resumability.

Feedback Loop State Management

Problem: When analysis fails, you need to re-run stages 3-5. How do you preserve state across iterations without losing progress or repeating expensive operations?

Solution: cycle_state.json tracks feedback loop iterations with:

  • current_cycle - Current iteration number
  • max_cycles - Maximum allowed iterations
  • failed_candidates - Variables that failed analysis
  • failure_reasons - Why each candidate failed

The reset_stage_progress() function deletes progress.json to enable re-entry. Fast-track mode skips web searches (they didn’t change), runs primary model + Table 1 only, and applies score penalties to failed candidates. Stages 3-5 files are never pruned during active feedback cycles.

Key Takeaways

  1. When you recognize repetition in your daily work, consider wrapping it into a skill or a multi-skill workflow. The initial investment pays off quickly.

  2. It takes a lot of time to make a workflow that’s solid and constantly provides high-quality results. The tutorial above skips months of iteration—the real work is in the debugging, testing, and refining.

  3. File system as state store - Immediate writes for crash recovery. Never trust LLM memory for critical state.

  4. Python validation, not LLM - Reliable checking without hallucination. If it matters, verify it with code.

  5. Semantic context + selective pruning - You can reduce tokens by ~80% while maintaining resumability if you design your context bundles thoughtfully.

Further Reading

GitHub Repository: https://github.com/DamarisDeng/paper-writing-system

  • workflow/scripts/progress_utils.py - Progress tracking implementation
  • workflow/scripts/context_manager.py - Context bundle and pruning system
  • workflow/scripts/feedback_utils.py - Feedback loop management
  • workflow/scripts/feasibility_validator.py - Pre-emptive validation
  • workflow/skills/load-and-profile/SKILL.md - Example skill structure
Share :

Related Posts

Prognostic Marker Discovery Pipeline

Overview Developed a semi-automated pipeline to standardize heterogeneous public single-cell datasets, enabling the discovery of prognostic markers across multiple tissue types.

Read More

PDAC Spatial Transcriptomics Analysis

Overview Analyzed the intersection of nerves and tumor cells in Pancreatic Ductal Adenocarcinoma (PDAC) to understand how the microenvironment facilitates perineural invasion.

Read More

Research Workflow Automation System

Overview Academic research involves repetitive, time-consuming tasks: data cleaning, literature searches, statistical analysis, figure generation, and writing. This system automates the entire research pipeline—from data to final PDF—with a single prompt and zero human intervention.

Read More