Building an Automated Research Workflow
- Junyang Deng
- Tutorial
- March 24, 2026
Introduction
This tutorial documents the technical innovations and hard-won solutions behind an automated research workflow. This is not a theoretical “how to organize agents” guide—it’s practical documentation of real problems encountered in building a 9-stage automated research pipeline that takes you from a research idea to a completed paper.
Why Build an Automated Research Workflow?
Academic research involves repetitive, time-consuming tasks: data cleaning, literature searches, statistical analysis, figure generation, and writing. The vision is simple: describe your research goal once, let AI handle the rest. A single prompt triggers the entire pipeline with zero human intervention from data to final PDF.
But building something that actually works—reliably, repeatedly—requires solving some hard technical problems.
Building Your Own Research Workflow
Here is a step-by-step guide to building your own research workflow, based on real experience iterating with Claude Code.
Step 1: Define Your End Goal
Start by clearly identifying what the pipeline should produce. Write your requirements, expectations, and scope into a file called .claude/plan.prompt.md. This file becomes the single source of truth for the entire workflow—it should capture what you want, not how to get there yet.
Step 2: Generate a Plan with Claude Code
Use the /plan function in Claude Code with your plan.prompt.md as input. Claude will generate a set of steps to achieve your goal. Review the plan carefully and make sure it aligns with your expectations—push back and revise until the high-level structure feels right before moving forward.
Step 3: Expand Each Step into Detailed Guidance
Once the plan is solid, ask Claude to generate more detailed instructions for each step. Pay special attention to how the steps interact with each other: every stage should have clear inputs, outputs, and handoffs so the pipeline flows end-to-end without gaps.
Step 4: Run, Observe, and Record Errors
Run the generated pipeline multiple times. Ask Claude to remember the errors and difficulties it encounters across runs. This iterative execution is where most of the real learning happens—you’ll discover edge cases, broken assumptions, and failure modes that weren’t obvious from the plan alone.
Step 5: Revise and Debug Individual Skills
After identifying recurring issues from your test runs, go into each skill individually to revise and debug. This is the fine-tuning phase where you fix specific failure points, improve prompts, tighten validation, and harden the pipeline until it runs reliably. In this step, I use /skill-creator frequently for validation and evaluation of the skills.
Technical Challenges & Solutions
When I designed this pipeline, I encountered several challenges. Here are the major ones and how I addressed them.
Research Question Formulation
Problem: LLMs tend to generate research questions that sound plausible but aren’t actually answerable with your data. A beautiful question is useless if the data doesn’t support it.
Solution: Use a data-driven approach with the PICO framework:
- Population: What subjects are in your data?
- Intervention: What variables can you manipulate?
- Comparison: What groups can you compare?
- Outcome: What metrics are actually measurable?
Before generating questions, have the LLM inspect your data schema and verify feasibility. A feasibility validator script checks that required variables exist, have sufficient sample size, and aren’t missing too many values.
LLM Hallucination
Problem: LLMs hallucinate when checking “did this work?” and can’t reliably verify file existence. They’ll say “file created successfully” when nothing was written.
Solution: Use scripts instead of prompts to ensure all steps are executed before leaving. Python-based file system validation plus pre-emptive feasibility checks. The _validate_outputs() function checks file existence and size directly via the OS, raising ValueError if expected outputs are missing. complete_stage() calls this validation before marking a stage complete.
def _validate_outputs(expected_outputs: dict) -> None:
"""Validate that expected output files exist and have content."""
for name, path in expected_outputs.items():
if not os.path.exists(path):
raise ValueError(f"Missing required output: {name} at {path}")
if os.path.getsize(path) == 0:
raise ValueError(f"Empty output file: {name} at {path}")
Token Overflow: Context Bundle + Pruning
Problem: 9 stages × large JSON files = token overflow. Each stage needs all previous context, but passing full file contents exceeds context limits.
Failed Approaches:
- Passing full file contents → token overflow
- Truncating files → loss of critical information
- Asking LLM to summarize → inconsistent, unreliable
Solution: Two-part system. Context bundles capture semantic decisions (why) rather than raw outputs (what). Each stage adds a compressed layer with:
key_decisions- What was decided and whyforward_references- Pointers to preserved filesstage_summary- Stage-specific output summary
Selective pruning rules specify:
can_prune- Files deletable after each stagemust_preserve- Files required for downstream stagessummary_in_context- What summaries remain in context
Pruning modes: safe (after checkpoint stages), aggressive (after every eligible stage), off (for debugging). Result: ~80% token reduction while maintaining full resumability.
Feedback Loop State Management
Problem: When analysis fails, you need to re-run stages 3-5. How do you preserve state across iterations without losing progress or repeating expensive operations?
Solution: cycle_state.json tracks feedback loop iterations with:
current_cycle- Current iteration numbermax_cycles- Maximum allowed iterationsfailed_candidates- Variables that failed analysisfailure_reasons- Why each candidate failed
The reset_stage_progress() function deletes progress.json to enable re-entry. Fast-track mode skips web searches (they didn’t change), runs primary model + Table 1 only, and applies score penalties to failed candidates. Stages 3-5 files are never pruned during active feedback cycles.
Key Takeaways
When you recognize repetition in your daily work, consider wrapping it into a skill or a multi-skill workflow. The initial investment pays off quickly.
It takes a lot of time to make a workflow that’s solid and constantly provides high-quality results. The tutorial above skips months of iteration—the real work is in the debugging, testing, and refining.
File system as state store - Immediate writes for crash recovery. Never trust LLM memory for critical state.
Python validation, not LLM - Reliable checking without hallucination. If it matters, verify it with code.
Semantic context + selective pruning - You can reduce tokens by ~80% while maintaining resumability if you design your context bundles thoughtfully.
Further Reading
GitHub Repository: https://github.com/DamarisDeng/paper-writing-system
workflow/scripts/progress_utils.py- Progress tracking implementationworkflow/scripts/context_manager.py- Context bundle and pruning systemworkflow/scripts/feedback_utils.py- Feedback loop managementworkflow/scripts/feasibility_validator.py- Pre-emptive validationworkflow/skills/load-and-profile/SKILL.md- Example skill structure