👉🏻 SWE-Factory 👈🏻

Your automated factory for GitHub Issue Resolution Training Data and Evaluation Benchmarks.

✨ Key Features

An automated pipeline for GitHub issue resolution data collection, reducing your manual effort!
Produce reliable and reproducible Docker-based evaluation environments
Automatic environment construction using the LLM-powered multi-agent system (SWE-Builder)
Support for multiple programming languages (we have evaluated Python, Java, JS, and TS extensively.)

📦 Environment Setup

Our experiments are conducted using Docker version 27.0.3-1 and Ubuntu 22.04.4 LTS.

To get started, run the following commands to set up the environment:

conda create --name swe-factory python=3.12.5 -y
conda activate swe-factory
pip install -r requirements.txt

🚀 Running SWE-Factory

📍 Stage I: Raw Issue Data Collection

We use GitHub APIs and predefined patterns to collect raw issue data (e.g., python-mypy-instances.jsonl). Check the detailed tutorial in the data_collection/collect directory.

🛠 Stage II: Automated Evaluation Environemnt Setup via SWE-Builder

After collecting raw issue data, set up the evaluation environment by running:

export OPENAI_API_BASE_URL=<your_base_url>
export OPENAI_KEY=<your_key>

python app/main.py swe-bench \
    --model gpt-4.1-mini \
    --tasks-map "python-mypy-instances.jsonl" \
    --num-processes 10 \
    --model-temperature 0.2 \
    --conv-round-limit 10 \
    --output-dir "output/git-4.1-mini/mypy" \
    --setup-dir "testbed" \
    --results-path "output/git-4.1-mini/mypy/results"

We employ SWE-Builder, an LLM-based multi-agent system consisting of:

🔍 Repository Explorer
- Gathers environment setup and test commands automatically.
🐳 Environment Manager
- Generates Dockerfiles for reproducible test environments.
📝 Test Manager
- Writes evaluation scripts to run tests inside containers.
🔬 Test Analyst
- Validates generated environments and orchestrates iterative refinement.
💾 Evaluation Environment Memory Pool
- Reuses previously successful setups for efficiency and consistency.

📊 SWE-Builder Evaluation Results

We evaluated SWE-Builder using three base models:

Base Model	Valid Rate (%)	Success Rate (%)	Cost (USD)	Time (min)
GPT-4.1-mini	40.1 (269/671)	57.2 (384/671)	0.045	22.4
DeepSeek-v3-0324	34.6 (232/671)	50.8 (341/671)	0.043	22.5
Gemini-2.5-flash-preview	33.5 (225/671)	49.8 (334/671)	0.024	27.0

To reproduce these experiments:

export OPENAI_API_BASE_URL=<your_base_url>
export OPENAI_KEY=<your_key>
bash run/run.sh

✅ Stage III: Fail2Pass Validation

After generating evaluation environments, perform Fail2Pass validation:

Obtain test logs before and after applying the ground-truth patch. Check evaluation for detailed instructions.
Run automated Fail2Pass validation:

python scripts/judge_fail2pass.py evaluation/run_instance/mypy_gpt-4.1-mini/gold fail2pass_status.json

The validated instances can be filtered using the generated fail2pass_status.json.

Note: Although our automated validation demonstrates high precision, manual checks are recommended to ensure dataset quality, particularly to identify and filter out error-to-pass cases.

📌 Using Your Own Dataset

After building your dataset for evaluation and training, check the evaluation directory for detailed instructions on how to run tests and obtain test exection feedback.

📖 Citation

If SWE-Factory helps your research or projects, star ⭐ our repo or cite us:

@article{guo2025swefactory,
  title={SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks},
  author={Lianghong Guo and Yanlin Wang and Caihua Li and Pengyu Yang and Jiachi Chen and Wei Tao and Yingtian Zou and Duyu Tang and Zibin Zheng},
  journal={arXiv preprint arXiv:2506.10954},
  year={2025},
  url={https://arxiv.org/abs/2506.10954},
}

🙏 Acknowledgements

We build upon prior research — SWE-bench, AutoCodeRover, Magis, and OmniGIRL — foundational to our work.
Huge thanks to the open-source developer community; your invaluable contributions underpin software engineering research! ❤️

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
app		app
data_collection		data_collection
evaluation		evaluation
figure		figure
run		run
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

👉🏻 SWE-Factory 👈🏻

✨ Key Features

📦 Environment Setup

🚀 Running SWE-Factory

📍 Stage I: Raw Issue Data Collection

🛠 Stage II: Automated Evaluation Environemnt Setup via SWE-Builder

📊 SWE-Builder Evaluation Results

✅ Stage III: Fail2Pass Validation

📌 Using Your Own Dataset

📖 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

DeepSoftwareAnalytics/swe-factory

Folders and files

Latest commit

History

Repository files navigation

👉🏻 SWE-Factory 👈🏻

✨ Key Features

📦 Environment Setup

🚀 Running SWE-Factory

📍 Stage I: Raw Issue Data Collection

🛠 Stage II: Automated Evaluation Environemnt Setup via SWE-Builder

📊 SWE-Builder Evaluation Results

✅ Stage III: Fail2Pass Validation

📌 Using Your Own Dataset

📖 Citation

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages