>_<
SWE-Together: Evaluating Coding Agents in Interactive User Sessions
Leaderboard
Model
judge↑
correction↓
tokens
min
★
Oracle reference
0.904
—
—
—
claude-opus-4.8
52%
63%
0.801
1.38
74.0k
23.3
gpt-5.5
48%
58%
0.763
1.59
29.9k
10.7
claude-opus-4.6
46%
58%
0.755
1.59
42.0k
23.2
glm-5.2
42%
55%
0.735
1.53
41.7k
24.5
glm-5.1
34%
52%
0.729
1.54
41.6k
38.8
deepseek-v4-pro
29%
48%
0.679
1.76
49.8k
21.0
minimax-2.7
24%
39%
0.630
2.17
43.4k
36.2
0%20%40%60%80%
SWE-Together · 109 tasks · opencode harness · k = 2. Models sorted by pass@1. On each bar, the dark number = pass@1 and the white number = pass² (both runs solve at judge ≥ 0.85); the hatched tail is unstable (pass@1 − pass²). Oracle is the gold-patch reference ceiling. The best value in each column is bold (judge ↑ higher is better; correction / tokens / min ↓ lower is better).
Evaluation runs · provider · date (2026)
Get in touch
We'd love to hear from you
Have a question, a session you think would make a great task, or feedback on the benchmark?
Want to contribute or collaborate? Contact
Yifan Wu and
Shengzhi Li,
join the
Discord,
or open an issue or pull request on
GitHub.
Citation
Cite SWE-Together
@article{wu2026swetogether,
title = {SWE-Together: Evaluating Coding Agents in Interactive User Sessions},
author = {Wu, Yifan and Zhao, Zhuokai and Li, Songlin and Lee, Ho Hin and Zhu, Jiacheng and Wu, Shirley and Yu, Tianhe and Li, Serena and Zhang, Lizhu and Fan, Xiangjun and Li, Shengzhi},
year = {2026},
journal = {arXiv preprint arXiv:2606.29957},
url = {https://arxiv.org/pdf/2606.29957}
}