Shared memory and context tools for agentic work.
Code Rooms
# Real-World Agent Lane: judge-1
Round: `real-world-v2-20260513T231822Z`
Arm: `adjudication`
Adjudicate after primary lanes finish. Do not change primary artifacts; judge correctness, overclaiming, evidence quality, and comparability.
Do not guess the benchmark hypothesis. Work as if this is a normal coding task.
Keep public claims out of the result. Record missing proof instead of smoothing it away.
Do not commit, publish, or push fixture repo changes.
## Fixture Repositories
- click-python-cli (python): `.m1nd-benchmark-fixtures/real-world/click-python-cli`
- p-limit-node (typescript): `.m1nd-benchmark-fixtures/real-world/p-limit-node`
- human-panic-rust-cli (rust): `.m1nd-benchmark-fixtures/real-world/human-panic-rust-cli`
## Isolated Lane Workspaces
Use your isolated workspace paths for patch tasks. Do not edit shared fixture repos.
- click-python-cli: `.m1nd-benchmark-fixtures/real-world-lanes/real-world-v2-20260513T231822Z/judge-1/click-python-cli`
- p-limit-node: `.m1nd-benchmark-fixtures/real-world-lanes/real-world-v2-20260513T231822Z/judge-1/p-limit-node`
- human-panic-rust-cli: `.m1nd-benchmark-fixtures/real-world-lanes/real-world-v2-20260513T231822Z/judge-1/human-panic-rust-cli`
If a fixture is missing, clone it from the URL in `round.json` or mark the affected task invalidated.
## Task Battery
- repo_architecture_audit on `click-python-cli`: Explain the repo architecture, main modules, entrypoints, data/control flow, and top risks. Fixed payload: `{"focus": "Audit Click's public export layer, decorators, command core, parser, and testing harness.", "must_cover": ["public API re-exports", "command and group invocation path", "parameter/type conversion", "test runner IO isolation"]}` Expected evidence: main entrypoints named, module boundaries named, at least two real file references, risk list separates proven facts from hypotheses.
- feature_location on `p-limit-node`: Find where a named feature or public behavior is implemented and identify the tests that protect it. Fixed payload: `{"feature": "The rejectOnClear and clearQueue behavior for pending tasks.", "must_find": ["runtime implementation", "type definition", "test coverage", "README/API docs"]}` Expected evidence: implementation file named, test file named or missing test stated, false-positive files avoided.
- flow_explanation on `human-panic-rust-cli`: Explain a realistic request/command/API flow from public entrypoint to internal behavior. Fixed payload: `{"flow": "Explain what happens when setup_panic!() is installed and a release-mode panic occurs.", "must_cover": ["public macro or setup entrypoint", "panic hook behavior", "report writing path", "observable user-facing output"]}` Expected evidence: entrypoint named, intermediate calls named, observable output or side effect named.
- bug_symptom_triage on `click-python-cli`: Given a realistic symptom, isolate the most likely fault boundary and name the next verification step. Fixed payload: `{"must_answer": ["most likely fault boundary", "why it is not a parser/runtime invocation issue", "next focused regression test"], "symptom": "A callable instance used as a custom Click option type crashes during command construction with AttributeError because the object has no __name__ attribute."}` Expected evidence: most likely fault file or function named, alternative theory preserved or rejected, next command/test/file named.
- safe_change_plan on `p-limit-node`: Plan a small behavior change, including blast radius, files to edit, and proof gates. Fixed payload: `{"change_request": "Plan a backwards-compatible change so clearQueue() returns the number of pending tasks it discarded or rejected, without touching already running tasks.", "must_cover": ["runtime edit target", "types/docs/test targets", "rejectOnClear behavior", "no change to activeCount semantics"]}` Expected evidence: edit targets named, downstream callers or tests named, risky assumptions explicit.
- small_feature_patch on `human-panic-rust-cli`: Implement a tiny feature or option consistent with local style and run focused checks. Fixed payload: `{"change_request": "Add Metadata::name(...) and Metadata::version(...) builder methods that preserve the existing non-empty string guard style.", "must_cover": ["minimal implementation", "focused unit tests", "no public panic/report behavior rewrite"]}` Expected evidence: minimal patch, test or example updated when appropriate, focused check result recorded.
- seeded_bug_fix on `click-python-cli`: Fix a seeded or clearly described bug without broad refactors. Fixed payload: `{"bug": "The lane workspace contains a seeded regression test proving callable instances should work as custom option types.", "must_cover": ["root cause", "minimal fix", "seeded regression test result"], "seeded_artifact_id": "click-callable-instance-type-test-v1"}` Expected evidence: root cause named, patch is scoped, regression proof recorded.
- bounded_refactor_plan on `p-limit-node`: Prepare a bounded refactor and identify hidden coupling before any edit. Fixed payload: `{"must_cover": ["resumeNext", "next", "enqueue", "clearQueue", "concurrency setter"], "refactor_scope": "Queue scheduling and draining helpers only."}` Expected evidence: coupled files named, safe ordering proposed, rollback or proof boundary named.
- code_review_diff on `human-panic-rust-cli`: Review a supplied or seeded diff for real bugs, regressions, and missing tests. Fixed payload: `{"must_cover": ["duplicate or noisy support output when homepage and repository coexist", "missing regression test", "avoid style-only findings"], "review_focus": "Find real user-visible regressions and missing tests in the supplied diff.", "supplied_diff": "benchmark-payloads/review-diff-human-panic.patch"}` Expected evidence: findings ordered by severity, file/line references when available, style-only comments avoided.
- docs_drift_check on `click-python-cli`: Compare README/docs claims against implementation and identify drift or missing documentation. Fixed payload: `{"claim": "README/docs say Click supports lazy loading of subcommands at runtime.", "must_compare": ["README and docs/index claim", "docs/complex lazy loading pattern", "actual Group behavior"]}` Expected evidence: claim source named, code truth named, drift or no-drift conclusion justified.
## Required Result
Fill a JSON result using `lane-result-template.json`.
Append raw investigation events to `event-streams/judge-1.jsonl`.
Use event_source=agent for events you create. Keep one JSON object per line.
Judge-only: fill top-level adjudications[] with primary_lane_id, task_id, adjudicated_final_state, adjudicated_scores, comparability_class, exclusion_reason, notes, and evidence.
Scores must be integers from 0 to 4. Use `4` for excellent; do not use `5`.