Suture Performance Baseline

Date: 2026-05-12 Platform: Linux x86_64, release profile (optimized) Toolchain: Rust stable, Criterion.rs 0.5

1. DAG Operations

BenchmarkMedian TimeNotes dag_perf_commit/commit_1000_files97.8 msAdd + commit 1000 files in one batch dag_perf_log/log_1000_commits5.23 msLog walk over 1000-commit history dag_perf_log/log_10000_commits50.8 msLog walk over 10,000-commit history (v3.2.1 fix) dag_perf_merge/merge_100_files3.26 msMerge branch modifying 100 files (clean)

Thresholds

OperationMeasuredAcceptableStatus Commit 1000 files97.8 ms< 500 msPASS Log 1000 commits5.23 ms< 50 msPASS Log 10000 commits50.8 ms< 500 msPASS Merge 100 files3.26 ms< 50 msPASS

2. Semantic Merge (JSON)

BenchmarkMedian Time semantic_merge_perf/json_small_10_fields8.63 us semantic_merge_perf/json_large_100_fields126 us semantic_merge_perf/json_conflict/106.96 us semantic_merge_perf/json_conflict/10098.4 us

Thresholds

OperationMeasuredAcceptableStatus Merge 10-field JSON8.63 us< 100 usPASS Merge 100-field JSON126 us< 1 msPASS Conflict detection 100 fields98.4 us< 1 msPASS

3. CAS Operations

BenchmarkMedian Time cas_perf_store/store_100_blobs/1KB1.50 ms cas_perf_store/store_100_blobs/10KB2.18 ms cas_perf_store/store_100_blobs/100KB5.82 ms cas_perf_store/store_100_blobs/1MB50.1 ms cas_perf_lookup/store_1000_lookup_1003.76 ms

Thresholds

OperationMeasuredAcceptableStatus Store 100 x 1KB blobs1.50 ms< 50 msPASS Store 100 x 1MB blobs50.1 ms< 500 msPASS Lookup 100 from 10003.76 ms< 50 msPASS

4. Patch Serialization

BenchmarkMedian Time patch_perf_serialize/serialize_100_patches63.9 us patch_perf_deserialize/deserialize_100_patches143 us

Thresholds

OperationMeasuredAcceptableStatus Serialize 100 patches63.9 us< 1 msPASS Deserialize 100 patches143 us< 1 msPASS

5. Hub Operations

BenchmarkMedian Time hub_perf_repo/create_100_repos425 us hub_perf_push_pull/push_50_patches693 us hub_perf_push_pull/pull_50_patches95.3 us hub_perf_push_pull/push_pull_roundtrip_501.07 ms

Thresholds

OperationMeasuredAcceptableStatus Create 100 repos425 us< 10 msPASS Push 50 patches693 us< 10 msPASS Pull 50 patches95.3 us< 10 msPASS Push+pull roundtrip 501.07 ms< 20 msPASS

6. Historical Benchmark Results (from existing suite)

These results are from the pre-existing benchmark suite in benchmarks.rs:

BenchmarkMedian Time blake3_hashing/hash/10242.83 us blake3_hashing/hash/10240071.5 us dag_insertion/linear_chain/10002.73 ms repo_add_commit/commit_n_files/1000652 ms repo_merge/merge_clean540 us semantic_merge_json/merge/10006.73 ms hub_storage/push_n_patches_blobs/100042.2 ms diff_large/patience_diff_1k_lines30.0 ms compress_decompress/compress_1MB544 us

Top 3 Bottlenecks

1. Log walk at scale FIXED in v3.2.1

The repo_log timeout at 10,000 commits was caused by commit() calling snapshot_uncached() after every commit, which replayed the entire patch chain from root to tip (O(n) per commit → O(n²) total). Fixed by computing the file tree incrementally: load the parent's cached tree from SQLite (O(1)) and apply only the new patch's changes (O(k) where k = files in the batch).

Before: 10,000 commits → >600s (timeout). After: 10,000 commits → 8.5s (70x faster).

2. Commit 1000 files — 652 ms (repo_add_commit vs dag_perf_commit)

The full repo add + commit path for 1000 files takes 652 ms (~0.65 ms/file), while the DAG-level commit benchmark takes 97.8 ms. The ~6.7x overhead comes from per-file filesystem I/O: reading files, hashing, writing blobs to CAS, and updating the staging index.

Recommendation: Batch filesystem reads with parallel hashing (rayon). Defer CAS writes until commit time. Use mtime/size stat cache to skip unchanged files.

3. Patience diff on large files — 30 ms for 1K lines

The patience diff algorithm takes 30 ms for a 1000-line file with 10% changes. This scales poorly — a 10K-line file would take ~300 ms, which is perceptible.

Recommendation: Profile to determine if the bottleneck is LCS computation or unique-line fingerprinting. Consider the imara-diff crate for large-file diffing, or switch to Myers' algorithm for files above a threshold.

Quick-Win Optimizations Applied

#[inline] on hot-path methods — Added to TouchSet::intersects,

TouchSet::len, TouchSet::iter, TouchSet::insert, TouchSet::contains, DagNode::id, PatchDag::patch_count, and hash_bytes (was already inlined).

Removed redundant from_utf8 in hash_with_context — The function

already takes &str, so from_utf8(context.as_bytes()) was a no-op conversion that could fail unnecessarily.

Eliminated duplicate method definitions — Removed accidental duplicate

len, iter, insert on TouchSet and has_patch on PatchDag that were causing code bloat.

Binary Size

BuildSizeNotes Debug243 MBDefault (dev profile) Release (pre-optimization)15 MBopt-level = 3 only Release (stripped)14 MBManual strip after pre-optimization build Release + LTO + strip14 MBOur default (opt-level = 3, lto = true, codegen-units = 1, panic = "abort", strip = true) Compressed (UPX)—UPX not available; skipped

Optimizations Applied

LTO (Link-Time Optimization): lto = true — enables cross-crate inlining and dead code elimination
Single codegen unit: codegen-units = 1 — better optimization at the cost of slower compilation
Panic = abort: panic = "abort" — removes unwinding machinery, saves ~50-100 KB
Strip: strip = true — removes debug info and symbol tables

Build Times

TargetTime suture-cli (debug)~25 s suture-cli (release, pre-optimization)~3 m 26 s suture-cli (release, LTO + codegen-units = 1)~6 m 57 s

Benchmark Files

FileContents benches/benchmarks.rsOriginal 28 benchmarks (core, repo, semantic merge, protocol, hub) benches/dag_perf.rsDAG commit, log (1K/10K), merge benchmarks benches/semantic_merge_perf.rsJSON small/large/conflict merge benchmarks benches/cas_perf.rsCAS store (varying sizes), lookup, patch serialize/deserialize benches/hub_perf.rsHub repo creation, push/pull/roundtrip benchmarks

7. Comprehensive Merge Benchmarks (v5.1.0)

Date: 2026-05-01 Platform: Linux x86_64, release profile, Criterion.rs 0.5 Method: Median of 50 iterations

JSON

SizeSame (µs)One-sided (µs)Different Keys (µs)Conflict (µs) 10 keys9.127.59.67.2 100 keys148137133113 1,000 keys1,3942,0491,8901,143 10,000 keys2,4531,927922318 ms

YAML

SizeSame (µs)One-sided (µs)Different Keys (µs)Conflict (µs) 10 keys53.249.954.438.9 50 keys63.651.846.357.4 200 keys94.996.4101.664.1

TOML

SizeSame (µs)One-sided (µs)Different Keys (µs)Conflict (µs) 10 keys12116917826.3 50 keys12217718235.3

CSV

SizeSame (µs)One-sided (µs)Different Keys (µs)Conflict (µs) 10r × 5c102116100100 100r × 10c1,3941,4411,3511,128 1000r × 20c23,54335,29823,6602,186

XML

SizeSame (µs)One-sided (µs)Different Keys (µs)Conflict (µs) 10 elements11256.564.18.6 100 elements333221219378

Key Takeaways

JSON is the fastest format — <1ms for typical files (up to 1,000 keys)
All formats merge in <5ms for typical config files (<100 keys)
CSV is slowest at scale due to row-based parsing — O(rows × cols)
Conflict detection is fast (<100µs for most formats) because it only needs to find the first mismatch
JSON scales sub-linearly past 1K keys (likely due to serde_json's optimized parser)

7. Recent Optimizations (v5.4.0)

7.1 Merge Engine (2026-05-12)

OptimizationFileImpact output_lines() returns &[String] instead of Vec<String>engine/merge.rsEliminates Vec allocation per merge line group. Three-way merge hot loop calls this 2-4x per conflict region. Pre-compute conflict markers once per mergeengine/merge.rsAvoids format!() allocation for <<<<<<<, =======, >>>>>>> markers on every conflict region. diff_trees() uses sort_by instead of sort_by_keyengine/diff.rsAvoids cloning every path string during diff entry sorting. For diffs with thousands of files, saves thousands of allocations.

7.2 Stash Operations (2026-05-12)

OptimizationFileImpact O(n*m) → O(n+m) stash push lookuprepository/repo_impl.rsBuilds a HashSet<String> of staged paths before the head-tree loop. Replaces `files.iter().any((p, _)p == path) (linear scan per file) with hashset.contains(path) (O(1)). For repos with many tracked files, this eliminates quadratic behavior during suture stash push`.