I'll give the snapshot a try, thanks, but it's hard for me to see :) how
eliminating a few hundred locks and unlocks that are 75% under 1 usec anyway
will help much. I may have unduly influenced anybody following this by only
showing detailed stats for the last two locks that I picked at random.
What I'm more focused on are the locks being heavily used by git and heavily
contended by the four git threads in OP's testcase. Here are the caller
stats for this set of locks...
lock latency buckets: <=1 <=10 <=100 <=1000 >1000
mtx 4908/06002A00F0 lks 155066, ulks 155066, 66359 4395 78895 4742 675
caller 0x0100454FAD, count 154792, L, /usr/src/git/builtin/pack-objects.c:1758
caller 0x0100454FFC, count 154777, U, /usr/src/git/builtin/pack-objects.c:1769
[...]
mtx 4908/0600FA6860 lks 154745, ulks 154745, 56092 3217 64883 25435 5118
caller 0x0100455659, count 1011, L, /usr/src/git/builtin/pack-objects.c:1640
caller 0x0100455687, count 1011, U, /usr/src/git/builtin/pack-objects.c:1642
caller 0x01004556E0, count 153734, L, /usr/src/git/builtin/pack-objects.c:1628
caller 0x010045570E, count 153734, U, /usr/src/git/builtin/pack-objects.c:1630
mtx 4908/0603498600 lks 316825, ulks 316825, 146254 4986 155345 9686 554
caller 0x0100455269, count 196769, L, /usr/src/git/builtin/pack-objects.c:1695
caller 0x01004552C4, count 15148, U, /usr/src/git/builtin/pack-objects.c:1705
caller 0x0100455478, count 181621, U, /usr/src/git/builtin/pack-objects.c:1702
caller 0x010045554C, count 120056, L, /usr/src/git/builtin/pack-objects.c:1834
caller 0x010045556E, count 120056, U, /usr/src/git/builtin/pack-objects.c:1837
If you look at the ">1000" column, you see that these 3 locks account for
675+5118+554 = 6347ms = 6.347sec spent waiting for these locks, best case
for just this 1000+ usec latency bucket. The next column to the left (a
faster bucket for latencies 101us up to 1000us) has 4742+25435+9686 = 39863,
times 101us best case for this bucket, giving 4.026sec spent waiting for
these locks.
I probably need to modify my stats gathering program to display the total
time spent waiting to make it more clear how things add up for each mutex
across the whole testcase. Again, these stats are not timing how long locks
are being held, but rather how long it takes just to acquire the lock.