What Changes the Most
jj log --no-graph -r 'ancestors(trunk()) & committer_date(after:"1 year ago")' \
-T 'self.diff().files().map(|f| f.path() ++ "\n").join("")' \
| sort | uniq -c | sort -nr | head -20
Who Built This jj log --no-graph -r 'ancestors(trunk()) & ~merges()' \
-T 'self.author().name() ++ "\n"' \
| sort | uniq -c | sort -nr
Where Do Bugs Cluster jj log --no-graph -r 'ancestors(trunk()) & description(regex:"(?i)fix|bug|broken")' \
-T 'self.diff().files().map(|f| f.path() ++ "\n").join("")' \
| sort | uniq -c | sort -nr | head -20
Is This Project Accelerating or Dying jj log --no-graph -r 'ancestors(trunk())' \
-T 'self.committer().timestamp().format("%Y-%m") ++ "\n"' \
| sort | uniq -c
How Often Is the Team Firefighting jj log --no-graph \
-r 'ancestors(trunk()) & committer_date(after:"1 year ago") & description(regex:"(?i)revert|hotfix|emergency|rollback")'
Much more verbose, closer to programming than shell scripting. But less flags to remember.All joking aside, it really is a chronic problem in the corporate world. Most codebases I encounter just have "changed stuff" or "hope this works now".
It's a small minority of developers (myself included) who consider the git commit log to be important enough to spend time writing something meaningful.
AI generated commit messages helps this a lot, if developers would actually use it (I hope they will).
[alias]
st = status
ci = commit
co = checkout
br = branch
df = diff
dfs = diff --stat
dfc = diff --cached
dfh = diff --histogram
dfn = diff --name-status
rs = restore
rsc = restore --staged
last = log -1 HEAD
lg = log --graph --decorate --oneline --abbrev-commit
cm = commit -m
ca = commit --amend
cane = commit --amend --no-edit
who = shortlog -sn --no-merges HEAD
dmg = log --oneline -i -E --grep='(incident|outage|downtime|rollback|revert|mitigate|mitigation|hotfix|broke|prod)' --since='1 year ago'
bugs = log --oneline -i -E --grep='(bug|bugfix|fix|fixed|fixes|defect|regression|hotfix|broke)' --since='1 year ago'
bugfiles = !git log --name-only --format='' -i -E --grep='(bug|bugfix|fix|fixed|fixes|defect|regression|hotfix|broke)' --since='1 year ago' | sort | uniq -c | sort -nr
monthly = !git log --since='1 year ago' --format='%ad' --date=format:'%Y-%m' | sort | uniq -c
churn = !git log --format='' --name-only --diff-filter=AM --since='1 year ago' | sort | uniq -c | sort -nr | head -20> git shortlog -sn --no-merges
Is the most egregious. In one codebase there is a developer's name at the top of the list who outpaced the number 2 by almost 3x the number of commits. That developer no longer works at the company? Crisis? Nope, the opposite. The developer was a net-negative to the team in more ways than one, didn't understand the codebase very well at all, and just happened to commit every time they turned around for some reason.
The most changed file is the one people are afraid of touching?
# summary: print a helpful summary of some typical metrics
summary = "!f() { \
printf \"Summary of this branch...\n\"; \
printf \"%s\n\" $(git rev-parse --abbrev-ref HEAD); \
printf \"%s first commit timestamp\n\" $(git log --date-order --format=%cI | tail -1); \
printf \"%s latest commit timestamp\n\" $(git log -1 --date-order --format=%cI); \
printf \"%d commit count\n\" $(git rev-list --count HEAD); \
printf \"%d date count\n\" $(git log --format=oneline --format=\"%ad\" --date=format:\"%Y-%m-%d\" | awk '{a[$0]=1}END{for(i in a){n++;} print n}'); \
printf \"%d tag count\n\" $(git tag | wc -l); \
printf \"%d author count\n\" $(git log --format=oneline --format=\"%aE\" | awk '{a[$0]=1}END{for(i in a){n++;} print n}'); \
printf \"%d committer count\n\" $(git log --format=oneline --format=\"%cE\" | awk '{a[$0]=1}END{for(i in a){n++;} print n}'); \
printf \"%d local branch count\n\" $(git branch | grep -v \" -> \" | wc -l); \
printf \"%d remote branch count\n\" $(git branch -r | grep -v \" -> \" | wc -l); \
printf \"\nSummary of this directory...\n\"; \
printf \"%s\n\" $(pwd); \
printf \"%d file count via git ls-files\n\" $(git ls-files | wc -l); \
printf \"%d file count via find command\n\" $(find . | wc -l); \
printf \"%d disk usage\n\" $(du -s | awk '{print $1}'); \
printf \"\nMost-active authors, with commit count and %%...\n\"; git log-of-count-and-email | head -7; \
printf \"\nMost-active dates, with commit count and %%...\n\"; git log-of-count-and-day | head -7; \
printf \"\nMost-active files, with churn count\n\"; git churn | head -7; \
}; f"
EDIT: props to https://github.com/GitAlias/gitaliasgit log -i -E --grep="\b(fix|fixed|fixes|bug|broken)\b" --name-only --format='' | sort | uniq -c | sort -nr | head -20
I have a project with a large package named "debugger". The presence of "bug" within "debugger" causes the original command to go crazy.
In my experience, when the team doesn't squash, this will reflect the messiest members of the team.
The top committer on the repository I maintain has 8x more commits than the second one. They were fired before I joined and nobody even remembers what they did. Git itself says: not much, just changing the same few files over and over.
Of course if nobody is making a mess in their own commits, this is not an issue. But if they are, squash can be quite more truthful.
If the commit frequency goes down, does it really mean that the project is dying? Maybe it is just becoming stable?
Also, very tangentially, to the notion of the Developer's Legacy Index: https://www.javaadvent.com/2021/12/using-jgit-to-analyse-the...
Let's NOT jump to conclusions; it could mean many things. For example, a period with other priorities, different urgencies, other issues external to the project itself and beyond our control, vacations, illnesses, or anything else that could impact the commit history.
I think these considerations and the others expressed in this article can easily lead to hasty conclusions and erroneous deductions, too simplistic.
Coding flow, like business needs, cannot always be objectively and deterministically measured.
git clone --depth 1 --branch $SomeReleaseTag $SomeRepoURL
If you only want to build something, it only downloads what you need to build it. I've probably saved a few terabytes at this point!
Most good projects end up solving a problem permanently and if there is no salary to protect with bogus new features it is then to be considered final?
The grep for bugs is not particularly comprehensive: it will pick up some things that aren't bugs, and will miss a bunch of things too.
The "project accelerating or dying" seems odd to me. By definition, the bulk of commits/changes will be at the very beginning of history. And regardless, "stability" doesn't mean "dying".
> How Often Is the Team Firefighting
> git log --oneline --since="1 year ago" | grep -iE 'revert|hotfix|emergency|rollback
> Crisis patterns are easy to read. Either they’re there or they’re not.
I disagree with the last two quoted sentences, and also, they sound like an LLM.
This list is also one of many arguments for maintaining good Git discipline.
Well isn't it typical that the person who wrote is also the person that merged? I have never worked in a place where that is not the norm for application code.
Even if you are one of those insane teams that do not squash merge because keeping everyone's spelling fixes and "try CI again" commits is important for some reason, you will still not see who _wrote_ the code, you will only see who committed the code. And if the person that wrote the code is not also the person that merges the code, I see no reason to trust that the person making commits is also the person writing the code.
For the "bus factor", there's one guy and then there's me, but I stopped being a primary contributor to this project nearly two years ago, lol.
Of course the two most useful ones would never be useful in the code base I am currently working on.
"fix" might be the single most common commit message, and after that comes "."
Trying for two years, to get people to include at least some information in their commit messages, has exhausted me.
and it touches in detail what exactly commit standards should be, and even how to automate this on CI level.
And then I also have idea/vision how to connect commits to actual product/technical/infra specs, and how to make it all granular and maintainable, and also IDE support.
I would love to see any feedback on my efforts. If you decide to go through my entire 3 posts I wrote, thank you
The takeaway from my experiment is that you can really tell a lot by how / when / what people commit, but conclusions are very hard to generalize.
For example, I've also stumbled upon the "merge vs squash" issue, where squashes compress and mostly hide big chunks of history, so drawing conclusions from a squashed commit is basically just wild guessing.
(The author of course has also flagged this. But I just wanted to add my voice: yeah, careful to generalize.)
¹ Nothing is ever finished.
Plus, adding an extra point: When you run git log --oneline --graph and the pattern on the left is more complex than the Persian carpet patterns or Ancient Egyptian writings in the Great Pyramid of Giza, you know it's engineering & process quality issue than the code itself...
Squash-merge workflows are stupid (you lose information without gaining anything in return as it was easily filterable at retrieval anyway) and only useful as a workaround for people not knowing how to use git, but git stores the author and committer names separately, so it doesn't matter who merged, but rather whether the squashed patchset consisted of commits with multiple authors (and even then you could store it with Co-authored-by trailers, but that's harder to use in such oneliners).
That probably isn’t a good sign
What a weird check and assumption.
I mean, surely most of the "20 most-changed files" will be README and docs, plus language-specific lock-files etc. ?
So if you're not accounting for those in your git/jj syntax you're going to end up with an awful lot of false-positive noise.
For most, I added some filters and slightly changed the regex, and it showed the reality of the codebase (I already knew the reality, I just wanted to see if it matched, and it did).
I abhor squash merging for this and a few other reasons. I literally have to go out of my way to re-check out a branch. Someone who wants to use my current branch cannot do so if I merge my changes a month later, because the squash rewrites history, and now git is very confused. I don't get the obsession with "cleaning up the history" as if we're all always constantly running out of storage over 2 more commits.
https://gist.github.com/aeimer/8edc0b25f3197c0986d3f2618f036...
Another one I do, is:
$alias gss='git for-each-ref --sort=-committerdate'
$gss
ce652ca83817e83f6041f7e5cd177f2d023a5489 commit refs/heads/project-feature-development
ce652ca83817e83f6041f7e5cd177f2d023a5489 commit refs/remotes/origin/project-feature-development
1ef272ea1d3552b59c3d22478afa9819d90dfb39 commit refs/remotes/origin/feature/feature-removal-from-good-state
c30b4c67298a5fa944d0b387119c1e5ddaf551f1 commit refs/remotes/origin/feature/feature-removal
eda340eb2c9e75eeb650b5a8850b1879b6b1f704 commit refs/remotes/origin/HEAD
eda340eb2c9e75eeb650b5a8850b1879b6b1f704 commit refs/remotes/origin/main
3f874b24fd49c1011e6866c8ec0f259991a24c94 commit refs/heads/project-bugfix-emergency
...
This way I can see right away which branches are 'ahead' of the pack, what 'the pack' looks like, and what is up and coming for future reference ... in fact I use the 'gss' alias to find out whats going on, regularly, i.e. "git fetch --all && gss" - doing this regularly, and even historically logging it to a file on login, helps see activity in the repo without too much digging. I just watch the hashes.The git blame tip is underrated. People treat it like a gotcha tool but its maybe the fastest way to find the PR/ticket that explains a weird decision.
Thank you.
Diagnostics function, colorized (I tried to add guards so it is portable with terminals that do not support color):
git_diag() {
local since="${1:-1 year ago}"
local root repo branch
# --- patterns ---
local pattern="${GIT_DIAG_PATTERN:-fix|bug|broken|hotfix|incident|issue|patch}"
local firefight_pattern="revert|hotfix|emergency|rollback"
# --- colors ---
local GREP_COLOR_MODE='never'
if [[ -z "${NO_COLOR:-}" ]] && [[ -t 1 ]] && [[ "${TERM:-}" != "dumb" ]] && [[ "$(tput colors 2>/dev/null || echo 0)" -ge 8 ]]; then
local BLACK=$(tput setaf 0)
local RED=$(tput setaf 1)
local GREEN=$(tput setaf 2)
local YELLOW=$(tput setaf 3)
local BLUE=$(tput setaf 4)
local MAGENTA=$(tput setaf 5)
local CYAN=$(tput setaf 6)
local WHITE=$(tput setaf 7)
local BOLD=$(tput bold)
local DIM=$(tput dim 2>/dev/null || true)
local RESET=$(tput sgr0)
GREP_COLOR_MODE='always'
else
local BLACK='' RED='' GREEN='' YELLOW='' BLUE='' MAGENTA='' CYAN='' WHITE=''
local BOLD='' DIM='' RESET=''
fi
local TITLE="$CYAN"
local COLOR_COUNT="$CYAN"
local COLOR_FILE="$YELLOW"
if ! root="$(git rev-parse --show-toplevel 2>/dev/null)"; then
printf 'git_diag: not inside a Git repository\n' >&2
return 1
fi
repo="${root##*/}"
branch="$(git branch --show-current 2>/dev/null)"
branch="${branch:-DETACHED}"
_git_diag_fmt_count() {
local count_color="$1"
local text_color="$2"
awk -v count_color="$count_color" -v text_color="$text_color" -v reset="$RESET" '{
c=$1
$1=""
sub(/^ +/, "")
printf " %s%10d%s %s%s%s\n", count_color, c, reset, text_color, $0, reset
}'
}
printf '%s%sGit repo diagnostics%s\n' "$BOLD" "$TITLE" "$RESET"
printf '%s%-11s%s %s\n' "$BOLD" "Repo:" "$RESET" "$repo"
printf '%s%-11s%s %s\n' "$BOLD" "Branch:" "$RESET" "$branch"
printf '%s%-11s%s %s\n' "$BOLD" "Timeframe:" "$RESET" "$since → now"
printf '\n\n'
printf '%s%s1) Most changed files%s\n' "$BOLD" "$TITLE" "$RESET"
git log --since="$since" --format='' --name-only \
| awk 'NF' \
| sort \
| uniq -c \
| sort -nr \
| head -n 10 \
| _git_diag_fmt_count "$COLOR_COUNT" "$COLOR_FILE"
printf '\n%s%s2) Top contributors%s\n' "$BOLD" "$TITLE" "$RESET"
git shortlog -sn --no-merges --since="$since" \
| head -n 10 \
| awk -v count_color="$COLOR_COUNT" -v reset="$RESET" '{
printf " %s%10d%s %s\n", count_color, $1, reset, substr($0, index($0,$2))
}'
printf '\n%s%s3) Bug/fix hotspots%s %s(pattern: %s)%s\n' "$BOLD" "$TITLE" "$RESET" "$DIM" "$pattern" "$RESET"
git log --since="$since" --format='' --name-only -i -E --grep="$pattern" \
| awk 'NF' \
| sort \
| uniq -c \
| sort -nr \
| head -n 10 \
| _git_diag_fmt_count "$COLOR_COUNT" "$COLOR_FILE"
printf '\n%s%s4) Commit count by month%s\n' "$BOLD" "$TITLE" "$RESET"
git log --since="$since" --format='%ad' --date=format:'%Y-%m' \
| sort \
| uniq -c \
| sort -k2r \
| awk -v count_color="$COLOR_COUNT" -v mag="$MAGENTA" -v reset="$RESET" '
{
data[NR,1] = $2
data[NR,2] = $1
if (length($1) > max) max = length($1)
}
END {
for (i = 1; i <= NR; i++) {
printf " %s%10s%s %s%*d commits%s\n",
mag, data[i,1], reset,
count_color, max, data[i,2], reset
}
}
'
printf '\n%s%s5) Firefighting commits%s %s(pattern: %s)%s\n' "$BOLD" "$TITLE" "$RESET" "$DIM" "$firefight_pattern" "$RESET"
git log --since="$since" -i -E \
--grep="$firefight_pattern" \
--date=short \
--pretty=format:'%ad %h %s' \
| head -n 10 \
| awk -v mag="$MAGENTA" -v dim="$DIM" -v reset="$RESET" '{
date=$1
hash=$2
$1=$2=""
sub(/^ */, "")
printf " %s%-10s%s %s%-12s%s %s\n",
mag, date, reset,
dim, hash, reset,
$0
}' \
| GREP_COLORS='ms=01;31' grep --color="$GREP_COLOR_MODE" -i -E "$firefight_pattern"
}
Uncolorized diagnostics function (same, but without the colors): git_diag() {
local since="${1:-1 year ago}"
local pattern="${GIT_DIAG_PATTERN:-fix|bug|broken|hotfix|incident|issue|patch}"
local root repo branch
if ! root="$(git rev-parse --show-toplevel 2>/dev/null)"; then
printf 'git_diag: not inside a Git repository\n' >&2
return 1
fi
repo="${root##*/}"
branch="$(git branch --show-current 2>/dev/null)"
branch="${branch:-DETACHED}"
_git_diag_fmt_count() {
awk '{
c=$1
$1=""
sub(/^ +/, "")
printf " %10d %s\n", c, $0
}'
}
printf '============================================================\n'
printf 'Git repo diagnostics\n'
printf '%-11s%as\n' 'Repo:' "$repo"
printf '%-11s%s\n' 'Branch:' "$branch"
printf '%-11s%s\n' 'Timeframe:' "$since"
printf '============================================================\n\n'
printf '1) Most changed files (top 10)\n'
git log --since="$since" --format='' --name-only \
| awk 'NF' \
| sort \
| uniq -c \
| sort -nr \
| head -n 10 \
| _git_diag_fmt_count
printf '\n2) Top 10 contributors (no merges, since %s)\n' "$since"
git shortlog -sn --no-merges --since="$since" \
| head -n 10 \
| _git_diag_fmt_count
printf '\n3) Bug/fix hotspots (top 10, matching: %s)\n' "$pattern"
git log --since="$since" --format='' --name-only -i -E --grep="$pattern" \
| awk 'NF' \
| sort \
| uniq -c \
| sort -nr \
| head -n 10 \
| _git_diag_fmt_count
printf '\n4) Commit count by month (since %s)\n' "$since"
git log --since="$since" --format='%ad' --date=format:'%Y-%m' \
| sort \
| uniq -c \
| sort -k2r \
| awk '
{
data[NR,1] = $2
data[NR,2] = $1
if (length($1) > max) max = length($1)
}
END {
for (i = 1; i <= NR; i++) {
printf " %10s %*d commits\n", data[i,1], max, data[i,2]
}
}
'
printf '\n5) 10 most recent firefighting commits (revert|hotfix|emergency|rollback)\n'
git log --since="$since" -i -E \
--grep='revert|hotfix|emergency|rollback' \
--date=short \
--pretty=format:'%ad %h %s' \
| head -n 10 \
| awk '{
date=$1
hash=$2
$1=$2=""
sub(/^ */, "")
printf " %-10s %-12s %s\n", date, hash, $0
}' \
| GREP_COLORS='ms=01;31' grep --color=always -i -E 'revert|hotfix|emergency|rollback'
}Anyway, I can glean a lot of this information in a few minutes scrolling through and filtering the log in magit, and it doesn't require memorizing a bunch of command line arguments.
I've got my Emacs set up to display next to every file that is versioned the number of commits that file has been modified in (for the curious: using a modified all-the-icons-ivy-rich + custom elisp code + custom Bash scripts I wrote and it's trickier than it seems to do in a way that doesn't slows everything down). For example in the menu to open a file or open a recently visited file etc.: basically in every file list, in addition to its size, owner, permissions, etc. I also add the number of commits if it's a versioned file.
I like the fix/bug/broken search in TFA to see where the bugs gather.
*Mainline linux*
Most changed files: pretty much what I expected for 1 and 2... the "cutting edge" of Linux development over other OSes -- bpf and containers. The bpf verifier and AMD GPU driver might get a boost in this list due to sheer LoCs in those files (26K and 14K respectively). An intel equivalent of amdgpu_dm is #21 in the list (drivers/gpu/drm/i915/display/intel_display.c) and nvidia is nowhere to be seen (presumably due to out-of-tree modules/blobs?).
186 kernel/bpf/verifier.c
174 fs/namespace.c
162 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
161 kernel/sched/ext.c
159 fs/f2fs/f2fs.h
Bus factor: obviously none. The top 4 10399 Christoph Hellwig -> I only know his name because of drama last year regarding rust bindings to DMA subsystem
8481 Mauro Carvalho Chehab -> I also know his name from the classic "Mauro, shut the fuck up!" Linus rant
8413 Takashi Iwai -> Listed as maintainer for sound subsystem, I think he manages ALSA
8072 Al Viro -> His name is all over bunch of filesystem code
Buggy files: Intel comes out on top of GPU drivers this time (twice). Along with KVM for x86(64), the main allocator, and BTRFS. 1477 drivers/gpu/drm/i915/intel_display.c
1406 MAINTAINERS
1390 sound/pci/hda/patch_realtek.c
1102 drivers/gpu/drm/i915/i915_drv.h
943 arch/x86/kvm/x86.c
928 mm/page_alloc.c
871 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
862 drivers/gpu/drm/i915/i915_reg.h
840 fs/btrfs/inode.c
*GCC*Most changed files: IR autovectorization code, riscv heuristics tables, and C++ template handling (pt.c is "paramaterized types").
152 gcc/tree-vect-stmts.cc
145 gcc/config/riscv/riscv.cc
131 gcc/tree-vect-loop.cc
116 gcc/cp/pt.cc
Buggy files: DWARF debuginfo generation, x86 heuristics tables, RS6000(?!) heuristic tables. I had to look up RS6000, it's an IBM instruction set from the 90s lol. cp-tree.h is an interesting file, it seems be the main C(++) AST datastructures. 1017 gcc/dwarf2out.c
885 gcc/config/i386/i386.c
796 gcc/cp/cp-tree.h
740 gcc/config/rs6000/rs6000.c
720 gcc/cp/pt.c
*xfwm4*
Most changed files: the list is dominated by *.po localizations. I filtered these out. Even after this, I discovered there is very little active development in the last few years. If I extend to 4 years ago, I get:
1. src/client.c - Realizing this project is too "small" to glean much from this. client.c is just the core X client management code. Makes sense.
2. src/placement.c - Other core window management code.This has not told me much other than where most of the functionality of this project lies.
Bus factor: Pretty huge. Not really an issue in this case due to lack of development I guess.
3298 Olivier Fourdan
530 Anonymous
319 Xfce Bot
121 Jasper Huijsmans
Files with bug commits: Very similar distribution to most changed files. Not enough datapoints in this one to draw any big conclusions.I think these massive open projects (excl xfwm) are generally pretty consistent code quality across the heavily trodden areas because of the amount of manpower available to refactor the pain points. I've yet to see an example of "god help you if you have to change that file" in e.g. linux, but I have of course seen that situation many times in large proprietary codebases.
|sort |uniq -c |sort -nr |head -20
I use it often.On main:
2020-01-01: "Changes"
2020-01-05: "Changes"
2020-01-06: "merge <ref to jira/gh issue>"
2020-01-07: "revert <ref to unrelated jira/gh issue from 2 yrs ago>"
Then there’s the people that include merge commits despite agreeing on rebasing.
Occasionally see sprinkles of decent, consistently formatted commit messages.
I think this is only useful on medium to large _open source_ projects. Clearly established CONTRIBUTING.md/README.md and commit formatting/merging guide.
Wtf is happening to this website
git rm -rf .