Rewriting project history for fun and profit

Cleaning up the version history of various firmware projects

Background:

Our Front panel adapter project started out as two separate programs:

The firmware providing forward and reflected power readout on an earlier model RFG, written in MikroBasic
A prototype menu system running on a PC controlling a mock-up front panel, written in Turbo-Pascal

The two were combined into a MikroBasic implementation of a front panel with a setup menu sometime around 2007, and the result was used on a minor variant of the RFG, then front panel development was largely suspended.

Later in about 2014-2015 development of the standard RFG firmware resumed. A key development was the introduction of the "enhanced" front panel adapter, which enabled the addition of further controls without the need to redesign the main Analogue Controller PCB. Around this time the code base was line-for-line translated from Basic to C.

The process of translating the code was slow but surprisingly straightforward, apart from some unfortunate differences in the way source modules are managed.

After the code base was converted from Basic to C the project was placed under a revision control using the "git" revision control system.

Prior to the use of git each major development had been retained by storing a copy of the entire project in a suitably named folder, but minor development stages were simply overwritten. The only way to really track the development path was through a large comment block at the beginning of the main section.

Revision control retains the history of the project as a series of date-stamped steps called "commits". At any time the project may be "rewound" to an earlier commit and that version tested, avoiding the need to duplicate the entire workspace. In addition it is possible to maintain parallel versions to support development in multiple directions that may later be reconciled with a "merge".

In short Revision control gives us a really deep "undo" capability and the ability to return to past versions. It also grants the designer the freedom to take development in multiple directions and reconcile them later.

With more organisation it allows multiple team members to work on the same project separately and simultaneously, then combine their work at a later date.

The problem:

Due to inexperience and learning-by-doing the early project history of the front panel is slightly broken. In particular a significant number of files are in the history that should have been excluded. In addition the first four commits are incomplete, and do not contain all the required source files. By about the fifth commit the project is complete, containing all required source and configuration files, but at this point still also containing unwanted intermediate (non-source) files.

It isn't possible to simply edit the history directly, as each commit "depends" on the previous one with strong integrity-checking. Git uses a system that resembles a blockchain. This is excessive for a one-user RCS but essential to shared Git projects.

The solution:

A "rebase" operation allows us to create an alternative history and apply all subsequent changes to that history, leaving us with a new chain of commits almost identical to the original but with the alterations we specify.

Issues addressed:

Filling in gaps in the opening commits of the Front Panel project
Splicing the history of the frequency agile project to make one continuous history
Fixing issues with the initial configuration
Fixing issues with text file format
Removing unwanted MPLAB files manually
Removing unwanted files automatically
Injecting a new configuration file project-wide

Note:

Typically the original history will be retained until a garbage collection operation removes it, unless the original history is still referenced somewhere. My preference is to "name" the old history so it is retained, then run comparisons between the old and the new history. This is critical as "rebase" is one of the few operations in GIT with potentially irreversible consequences.

MPLAB X Project file structure:

Project root contains source code, an editable "Makefile" and the ".gitignore" and ".gitattributes" configuration files

Note that some developers prefer to put source code in a subfolder. At this time I consider it a personal preference, though in some environments it may be mandatory.

".gitignore" contains a list of folders and files that are not to be stored in revision control. Note that typically the .gitignore file IS stored under revision control though it doesn't have to be. Also if an "ignored" file type is checked in, either because it was done prior to gitignore or it was forced, then it will continue to be tracked irrespective of if it is listed as excluded, so if you add .gitignore to an existing project you may still need to manually clean up.

".gitattributes" mostly contains instructions for handling various file types in the project. This tells Git if a file is a binary or a text file, and for text files if they are in Windows or Linux format. It is worth noting that MPLAB X appears to use a version of git called jgit which is older than the release version of Git, and does not check .gitattributes. This can give rise to compatibility issues

.git folder: This is where the history is stored and will usually only be accessed using "git" commands.

build, debug, dist folders: These are where compiled code and intermediate files go, and should be excluded from the repository

nbproject: This contains two files that are needed to reconstruct the project's configuration: configurations.xml and project.xml. It also contains a significant number of generated files that are usually excluded, and the "private" subfolder containing machine-specific configuration. For this reason the .gitignore rules for "nbproject" tend to be a bit complicated.

nbproject/private contains a SECOND configurations.xml file. This is deliberate, the MPLAB options have been split such that project configuration goes in the first one and computer-specific information such as install files will be stored in the private one. By ignoring "private" we help ensure that the stored project is "portable" between computers.

nbproject: project.project appears to be a junk file of zero length. It is not always present and can usually be deleted.

Filling in gaps in the opening commits of the Front Panel project

Given what we now know about MPLAB it is possible to inspect the project commits to determine what should have been excluded. Specifically it will be desirable to insert the correct "gitignore" at the outset. We'll also insert a "gitattributes" file but with only the minimum "* text=auto" configuration to reduce issues with line-endings.

The first commit is titled: "added some comments" and only contains the main source file, at this point still named fpa0862.c. Revision control removes the need to "version" the filename, so it is preferable to use one consistent filename.

The second commit is titled: "Added Idle hook to displays and Read_Analog", and is the one where the majority of the project is added to revision control, but not the configuration. This is also the point where the options filename is corrected from fpa0861options to fpa0862options.

The third commit is "Imported serial code" and just adds serial.c, a complete MAX3110 demo that has yet to be adapted into a library.

The fourth: "Serial code compiles" adds an incomplete "gitignore" file, the missing configurations and a lot of files that should have been ignored.

The sixth "Testing Serial" also adds some unwanted files.

In particular most of "nbproject" should have been excluded. These files pop up repeatedly in subsequent commits.

The PLAN:

1: Make a "dry run" on a duplicate before attempting the proper operation. If too many errors occur then abandon the attempt and try again later (This turned out to be unnecessary as Git allows you to easily abandon the rebase)

2: Initiate an interactive rewrite of "master"

3: Duplicate the first commit

4: Select editing of the first, second, third (was second) and fifth (was fourth) commits

5: When the first commit comes up amend it, adding the proper gitignore, removing any source files already present and adding all the 0.861 project files.

5b: Note that I now consider it preferable to put .gitignore and .gitattributes in a commit at the beginning before the project files, but this is a personal preference not a requirement.

6: The second commit should complete by itself, however if there is contention then the source file may need adding. amend it to remove the old source file. When this is done GIT will report that fpa0861 was renamed to fpa0862.

7: When the third commit comes up make sure the redundant fpa0861options files are removed, using amend if necessary. Again this makes it look like the files were renamed.

8: When the fifth commit comes up make sure the unwanted files are removed.

An example command would be git rm –cached <filename> which stages a deletion. We then amend the previous commit, the add and delete cancel out leaving the file on disk but not committed. Then we deliberately remove the unwanted files from nbproject so any subsequent commits referencing those files will fail and need merging. Note the details of when to use rm --cached and when to use reset are complex, and explained better elsewhere

9: subsequent commits may fail due to containing changes to excluded files, the procedure is simply to "rm --cached" these files in order that the "rebase" can continue. Deliberately remove the unwanted files from nbproject.

It should be noted that there is an easier way to do the file removal, it is possible to bulk scan a whole repository to remove files matching a pattern, but Git filter-branch is a separate subject

Relevant commands:

git rebase -i HEAD~<number>

Perform an interractive in-place rebase of the current branch, it will open a text editor listing the last <number> commits OLDEST FIRST (the reverse of how commits are usually visualised)

git rebase -i --root

Goes all the way back to the beginning

When using "rebase" interractively the text editor is often "vim" which operates in a slightly obscure way. To perform conventional editing place the cursor where you wish to start and press "i" then press escape when done

"dd" removes a line
"P" inserts the line ("p" inserts below)
":wq" saves and exits
":q!" quits without save

Alternatively some configurations (github shell) just open notepad instead

"git rebase --abort" puts everything back how it was
"git rebase --continue" completes one stage of the rebase, note that depending on the reason for the stop this can happen before or after the commit. Merge conflicts halt it before committing, edits halt it immediately after.

"git status" indicates which files are staged to be committed

"git add <file>" stages a file
"git rm --cached <file>" has a confusing definition but what it appears to do is stage a deletion or indicate that a file should be deleted, its main use seems to be during rebasing but it also appears to serve as the opposite of “add”.
"git rm <file>" should both remove the file and stage the deletion of the file indicating the file has been deleted from the project. Alternatively you can add a file that isn’t there, it has the same effect
"git reset HEAD <file>" appears to "un-add" a file, not clear on this, doesn't seem to work during rebase. “git reset” is a drastic command to be handled with care
"git reset HEAD^" removes the last commit (dangerous) this is useful during rebase if a commit needs to be rewritten and --amend won't do it, but cannot be used to back out of a merge
"git commit -m <message>" commits latest changes
"git commit --amend -m <message>" rewrites the previous commit, normally used to change the message but can add file changes too
"git commit --amend --no-edit" rewrites the previous commit using the same message

Problems:

On the first test run a significant number of "Merge conflicts" sprung up. Merging is supposed to be the correct way to resolve conflicts between the old and new history, but there should not have been any conflicts. Most of these were line-ending problems. By having an inconsistent line-ending configuration different installs of Git see the same file differently. ".gitattributes" fixes this.

It also looks as if there is no simple way to "back out" of a merge commit. Normal commits can be reset and rewritten using the "splitting" procedure or by amending, but once a merge is in progress it must be completed. This is tricky as the merge procedure is quite scary when you are unfamiliar

git checkout --theirs <filename> retrieves the newer version

git checkout --ours <filename> retrieves the older version (This contradicts how a "merge" normally works, in rebase "ours" represents the status quo, "theirs" represents the new

git rm --cached <filename> sucessfully unstages the unwanted files

Using the above it was possible to rebase a test version however the "github" git version had persistent problems merging XML files and tripped over CR/LF issues frequently

A subsequent attempt with ".gitattributes" added completed without the problems.

Further it was a trivial matter to point the master branch to the new location on contract1971 and to rebase devel onto its new home:

git rebase --onto <destination> <source> <branch>

Footnotes: Compatibility with MPLAB/Netbeans

There is an important language difference: If you are checking out a revision in Netbeans then Revert just means discard the current changes.

In GIT "revert" often but not always means create a commit that undoes a previous commit.

The "switch" operation in Netbeans appears to be the GIT RESET function which is important but "unsafe".

Monday re-try using Thursday 14/9/2017 backup as base (to remove Friday's "hacking")

The "Modifications to I2C" commit has an unwanted disassembly file. Removed.

MPLAB git may be set to treat all files as binary?

new 9d879bda93898cc2c97863cffbba76b0d4dd3a4a

contract1971 vfd7000 compiles (other variants currently need work anyway)

Checking out revisions in MPLAB. MPLAB tends to choke if there are major changes to project settings, so if project.XML changes then it is best to close the project, perform the check-out in Git GUI, then re-open the project.

master vfd7000 compiles

0861 variants compile but "complain"

Further work: splicing a project history.

The Frequency Agile project was split into two projects for reasons that seemed good at the time.

This is not too hard to fix.

Start with the newest project. Create an extra "was" branch at the head so it will be retained.

Add the old project as a "remote" and "fetch" it. This will leave you with two histories.

Create a "old" branch on the remote. I thought I had to force this using its SHA, but actually "create new branch" should have done it.

Remove the remote. Now we have a master, a "was" and a "old" commit chain.

Check out "old" then delete all the project files and check in the result. This is important, the newer commits start from an empty state so to join them up Git needs to see an empty project. The "old" chain should end in an empty commit.

Rebase master onto old:

git checkout master

git rebase --onto old --root

This should proceed without conflict since there were no files at the head of old.

The new history has a gap where everything was deleted. Perform an interactive rebase and "squash" the following empty commit. This will close the gap enabling changes to be tracked across the join.

After the join use the "diff" function to compare master with "was" to confirm there is no change

Resolution of the line-endings issue

Background to the line-endings issue:

Different computer systems use different line ending markers, and this causes problems when plain text is transferred from one computer to another.

MS-DOS derived systems use "CRLF", two bytes.

Linux uses "LF"

Many GNU tools use "LF" even when ported to Windows.

Some really old systems use "CR".

Git internally expects text files to use "LF" but can tolerate "CRLF".

The official strategy for Git on Windows is to convert text files to Linux format when checking them in and convert them back when checking out.

The ".gitattributes" file should list file types and their correct conversion strategy.

A common alternative in Windows (apparently used by MPLAB) is to perform no conversions at all. In order to make MPLAB git follow the convention followed by later "gits" it is nessecery to make a configuration change in the repository (not global). In repository config add "autocrlf=true" to the section "[core]".

Problems occur when a Windows only project is transferred to Linux.

It may be preferable to "normalize" line endings.

There are three strategies to follow:

1: Given the choice start a project with a clear ".gitattributes" configuration

2: Have a change-over commit and recommit everything that is affected, then use normalised format from that point on. This may still cause grief if anyone has to access the history.

3: rewrite the whole history to the correct format. This is a scary option but ultimately preferable.

Understanding the Rebase process

I'm not an authority on this but I'm getting the hang of it slowly

What is a cherry-pick
What's with the ours-theirs thing
What is actually going on during a rebase
How does one command have so many use-cases
What does the repository look like while a rebase is in progress
What happens if I leave it unfinished

I can't entirely answer the sixth question but a short answer is "surprisingly little" because the old chain of commits is still in place and should be retained until such time as the rebase is completed at which point the branch is reset to the rebased state. Any other branches pointing to the old commit chain will stay there unless manually reset.

What is a cherry-pick

This is a good starting point for understanding rebase overall.

Each commit in a project's history may be reduced to a "patch", which is a set of instructions for generating that commit from the parent (or "base") commit by making the minimum changes. The patch represents the specific edit that was made at that point.

The clever thing about patches is that you can apply them to a different target commit and provided the changes do not conflict they will produce a new commit combining the edits in the cherry pick with the edits leading up to the target state.

A simple use case is when you do a bunch of work and then realise you forgot to pull the latest state, so now your commit is a little side-branch. You can make a temporary branch for your edit, then hard-reset to the correct head, then cherry pick your commit and it will attempt to re-create your work onto the correct head.

Another use, probably more "correct" is when a project has multiple branches and one of the branches has gained a feature that needs to be shared across branches but it isn't yet time to merge them. It is possible to cherry pick the relevant commit from another branch to add it to the current branch. This might happen if one member of a development team fixes an old bug, and it is desirable to fix the bug in all branches to prevent the risk of the bug being resurrected after merging.

The ours-theirs thing...

Cherry-picking is the reason for the "ours-theirs" issue. When you cherry-pick it is assumed that the current state of "head" is "ours", and the picked commit is "theirs". When this is done as part of a rebase operation the part that is being rebased is considered "theirs", even though in practice you are more likely to be rebasing "your" branch onto "their" branch.

In short "ours" and "theirs" end up swapped, but it is safer to understand why than to just assume.

Also if you don't get conflicts then you won't have to worry about it. Conflicts occur when edits overlap.

What is actually going on during a rebase

Rebase needs three locations:

The current branch/head
The target branch
An optional "onto" branch

Rebase starts from the current head.

It steps back from "head" until it finds a commit that either is the target or is shared with the target. A third case is it backs up to a previous merge, in which case I'm not clear what exactly happens.

The commits that aren't shared are put into a "to do" list. It stops when it reaches a commit that is shared with the target, meaning it would already be present in the rebased branch. A simple way to view it is it "stops at the fork".

If you selected an interactive rebase that list is presented for editing. I'm inferring that the list exists anyway interactive or not. I also suspect that the list serves as a "shopping list" for commits and that you could insert the hashes of unrelated commits and they would be added to the rebase.

If you did not include an "onto" branch then it switches to the target branch.

If you DID specify an "onto" branch it will switch there instead.

It will now execute a series of cherry-picks as described in the list that was built earlier.

If any of those cherry-picks fails then it must be resolved by manual intervention as with a failed merge. The confusing part is that the "ours" state is the destination of the rebase and the "theirs" state is the commit you are rebasing.

How does one command have so many use-cases

The most basic use is taking a branch and slapping it onto the head of another branch, and the classic case is rebasing your branch onto the project head so that the project presents a linear timeline and not a gnarly rope of parallel developments. This is sometimes called "vanity rebasing". Rebase automatically determines how far to go back.

A warning against vanity rebasing: it is important to note that unless you deliberately add tests you don't know that all the rebased commits represent valid project states. You would be expected to run tests on the final state to ensure validity but the intermediate states would be git-generated and might not pass tests.

Then there's interactive rebasing, in which the target is usually a number of steps behind the head, so the "HEAD~<number>" notation is often used. The number is the number of commits it steps back by, so "HEAD~5" presents five commits.

A simple use for interactive rebase is "squashing", combining commits that are really one step. Less commonly you might split a commit into two steps, or adjust a commit to add a missing file.

A reorder rebase is an interesting sub-case. Sometimes you realise that you've interrupted a line of development to fix a bug or otherwise change some unrelated part of the project. It may be desirable to group the commits by purpose, not by time.

The more complicated one is an "onto" rebase, which allows you to take a branch and recreate it in a different place entirely. There aren't many cases for this, but it might be needed if a line of development is abandoned completely and it is desirable to salvage as much development by pruning and grafting branches.

One "onto" use case is if the master branch of a project has had to be heavily rebased (usually a bad idea), leaving all development branches attached to the old pre-rebase master. "onto" can be used to graft the development branches onto the new master.

What does the repository look like while a rebase is in progress

As far as I can tell your local repository will be in a "detached head" state with no branch corresponding to the head. This is normally a bad thing as it means that checking out a branch will cause the current state to be lost, but in the case of a rebase it means that while the rebase is in progress the branches will point to their original valid states and the rebase-in-progress will be a branch-to-nowhere. The rebase may be aborted at any time in which case head is switched back to its original location.

Importantly this means that if a colleague pulls from your repository (possible in peer-to-peer development) only the valid branches will be visible.

What happens if I leave it unfinished

The repository will be left in a "detached head" state where no branch is currently checked out. Checking out a branch will effectively abort the rebase. Assigning a new branch to the current head would preserve that state, even if you then aborted.

At a later date GIT may give you warnings about a rebase being in progress, meaning you may have to manually clean up, so if you do abandon a rebase it is best to abandon it properly with "git rebase --abort" rather than leaving it hanging.