Git for binary files#

git diff can be configured so that it can also display meaningful diffs for binary files.

… for Excel files#

For this we need openpyxl and pandas:

$ pipenv install openpyxl pandas

Then we can use pandas.DataFrame.to_csv in exceltocsv.py to convert the Excel files:

exceltocsv.py#
# SPDX-FileCopyrightText: 2023 Veit Schiele
#
# SPDX-License-Identifier: BSD-3-Clause

import sys
from io import StringIO

import pandas as pd

for sheet_name in pd.ExcelFile(sys.argv[1]).sheet_names:
    output = StringIO()
    print("Sheet: %s" % sheet_name)
    pd.read_excel(sys.argv[1], sheet_name=sheet_name).to_csv(
        output, header=True, index=False
    )
    print(output.getvalue())

Now add the following section to your global Git configuration ~/.gitconfig:

[diff "excel"]
    textconv=python3 /PATH/TO/exceltocsv.py
    binary=true

Finally, in the global ~/.gitattributes file, our excel converter is linked to *.xlsx files:

*.xlsx diff=excel

… for PDF files#

For this, pdftohtml is additionally required. It can be installed with

$ sudo apt install poppler-utils
$ brew install pdftohtml

Add the following section to the global Git configuration ~/.gitconfig:

[diff "pdf"]
    textconv=pdftohtml -stdout

Finally, in the global ~/.gitattributes file, our pdf converter is linked to *.pdf files:

*.pdf diff=pdf

Now, when git diff is called, the PDF files are first converted and then a diff is performed over the outputs of the converter.

… for Word documents#

Differences in Word documents can also be displayed. For this purpose Pandoc can be used, which can be easily installed with

$ sudo apt install pandoc
$ brew install pandoc

Download and install the *.msi. file from GitHub.

Then add the following section to your global Git configuration ~/.gitconfig:

[diff "word"]
    textconv=pandoc --to=markdown
    binary=true
    prompt=false

Finally, in the global ~/.gitattributes file, our word converter is linked to *.docx files:

*.docx diff=word

The same procedure can be used to obtain useful diffs from other binaries, for example *.zip, *.jar and other archives with unzip or for changes in the meta information of images with exiv2. There are also conversion tools for converting *.odt, .doc and other document formats into plain text. For binary files for which there is no converter, strings are often sufficient.