Unwrapping git to treat SQL files as text
I got this error message when trying to compare two plain text sql files:
"Binary files a/person.sql and b/person.sql differ"
Hold Up! These aren't binary files. To prove it to myself I opened them up in notepad++. "See!," I said to my rubber duck - they're just plain text characters.
Where I went next took me on a 3 hour journey of unicode encoding and git internals.
These aren't the meat and potatoes of a developer's life, but they compose the building blocks of literally almost every ounce of output delivered, so they're worth understanding, at least nominally.
The Problem #
First let's look at git to see where the problem begins.
Every time git stores a file, it has to undergo a process of binary detection to determine whether the file is worth diffing. It doesn't get to make this determination merely because of the file extension because multiple vendors can use the same extension and it doesn't have a registry of every possible filetype anyway.
[Instead, git looks for] a NULL (0) byte somewhere within the first 8000 characters of the file. Typically, that happens because the file is being saved as something other than UTF-8. So, it's likely being saved as UCS-2, UCS-4, UTF-16, or UTF-32. All of those have embedded NULL characters when using ASCII characters
- John Szakmeister
Sure enough, if we look at the "binary" sql file, we'll see it's saved as UCS-2 w/ BOM. And if we look open it up with a HEX editor, we can see multi-column null characters.
Compare to what happens when we convert it back to UTF-8:
Possible Solutions #
Landscape
- Change Encoding (Cleanup (individual/batch) + Plug the leak)
- Change .gitattributes
- Change Diff Tool
Show each solution with across different git tools:
- git cmd
- git gui
- sourcetree
- github
- External - Kdiff3
- External - Others
# get all files
$baseDirectory = "H:\stash\db_sphinx\SQLCode"
$files = Get-ChildItem -Path $baseDirectory -Recurse *.sql
# using WriteAllLines
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
$files | ForEach {
$FileContents = Get-Content $_.FullName;
[System.IO.File]::WriteAllLines($_.FullName, $FileContents, $Utf8NoBomEncoding)
}
References #
Git Binary/Text Files
- Why does git think my .sql file is a binary file?
- Why does git think my cs file is binary?
- How do I diff utf-16 files with GNU diff?
- Can I make git recognize a UTF-16 file as text?
- Why does Git treat this text file as a binary file?
- Set file as non-binary in git
Update Encoding
SQL Files Binary in Git
- How can I make Git show changes to my .sql files?
- Diffs for Sql files are shown as "binary"
- Default encoding in SQL Server Management Studio
- Need a way to set the default encoding for query files in SSMS
- Unicode Defaults on SSMS Save
Git attributes
Github
- github/linguist
- 3D File Diffs
- Diffable, more customizable maps
- Behold: Image view modes](https://github.com/blog/817-behold-image-view-modes)