The ubiquitous version control system, Git, has revolutionised software development workflows with its robust set of capabilities. It simplifies the tracking of code changes, enables seamless branching and merging, and facilitates tight collaboration. Today, more than 100 million developers worldwide use the GitHub platform alone.
However, Git is not always a suitable solution for all kinds of source code file management. While Git excels in managing source code for software projects, connecting it to every file within a development environment can lead to inefficiencies – it may even hinder productivity.
Tendency to Track Everything
Git’s default behavior is to track all files added to a given repository. Aside from tracking the source code, it also monitors configuration files, build artifacts, and temporary files. It even tracks personal notes and files that may have been unintentionally included.
This unnecessary tracking results in a bloated repository history, which can be sluggish and confusing to work with. It can be challenging to find important code changes amid the presence of numerous unessential files cluttering the repository.
Moreover, there are security risks involved. Git can track sensitive information that may have inadvertently been hard coded, such as passwords and API keys, and expose them to those who have access to the repository. Sometimes these can even include unauthorised parties.
To avoid nonessential tracking, it is advisable to regularly review the repository for unnecessary files, while exercising mindfulness when adding items to the repository. Also, it is crucial to create a .gitignore file, which makes it possible to specify the files that Git should omit from its tracking. This file can direct Git to ignore files en masse according to type and location.
The .gitignore file is an invaluable tool Git users should master. Those who prefer to stick with Git should go beyond the basics and explore more .gitignore examples. There are advanced uses of the file, like the exclusion of Terraform provider binaries and IDE-specific files that tend to create clutter without adding any value.
However, Git may not be the best option for every source code file version management situation. It makes perfect sense to consider alternatives such as the lightweight and self-contained Fossil, large-repository-optimised Mercurial, and centralised Perforce (Helix Core) to address Git’s default track-everything behavior.
Bloated Repositories
“Repository bloat” encapsulates a number of issues that have been known to arise as a result of Git’s version control design. Repositories can become overinflated due to Git’s tendencies for redundant data storage, excessive histories, and the implications of accidental commits.
Teams working on a project that involves large binary files will likely find Git unwieldy, because it stores the entire content of a binary file for every version. For example, a 15 MB image will be stored as the same 15 MB image in the new version even if only a few pixels of the image were modified or added. This redundancy inevitably creates a massive repository with the release of multiple versions.
Git is prone to producing excessive histories. A repository’s history can grow massively with every version of every file, especially the large ones, carried forward with the same heft even if the actual changes are minimal. Also, even if files are deleted from the project, the deleted files may continue to take up space because they may still be present in the repository’s history.
Additionally, Git’s design exacerbates the bloating effect of accidental commits. There are times when developers accidentally commit large files such as build artifacts, datasets, and temporary files that should not exist in the repository. A 1GB database dump, for example, can quickly inflate a repository.
Even if this accidental commit is eventually corrected in a later commit, the large file will continue to exist in the repository’s history, which likewise has potentially dangerous data privacy implications.
Aside from raising storage costs and sensitive data exposure dangers, bloated repositories cause sluggish workflows with the slowdown of Git operations such as fetching, pushing, branching, cloning, and merging. This slow performance and efficiency impacts development productivity. Moreover, bloated repository histories make it challenging to understand a project. The presence of clutter and unnecessary files perplexes repository navigation and management.
Data Exposure Risks
Git is notable for its data integrity, decentralisation, and access control features that help ensure secure source code management. However, it is not inherently foolproof. Like any other solution, its security depends heavily on how it is used and the security practices put in place.
Because of its default track-everything function, Git can accidentally expose passwords, API keys, or private certificates to anyone who has access to the repository. Accidental commits may even expose this sensitive information to unauthorised individuals.
Meanwhile, the bloating of repositories infers the broadening of attack surfaces. They are more difficult to manage, hence spotting and resolving security issues become more difficult. For example, developers may accidentally commit internal documentation or debug logs that contain sensitive data. Finding these errors may take time or they may never be discovered at all.
Moreover, inflated repositories can increase the risk of security-related errors such as accidental changes that disable security features or result in SQL injection and cross-site scripting vulnerabilities. There are also cases of files getting committed with incorrect permissions, enabling access to unauthorised users.
Using Separate Repositories
For some projects, it may be viable to continue using Git for some file types and separate repositories for other file types.
For example, a project can have a dedicated database repository for managing schema changes and a separate repository for documentation. Media files can be managed separately through a cloud-based asset management system. How you handle these elements essentially comes down to how your organisation handles DevOps.
Amazon S3, Google Cloud Storage, and Azure Blob Storage offer scalable solutions that are cost-effective for storing large files and media assets. These services come with robust security features and efficient data retrieval mechanisms.
Using Git with separate repositories for media files, documentation, and file types is a common practice. Doing this effectively addresses the issues of repository bloat and inefficient binary files management associated with Git.
Git remains an indispensable tool for modern software development. However, it would be counterproductive to continue using it when repository bloating, indiscriminate file tracking, and security issues arise. The .gitignore file is a good way to address these concerns, but some teams may find it easier to use alternatives or use Git with separate repositories for large files.