# A Review of Hash Functions

Thanks to discussions about git at a recent hackday, concurrent discussion of how cryptographic weakness is related to algorithms and bitlengths, and a couple of offline conversations with jrollins and enw, i've thought a lot more about SHA-1 and hashing functions in general over the last few days.

# What is a hash function?

We all use SHA-1 all the time in modern computer systems, and we're
basically dependent on it. Starting to understand the internals of
git (which is *extremely* SHA-1-dependent) makes me realize just how
much crazy stuff is possible if you're able to accept the underlying
idea of a cryptographically-good hash function.

So how does a hash function work? What makes a hash function cryptographically-good? Collision-resistance and one-wayness are the two key criteria.

# How can a hash function fail?

However, hash functions don't stay "good" forever. Most attacks i've seen concentrate on the math involved in the hash function to weaken the collision-resistance property. That is, if we can understand the underlying math, it might be possible to make reasonable deductions that could collapse the search space for a collision to a tiny fraction of the search space that a brute-force search for a collision would require.

This was quite famously done for MD5 (a once-widely-used hash function) about 4 years ago, and was cleverly demonstrated this year in a so-called "Nostradamus Attack" titled Predicting the winner of the 2008 US Presidential Elections using a Sony PlayStation 3.

Because of these attacks, MD5 is no longer taken seriously as a useful hash function for strong encryption. If you are given the choice between MD5 and SHA-1 in a cryptographic context, you should always choose SHA-1.

# Where does SHA-1 stand?

However, SHA-1 was found to be "broken" 3 years ago by the same
cryptographer who broke MD5. But "brokenness" is on sort of a sliding
scale. Where you'd expect to have to search through 2^{80} items to
find a single matching pair in a fully collision-resistant 160-bit
hash, Dr. Wang showed that that number is actually only 2^{69} for SHA-1.
Bruce Schneier's blog has a good record of this event.

2^{69} is still a very large number, and no actual collision has yet
been found. But there is ongoing work to find a single collision in SHA-1.

# What's next?

In ponderous bureaucratic response to these announced weaknesses, NIST (the national Institute of Standards and Technology, responsible for codifying SHA-1 after it was invented by the NSA) has finally set in motion a public call for a new cryptographic hash family

There are also "longer versions" of SHA-1, up to SHA-512, which
produces a 512-bit hash (which means that a brute-force search for a
single collision should have to examine 2^{256} items, which is a lot).
`/usr/bin/sha512sum` is included in coreutils. And
there are other algorithms proposed, such as whirlpool (also a 512-bit
hash, available in whirlpool).

So what does this mean for git? Very little for the underlying
system, i expect: as i understand the arrangement, *any* hash function
could work. Linus likely chose SHA-1 because of its ubiquity more
than anything else. There was some discussion early in the
development of git that appears to question that choice, but i
couldn't find any serious followup to it. The discussion i found actually seems to imply that there was a SHA-1 collision pair found, but i can't find any other reference to it online. Perhaps Ted T'so was confused by the collision found in SHA-0?

# Additional Reading

- WikiPedia:SHA_hash_functions suggests that the current state of the art at reducing the SHA-1 space stands at 2
^{63}, not 2^{69}, 1/64th cheaper! - An ostensibly global collection of hashes and data files