Original English article: https://www.nervos.org/knowledge-base/what_is_a_hash_function

What is a Hash Function?

A hash function is a mathematical function that takes an input of arbitrary length and converts it into a string of fixed length. They are a primary encryption tool in cryptography and are used by many everyday digital systems, including messaging, banking applications and cryptocurrencies.

The key to understanding hash functions is that they are one-way functions, meaning that their input cannot be reverse-interpreted from the output, which makes them very simple but reliable cryptographic tools.

For example, the following example:

As shown in the example above, running the “let’s learn blockchain” input through the hash function results in the following output:

77db72b12a7667ad73fd33544d1f397268dffe18ca3042e0a09af9f993a8f9c1

However, adding a dot to the input, and re-running it through the hash function, will completely change the output:

17368fcb5bab73c97aa60aa7ae9e54e6676d292743587b9a35ace927a626520a

Examples are the best way to demonstrate the power of hash functions as cryptographic mechanisms. Even the slightest change in the input completely changes the output, meaning it is (almost) impossible to reverse engineer and figure out the original input just by analyzing the output.

Why are hash functions useful?

Hash functions play a vital role in Bitcoin and proof-of-work mining. They ensure the integrity of the blockchain by ensuring that each block contains a unique, unchangeable hash value based on the block's contents.

In Bitcoin mining, miners compete to find a hash value that is less than a target value set by the network. This is done by combining the data from the block header with a nonce (random number) and then running the resulting data through a hash function (SHA-256). The output of this hash function is a fixed-length string of numbers and letters that are unique to the contents of the block. Miners must try many different nonces until they find one that hashes less than the target value. Once a miner finds the right hash value, they broadcast it to the network as a proof of work and are rewarded with newly minted Bitcoins.

Hash functions are also used to link blocks in a blockchain together. Each block contains a hash of the previous blockchain header data, which creates a blockchain that is linked together in a tamper-proof manner. Any attempt to change the data in a block will result in a different hash value, which will be detected by the network and considered invalid.

Hash functions are critical to the security and integrity of the Bitcoin network and proof-of-work mining in general. They ensure that each block contains unique and unchangeable data and enable the creation of a tamper-proof blockchain that forms the basis of the blockchain.

What are the most common hashing algorithms?

While there are many different types of hash algorithms, each with unique characteristics, used by individuals and businesses today, the most popular include Message Digest 5 (MD5), Secure Hash Algorithm 1 (SHA-1), Secure Hash Algorithm 2 (SHA-2), and Secure Hash Algorithm 3 (SHA-3).

Message Digest 5 (MD5)

Message Digest 5 (MD5) is a cryptographic hash function that produces a fixed-size output of 128 bits, regardless of the size of the input message. It was developed by Ronald Rivest in 1991 and is widely used in digital signature applications and to verify the integrity of files.

MD5 takes an input message of arbitrary length and breaks it into fixed blocks. Each block is then processed through a series of rounds, with each round using a different mathematical function to transform the input block. In each round, MD5 performs four basic operations on the input block: addition, bitwise logical operations, circular shifts, and modular addition. These operations are designed to scramble the input block in an irreversible way and produce a fixed-size output that is unique to the input message.

MD5 is considered a relatively fast and efficient hash function, but it also has several weaknesses that make it vulnerable to attack. For example, it is possible to create different input messages that produce the same MD5 output (called a "collision"), which makes it easier for attackers to create malicious files that appear to have the same integrity as legitimate files. Due to the vulnerabilities, MD5 is no longer recommended for new applications that require strong cryptographic security. Instead, it is recommended to use more secure hash functions such as SHA-256 or SHA-3.

Secure Hash Algorithm 1 (SHA-1)

Secure Hash Algorithm 1 (SHA-1) is a hash function that takes a random-length input and produces a 160-bit (20-byte) hash value called a message digest, usually expressed as 40 hexadecimal digits. The NSA designed the algorithm in 1995, but hash functions have since been broken and replaced by more secure protocols.

SHA-1 transforms the user data by breaking the input into "n" parts; each 448 bits in size, then adding 64 bits of padding, for a total of 512 bits. These 512 bits are sent through a compression function, which outputs a final 160-bit hash value.

Secure Hash Algorithm (SHA-2)

Secure Hash Algorithm (SHA-2) is a family of cryptographic hash functions that includes SHA-224, SHA-256, SHA-384, and SHA-512. Like SHA-1, SHA-2 was designed by the United States National Security Agency (NSA) and is widely used in a variety of security protocols and applications.

SHA-2 uses the same basic structure as SHA-1, but has longer input and output block sizes, which makes it more secure against brute force attacks. SHA-224 and SHA-256 have 32 bytes, while SHA-384 and SHA-512 have 64 bytes.

SHA-2 works by breaking the input message into fixed-size blocks and then processing each block using a series of mathematical operations. The processing of each block involves a series of logical functions such as AND, OR, XOR, as well as modular addition and bit rotation operations.

The core of the SHA-2 algorithm is a compression function that takes a message block and a set of variables called a message schedule and updates the variables to produce a new hash value. This compression function is repeated until all message blocks have been processed, at which point the final hash value is generated.

SHA-2 is widely regarded as a secure and strong cryptographic hash function and is used in various applications such as digital signatures in blockchain (SHA-256), SSL/TLS, and file integrity checks. However, many security researchers believe that sooner or later the world will migrate from SHA-256 to SHA-512 to ensure greater security.

Secure Hash Algorithm 3 (SHA-3)

Secure Hash Algorithm 3 (SHA-3) is the latest iteration of the Secure Hash Algorithm family of cryptographic hash functions, published by the United States National Institute of Standards and Technology (NIST) in 2015. It is based on a new design called the Keccak algorithm, which was selected from an open competition to develop a new hash standard to replace SHA-2.

Like its predecessors, SHA-3 accepts an input message of arbitrary length and produces a fixed-length output, or hash, of 224, 256, 384, or 512 bits. SHA-3 uses a sponge structure, meaning that the input message is absorbed into the algorithm's state, which is then compressed to produce the output hash.

The sponge construction is based on a permutation function, a bijective mapping of input bits to output bits. The permutation function is repeatedly applied to the state along with the input message until the entire message has been absorbed. The remaining state is then compressed to produce the output hash.

One of the main advantages of SHA-3 over SHA-2 is that it is resistant to length extension attacks, where an attacker can append additional data to the hash without knowing the original input. SHA-3 also has a simpler design than SHA-2, which makes it easier to implement in hardware and software.

Overall, SHA-3 is considered a secure and efficient cryptographic hash function, recommended for applications such as digital signatures, key derivation, and data integrity. For this reason, a popular hash function from the SHA-3 family, called keccak-256, is used in several established blockchains today, including Ethereum. Nervos’ Layer 1 blockchain, Common Knowledge Base (CKB), uses a novel, SHA-3-inspired hash algorithm called Eaglesong.

While hash functions are generally secure and widely used in cryptography, they are not foolproof. Namely, some potential vulnerabilities associated with them include:

  • Collision Attack: This type of attack occurs when an attacker can generate two inputs that produce the same hash output. This could allow an attacker to replace one input with another, potentially leading to a security breach.

  • Length extension attack: In this type of attack, the attacker can append additional data to the end of the message without knowing the content of the original message. This may allow the attacker to create a fake message with a valid hash, making it appear as if the message is legitimate.

  • Preimage Attack: A preimage attack occurs when an attacker can find an input that produces a specific hash output. This could allow an attacker to create a message that hashes to a known value, potentially leading to a security breach.

  • Birthday attack: In a birthday attack, the attacker exploits the birthday paradox to find two messages that hash to the same value. This type of attack is particularly effective against hash functions with small output sizes, such as MD5.

  • Side channel attacks: These attacks do not target the hash function directly, but exploit weaknesses in the implementation or environment in which the hash function is used. Side channel attacks include timing, power analysis, or electromagnetic attacks.

It’s worth noting that many of these vulnerabilities are related to older or weaker hash functions, such as MD5 or SHA-1. More modern hash functions, such as SHA-256 or SHA-3, are designed with these attack vectors in mind and are generally considered unbreakable.