If a cryptographic hash function isn’t secure, it may be vulnerable to preimage attacks. There are two types of preimage attack:
- First preimage attack — an attacker can find the original input from a hash.
- Second preimage attack — an attacker is given one input and can find a matching input that results in the same hash.
Both of these attacks can undermine the security of cryptographic hash functions.
This is a companion piece to our What is a collision attack? article. Preimage attacks are closely related to collision attacks and cover much of the same ground, so we won’t bother rehashing too much of the same information. If you want to know what a cryptographic hash function actually is, or what a collision is, you should head to the other article first.
So let’s jump straight to it and talk about the wonders of preimage attacks.
Cryptographic hash function basics
Before we can understand what a preimage attack is, we need to know what a preimage actually is. In this context, it requires a little knowledge on cryptographic hash functions.
As a quick definition, a hash function can take in an input of any length, and it always outputs a fixed length hash, such as 256 bits.
For a cryptographic hash function to be secure, it needs to meet a few requirements, such as being one-way, which means that it’s almost impossible to figure out the initial input from the hash.
Cryptographic hash functions are also deterministic, which means that the same input always results in the same hash. Another critical feature is that it’s unfeasible to find two inputs that result in the same hash. There’s a little more to them, but that’s the gist. If you want to know more, head to the article linked above.
What is a preimage?
Now that we’ve covered the hash function basics, let’s get back to understanding what a preimage is.
When we run the text, “What is a preimage?” through this SHA-256 calculator, it gives us the following hash value:
016617d839d3221681037f5a4f14bbb26e519f79df2269bd728c1197c4fa6060
“What is a preimage?” is actually a preimage of 016617d839d3221681037f5a4f14bbb26e519f79df2269bd728c1197c4fa6060.
When we say this, we basically mean that a preimage is any value from a group that results in a specific hash. If we were able to find any other messages that also resulted in the above hash, these would be preimages of it as well.
If you were paying close attention, you may have noticed what seems a little bit like a contradiction. We just said that a critical feature of a secure hash function is that “it’s unfeasible to find two inputs that result in the same hash,” but now, apparently, there’s a whole group that can result in the same specific hash. What gives?
No, it isn’t actually a contradiction, and it’s because the term “unfeasible” is doing a lot of the heavy lifting in that sentence. The reality is that every given hash has multiple inputs, or preimages, that can produce it. This is an unfortunate reality we have to deal with, and it’s what makes preimages such a concern when it comes to ensuring our cryptographic hash functions are secure.
Hash functions, preimages and the pigeonhole principle
Let’s look at it this way. A SHA-256 hash function produces hashes that are 256 bits in length. This means that the biggest possible number can only be 256 bits long. We’ll convert this number to decimal so that you can think about it in the number system that’s more familiar to you:
115,792,089,237,316,195,423,570,985,008,687,907,853,269,984,665,640,564,039,457,584,007,913,129,639,936
If the above number is the largest possible hash, that means that there are that many possible hash combinations in SHA-256. Now, don’t get us wrong, that’s an unfathomably large number, but it’s still a number we can write down. It’s a finite number.
Let’s think about what this means in the context of another property that secure hash functions have, that they can take inputs of any length. This means that you can throw anything in there, from a full stop up to the collected works of Charles Dickens. You won’t be able to put anything super long into the hash calculator we linked above, but this is because of its implementation of SHA-256, and not a limitation of SHA-256 itself.
What we’re saying is that the potential inputs to a hash function are infinite. If there is only a fixed 256-bit number of hashes for them to map to, it means that we will eventually run out of possible hashes. While 256-bit numbers are huge, an infinite number is, you know… infinite.
Therefore, there have to be multiple inputs that lead to the same hash. Every single hash must have a group of preimages that can produce it. This is due to the pigeonhole principle, which basically states that if there are x number of objects, and y number of containers that they need to be put in, but x is a larger number than y, then some of the containers must have multiple objects placed in them.
Domestic pigeon from Wikimedia Commons by Eatcha, CC-BY 4.0.
Infinity is a whole lot bigger than 256 bits, which means that every possible SHA-256 hash must have a group of possible preimages.
But what does this mean? Is the sky falling?
Not necessarily. We can actually get away with it and still have secure cryptographic hash functions, even though the pigeonhole principle is trying to rain on our parade. We just have to make sure that it’s really, really hard for anyone to find these preimages.
If it becomes practical to find preimages in a cryptographic hash function, then that hash function is not considered secure against preimages and should not be used in any scenario in which they are a threat.
This covers most things that we actually use cryptographic hash functions for, however, they may still be useful in low-risk situations, such as when we are checking for data corruption and aren’t concerned about attackers.
Formalizing preimages
Let’s frame preimages in some fancy mathematical notation. If our input into the cryptographic hash function is x, and H indicates the hash function, we can show the hash of x as H(x). If the value of our hash is h (note that it’s lowercase this time), then we can write this all as:
h = H(x).
x is therefore a preimage of h. However, given that we said that there are a group of possible preimages out there for each value, this means that x isn’t the only preimage of h. Other preimages may be almost impossible to find, but due to the pigeonhole principle, somewhere in the ether, there is an input, y, where h = H(y). Therefore, y would also be a preimage of h.
If x is “What is a preimage?“, then when we hash “What is a preimage?” with SHA-256, we can write it as H(“What is a preimage?”).
We know that h = H(x), and from the calculator we used earlier in the article, we also know that the hash of “What is a preimage?” is:
016617d839d3221681037f5a4f14bbb26e519f79df2269bd728c1197c4fa6060
Therefore:
016617d839d3221681037f5a4f14bbb26e519f79df2269bd728c1197c4fa6060 = H(“What is a preimage?”)
This means that “What is a preimage?” is a preimage of the hash 016617d839d3221681037f5a4f14bbb26e519f79df2269bd728c1197c4fa6060. There will be many other preimages for the above number out there, but because SHA-256 is a secure cryptographic hash function, we will not be able to find them.
What is a preimage attack?
We know that based on the limitations of cryptographic hash functions and the pigeonhole principle, every hash output must have multiple preimages. However, for our hash functions to be secure, these preimages need to be almost impossible to find. A preimage attack is when an attacker can practically find a preimage for a hash.
If an attacker can find a preimage from just the hash, this circumvents one of the critical properties of a secure cryptographic hash function— the hash function is one way.
In situations where an attacker can practically figure out an input from the hash, the security of the hash function is completely undermined. If preimage attacks are practical, this can have severe ramifications for our online security.
One example of where this can cause problems is in a protocol that encrypts the message, but not the digital signature that is sent alongside it. If preimages can be found and the cryptographic hash function is not one-way, then it’s possible for an attacker to compute the plaintext of the message. This is due to the fact that when a hash function isn’t one way, the digital signature can leak the plaintext, circumventing the encryption that was intended to keep it confidential.
To prevent these preimage attacks, our hash functions therefore need to be preimage resistant. They need to be carefully designed to retain their one-way properties, so that it is unfeasible to figure out preimages from their hashes.
What is a second preimage attack?
Yes, there’s more than one type of preimage attack. A second preimage attack is when an attacker is given a specific input, and can then find another input that results in the same hash.
In the case of our input, “What is a preimage?,” a second preimage attack would occur if an attacker could find another input that also results in the same hash value, which as we said, is:
016617d839d3221681037f5a4f14bbb26e519f79df2269bd728c1197c4fa6060
Written out formally, a second preimage attack is when an attacker is given a input, x1, and then can find any other input, x2, so that:
H(x1) = H(x2)
If a second preimage attack is practical against the cryptographic hash functions that we deploy, it would have significant ramifications for our security.
Let’s say that Alice hashes a message with a hash function that is vulnerable to second preimage attacks, and then digitally signs it. An attacker, Eve, could then find a second message that also produces the same hash. She could then substitute this second message for the original message and send it alongside the original digital signature.
When Bob goes to verify the digital signature, it will look like this second message has been signed by Alice, because the fake message has the exact same hash as the original one. Bob would assume that this second message is legitimate, which means that Eve’s deception would go unnoticed.
Because of problems like these, we also need to design our cryptographic hash functions so that they are second preimage resistant. It’s important to make it almost impossible for an attacker to find a message that hashes to the same value as a given message.
What’s the difference between second preimage attacks and collision attacks?
If you check out our other article on collision attacks, you will see that a collision attack is whenever any two inputs can be found that both result in the same hash.
This differs from a second preimage attack, because a second preimage attack involves an attacker being given a specific input and then having to find another input that produces the exact same hash.
Collision attacks are much easier, because an attacker can try to find matching hashes for any two possible inputs. It doesn’t matter what the hash is, as long as the attacker can find two inputs that produce the same one.
Second preimage attacks are much more challenging to successfully mount, because an attacker is limited by the specific input they have been given, and they must find another input that also results in that exact same hash.
In notation, a collision attack is where an attacker can choose any two inputs, x1 and x2, so that:
H(x1) = H(x2)
In a second preimage attack, the attacker must also find:
H(x1) = H(x2)
However, in this case the attacker is given the value for x1.
Second preimage resistance is also known as weak collision resistance, while normal collision resistance is also known as strong collision resistance. This is because second preimage attacks are much harder to achieve than collision attacks, so a cryptographic hash function doesn’t need to be as strong to be considered secure against second preimage attacks.
We go into the potential harms that collision attacks can cause in more detail in our other article, but we’ll give you the short version here. When an attacker can find collisions in an algorithm, it makes it possible for them to manipulate documents to circumvent the digital signature process. This allows them to pass off fraudulent messages and other documents as legitimate.
This means that it’s also crucial for our hash functions to be collision resistant as well, and it needs to be unfeasible for an attacker to be able to find collisions.
Are preimage attacks dangerous?
All of the cryptographic hash functions that have seen widespread use over the last decade or two are still considered preimage resistant. This means that even if you are using a hash function that is outdated and insecure for many purposes, such as MD5 or SHA-1, they are still presumed to be resistant to preimage attacks.
However, this doesn’t mean that these hash functions are safe to be implemented in many scenarios. Collisions have been found in both MD5 and SHA-1, so the situations that they can be safely used in are more limited. One scenario in which SHA-1 is still considered secure is when it is implemented in HMACs. SHA-2 is a much better choice in most other scenarios.
Despite the fact that preimage attacks haven’t been found in any of our recently used cryptographic hash functions, this doesn’t mean that they aren’t a concern. Due to the critical role that cryptographic hash functions play in our online security, the threat of preimage attacks always needs to be considered in the design and development of hash functions. Researchers also need to be constantly probing the hash functions we currently use to ensure that preimage attacks that would undermine our security cannot be found.
However, due to the links between collision attacks and second preimage attacks, we will always find collisions in an algorithm before we discover second preimage attacks. This is due to the strong collision resistance and weak collision resistance properties we referenced earlier.
This means that we will abandon a cryptographic hash function due to fears over potential collisions before second preimage attacks become a practical threat.
Preimage attack on MD2
We do have an example of a preimage attack against a cryptographic hash function that actually got put into the field, but we had to go all the way back to MD2 to find it. MD2 was published by Ron Rivest way back in 1989, and collisions against its compression function were first published in 1997.
By 2004, researchers found that the MD2 was vulnerable to preimage attacks against the algorithm’s compression function. They calculated that a preimage of an MD2 hash could be found with a complexity of 2104. This is a huge number, and although it means that we definitely shouldn’t be implementing MD2, this preimage attack isn’t particularly practical.
This was improved upon in 2008 when another academic published a preimage attack on MD2. The paper showed a significant improvement, bringing down the complexity to 273. While this is a significant leap, preimage attacks on MD2 still aren’t easy to mount.
Preimage attacks and the security of our current algorithms
What’s the significance of this? To find preimage attacks that are even somewhat practical, we had to go all the way back to a cryptographic hash function from the eighties. Even in this case, the preimage attacks against it aren’t exactly easy. This means that at least for now, we don’t have to be too concerned about preimage attacks against a current cryptographic hash algorithm like SHA-256. But what of the future?
Post-quantum preimage attacks
Quantum computing poses an, as yet, existential threat to current cryptography standards. At the time of writing, the majority of quantum computers are essentially prototypes with between 30 and 1,000+ quantum bits (qubits). Qubits have more flexibility than traditional two-value bits, as they can be in more than one physical state at a time.
Quantum machines of the future are expected to have millions of qubits and the capability to do things like run Shor’s algorithm, which would theoretically allow them to factor large prime numbers — the difficulty of which provides the basis of the security in the widely used RSA algorithm.
Nevertheless, SHA-256 is theorized to be safe against pre-image attacks using ever more powerful quantum machines. Scientists from the University of Tartu showed that the Merkle–Damgård construction used by the SHA-2 family is quantum collision-resistant.