I first read about the Engineyard Programming Contest yesterday and I thought it was a silly contest, winnable only through the application of raw brute force.

For some reason, I woke up this morning obsessed with it. This is despite the fact that this “competition” is basically a lottery, in which you buy tickets with basic programming skills and large amounts of computing time.

In the spirit of sharing, I have a few (fairly obvious) things I’ve noticed in an evening of messing around.

I’m not any kind of cryptographer, but from what I know about the known weaknesses in SHA-1, none of them will apply significantly to this contest. Maybe I’m wrong though.

The Avalanche effect means you don’t have to change much in the input to see a big change in the output. So making large changes (whole word permutations) is a waste of cycles.

Permutating the word list at all is almost unnecessary. One core on my 2.2Ghz Macbook pro takes 45 minutes to check all 7.7 billion combinations of printable five-character strings for a single word list combination. Once you add the possibilities for varying capitalisation in a single sentence (at least 2^40), you have more permutations in a single word list string than a single core can run in many times the 30 hours of test time. So distributing word list permutations is, at most, the “top level” job to distribute work to each cpu.

SHA-1 uses 64-byte blocks so if your total string is more than 64 bytes and the first 64 bytes don’t change, you can calculate that hash separately just once. Testing on an 85-character test string (the one from the competition blog posting), this got me from 1.6 million hash checks per second per core to 2.5 million/second/core.

Using gcc’s __builtin_popcount() and/or the x86 *popcntl* instruction lets you compute hamming distance in a handful of instructons.

None of this matters at all, although it’s fun to think about. Even with all these optimisations, I still have at most 16 cores (work and home) to run this on. The winner will have hundreds or thousands of parallel cores at their disposal.

Programming skills seem to only play a minor part. Several hours of optimisation only yielded me a 60% improvement compared to my original naive C program. Although, one of the posters on HN suggested he was only getting a tenth of that performance, which suggests a combination of language choice and savvy performance design may be statistically significant in the long run.

I will laugh if the winner is a shady programmer with a medium sized botnet as his or her disposal.

Does anyone have any more optimisations to share? Despite it being kinda pointless, I find this kind of thing really interesting. I honestly don’t plan to enter, except for maybe out of curiosity to see how close a single consumer computer can get to the winning results.

I agree with your search strategy and implemented the same thing myself. But I don’t think you’re right about who will win. I don’t think someone with a ton of computing power has much of a better chance than someone searching on a single core.

The problem is that the keyspace is so large, even a botnet with a million zombie machines won’t let you search more than a miniscule fraction of the keyspace. And the winning (low Hamming distance) keys are so sparsely distributed, that no one has a good chance of hitting one.

Let’s do the math. The number of possible combinations depends on how long the words are, but assuming an average word length of 6, there are this many possible combinations in the keyspace:

(((2^6) * 1000) ^ 12) * (94 ^ 5) = 3.5 * 10^67

Say you can generate and check 3 million (3 * 10^6) hashes per second per core, and you have a botnet with 1 million dual-core machines (2 * 10^6). So, you can check 6 * 10^12 combinations per second. At that rate, you can check 7 * 10^17 hashes in 30 hours. You would need to increase your rate of checking by 48 orders of magnitude to cover even 1% of the keyspace.

So, no matter how many computers you are running on, you’re essentially taking a blind stab with no chance of hitting the optimal. The difference in probability of finding the optimum key between the botnet and the single machine are so small, and they’re both so unlikely, I think the winner will just be someone that happens to hit a good key, regardless of the computing power used.

That’s a really good point. FWIW, I don’t think you need to cover the entire space of input strings. 10^67 ~= 2^222 which is much larger than the output size of 2^160, the rest will be collisions. Of course, there’s no guarantee of a match even in the full input set, but I think it’s probable you’d find one searching the order of the output set. This is why focusing on word order and choice is

mostly a red herring, you already have more than enough expressiveness with just a few simple choices & capitalisation. Your overall point still stands entirely for 2^160 though.

It would be interesting to run the odds on getting a hit for each Hamming Distance, and see the point at which the exponential unlikelihood of a match diverges competely from the relatively linear growth in number of cores. You could probaby guess the closest likely winning distance for the contest.

I’ve been thinking about this again, and I think the search space size is misleading. If there was a raffle, and I had 32 tickets and you had 1000 tickets, who would you expect to win?

Yes, it’s not a given – and many small entrants can take the leading balance away from the large ones. But I’d still follow the odds.

I think a better analogy is if there was a raffle with 10 billion tickets, and I had one ticket and you had two. Are the odds of one person much higher than the other? It’s the same reason that buying 10 lottery tickets isn’t any smarter than buying 1.

That’s not the reason you don’t buy 10 lottery tickets though. The reason buying of them is bad is because you have to pay for each ticket, and the expected return is too low. Your odds are indeed still ten times better than before, but monetarily it’s just not worth it.

I think it’s foolish to say that anyone with one machine will have the same change as someone with a botnet. It’s possible for then to win, but their odds will literally be 100’s of times lower even if they have an extremely fast program.

Also, unlike the lottery there is a garanteed winner, which changes things significantly. The goal isn’t to find the ONE collision, but the closest.

I don’t understand how __builtin_popcount can be used to compute the hamming distance. From what I can tell, that function just counts the number of 1 bits in an integer. However doesn’t the position of the 1 bits also matter for the hamming distance? If one byte is 10000000 and another is 00000001 then both bytes have a single “1” bit yet shouldn’t the hamming distance be 2 since 2 bits are different (the bits in the first and last positions)?

XOR the two values first.

1. You don’t need 2^160 SHA1’s to find collision. 2^80 is enough (see “birthday paradox”).

2. Your timings for SHA1_Transform really far from perfect. Optimized version requires only about 165 machine cycles to perform one transform on Intel Core (more correctly, ~660 cycles to process 4 hashes simultaneously with SSE2). So 2.2 dualcore cpu able to do about 26.7M hashes/sec.

3. Previous thing also doesn’t really matter as hashing is ideal thing to do with modern GPUs. ATI RV740+ and nVidia G80+ easily beats CPUs in hashing speed. For example, 4870×2 able to compute ~720M hashes/s, GTX295 ~415M/s.

4. But all above things again doesn’t matters as it’s still impossible to find collision in real time. However having several modern GPUs can increase your chances to win.

Thanks for the post, Ivan. I’m kind quite embarassed I didn’t think of SIMD optimisations off the top of my head.

I’m surprised that it seems like the Nettle implementation of SHA1 is a full ten times slower than one using SSE2. When I get home I’ll swap in one of the publicly available SSE2-enabled implementations and compare.

I just looked at your website, Ivan. Very impressive. :-)

Well, I don’t know much about Nettle implementation of SHA1 but simple usage of SSE2 won’t give you 10x speed-up. You’ll need to remove all unnecessary code (at least strip off little endian big endian conversions) and then try to use iCore pipelines to maximum (which is 3 SSE2 instructions per cycle). 2nd thing isn’t that easy to archive ;).

Thanks again Ivan.

Due to real life (and spending the last 3 days at the beach), I haven’t had time to do anything on this and probably won’t, but there’s a thread in the Nvidia forums with an interesting CUDA implementation:

http://forums.nvidia.com/index.php?showtopic=102349

I also turned up a CUDA kernel for SHA-1 as part of the Pyrit WPA-cracking project, which looks like it could be pretty easily shoehorned into an existing C-based solution, in place of OpenSSL/Nettle. Not sure how it would perform, though.

http://code.google.com/p/pyrit/source/browse/trunk/cpyrit/

That CUDA program is impressive.. 180 M tests per second. That makes even a fast CPU at only 4M per second rather trivial. And you can use multiple video cards at once!