MD5 Collision Lab

最近更新：2024-08-26 | 字数总计：4.2k | 阅读估时：20分钟 | 阅读量：次

MD5 Collision Lab
1. 实验内容

MD5 Collision Lab

散列函数（英语：Hash function）又称散列算法、哈希函数，是一种从任何一种数据中创建小的数字“指纹”的方法。散列函数把消息或数据压缩成摘要，使得数据量变小，将数据的格式固定下来。该函数将数据打乱混合，重新创建一个叫做散列值（hash values，hash codes，hash sums，或hashes）的指纹。散列值通常用一个短的随机字母和数字组成的字符串来代表。[1]好的散列函数在输入域中很少出现散列冲突。在散列表和数据处理中，不抑制冲突来区别数据，会使得数据库记录更难找到。

如今，散列算法也被用来加密存在数据库中的密码（password）字符串，由于散列算法所计算出来的散列值（Hash Value）具有不可逆（无法逆向演算回原本的数值）的性质，因此可有效的保护密码。

散列函数的（弱）抗碰撞性：对于一个给定的输入x，找另一个值y使得h(x)=h(y)非常困难。

MD5函数的缺陷导致md5容易遭受碰撞攻击，此次实验可以证明散列函数的抗碰撞性的重要。

MD5碰撞试验：https://seedsecuritylabs.org/Labs_20.04/Crypto/Crypto_MD5_Collision/

实验内容

Task1: Generating Two Different Files with the Same MD5 Hash

使用seed-ubuntu虚拟机中的md5collgen工具生成两个内容不同，但是md5-hash值相同的文件。

先自己创建一个prefix.txt文件，可以在里面随意添加内容。

随后使用md5collgen工具生成两个文件out1.bin和out2.bin。

1	md5collgen -p prefix.txt -o ou1.bin out2.bin

运行情况：

查看文件的内容是否相同：

out1.bin:

out2.bin：

可以看到，两个文件都有共同的前缀jay1an，但是后面仍有不同的地方，也就是md5collgen生成的Padding不同，正是这一部分数据的作用，导致两个文件的hash值是相同的。

注意到这里两个文件的大小都是相同的，并且jay1an后面都有很多空值00(hex)，可以从这里看出，md5collgen会先用空值00(hex)填充，然后再填充有意义的数据。

md5collgen会在prefix内容后增加特定的内容，也正是这些内容导致两个文件的md5-hash值相同。

Question 1. If the length of your prefix file is not multiple of 64, what is going to happen?

分析前面所得的out1.bin和out2.bin文件，可以发现最终生成的两个文件的大小都是192bytes，而且文件的内容分为两个部分，前64bytes和后128bytes。

前64bytes：我们指定的前缀+00(hex)

后128bytes：md5collgen精心构造的128bytes。

又生成两个文件，查看内存内容：

可以得出结论：如果前缀文件大小不是64bytes的倍数，那么md5collgen将会自动用00(hex)将其补充至64bytes的倍数。

Question 2. Create a prefix file with exactly 64 bytes, and run the collision tool again, and see what happens.

通过上面的分析，可以得出：如果prefix文件的大小恰巧是64bytes，那么md5collgen没必要使用00(hex)填充。

使用大小刚好为64bytes的prefix验证之后，与预想一致。

Question 3. Are the data (128 bytes) generated by md5collgen completely different for the two output files? Please identify all the bytes that are different.

这128bytes并不是完全不同的，实际上只有一些细小的差别。

我没有去找出每一个不同的byte，不知道是否还有其他规律

Task 2: Understanding MD5’s Property

MD5是一个很复杂的算法，涉及多轮数学运算，但是算法的框架并没有那么复杂。

MD5算法将数据分为多个大小为64bytes的块，再设置一个初始的向量。然后将初始向量IHV0+一个数据块M1作为输入，进行函数F运算，会获得一个中间向量IHV1（128bytes），此向量还会参与下一次F运算，一直迭代直至所有的数据块都被输入且进行运算了，最后的向量 IHVn就作为md5-hash值。

这里可以得出一个md5算法的一个性质：给出两个值M, N, 如果 MD5(M) = MD5(N)，那么任意的值T，都满足 MD5(M||T) = MD5(N||T)，||表示串联、附加。

也就是说，如果M和N的md5-hash值是一样的，那么在M和N后面加上任意相同的后缀生成的两个不同文件，它们的md5-hash值也是一样的。

此性质对其他类似的hash算法也适用。

在linux上可以使用cat命令将两个文件合为一个文件。

1	cat file1 file2 > file3

此命令会将file2的内容连接到file1后面，保存生成为file3。

Task 3: Generating Two Executable Files with the Same MD5 Hash

要生成两个md5-hash相同但是内容不同的可执行文件。

实验给出了示例代码如下：

#include <stdio.h>
unsigned char xyz[200] = {
	/* The actual contents of this array are up to you */
};
int main()
{
	int i;
	for (i=0; i<200; i++){
		printf("%x", xyz[i]);
	}
	printf("\n");
}

按照实验文档给出的思路，我们先将xyz数组用字母’A’ (ASCII编码为0x41)填满，然后将代码编译为二进制可执行文件。

接着我们分析二进制文件的内容，找到储存xyz数组内容的地方。

最终在offset为0x3020的地方发现了xyz数组：

根据前面task的知识，md5collgen生成的两个hash值相同的文件的不同之处仅在那精心构造的128byets之中。

现在我们将该二进制文件分为三个部分：xyz数组之前，xyz数组，xyz数组之后。

我们选择一个合适的位置将二进制文件截断，截取前面一部分为prefix，往后数128bytes为suffix，保证这128bytes处于xyz数组中间。再使用md5collgen工具，以prefix为前缀生成两个文件prefix_1，prefix_2，然后再依次拼接相同的suffix，生成两个可执行二进制文件ou1.bin，out2.bin。

这样生成的out1.bin和out2.bin就满足要求。

但是要注意，prefix的大小一定要是64bytes的倍数，如果不是，填充的空字符会打乱代码的结构，二进制代码无法执行。

因为0x3020 = 12320(dec)，但是12320并不是64的倍数，所以截取的prefix应该是前12352bytes，而suffix应该截取除了前面（12352+128）bytes之后的所有。

部分操作截图：

截取prefix，suffix：

截取suffix的命令，应该为tail -c +12481 a.out > suffix

因为tail -c +12381 a.out是显示a.out中从第12381个字节到末尾的所有字节。

生成两个前缀prefix_1 prefix_2

将前缀和后缀链接：

运行程序比较结果，发现两个程序打印出的内容不同：

查看md5-hash：

MD5(prefix ‖ P ‖ suffix) = MD5(prefix ‖ Q ‖ suffix)

Task 4: Making the Two Programs Behave Differently

在task3中，我们创建了两个md5-hash值相同但是运行效果不同的文件，但这不同仅仅是print的内容不同而已，代码逻辑仍没有变化，那么能不能让这个“不同”更有意义呢？

Assume that you have created a software which does good things. You send the software to a trusted authority to get certified. The authority conducts a comprehensive testing of your software, and concludes that your software is indeed doing good things. The authority will present you with a certificate, stating that your program is good. To prevent you from changing your program after getting the certificate, the MD5 hash value of your program is also included in the certificate; the certificate is signed by the authority, so you cannot change anything on the certificate or your program without rendering the signature invalid.

You would like to get your malicious software certified by the authority, but you have zero chance to achieve that goal if you simply send your malicious software to the authority. However, you have noticed that the authority uses MD5 to generate the hash value. You got an idea. You plan to prepare two different programs. One program will always execute benign instructions and do good things, while the other program will execute malicious instructions and cause damages. You find a way to get these two programs to share the same MD5 hash value.

You then send the benign version to the authority for certification. Since this version does good things, it will pass the certification, and you will get a certificate that contains the hash value of your benign program. Because your malicious program has the same hash value, this certificate is also valid for your malicious program. Therefore, you have successfully obtained a valid certificate for your malicious program. If other people trusted the certificate issued by the authority, they will download your malicious program.

The objective of this task is to launch the attack described above. Namely, you need to create two programs that share the same MD5 hash. However, one program will always execute benign instructions, while the other program will execute malicious instructions. In your work, what benign/malicious instructions are executed is not important; it is sufficient to demonstrate that the instructions executed by these two programs
are different.

文档中说：创建两个软件，使它们的hash值相同，但是一个执行良性代码，另一个执行恶意代码。那么就可以向权威机构认证此(良性)软件，证书中将包含hash值，但是同时也可以把执行恶意代码的软件上传到互联网，证书同样对执行恶意代码的软件生效。

本次任务的就是创建两个不同的程序，一个执行benign instructions，另一个执行malicious instructions,但hash值相同。

文档给出的示例伪代码:

Array X;
Array Y;
main()
{
	if(X’s contents and Y’s contents are the same)
	run benign code;
	else
	run malicious code;
	return;
}

我的代码：

#include <stdio.h>

unsigned char A[200] = {
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A'
};

unsigned char B[200] = {
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A', 
'A','A','A','A','A','A','A','A','A','A'
};

int main()
{
	int i;
	int isSame = 1;
	for (i=0; i<200; i++){
		if(A[i]!=B[i]){
			isSame = 0;
			break;
		}
	}
	if(isSame){
		printf("run benign code");
	}else{
		printf("run malicious code");
	}
}

思路：先定义A、B两个数组，最初的程序中是，A、B是一模一样的，如果没有对该程序进行md5碰撞攻击，那么此程序就会运行isSame为真时的代码，即“良性代码”。

如果对其进行md5碰撞攻击，使得A数组的内容改变，即截取包含A数组内容的128bytes数据块，再通过md5collgen生成两个128bytes实现碰撞实施替换，生成拥有相同md5-hash的两个prefix，然后选择其中任意一个prefix，寻找改变后的A数组的内容，将改变后的A数组内容复制到B数组中（需要对suffix进行分片组合的操作），在将修改后的suffix与两个prefix连接即可获得两个正常的程序：其中一个程序A、B数组是相同的，另外一个则不相同，所以isSame的值也会不同，进而运行的代码会不同，程序的行为会不同。但是，两个程序的md5-hash确实相同的。、

确定分割点，生成prefix：

将代码编译成二进制执行文件，用hexdump命令查看二进制内容：