Tuesday, May 25, 2004

 

Some notes on memcpy

这个问题是看了袁峰blog上的讨论
http://blog.joycode.com/fyuan/articles/30945.aspx
而来的。其实,Memcpy一直是大家讨论的热点。这里也算老贴回顾,来两篇文章。

1.
Optimizing Memcpy improves speed
by Michael Morrow
from http://www.embedded.com/showArticle.jhtml?articleID=19205567

Knowing a few details about your system-memory size, cache type, and bus width can pay big dividends in higher performance.

The memcpy() routine in every C library moves blocks of memory of arbitrary size. It's used quite a bit in some programs and so is a natural target for optimization. Cross-compiler vendors generally include a precompiled set of standard class libraries, including a basic implementation of memcpy(). Unfortunately, since this same code must run on hardware with a variety of processors and memory architectures, it can't be optimized for any specific architecture. An intimate knowledge of your target hardware and memory-transfer needs can help you write a much more efficient implementation of memcpy().

This article will show you how to find the best algorithm for optimizing the memcpy() library routine on your hardware. I'll discuss three popular algorithms for moving data within memory and some factors that should help you choose the best algorithm for your needs. Although I used an Intel XScale 80200 processor and evaluation board for this study, the results are general and can be applied to any hardware.

A variety of hardware and software factors might affect your decision about a memcpy() algorithm. These include the speed of your processor, the width of your memory bus, the availability and features of a data cache, and the size and alignment of the memory transfers your application will make. I'll show you how each of these factors affects the performance of the three algorithms. But let's first discuss the algorithms themselves.

Three basic memcpy() algorithms
The simplest memory-transfer algorithm just reads one byte at a time and writes that byte before reading the next. We'll call this algorithm byte-by-byte. Listing 1 shows the C code for this algorithm. As you can see, it has the advantage of implementation simplicity. Byte-by-byte, however, may not offer optimal performance, particularly if your memory bus is wider than 8 bits.

Listing 1: The byte-by-byte algorithm

void * memcpy(void * dst, void const * src, size_t len)
{
char * pDst = (char *) dst;
char const * pSrc = (char const *) src;


while (len--)
{
*pDst++ = *pSrc++;
}

return (dst);
}

An algorithm that offers better performance on wider memory buses, such as the one on the evaluation board I used, can be found in GNU's newlib source code. I've posted the code here. If the source and destination pointers are both aligned on 4-byte boundaries, my modified-GNU algorithm copies 32 bits at a time rather than 8 bits. Listing 2 shows an implementation of this algorithm.

Listing 2: The modified-GNU algorithm

void * memcpy(void * dst, void const * src, size_t len)
{
long * plDst = (long *) dst;
long const * plSrc = (long const *) src;

if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))
{
while (len >= 4)
{
*plDst++ = *plSrc++;
len -= 4;
}
}

char * pcDst = (char *) plDst;
char const * pcDst = (char const *) plSrc;


while (len--)
{
*pcDst++ = *pcSrc++;
}

return (dst);
}

A variation of the modified-GNU algorithm uses computation to adjust for address misalignment. I'll call this algorithm the optimized algorithm. The optimized algorithm attempts to access memory efficiently, using 4-byte or larger reads-writes. It operates on the data internally to get the right bytes into the appropriate places.

Note that the optimized algorithm uses some XScale assembly language. The preload instruction is a hint to the ARM processor that data at a specified address may be needed soon. Processor-specific opcodes like these can help wring every bit of performance out of a critical routine. Knowing your target machine is a virtue when optimizing memcpy().

Having looked at all three of the algorithms in some detail, we can begin to compare their performance under various conditions.

Block size
What effect does data size have on the performance of our algorithms? To keep things simple, let's assume that there's no data cache (or that it's been disabled) and that all of the source and destination addresses are aligned on 4-byte boundaries.

Note also that the performance of byte-by-byte improves dramatically as the processor clock speed increases. This implies that the routine is CPU-bound. Until we saturate the memory bus with reads and writes, the byte-by-byte algorithm will continue to execute more quickly.

Here the modified-GNU algorithm has the best performance, as it makes more efficient use of the memory bus. The optimized algorithm has comparable performance, but the effects of its additional computation take their toll in overhead.

Data alignment
What happens if the source and destination are not both aligned on a 4-byte boundary? The modified-GNU algorithm performs worse than byte-by-byte in this situation, largely because it defaults to byte-by-byte (after a fixed overhead).

The big overhead in the GNU algorithm comes from register save-restore in its prologue-epilogue. The algorithm saves off four registers, where the byte-by-byte routine saves none. So, as memory speed decreases in relation to processor speed, the GNU algorithm suffers accordingly. By the way, "optimized" memcpy saves off nine registers, which is part of the reason it becomes less compelling at high core and bus speeds. This overhead matters less when the stack is cached (probably the normal case).

The optimized algorithm handles unaligned addresses the best outperforming byte-by-byte. At slower clock speeds, the overhead of dealing with alignment is amortized by the cost of actually moving the memory four times as quickly. As CPU performance improves (with respect to the memory system), the elegance of an algorithm like optimized becomes less helpful.

Caching
Everything changes if your processor has a data cache. Let's try the same memcpy tests we've already run, but with the data already in cache.

Also, the data for unaligned memcpy shows that the GNU memcpy performance degrades to that of byte-by-byte performance when addresses are not aligned. You may see severe degradation in memcpy performance if your data is not always aligned in memory.

Write policy
A write-through cache is one that updates both the cache and the memory behind it whenever the processor writes. This sort of cache tries to satisfy reads without going to memory.

A write-back cache, on the other hand, tries to satisfy both reads and writes without going to memory. Only when the cache needs storage will it evict some of its data to memory; this is called variously a write back, a cast out, or an eviction. Write-back caches tend to use less memory bandwidth than write-through caches.

The processor I used allows the cache to be configured using either policy. What effect does this have on memcpy? It depends. With a cold cache, optimized memcpy with write-back cache works best because the cache doesn't have to write to memory and so avoids any delays on the bus.

For a garbage-filled cache, write-through caches work slightly better, because the cache doesn't need to spend extra cycles evicting irrelevant data to memory. As usual, the more you know about your system"such as the likelihood of having certain data in the cache"the better you can judge the efficacy of one cache policy over another.

Special situations
If you know all about the data you're copying as well as the environment in which memcpy runs, you may be able to create a specialized version that runs very fast. Figure 9 shows the performance gain we accrue by writing a memcpy that handles only 4KB-aligned pages when the cache is in write-through mode. This example shows that writing a very specific algorithm may double the speed of a memcpy-rich program. I've posted one of these algorithms here.

Optimize away
Some applications spend significant processor time transferring data within memory; by choosing the optimal algorithm, you could improve overall program performance significantly. The moral of the story: know your target hardware and the characteristics of your application. Armed with this knowledge, you can easily find the optimal algorithm.

(按:这几天还在组里讨论ARM芯片的国产化问题呢。都说支持国产,可龙芯那样的实在有点高不成,低不就的感觉)

2.
http://www.boost.org/more/generic_programming.html
这篇文档相信看过的人很多。看看Boost的代码就可以知道他们对template的memcopy做了什么样的优化

现在,我想大家都明白为什么袁峰的code定义要快一点了吧。那就是让complier优化选取按多大的宽度进行拷贝。

3.
但是这样也会带来安全性的一些问题,比如

为什么C语言的strcpy函数有漏洞 czy原创于03.04
from http://bbs.nsfocus.net/index.php?act=ST&f=3&t=144495

前言:
研究了几天DOS下的溢出原理,最后明白了其实原理都很简单
关键是要懂得为什么C语言的strcpy函数有漏洞,为什么对这个函
数的不正常使用会造成溢出.

一节:介绍strcpy函数
能看到这篇文章的人可能都知道问题很多是出在它的身上吧呵呵。
先看一看在标准的C语言的string.h中对这个函数的申明
char *_Cdecl stpcpy (char *dest, const char *src);
对于代码看下面的:(这是微软对这个函数的说明)
(%VC%/vc7/crt/src/intel/strcat.asm)
;***
;char *strcpy(dst, src) - copy one string over another
;Purpose:
; Copies the string src into the spot specified by
; dest; assumes enough room.
;
; Algorithm:
; char * strcpy (char * dst, char * src)
; {
; char * cp = dst;
; while( *cp++ = *src++ ); /* Copy src over dst */
; return( dst );
; }
;Entry:
; char * dst - string over which "src" is to be copied
; const char * src - string to be copied over "dst"
;
;Exit:
; The address of "dst" in EAX
;
;Uses:
; EAX, ECX
;
;Exceptions:
;**********************************************************************

本来想去掉一些注解,不过觉得还是留着好哈:)
从上面我们可以看到这样的代码有问题有:
1.没有检查输入的两个指针是否有效。
2.没有检查两个字符串是否以NULL结尾。
3.没有检查目标指针的空间是否大于等于原字符串的空间。

好了现在我们知道了对于调用string.h中的这个函数,和我们自已写一个如下的程序
没有本质上的区别那么我们就来研究它就可以了.
就叫它c4.exe吧.
main(){j();}
j()
{
char a[]={'a','b','\0'};
char b[1];
char *c=a;
char *d=b;
while(*d++=*c++);
printf("%s\n",b);
}


二节:调试我们的c4.exe
所用工具W32dasm,debug,tcc,tc
第一步我们用TC2编绎生成可执行文件c4.exe.
第二步用TCC -B生成这段C代码的汇编源代码.
第三步用W32dasm和debug对c4.exe进行静态和动态调试

先分析由TCC生成的c4.asm代码如下:
先说明一下由于这是一个完整的包括了MAIN函数的C程序
程序刚开始时数据段和堆栈段还有代码都不在一起但是当
执行到我们的J函数时堆栈和数段就在一起了这要特别注意.

ifndef ??version
?debug macro
endm
endif
?debug S "c4.c"

_TEXT segment byte public 'CODE'
DGROUP group _DATA,_BSS
assume cs:_TEXT,ds:DGROUP,ss:DGROUP
_TEXT ends

_DATA segment word public 'DATA'
d@ label byte
d@w label word
_DATA ends

_BSS segment word public 'BSS'
b@ label byte
b@w label word
?debug C E930A68D2E0463342E63
_BSS ends


_TEXT segment byte public 'CODE'
; ?debug L 1
_main proc near
; ?debug L 3
call near ptr _j //这儿执行我们的J函数
@1:
; ?debug L 4
ret
_main endp
_TEXT ends

_DATA segment word public 'DATA' //最先在数据段中定义我们的源串ab结尾符\0
db 97
db 98
db 0
_DATA ends


_TEXT segment byte public 'CODE'
; ?debug L 6
_j proc near
push bp //J函数入口
mov bp,sp
sub sp,6
push si
push di
push ss
lea ax,word ptr [bp-6]
push ax
push ds
mov ax,offset DGROUP:d@ //特别注意这是得到源串在数据段中的偏移
push ax //所有SCOPY@以上的代码的作用是在堆栈中分配源串加目的串那么多个空间
mov cx,3 //cx=3指定要拷贝的字符数
call far ptr SCOPY@ //执行了另一个函数作用是把数据段中的源串拷到栈中
; ?debug L 10
lea si,word ptr [bp-6]
; ?debug L 11
lea di,word ptr [bp-2]
; ?debug L 12
jmp short @3
@5:
@3:
; ?debug L 12
mov bx,si
inc si
mov al,byte ptr [bx]
mov bx,di
inc di
mov byte ptr [bx],al
or al,al
jne @5
@4:
; ?debug L 13
lea ax,word ptr [bp-2]
push ax
mov ax,offset DGROUP:s@ //得到printf函数的打印格式参数
push ax
call near ptr _printf
pop cx
pop cx
@2:
; ?debug L 14
pop di
pop si
mov sp,bp
pop bp
ret
_j endp
_TEXT ends
?debug C E9

_DATA segment word public 'DATA'
s@ label byte
db 37 //%
db 115 //s
db 10 //换行符:)
db 0
_DATA ends
extrn SCOPY@:far
_TEXT segment byte public 'CODE'
extrn _printf:near
_TEXT ends
public _main
public _j
end


三节:分析W32Dasm得来的静态汇编代码,也就是程序最终的代码同时我们一步步来分析
这时堆栈的情况.
文章写到这儿可能大家一定认识都是些看到就头大的代码吧,没事我先分析一下
这些代码就执行来说可以分为三个部分:

1.从01FE到020B是根据C代码中的定义在堆栈中分配空间例子中分了6个字节,定义多少分多少也没有毛病
2远跳到0000:1395是把数据段中的源串放到堆栈中由于放入个数在cx中所以这儿也没有毛病
3在堆栈中把源串拷到目的串所在的内存单元中问题就在这儿了!


:0001.01FA E80100 call 01FE //执行我们的j函数
:0001.01FD C3 ret

:0001.01FE 55 push bp
:0001.01FF 8BEC mov bp, sp
:0001.0201 83EC06 sub sp, 0006
:0001.0204 56 push si
:0001.0205 57 push di
:0001.0206 16 push ss
:0001.0207 8D46FA lea ax, [bp-06]
:0001.020A 50 push ax
:0001.020B 1E push ds
:0001.020C B89401 mov ax, 0194
:0001.020F 50 push ax
:0001.0210 B90300 mov cx, 0003
:0001.0213 9A95130000 call 0000:1395 //这儿先跳到1395去执行了由于它是在0000所以是远跳

:0001.0218 8D76FA lea si, [bp-06]
:0001.021B 8D7EFE lea di, [bp-02]
:0001.021E EB00 jmp 0220
:0001.0220 8BDE mov bx, si
:0001.0222 46 inc si
:0001.0223 8A07 mov al , [bx]
:0001.0225 8BDF mov bx, di
:0001.0227 47 inc di
:0001.0228 8807 mov [bx], al
:0001.022A 0AC0 or al , al
:0001.022C 75F2 jne 0220
:0001.022E 8D46FE lea ax, [bp-02]
:0001.0231 50 push ax
:0001.0232 B89701 mov ax, 0197
:0001.0235 50 push ax
:0001.0236 E8BC08 call 0AF5 //执行打印输出
:0001.0239 59 pop cx
:0001.023A 59 pop cx
:0001.023B 5F pop di
:0001.023C 5E pop si
:0001.023D 8BE5 mov sp, bp
:0001.023F 5D pop bp
:0001.0240 C3 ret
//下面的就是我们的SCOPY@
0001.1395 55 push bp
:0001.1396 8BEC mov bp, sp
:0001.1398 56 push si
:0001.1399 57 push di
:0001.139A 1E push ds
:0001.139B C57606 lds si, [bp+06]
:0001.139E C47E0A les di, [bp+0A]
:0001.13A1 FC cld
:0001.13A2 D1E9 shr cx, 01
:0001.13A4 F3 repz
:0001.13A5 A5 movsw
:0001.13A6 13C9 adc cx, cx
:0001.13A8 F3 repz
:0001.13A9 A4 movsb
:0001.13AA 1F pop ds
:0001.13AB 5F pop di
:0001.13AC 5E pop si
:0001.13AD 5D pop bp
:0001.13AE CA0800 retf 0008

我们现在开始DEBUG动态调试:
第一步D:\turboc2>debug c4.exe
-g 01FE 通过W32DASM中的查找我们直接跳到J入口处执行

AX=0000 BX=0566 CX=000E DX=067F SP=FFE8 BP=FFF4 SI=00D8 DI=054B
DS=13DB ES=13DB SS=13DB CS=129F IP=01FE NV UP EI PL ZR NA PE NC
129F:01FE 55 PUSH BP
-t

AX=0000 BX=0566 CX=000E DX=1193 SP=FFE6 BP=FFF4 SI=00D8 DI=054B
DS=13DB ES=13DB SS=13DB CS=129F IP=01FF NV UP EI PL ZR NA PE NC
129F:01FF 8BEC MOV BP,SP


由于上一条指令是CALL O1FE,所以也就有一条POP 01FD,然后又是一个PUSH BP
-d ss:ffe0
13DB:FFE0 FF 01 9F 12 F3 0B F4 FF-FD 01 1D 01 01 00 F2 FF ................
13DB:FFF0 54 05 F6 FF 00 00 43 35-2E 45 58 45 00 00 FB 00 T.....C5.EXE....

现在就来看看栈的情况
mov bp,sp后BP就成了FFE6


FFE0 | ->SUB SP,0006(空了六个字节为源目的串在堆栈中分配了空间)
FFE1 |
FFE2 |
FFE3 |
FFE4 |
FFE5 |
FFE6 |F4 ---->当前的栈顶FFE6
FFE7 |FF ---->原BP
FFE8 |FD
FFE9 |01
FFEA |



然后把si,di,ss压入堆栈,这时SP就变成了FFDA
再执行lea ax,[bp-06]
push ax
push ds
这是把分配的内存空间的内存地址也放到堆栈中,还有DS
然后又执行
mov ax,0194(mov ax,offset DGROUP:d@) 得到字串在数据段中的偏移
push ax
mov cx,03
好了该执行我们的SCOPY@了
CALL 0000:1395 由于是一个远跳所以CS IP都压堆栈了
再来看看堆栈的情况

内存低地址

FFD0 |18 ---->ip,也就是lea si,[bp-06]所在的CS段中的偏移
FFD1 |02
FFD2 |9F ---->先压CS
FFD3 |12
FFD4 |94 ---->字串在数据段中的偏移压栈
FFD5 |01
FFD6 |DB ---->DS
FFD7 |13
FFD8 |EO ---->为字串分配的空间的地址
FFD9 |FF
FFDA |DB ---->SS
FFDB |13
FFDC |DI ---->这儿把DI,SI压入堆栈的目的是因为过会儿把数
FFDD | 据段中的数据般到堆栈时又要用到它们所以要先保存
FFDE |SI
FFDF |
FFE0 |1 ->SUB SP,0006(空了六个字节为源,目的串在堆栈中分配了空间)
FFE1 | 2
FFE2 | 3
FFE3 | 4
FFE4 | 5
FFE5 | 6
FFE6 |F4 ---->当前的栈顶FFE6
FFE7 |FF 原BP为FFF4,现在BP为FFE6
FFE8 |FD ---->j执行完后返回地址
FFE9 |01
FFEA |

内存高地址
好了到这儿我们的分析算是完成1/3了,现在就来看看到底SCOPY是怎么把数据段中的字串
放到堆栈中去的.

push bp 把以前的BP(FFE6)压栈
mov bp,sp 当前sp=bp=FFCE
push si
push di
push ds 这时sp为FFC8
然后执行
lds si,[bp+06] si就等于ffce+06=ffd4,ffd4中的数据就是字串在数据段中的偏移0194
les di,[bp+0a] di就等于ffc3+0a=ffd8,ffd8中的数据就是堆栈中存放字串的首地址ffe0
这两条指令执行完后si=0194,di=ffe0
内存低地址
下面是栈顶情况:

FFC8 |DB -->ds压栈 <--sp=ffc8
FFC9 |13
FFCA |DI
FFCB |
FFCC |SI
FFCD |
FFCE |E6
FFCF |FF

内存高地址

下面的7行代码就简单了把SI指向的地址中的数据移到DI指的地址中去
cld
shr cx,01 (CX等于3)
repz
movsw
adc cx,cx
repz
movsb

这样是较率比较高的移法了先一次移两个用MOVSW指令,当只一个时用MOVSB
上面的指令执行完后,堆栈中的
FFE0 就分别成了 a
FFE1 b
FFE2 0

好了数据般完了,该还原DS,DI,SI,BP了
pop ds
pop di
pop si
pop bp
这四条指令执行完后sp=FFD0,bp还原成了以前的FFE6
最后是返回指令
retf 8
对这个指令要好好就一下:由于是远跳来执行的所以sp要加4(ffd0+4=ffd4)
再加上代参数8所以还要加8(ffd4+8=ffdc)

这时堆栈的情况就成了:
SP=FFDC,BP=FFE6

FFDC |DI ---->这儿把DI,SI压入堆栈的目的是因为过会儿把数
FFDD | 据段中的数据般到堆栈时又要用到它们所以要先保存
FFDE |SI
FFDF |
FFE0 |1 a ->SUB SP,0006(空了六个字节为源,目的串在堆栈中分配了空间)
FFE1 | 2 b
FFE2 | 3 0(注意源字串已经正确的放在了堆栈中!)
FFE3 | 4
FFE4 | 5
FFE5 | 6
FFE6 |F4 ---->当前的栈顶FFE6
FFE7 |FF 原BP为FFF4,现在BP为FFE6
FFE8 |FD ---->j执行完后返回地址
FFE9 |01


好了该总结一下了:
从上面我们就可以看出上面这些都是没有毛病的.
为什么要分配六个字节的空间?
先来看看我在C程序中是怎么定义的:
char a[]={'a','b','\0'};
char b[1];
其实C中分配空间的规则是很简单的,就是每一个串的长度一定要是双数
如果为单就加1
象上面的: 源为3+1=4
目的1+1=2,源+目的=4+2=6

个人认为这个地方对于要正确的溢出是很重要的,因为有的文章里面说
多一个字节可以了但真的是这样吗?不一定吧,象我例子中就是这样的!

不多说了看代码吧:
lea si,[bp-06] 这时BP=FFE6,FFE6-06=FFE0
lea di,[bp-02] 同样DI就等于FFE4
设好了SI,DI后,就是一个循环了一次一个字节的把源串中的字母放到目的串中
下面的代码是最重要的了,问题也就出在这儿:!!!!!!

jmp 0220
0220 mov bx,si (把SI的地址给BX)
inc si (SI地址加1)
mov al,[bx] (把BX寄存器中记录的内存地址中的数据给al,第一次就是取出a)
mov bx,di (把dI的地址给BX)
inc di (dI地址加1)
mov [bx],al (把AL中的字符给BX指向的地址)
or al,al
jne 0220 (不为0则跳)

------------又来看看栈的情况--------BP=FFE6,SP=SP=FFDC
FFDC |DI
FFDD |
FFDE |SI
FFDF |
FFE0 |a
FFE1 |b
FFE2 |0
FFE3 |
FFE4 |a (第一次执行DI=FFE4,FFE4中的值成了a)
FFE5 |

从上面的代码我想你已经看出问题来了,是否拷完代码的判断条件只是看有没有遇到\0
而并没有去比较源串的大小是否比目的串大!


FFE0 |a
FFE1 |b
FFE2 |0
FFE3 |
FFE4 |a (第一次执行DI=FFE4,FFE4中的值成了a)
FFE5 |b
FFE6 |F4 ---->原始BP (注意在我的例子中FFE6将变成0,
FFE7 |FF 但这个只是保存的上个函数的BP所以程序还没出错)
FFE8 |FD ---->main函返回地址
FFE9 |01

再看后来的代码:


lea ax, [bp-02]
mov ax, 0197
push ax
call 0AF5 //执行打印输出
pop cx
pop cx //上面的几行就是打印出目的串

pop di
pop si //把DI,SI弹出
mov sp, bp
pop bp
ret

hurk 在后面接道

现在明白了^-^
====================vul.c======================
#include
Foo(char* s)
{
char buf[16]="";
strcpy(buf, s);
printf("The input String is %s\n", buf);
}
main(int argc, char* argv[])
{
if(argc == 2)
{
Foo(argv[1]);
}
else
{
printf("Usage: %s < string>\n", argv[0]);
}

}
=================end======================

以Debug模式编译:

shanghai =>gcc vul.c -o vul -g

再用gdb运行它:

shanghai =>gdb vul
GNU gdb 5.2
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are welcome to change it and/or distribute copies of it under
certain conditions. Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details. This GDB was configured as "i386-pc-solaris2.8"...
/*
在程序入口处main设置一个断点
*/
(gdb) b main
Breakpoint 1 at 0x8050956: file vul.c, line 12.
/*
我以一个字符串--10个'A'作为程序vul的输入参数。
*/
(gdb) r AAAAAAAAAA
Starting program: /export/home/moda/buf_of/vul AAAAAAAAAA
Breakpoint 1, m12 if(argc == 2)
/*
程序在main处中断,我们接着单步执行(Single Step)到Foo
*/
(gdb) s
14 Foo(argv[1]);
/*
程序即将要进入Foo,我们看看将要执行的汇编指令。
*/
(gdb) x/10i $eip
0x805095c : add $0xfffffff4,%esp
0x805095f : mov 0xc(%ebp),%eax
0x8050962 : add $0x4,%eax
0x8050967 : mov (%eax),%edx
0x8050969 : push %edx
0x805096a : call 0x8050904
0x805096f : add $0x10,%esp
0x8050972 : jmp 0x805098a
0x8050974 : add $0xfffffff8,%esp
0x8050977 : mov 0xc(%ebp),%eax
/*
函数Foo在地址0x805096a处被main调用:"call 0x8050904
"。而0x805096f为Foo调用后的返回地址,大家记住这个地址,因为下面要提到它。

接着我们用si(Single
Instruction)一个指令一个指令地执行
*/
(gdb) si
0x0805095f 14 Foo(argv[1]);
(gdb) si
0x08050962 14 Foo(argv[1]);
(gdb) si
0x08050967 14 Foo(argv[1]);
(gdb) si
0x08050969 14 Foo(argv[1]);
(gdb) si
0x0805096a 14 Foo(argv[1]);
/*
在这里程序即将要进入Foo,我们看看寄存器当前的内容:
*/
(gdb) i reg
eax 0x8047bf8 134511608
ecx 0x0 0
edx 0x8047d0d 134511885
ebx 0xdfbfb000 -541085696
esp 0x8047bb4 0x8047bb4
ebp 0x8047bcc 0x8047bcc
esi 0x8047bb0 134511536
edi 0x8047c74 134511732
eip 0x805096a 0x805096a
eflags 0x202 514
cs 0x17 23
ss 0x1f 31
ds 0x1f 31
es 0x1f 31
fs 0x0 0
gs 0x0 0
fctrl 0x137f 4991
fstat 0x0 0
ftag 0xffff 65535
fiseg 0x0 0
fioff 0x0 0
foseg 0x0 0
fooff 0x0 0
---Type to continue, or q to quit---q
Quit
/*
在寄存器ESP中的是当前堆栈栈顶指针0x8047bb4,
而EBP为main函数堆栈栈底指针0x8047bcc,这个0x8047bcc是要大家记住的第二个地址

继续向下执行
*/
(gdb) si
Foo (s=0x2
) at vul.c:4
4 {
(gdb) x/5i $eip
0x8050904 : push %ebp
0x8050905 : mov %esp,%ebp
0x8050907 : sub $0x18,%esp
0x805090a : mov 0x8050a28,%al
0x805090f : mov %al,0xfffffff0(%ebp)
(gdb) s
Foo (s=0x8047d0d "AAAAAAAAAA") at vul.c:5
5 char buf[16]="";
(gdb) s
6 strcpy(buf, s);
(gdb) s
7 printf("The input String is %s\n", buf);
/*
程序进入Foo中,并把输入的字符串拷贝到了缓冲区buf中。我们来看一下当前堆栈的内容:
*/
(gdb) x/20x $esp
0x8047b94: 0xdfb35d90 0x00000210 0x41414141
0x41414141
0x8047ba4: 0x00004141 0x00000000 0x08047bcc
0x0805096f
0x8047bb4: 0x08047d0d 0xdfb3d3a7 0xdfbf137f
0x08047bb0
0x8047bc4: 0xdfbfb000 0xdfbf137f 0x08047be8
0x0805081b
0x8047bd4: 0x00000002 0x08047bf4 0x08047c00
0x08050a10
省略以下部分。。。

从地址0x8047b9c到0x8047bab的区域是系统分配给buf的缓冲区,刚好16个字节,分别对应buf的16个字符;前面的10个字节已经填上了10个刚拷贝进来的字符'A'的ASCII码41。紧接着这16个字节的是0x08047bcc,就是我刚才要大家记住的调用函数main的堆栈栈底地址;再后面的是我要大家记住的被调用函数Foo的返回地址0x0805096f。

根据上面的分析我们可以画一个缓冲区在内存中的分布图:

|<--buf: 16 Byte-->|<--Calling Function $EBP: 4 Byte-->|<--Called
Function RetAddr: 4 Byte -->|

buf缓冲区被分配在堆栈中,而且是和重要的系统管理数据如:调用函数堆栈栈底地址、被调用函数的返回地址紧紧地靠在一块----这,就是悲剧DNA根源。

在Foo中,buf是strcpy拷贝字符串的目标,而源字符串为s。strcpy忠实地把源字符串s完整地拷贝到以buf为开端的内存后面
---
┣━┒ ; `.
┟━┃┍╄┓
┝─┃┣╈┤
┗━┘┗┸┛



<< Home

This page is powered by Blogger. Isn't yours?