EvilZone

Programming and Scripting => Projects and Discussion => : seci May 23, 2011, 12:44:25 AM

: Translating bytes into assembly
: seci May 23, 2011, 12:44:25 AM
This is a continued discussion of http://evilzone.org/projects-and-discussion/calculating-entrypoint-and-mapping-it-to-byte-array/, (http://evilzone.org/projects-and-discussion/calculating-entrypoint-and-mapping-it-to-byte-array/,) which I will be updating soon with the information I have learned. I now have managed to read all of the PE headers, calculating the entrypoint and finding the bytes at the entrypoint.

Now I want to translate these bytes into something meaningful! So, how on earth do I do this? If someone could point me to tutorials, guides, books(not to big :() I would be very grateful!

I have a rough understanding of it, an instruction can be anything from ~2 bytes to like 14 or 17 or something. And the length of the instruction is determined by the early bytes in the instruction etc. But there is no good tables/descriptions etc for how to actually do this. I tried looking at the Intel opcode instruction reference PDF, but its like 1500 pages of nonsense :(
: Re: Translating bytes into assembly
: Tsar May 23, 2011, 03:40:13 AM
I have a rough understanding of it, an instruction can be anything from ~2 bytes to like 14 or 17 or something. And the length of the instruction is determined by the early bytes in the instruction etc. But there is no good tables/descriptions etc for how to actually do this. I tried looking at the Intel opcode instruction reference PDF, but its like 1500 pages of nonsense :(

You are going to need some kind of list of all the instructions and their respective binary/hex versions, ca0s made a "ASM to Hex" program that you could use if you don't have the hex/binary.

As for the part I underlined, this sound's like you will need something similar to Huffman's algorithm.
http://en.wikipedia.org/wiki/Huffman_coding (http://en.wikipedia.org/wiki/Huffman_coding)
http://www.cs.duke.edu/csed/poop/huff/info/ (http://www.cs.duke.edu/csed/poop/huff/info/)

Essentially it follows a tree to see if it is proceeded by anything. Each node could have the associated instruction per opcode, and you know if it is an instruction if it gets to a node with no children.

I'm not going to lie it will have to be pretty complex, but I think it is do-able for sure with enough effort.
: Re: Translating bytes into assembly
: iMorg May 23, 2011, 03:52:17 AM
Create something of a "reverse" assembler.

Take large chunks of the code(not the whole file at once), decode the opcodes into commands based on the target instruction set.

Tokenization could make the process go faster if you tokenize the opcodes before decoding them.
: Re: Translating bytes into assembly
: ca0s May 23, 2011, 06:12:27 PM
Create something of a "reverse" assembler.

Dissassembler? :P
You are going to have a DB in your program containing every possible opcode and its hexa value, and also, how many bytes takes as instruction argument. I tought about doing something like that time ago, but it is a hard work.
: Re: Translating bytes into assembly
: Huntondoom May 23, 2011, 08:37:12 PM
if the instruction line are 2 - 14 long
then try
the total amount of bytes - (the number of bytes that are used for setting the length)
then divide by 2 / 14 and see which gives a good number
since the length of those instruction line are set to a certain number
then the length of the total amount of bytes should be in the range of the Multiplication table of that number
but you have to exclude the first bytes from the total amount since they set the whole length
at least this is what I think
I have no further knowledge of this stuff :P
: Re: Translating bytes into assembly
: seci May 23, 2011, 08:55:03 PM
if the instruction line are 2 - 14 long
then try
the total amount of bytes - (the number of bytes that are used for setting the length)
then divide by 2 / 14 and see which gives a good number
since the length of those instruction line are set to a certain number
then the length of the total amount of bytes should be in the range of the Multiplication table of that number
but you have to exclude the first bytes from the total amount since they set the whole length
at least this is what I think
I have no further knowledge of this stuff :P

If it only was that simple :(

Basically, there is no way knowing from reading it anywhere how long each instruction is. You have to judge it by the byte values.
First there is a instruction prefix that can be anything from 0 to 4 bytes(optional), this prefix modifies the function of the instruction, which is the coming 1 or 2 bytes(required). Then there is another 1 byte(optional) that does modification to the way the instruction work. Then another 1 byte that does more modification of the instruction(optional). Then there is something called displacement(optional) which I am not 100% what is, I am reading this right now. Then there is the last bytes which is immediate data, 1 to 4 bytes(optional)

In total there is 1 byte minimum and 16 bytes maximum for each instruction. More on this later.


I have been reading tons and tons of research the past days, learned a lot. I will maybe write a rather large guide/tutorial or whatever on PE/COFF, reading headers and starting reversing of PE's from bottom up and close up the two topics I have about this to now.
: Re: Translating bytes into assembly
: Tsar May 23, 2011, 10:36:31 PM
Nothing to add that I haven't already, but I thought I would mention that last night I had dreams about trying to solve this problem for hours. lol
: Re: Translating bytes into assembly
: iMorg May 24, 2011, 01:28:42 AM

Dissassembler? :P
You are going to have a DB in your program containing every possible opcode and its hexa value, and also, how many bytes takes as instruction argument. I tought about doing something like that time ago, but it is a hard work.

Wow. I must have been really tired when posting that lmao.

Its similar to writing an emulator(not sure if you have or not), you already have the file loaded now you just need to decode the opcodes into their correct instructions. Checkout QEMU, its open source and has some sophisticated instruction decoding chunks in it.