Author Topic: Visual Binary Analysis - [PYTHON] (Read 3366 times)

HTH · « **on:** March 06, 2015, 04:04:21 am »

BinDyn.py - A visual Binary Analysis Script, hacked together by yours truly.

-Finds most common byte and also counts padding (0x00 and 0xff)
-Creates a histogram and prints average byte value
-Creates a Digraph so you can view the implicit relationship between neighboring bytes
-Creates a byteplot
-Creates a self similarity plot
-Can scan the file for signatures. I will NOT reduce the false positive rate because I want to be able to scan large files with resources and ID them, at least half assed reliably, scanning just the header wont do that.
-Lets you dump any strings found in it
-Lets you set start and end points, or rather start point and size, useful for honing in on things

-Creates two maps of the files content, one highlighting printable vs non printable text
-The other showing changes in entropy, very useful for finding enc keys/certs malware stubs, whatevah.
-That's all for now folks, this IS version 0.2 after all

--I'd like to hear any ideas you have :p Im currently toying with ngram analysis as an idea.

Code: [Select]

'''
Visual Binary Profiler By HTH
This is currently a WIP, that means certain things suck. 
I'm aware of this.
Don't mention it.
Please.

Credit to Schalla for the Idea
Credit to Those Awesome guys on Youtube for the inspiration
Credit to some rando who wrote the hilbert curve algorithm in C, which I converted for my own use.

Released under the 
"Use it for whatever you want, dont even bother to credit me really,
 but dont bitch when it breaks (or isnt up to your standards)" License.
 
 It's a mouthful.

Oh yeah and for now, because of that previously mentioned 0 fucks given, the maps generated by the hilbert curves aren't perfect.
They generate a bunch of blank space as well because the curve is improperly sized, see it expects n to be a factor of 2 and to
 remove that blank space id need to froce it to be one... which could cut off a LOT of data.
'''

from PIL import Image, ImageDraw
import sys
import math
from optparse import OptionParser

'''
THese Two Functions are what I was talking about, I made my own space filling curve but it was uhhh, questionable at best.
I decided to leave the math to the math guys ;)
'''
def hilbertify(n, d):
    t = d
    x = y = 0
    s = 1
    while (s < n):
        rx = 1 & (t / 2)
        ry = 1 & (t ^ rx)
        x, y = rot(s, x, y, rx, ry)
        x += s * rx
        y += s * ry
        t /= 4
        s *= 2
    return x, y

def rot(n, x, y, rx, ry):
    if ry == 0:
        if rx == 1:
            x = n - 1 - x
            y = n - 1 - y
        return y, x
    return x, y

'''
Calculates the entropy of a given bunch of text (dose technical terms though..)  
'''
def entropy(text):
    import math
    log2=lambda x:math.log(x)/math.log(2)
    exr={}
    infoc=0
    for each in text:
        try:
            exr[each]+=1
        except:
            exr[each] = 1
    textlen=len(text)
    for k,v in exr.items():
        freq  =  1.0*v/textlen
        infoc+=freq*log2(freq)
    infoc*=-1
    return infoc
 
'''
This bad boy plots self similarity but it also creates a 100 mb image if you hand it 10 kb, have a care.
'''
def selfsim(fileobject):
    fileobject.seek(0)
    digraphImage = Image.new('RGBA', (size,size), "black")
    drawingObject = ImageDraw.Draw(digraphImage)
    for x in range (0,size):
        fileobject.seek(x)
        xval = fileobject.read(1)
        fileobject.seek(0)
        for y in range(size):
            yval = fileobject.read(1)
            if yval == xval:
                drawingObject.point((x,y),(255,0,0))
    digraphImage.show()
    
'''
Uses a user defined window size to calculate the entropy of the text in a sliding window
'''
def entropyMap(fileobject, enlimit, enrange):
    fileobject.seek(0)
    entropyImage = Image.new('RGBA', (dimension,dimension), "black")
    entropyDrawing = ImageDraw.Draw(entropyImage)
    for window in range(0,size-enrange):
        fileobject.seek(window)
        ent = entropy(fileobject.read(enrange))
        if ent < enlimit:
            entropyDrawing.point(hilbertify(dimension,window),(0,0,int(ent * 20)))
        else:
            entropyDrawing.point(hilbertify(dimension,window),(255,0,0))
    entropyImage.show()
    
'''
Digraph, creates patterns based upon implicit relationships between neighbouring bytes
Theres lots of patterns if you play around with it.
'''
def digraph(fileobject, brightness):
    fileobject.seek(0)
    digraphImage = Image.new('RGBA', (256,256), "black")
    drawingObject = ImageDraw.Draw(digraphImage)
    x = ord(fileobject.read(1))
    y = fileobject.read(1)
    while y != '' and x != '':
        y = ord(y)
        r,g,b,a = digraphImage.getpixel((x,y))
        drawingObject.point((x,y),(r+brightness,0,0))
        x = y
        y = fileobject.read(1)
    digraphImage.show()

'''
Byte Plot, useful for locating things like bitmaps
'''
def bytePlot(fileobject):
    fileobject.seek(0)
    digraphImage = Image.new('RGBA', (dimension,dimension), "black")
    drawingObject = ImageDraw.Draw(digraphImage)
    x = fileobject.read(1)
    pos = 1
    while x != '':
        x = ord(x)
        drawingObject.point((pos%dimension,pos/dimension),(x,0,0))
        x = fileobject.read(1)
        pos += 1
    digraphImage.show()
    
'''
Map of printable charsacters vs chars that are not, be they higher or lower
Had an encryption key show up clear as day here when it wouldnt (easily) in the entropy map
Also cool for locating strings I guess
'''    
def printableMap(fileobject):
    printableImage = Image.new('RGBA', (dimension,dimension), "black")
    printableObject = ImageDraw.Draw(printableImage)
    fileobject.seek(0)
    pos = 0
    char = fileobject.read(1)
    while char != '':
        char = ord(char)
        if char == 0: color = (0,0,0,255)
        elif char == 255: color = (255,255,255,255)
        elif char > 32 and char < 127: color = (255,0,0,255)
        elif char <= 32: color = (0,255,0,255)
        else: color = (0,0,255,255)
        printableObject.point(hilbertify(dimension,pos),color)
        char = fileobject.read(1)
        pos += 1
    printableImage.show()
    
'''
This module scans through in 1000 byte chunks and looks for signatures based upon the dictionary in the first line.
It is not 100% inclusive this is one of those, I give you the framework, you don't suck at it, things.
Also yes I know there is a slim chance that the 1000 byte reads will cut a signature in half,
this is how I decided to implement it. I know elsewhere in the script I read over the file like file IO 
isn't a bottleneck but since tgis will grow to ~50 signatures.... at least.

I aint reading a file 50 times just cause theres an off chance Ill cut a signature in half. 
(not when chances are most will be in the first 1000 anyway) :p

This will generate a FUCK TON of false (or just excessive) positives, which if youre not retarded you will see through ;)
f.ex a jar file I tested contained ~29 Java Bytecode signatures and about 20 images
Obviously this means its a moderately complex program (29 classes at least, with some gui elements)
'''
def signatureScan(file):
    file.seek(0)
    signatureList = [("DOS mode","PE"),
    ("ELF","ELF"),("ustar.00","TAR"),
    ("fLaC","Flac"),("BM","Bitmap"),
    ("\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1","DOC"),
    ("%PDF","PDF"),("\xFF\xFE\x00\x00","32 Bit UTF encoded Text"),
    ("\x52\x61\x72\x21\x1A\x07","RAR"),("BZh","Bzip"),
    ("\x50\x4B\x03\x04\x14\x00\x08\x00\x08\x00","Java ByteCode"),("\xFF\xD8","Jpeg") 
    ]
    while True:
        data = file.read(1000)
        if not data: break
        for signature, indicator in signatureList:
            if data.find(signature) != -1:
                print "Possibly a " + indicator + " file, signature \"" + signature + "\" found"
                
''' 
Dis here runs over the feed and finds the most common byte among other things.
'''
def valueMap(fileobject):
    fileobject.seek(0)
    values = [0] * 257;
    sumof = 0
    x = fileobject.read(1)
    while x != '':
        x = ord(x)
        sumof += x
        values[x] += 1
        x = fileobject.read(1)
    vmax = max(values[1:-2])
    print "This File contains "  + str(values[0]) + " 0x00 bytes out of " + str(size)
    print "This File contains "  + str(values[255]) + " 0xff bytes out of " + str(size)
    print "The Average byte value is : " + str(sumof/size)
    print "Most Common Value is : " + hex(values.index(vmax))
    valueImage = Image.new('RGBA', (256,256), "black")
    drawingObject = ImageDraw.Draw(valueImage)
    for x in range(0,255):
        for y in range(0,int(values[x]*255/vmax)):
            drawingObject.point((x,255-y),(255,0,0))
    valueImage.show()
    
'''
This function is pretty self explanatory
'''
def stringDump(fileobject):
    fileobject.seek(0)
    temp = fileobject.read(1)
    string = ''
    while temp != '':
        
        if ord(temp) > 31 and ord(temp) < 128:
            string += temp
        elif temp == '\n' and len(string) > 3:
            print string
            string = ''
        else:
            string = ''
        temp = fileobject.read(1)

'''
 Yes I chose to make a temp file rather than fuck around with dynamically calculating things, at least for now
'''
def fileTrim(fileobject,startpoint,endpoint):
    file2 = open("bindyn.tmp","wb+")
    file.seek(int(startpoint))
    if (endpoint == 0):
        tempchar = '0'
        while tempchar != '':
            tempchar = file.read(1)
            file2.write(tempchar)
    else:
        tempchar = '0'
        while file.tell() != startpoint + endpoint:
            tempchar = file.read(1)
            file2.write(tempchar)
    return file2
    
'''
And now we're at the beautiful main function situation
Don't suck and the user input parsing won't break on you.
'''
if __name__ == "__main__":
    parser = OptionParser()
    parser.add_option("-f", "--file", dest="filename",help="specify the FILE", metavar="FILE")
    parser.add_option("--stats","-s", action="store_true", dest="stats", default=False, help="calculate statistics of byte frequency")
    parser.add_option("--digraph","-d", dest="brightness", help="create a digraph of the given file", metavar="BRIGHTNESS")
    parser.add_option("--byteplot","-b", action="store_true", dest="byteplot", default=False, help="create a byteplotof the given file")
    parser.add_option("--selfsim","-Z", action="store_true", dest="selfsim", default=False, help="create a self similarity plot of the given file, this image is the size of your file squared")
    parser.add_option("--strings", action="store_true", dest="strings", default=False, help="output all strings in the file, I suggest defining start and end points")
    parser.add_option("--signatures","-F", action="store_true", dest="signatures", default=False, help="scan for file signatures, super rudimentary, meant to be expanded by user.")    
    parser.add_option("--printable","-p", action="store_true", dest="printable", default=False, help="create a map of the printable characters")    
    parser.add_option("--entropy","-e" , action="store", dest="minEntropy",  help="make a map of the entropy of the file", metavar="ENTROPYLIMIT")
    parser.add_option("--sample","-S" , action="store", dest="sampleSize", default = 100,  help="change the default sample size", metavar="SAMPLE_SIZE")
    parser.add_option("--startpoint" , action="store", dest="startPoint", default = 0, help="set the byte to start at", metavar="STARTP")
    parser.add_option("--endpoint" , action="store", dest="endPoint", default = 0, help="set amount of bytes to read", metavar="ENDP")
    (options, args) = parser.parse_args()
    
    '''Create or open the file to be worked with'''
    file = open(options.filename, 'rb')
    if options.startPoint != 0 or options.endPoint != 0:
        file = fileTrim(file,options.startPoint,options.endPoint)
    file.seek(0,2)
    size = file.tell()
    dimension = int(math.sqrt(size))

    '''The Easy part'''
    if options.printable:
        printableMap(file)
    
    if options.strings:
        stringDump(file)
    
    if options.stats:
        valueMap(file)
    
    if options.byteplot:
        bytePlot(file)

    if options.selfsim:
        selfsim(file)
    
    if options.brightness:
        digraph(file, int(options.brightness))
    
    if options.minEntropy:
        entropyMap(file, float(options.minEntropy), int(options.sampleSize))
        
    if options.signatures:
        signatureScan(file)

Deque · « **Reply #1 on:** March 06, 2015, 08:33:40 am »

Nice project, HTH.
This is similar to http://binvis.io
Did you get your idea from that project?
There is also senseye: https://github.com/letoram/senseye

Take care that you do something that the others don't offer.

HTH · « **Reply #2 on:** March 06, 2015, 08:36:07 am »

I actually got my idea from Schalla

(well, a talk from a while ago he showed me)

Just waiting for him to come point that out

Do you have a chance to get on IRC any time soon :p (would like malware samples and to talk with you

)

Deque · « **Reply #3 on:** March 06, 2015, 08:51:56 am »

Not soon, I can use IRC when I am home again in about 11 hours.
Edit: Getting malware samples is no problem. I have access to all samples on VT.

HTH · « **Reply #4 on:** March 06, 2015, 11:09:05 pm »

It might be abusing the system but I delete my previous post so I could bump my topic

Anywho Deque sorry I wasn't on IRC when you were, I'm sure I'll catch you eventually.

Schalla · « **Reply #5 on:** March 07, 2015, 09:47:28 pm »

Yup Yup Yup....

And yeah Deque, the idea I had after reading a few blog posts from corte.si and after a talk from the DerbyCon about visualizing binaries. I also linked those to HTH.

HTH · « **Reply #6 on:** March 08, 2015, 12:15:23 am »

I'll see your GUI and raise you one that allows scrolling and altering the points at the file where you areworking

Getting closer to those videos and even have some features they didn't now

GUI is a mockup only at this point, Schalla knew this but making sure everyone else does

Schalla · « **Reply #7 on:** March 08, 2015, 01:12:20 am »

I stopped working on the project quite some time ago, but the GUI looks good.

HTH · « **Reply #8 on:** March 12, 2015, 08:01:09 am »

Just a little update for my ~~people forced to listen to me~~ fans. The GUI is working splendidly, polyphony rated it a 10/10 with this to say:

<Polyphony> Most intuitive piece of software ever coded, absolutely no flaws or catch 22s

The code looks like ratshit though and theres a few more features I want to include before I release it, I mean the code will still look like ratshit given this has been coded mostly from 2 am - 5 am after homework/irl projects have been finished but I want it to at least be POLISHED ratshit before I post it :p

Polyphony · « **Reply #9 on:** March 12, 2015, 08:08:29 am »

Hi, my name is Polyphony, and I approve this massage...

Deque · « **Reply #10 on:** March 12, 2015, 05:44:18 pm »

It looks really good and I would definitely use it in my daily work. The more I read and see videos about this, the more I think it really makes sense to visualize binary data.
Since I saw this talk: https://www.youtube.com/watch?v=T3qqeP4TdPA
I started to experiment with their tool binvis (which a bit non-intuitive and too buggy for daily purposes, it crashes too often) and I recognise datatypes in binaries quickly with just a little bit of exercise. I deeply want that tool but in better.

So, please, go on, HTH.

Deque · « **Reply #11 on:** March 12, 2015, 05:58:09 pm »

Besides, I found that coloring scheme to be very useful:

Code: (Scala) [Select]

if (byteVal == 0)
			return Color.black;
		if (byteVal == 0xff)
			return Color.white;

		if (byteVal > 0 && byteVal <= 127) { // ASCII
			float hue = blueHue;
			float saturation = 1;
			if (byteVal < 33 || byteVal == 127) {
				hue = greenHue;
			}
			float brightness = (float) (byteVal / (float) 127);
			return Color.getHSBColor(hue, saturation, brightness);
		} else { // non-ASCII
			float saturation = 1;
			float hue = yellowHue;
			float brightness = (float) ((byteVal - 127) / (float) (255 - 127));
			return Color.getHSBColor(hue, saturation, brightness);
		}

Or to explain it in words: The brightness information is kept and determined by the byte value, but additionally I color-coded the bytes by visible ASCII, non-visible ASCII, non-ASCII, 0xff, and 0x00.

An image looks like the left most part of this one (here a YAC setup):

Since PE sections are always aligned to 512, you can also see its icons if you align the image with to 512 too.
Cool?

Try to experiment a bit and find out what works for you. By now it really looks like you just replicated, even the colors are the same. E.g. you can take care that red-green blind people are able to use it too.

HTH · « **Reply #12 on:** March 12, 2015, 09:10:05 pm »

Thank you for the kind words deque I do plan on incorporating a lot of things especially things that I know they haven't.(scrolling is a big one) Unfortunately the core idea it was theirs first. I hope eventually that my tool will be polished enough that people will forget about that and just see an amazing piece of software.

Of course it has only been like a week :p

Deque · « **Reply #13 on:** March 12, 2015, 09:56:22 pm »

People won't care who invented it, if the program is practical, easy, intuitive, useful and just better to use than the other software available.

EvilZone

News:

Author Topic: Visual Binary Analysis - [PYTHON] (Read 3366 times)

HTH

Visual Binary Analysis - [PYTHON]

Deque

Re: Visual Binary Analysis - [PYTHON]

HTH

Re: Visual Binary Analysis - [PYTHON]

Deque

Re: Visual Binary Analysis - [PYTHON]

HTH

Re: Visual Binary Analysis - [PYTHON]

Schalla

Re: Visual Binary Analysis - [PYTHON]

HTH

Re: Visual Binary Analysis - [PYTHON]

Schalla

Re: Visual Binary Analysis - [PYTHON]

HTH

Re: Visual Binary Analysis - [PYTHON]

Polyphony

Re: Visual Binary Analysis - [PYTHON]

Deque

Re: Visual Binary Analysis - [PYTHON]

Deque

Re: Visual Binary Analysis - [PYTHON]

HTH

Re: Visual Binary Analysis - [PYTHON]

Deque

Re: Visual Binary Analysis - [PYTHON]