I wrote a script that will perform the following
1. Download the webpage source code
2. Search for links in the source and print them to screen
3. Search for email addresses and UK phone numbers and print them to screen, for other types of phone numbers the regular expressions will need to be modified.
4. Search for MD5 hashes (The script will try to crack them - uses an inbuilt mini dictionary)
5. Search for images and documents
6. Download images and documents
The way I wrote the script was using one script to control each individual part of the project. I learned from this project that indentation is incredibly important in Python. I used IDLE as my python editor. I am not a programmer and the code can definitely be improved and made more efficient. I hope this helps someone.
MAKE SURE DIRECTORY C:\temp exists.main.py controlled the running:
# Script: main.py
#A lot of importing
import webpage_getlinks, webpage_getemails, webpage_getstuff, webpage_md5, webpage_getimages
import webpage_downloadcontent, forensic_analysis, urllib, httplib, urlparse, urllib2
# Function main
def main():
#Display options and stuff
print 'Below is the list of available options, please select one when asked to do so:'
print '1. Get webpage content and print hyperlinks.'
print '2. Get webpage content and print emails and numbers.'
print '3. Get webpage content and find the MD5 hashes if any.'
print 'The script will then try to crack them.'
print '4. Get webpage content - images and documents.'
print '5. Download webpage content - images and documents.'
print '6. Perform Forensic Analysis.'
print 'Please enter the number to work from:'
#Get users option
option = raw_input()
print ''
print 'You have selected option number', option
print 'Please enter the URL to get content from (MUST INCLUDE HTTP://):'
# Get user input
url = raw_input()
print ''
print 'You have selected:', url
print ''
#Checks to make sure the site works but MUST have http://...
#if it is continue if not error message
#If option is equal to one then
if option == '1':
# 1. Get webpage content and print hyperlinks.
elif option == '2':
# 2. Get webpage content and print emails and numbers.
elif option == '3':
# 3. Get webpage content and find the MD5 hashes if any. The script will then try to crack them.
elif option == '4':
# 4. Get webpage content - images and documents.
elif option == '5':
# 5. Download webpage content - images and documents.
elif option == '6':
# 5. Download webpage content - images and documents.
except ValueError, ex:
print 'There appears to be a problem with the server, please try again later!'
except urllib2.URLError, ex:
print 'There appears to be a problem with the server, please try again later!'
if __name__ == '__main__':
# Python sucks
import sys, re, urllib
import main
# Function print_links
def print_links(page):
# find all hyperlinks on a webpage passed in as input and print
print '[*] print_links()'
# regex to match on hyperlinks, returning 3 grps, links[1] being the link itself
links = re.findall(r'\<a.*href\=.*http\:.+', page)
# sort and print the links
#print [+], the numbers of links found and HyperLinks Found:
print '[+]', str(len(links)), 'HyperLinks Found:'
#print links
#Uncomment below when testing (commented out to keep everything tidy)
for link in links:
print link
print 'All non-encrypted hyperlinks found!'
def print_slinks(page):
# find all hyperlinks on a webpage passed in as input and print
print '[*] print_slinks()'
# regex to match on hyperlinks, returning 3 grps, links[1] being the link itself
links = re.findall(r'\<a.*href\=.*https\:.+', page)
# sort and print the links
#print [+], the numbers of links found and HyperLinks Found:
print '[+]', str(len(links)), 'Secure HyperLinks Found:'
#print links
#Uncomment below when testing (commented out to keep everything tidy)
for link in links:
print link
print 'All secure hyperlinks found!'
# Function wget
def wget(url):
# Try to retrieve a webpage via its url, and return its contents
print '[*] wget()'
# open file like url object from web, based on url
url_file = urllib.urlopen(url)
# get webpage contents
page = url_file.read()
# return whatever is in page.. don't know if this is needed..
return page
# Function main
def main(url):
page = wget(url)
# Get the links
# Get the links
if __name__ == '__main__':
# Python sucks
import sys, re, urllib
import main
# Function print_emails
def print_emails(page):
# Set blank space between functions
print ""
# find all emails on a webpage passed in as input and print
print '[*] print_emails()'
# regex to match on emails, returning 3 grps, links[1] being the link itself
emails = re.findall(r'\w+@\w+.\w+.\w+', page)
# sort and print the links
#print [+], the numbers of emails found and Emails Found:
print '[+]', 'There were', str(len(emails)), 'emails found.'
print 'Emails found:'
#Uncomment below when testing (commented out to keep everything tidy)
for email in emails:
print email
# Function print_numbers
def print_numbers(page):
# Set blank space between functions
print ""
# find all numbers on a webpage passed in as input and print
print '[*] print_numbers()'
# regex to match on emails, returning 3 grps, links[1] being the link itself
numbers = re.findall(r'([+].*[(].*[)]\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??)', page)
# sort and print the links
#print [+], the number of numbers found and Emails Found:
print '[+]', 'There were', str(len(numbers)), 'phone numbers found.'
print 'Phone numbers found:'
#print numbers
#Uncomment below when testing (commented out to keep everything tidy)
for number in numbers:
print number
# Function wget
def wget(url):
# Try to retrieve a webpage via its url, and return its contents
print '[*] wget()'
# open file like url object from web, based on url
url_file = urllib.urlopen(url)
# get webpage contents
page = url_file.read()
# return whatever is in page???
return page
# Function main
def main(url):
page = wget(url)
# Get the emails
# Get the numbers
if __name__ == '__main__':
# Python sucks
import sys, hashlib
# Function crack_hashes
def crack_hashes(md5):
# Set blank space between functions
print ""
#Checks password hash, against a dictionary of common passwords
print 'crack_hashes(): Cracking hash:', md5
# set up list of common password words
dic = ['123','1234','12345','123456','1234567','12345678','password','123123', 'qwerty','abc','abcd','abc123','111111','monkey','arsenal','letmein','trustno1','dragon','baseball','superman','iloveyou','starwars','montypython','cheese','football','password','batman']
#passwd_found = False;
for passwd in dic:
passwds = hashlib.md5(passwd)
#print passwds.hexdigest()
if passwds.hexdigest() == md5:
#passwd_found = True;
print 'crack_hashes(): Password recovered:', passwd
def main():
print'[crack_hashes] Tests'
md5 = 'd8578edf8458ce06fbc5bb76a58c5ca4'
if __name__ == '__main__':
# Python sucks
#download the hases
import sys, re, urllib
import main
import webpage_crackmd5
# Function print_md5
def print_md5(page):
# Set blank space between functions
print ""
# find all numbers on a webpage passed in as input and print
print '[*] print_md5()'
# regex to match on emails, returning 3 grps, links[1] being the link itself
md5 = re.findall(r'([a-fA-F\d]{32})', page)
# sort and print the links
#print [+], the number of numbers found and Emails Found:
print '[+]', 'There were', str(len(md5)), 'md5 hashes found.'
print 'MD5 hashes found:'
#print numbers
#Uncomment below when testing (commented out to keep everything tidy)
for i in md5:
print i
# Function wget
def wget(url):
# Try to retrieve a webpage via its url, and return its contents
print '[*] wget()'
# open file like url object from web, based on url
url_file = urllib.urlopen(url)
# get webpage contents
page = url_file.read()
# return whatever is in page.. don't know if this is needed
return page
# Function main
def main(url):
page = wget(url)
# Print md5
if __name__ == '__main__':
# Python sucks
import sys, re, urllib
import main
# Function print_numbers
def print_images(page):
# Set blank space between functions
print ""
# find all numbers on a webpage passed in as input and print
print '[*] print_images()'
# regex to match on emails, returning 3 grps, links[1] being the link itself
numbers = re.findall(r'src="([^"]+)"', page)
# sort and print the links
#print [+], the number of numbers found and Emails Found:
print '[+]', 'There were', str(len(numbers)), 'images found.'
print 'Images found:'
#print numbers
#Uncomment below when testing (commented out to keep everything tidy)
for number in numbers:
print number
# Function print_numbers
def print_documents(page):
# Set blank space between functions
print ""
# find all numbers on a webpage passed in as input and print
print '[*] print_documents()'
# regex to match on emails, returning 3 grps, links[1] being the link itself
numbers = re.findall(r'"(.*\.docx)"', page) #
# sort and print the links
#print [+], the number of numbers found and Emails Found:
print '[+]', 'There were', str(len(numbers)), 'documents found.'
print 'Documents found:'
#print numbers
#Uncomment below when testing (commented out to keep everything tidy)
for number in numbers:
print number
# Function wget
def wget(url):
# Try to retrieve a webpage via its url, and return its contents
print '[*] wget()'
# open file like url object from web, based on url
url_file = urllib.urlopen(url)
# get webpage contents
page = url_file.read()
# return whatever is in page, don't know if this is needed..
return page
# Function main
def main(url):
page = wget(url)
# Get the emails
if __name__ == '__main__':
# Download images
import os
import main
import re
import urllib
import httplib, urlparse
def download_images(url, page):
#Set link equal to the link minus the page for downloading images without a link
link = os.path.dirname(url)#
#Set numbers equal to the images on the page (WILL ONLY FIND FULL LINKS)
images = re.findall(r'src=".*http\://([^"]+)"', page)
#Set aegami equal to the images on the page (WILL ONLY FIND IMAGES WITH NO PATH)
aegami = re.findall(r'\ssrc\s*=\s*"([^/"]+\.(?:jpg|gif|png|jpeg|bmp))\s*"', page) #\ssrc\s*=\s*"([^/"]+\.[^/".]+" no need to specify image file extensions
#This is for adding URL to images with no full URL
for image in aegami:
#Set test = dirname of the URL + / + each individual file in the loop
test = link+"/"+image
#Set location equal to the directory for the download content to be stored in
location = os.path.abspath("C:/temp/coursework/")
#Need to concantenate location C:\temp/coursework + filename) - this is to save the file
get = os.path.join(location, image)
#Check the file actually exists by getting the HEAD code
status = urllib.urlopen(test).getcode()
#If status is 200 (OK)
if status == 200:
urllib.urlretrieve(test, get)
print 'The file:', image, 'has been saved to:', get
#If status is 404 (FILE NOT EXISTANT)
elif status == 404:
print 'The file:', image, 'could not be saved. Does not exist!!'
print 'Unknown Error:', status
######################### This is for files with a link ################
for files in images:
#Download the images
#Set filename equal to the basename of files, so the actual file i.e image1.jpg of http://google.com/image1.jpg
filename = os.path.basename(files)
#Need to concantenate location C:\temp/coursework + filename) - this is to save the file
save = os.path.join(location, filename)
#This is because the regular expression removes http:// for some weird reason, without http the download does not work...
addhttp = "http://"+files
#Download the file arguements(link to file, destionation to save and the filename)
status = urllib.urlopen(test).getcode()
if status == 200:
urllib.urlretrieve(addhttp, save)
print 'The file:', filename, 'has been saved to:', save
elif status == 404:
print 'The file:', filename, 'could not be saved. Does not exist!!'
print 'Unknown Error:', status
print 'Download of images complete!'
#Print white space
print ''
def download_documents(url, page):
#Set dirname equal to the link minus the page for downloading images without a link
dirname = os.path.dirname(url)
#Set documents equal to the documents on the page
documents = re.findall(r'"(.*\.docx)"', page)
#Download the documents, see above comments - pretty much same code but different
#Variables and regex
for doc in documents:
test = dirname+"/"+doc
location = os.path.abspath("C:/temp/coursework/")
#Set filename equal to the basename of files, so the actual file i.e image1.jpg of http://google.com/image1.jpg
name = os.path.basename(test)
#Need to concantenate location C:\temp/coursework + filename) - this is to save the file
get = os.path.join(location, name)
status = urllib.urlopen(test).getcode()
#print status
if status == 200:
urllib.urlretrieve(test, get)
print 'The file:', doc, 'has been saved to:', get
elif status == 404:
print 'The file:', doc, 'could not be saved. Does not exist!!'
print 'Unknown Error:', status
print 'Download of documents complete!'
# Function wget
def wget(url):
# Try to retrieve a webpage via its url, and return its contents
print ''
print '[*] wget()'
#Print white space
print ''
# open file like url object from web, based on url
url_file = urllib.urlopen(url)
# get webpage contents
page = url_file.read()
# return whatever is in page???
return page
def main(url):
page = wget(url)
download_images(url, page)
download_documents(url, page)
if __name__ == '__main__':
#Forensic Analysis
import hashlib, urllib, binascii, os, shutil
#Binascii module allows convertion of ASCII characters to Hex and Binary.
def forensic_analysis(url, page):
#Setup the dictionary [0] = hash [1] = the file name
bad_hashes = [('9d377b10ce778c4938b3c7e2c63a229a','badfile1.jpg'), ('6bbaa34b19edd6c6fa06cccf29b33125', 'badfile2.jpg'), ('28f6607fa6ec96acb89027056cc4c0f5', 'badfile3.jpg'), ('1d6d9c72e3476d336e657b50a77aee05', 'badfile4.jpg')]
#Set location equal to the directory where downloads are stored
location = r'C:\\temp\\coursework\\'
#Counts how many files are in C:\\temp\\coursework
path, dirs, files = os.walk("C:\\temp\\coursework").next()
file_count = len(files)
#Tell the user how many files there are in the directory
print 'How many files:', file_count
#For each file in the directory do
for each in files:
#Set filename equal to the downloads directory + each individual filename in the directory
filename = location+each
#Blank space
print ''
print 'Current filename:', filename
#Open the file so the contents are hashed
fh = open(filename, 'r')
#Read the first four bytes
file_sig = fh.read(4)
#Convert from ascii to hex
#test = binascii.hexlify(file_sig)
#Extract file extension from file name
fileName, fileExtension = os.path.splitext(each)
#Print the files hash signature
filesig_hex = binascii.hexlify(file_sig)
print 'This is the files hash signature:', filesig_hex
#Set hasher to hash files
hasher = hashlib.md5()
#if the file signature == (SIGNATURE - in this case JPG)
if filesig_hex == 'ffd8ffe0':
#Tell the user this is a jpg file
print 'This file is JPG'
#Set file extension to jpg
fileExtension = '.jpg'
#Set it's new name to the filename plus it's fixed extension
newname = fileName+fileExtension
#Print it's newname
print 'The files newname:', newname
#Set destined to location\fixed
dst_dir = os.path.join(location, "fixed")
#Join location and file name together
src_file = os.path.join(location, each)
#Copy the original file to the new directory location\fixed
shutil.copy(src_file, dst_dir)
### Note that because the way Windows works, we must make sure
### that the fixed directory is empty before running the script as
### the code will not be able to rename files otherwise.
### This code will work perfectly on Unix systems as the way they work
### Is overwrite is guaranteed.
#Set dst_file to the new location + the file name
dst_file = os.path.join(dst_dir, each)
#Set the new name to new location + newname
new_dst_file_name = os.path.join(dst_dir, newname)
#Rename the original file to it's new name
os.rename(dst_file, new_dst_file_name)
##### need to copy file to new folder, rename it and then hash it^
with open(new_dst_file_name, 'rb') as afile:
buf = afile.read()
for i in bad_hashes:
if hasher.hexdigest() == i[0]:
print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1]
#See above comments for anything below, code is the same but I ran out of
#time to put this into a function which would take up less lines and be more efficient.
if filesig_hex == 'ffd8ffe1':
print 'This file is JPG'
fileExtension = '.jpg'
newname = fileName+fileExtension
print 'Newname:', newname
dst_dir = os.path.join(location, "fixed")
src_file = os.path.join(location, each)
shutil.copy(src_file, dst_dir)
dst_file = os.path.join(dst_dir, each)
new_dst_file_name = os.path.join(dst_dir, newname)
os.rename(dst_file, new_dst_file_name)
with open(new_dst_file_name, 'rb') as afile:
buf = afile.read()
for i in bad_hashes:
if hasher.hexdigest() == i[0]:
print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1]
if filesig_hex == '424d3e':
print 'This file is BMP'
fileExtension = '.bmp'
newname = fileName+fileExtension
print 'Newname:', newname
dst_dir = os.path.join(location, "fixed")
src_file = os.path.join(location, each)
shutil.copy(src_file, dst_dir)
dst_file = os.path.join(dst_dir, each)
new_dst_file_name = os.path.join(dst_dir, newname)
os.rename(dst_file, new_dst_file_name)
with open(new_dst_file_name, 'rb') as afile:
buf = afile.read()
for i in bad_hashes:
if hasher.hexdigest() == i[0]:
if filesig_hex == '47494638':
print 'This file is GIF'
fileExtension = '.gif'
newname = fileName+fileExtension
print 'Newname:', newname
dst_dir = os.path.join(location, "fixed")
src_file = os.path.join(location, each)
shutil.copy(src_file, dst_dir)
dst_file = os.path.join(dst_dir, each)
new_dst_file_name = os.path.join(dst_dir, newname)
os.rename(dst_file, new_dst_file_name)
with open(new_dst_file_name, 'rb') as afile:
buf = afile.read()
for i in bad_hashes:
if hasher.hexdigest() == i[0]:
print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1]
if filesig_hex == '000001':
print 'This file is ICO'
fileExtension = '.ico'
newname = fileName+fileExtension
print 'Newname:', newname
dst_dir = os.path.join(location, "fixed")
src_file = os.path.join(location, each)
shutil.copy(src_file, dst_dir)
dst_file = os.path.join(dst_dir, each)
new_dst_file_name = os.path.join(dst_dir, newname)
os.rename(dst_file, new_dst_file_name)
with open(new_dst_file_name, 'rb') as afile:
buf = afile.read()
for i in bad_hashes:
if hasher.hexdigest() == i[0]:
print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1]
if filesig_hex == '89504e':
print 'This file is PNG'
fileExtension = '.png'
newname = fileName+fileExtension
print 'Newname:', newname
dst_dir = os.path.join(location, "fixed")
src_file = os.path.join(location, each)
shutil.copy(src_file, dst_dir)
dst_file = os.path.join(dst_dir, each)
new_dst_file_name = os.path.join(dst_dir, newname)
os.rename(dst_file, new_dst_file_name)
with open(new_dst_file_name, 'rb') as afile:
buf = afile.read()
for i in bad_hashes:
if hasher.hexdigest() == i[0]:
print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1]
if filesig_hex == 'd0cf11':
print 'This file is DOC'
fileExtension = '.doc'
newname = fileName+fileExtension
print 'Newname:', newname
dst_dir = os.path.join(location, "fixed")
src_file = os.path.join(location, each)
shutil.copy(src_file, dst_dir)
dst_file = os.path.join(dst_dir, each)
new_dst_file_name = os.path.join(dst_dir, newname)
os.rename(dst_file, new_dst_file_name)
with open(new_dst_file_name, 'rb') as afile:
buf = afile.read()
for i in bad_hashes:
if hasher.hexdigest() == i[0]:
print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1]
if filesig_hex == '504b0304':
print 'This file is DOCX'
fileExtension = '.docx'
newname = fileName+fileExtension
print 'Newname:', newname
dst_dir = os.path.join(location, "fixed")
src_file = os.path.join(location, each)
shutil.copy(src_file, dst_dir)
dst_file = os.path.join(dst_dir, each)
new_dst_file_name = os.path.join(dst_dir, newname)
os.rename(dst_file, new_dst_file_name)
with open(new_dst_file_name, 'rb') as afile:
buf = afile.read()
for i in bad_hashes:
if hasher.hexdigest() == i[0]:
print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1]
if filesig_hex == '7b5c72':
print 'This file is RTF'
fileExtension = '.rtf'
newname = fileName+fileExtension
print 'Newname:', newname
dst_dir = os.path.join(location, "fixed")
src_file = os.path.join(location, each)
shutil.copy(src_file, dst_dir)
dst_file = os.path.join(dst_dir, each)
new_dst_file_name = os.path.join(dst_dir, newname)
os.rename(dst_file, new_dst_file_name)
with open(new_dst_file_name, 'rb') as afile:
buf = afile.read()
for i in bad_hashes:
if hasher.hexdigest() == i[0]:
print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1]
if filesig_hex == 'd0cf11':
print 'This file is XLS'
fileExtension = '.xls'
newname = fileName+fileExtension
print 'Newname:', newname
dst_dir = os.path.join(location, "fixed")
src_file = os.path.join(location, each)
shutil.copy(src_file, dst_dir)
dst_file = os.path.join(dst_dir, each)
new_dst_file_name = os.path.join(dst_dir, newname)
os.rename(dst_file, new_dst_file_name)
##### need to copy file to new folder, rename it and then hash it
with open(new_dst_file_name, 'rb') as afile:
buf = afile.read()
for i in bad_hashes:
if hasher.hexdigest() == i[0]:
print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1]
print 'File signature not recognised. Be wary'
# Function wget
def wget(url):
# Try to retrieve a webpage via its url, and return its contents
print '[*] wget()'
# open file like url object from web, based on url
url_file = urllib.urlopen(url)
# get webpage contents
page = url_file.read()
# return whatever is in page???
return page
def main(url):
#Set page equal to the output of the wget function
page = wget(url)
#Call the forensic_analysis function and pass the arguements url and page
forensic_analysis(url, page)
if __name__ == '__main__':
Note that to perform forensic analysis please make sure you have downloaded some files...
Forensic analysis works by checking the file signature and ensuring that the file actually is whatever extension it claims to be. If for example a file is called file1.jpg, the forensic analysis script will check file1.jpg signature to see if it actually is a jpg. If it is then it does nothing and just states the file is a jpg, if it isn't then it will tell the user that it isn't actually a jpg and that it is actually whatever file extension by checking it's signature.
I haven't looked at the code for a couple of months and if you have any issues with it I am more than happy to help. Feel free to modify it to your personal use.
To download all the scripts: