Author Topic: Get webpage content script (Read 1179 times)

dsme · « **on:** January 12, 2014, 06:57:44 pm »

Hi,

I wrote a script that will perform the following

1. Download the webpage source code
2. Search for links in the source and print them to screen
3. Search for email addresses and UK phone numbers and print them to screen, for other types of phone numbers the regular expressions will need to be modified.
4. Search for MD5 hashes (The script will try to crack them - uses an inbuilt mini dictionary)
5. Search for images and documents
6. Download images and documents

The way I wrote the script was using one script to control each individual part of the project. I learned from this project that indentation is incredibly important in Python. I used IDLE as my python editor. I am not a programmer and the code can definitely be improved and made more efficient. I hope this helps someone.

MAKE SURE DIRECTORY C:\temp exists.

main.py controlled the running:

Code: [Select]

#
# Script: main.py
#


#A lot of importing
import webpage_getlinks, webpage_getemails, webpage_getstuff, webpage_md5, webpage_getimages
import webpage_downloadcontent, forensic_analysis, urllib, httplib, urlparse, urllib2


# Function main
def main():


#Display options and stuff
 print 'Below is the list of available options, please select one when asked to do so:'
 print '1. Get webpage content and print hyperlinks.'
 print '2. Get webpage content and print emails and numbers.'
 print '3. Get webpage content and find the MD5 hashes if any.'
 print 'The script will then try to crack them.'
 print '4. Get webpage content - images and documents.'
 print '5. Download webpage content - images and documents.'
 print '6. Perform Forensic Analysis.'
 print 'Please enter the number to work from:'
 #Get users option
 option = raw_input()
 print ''
 print 'You have selected option number', option


 print 'Please enter the URL to get content from (MUST INCLUDE HTTP://):'
 # Get user input
 url = raw_input()


 print ''
 print 'You have selected:', url
 print ''


 #Checks to make sure the site works but MUST have http://...
 #if it is continue if not error message


 try:
     urllib2.urlopen(url)
       #If option is equal to one then
     if option == '1':
          # 1. Get webpage content and print hyperlinks.
            webpage_getlinks.main(url)


     elif option == '2':
            # 2. Get webpage content and print emails and numbers.
            webpage_getemails.main(url)


     elif option == '3':
            # 3. Get webpage content and find the MD5 hashes if any. The script will then try to crack them.
            webpage_md5.main(url)


     elif option == '4':
            # 4. Get webpage content - images and documents.
            webpage_getimages.main(url)
     elif option == '5':
          # 5. Download webpage content - images and documents.
            webpage_downloadcontent.main(url)
     elif option == '6':
            # 5. Download webpage content - images and documents.
            forensic_analysis.main(url)
 except ValueError, ex:
     print 'There appears to be a problem with the server, please try again later!'
 except urllib2.URLError, ex:
     print 'There appears to be a problem with the server, please try again later!'




 
if __name__ == '__main__': 
 main()

webpage_getlinks.py

Code: [Select]

# Python sucks  
import sys, re, urllib 
import main 
  
# Function print_links 
def print_links(page):  
 # find all hyperlinks on a webpage passed in as input and print  
 print '[*] print_links()' 
 # regex to match on hyperlinks, returning 3 grps, links[1] being the link itself  
 links = re.findall(r'\<a.*href\=.*http\:.+', page) 
  
 # sort and print the links  
 links.sort() 
 #print [+], the numbers of links found and HyperLinks Found: 
 print '[+]', str(len(links)), 'HyperLinks Found:'
  
 #print links  
 #Uncomment below when testing (commented out to keep everything tidy) 
 for link in links:  
  print link 
 print 'All non-encrypted hyperlinks found!'
 
def print_slinks(page):  
 # find all hyperlinks on a webpage passed in as input and print  
 print '[*] print_slinks()' 
 # regex to match on hyperlinks, returning 3 grps, links[1] being the link itself  
 links = re.findall(r'\<a.*href\=.*https\:.+', page) 
  
 # sort and print the links  
 links.sort() 
 #print [+], the numbers of links found and HyperLinks Found: 
 print '[+]', str(len(links)), 'Secure HyperLinks Found:'
  
 #print links  
 #Uncomment below when testing (commented out to keep everything tidy) 
 for link in links:  
  print link 
 print 'All secure hyperlinks found!'
  
# Function wget 
def wget(url): 
      
 # Try to retrieve a webpage via its url, and return its contents  
 print '[*] wget()'
 # open file like url object from web, based on url  
 url_file = urllib.urlopen(url) 
 # get webpage contents  
 page = url_file.read() 
 # return whatever is in page.. don't know if this is needed.. 
 return page 
  
  
# Function main 
def main(url):  


 page = wget(url) 
 # Get the links  
 print_links(page) 
  # Get the links  
 print_slinks(page)  
   
if __name__ == '__main__':  
 main()

webpage_getemails.py

Code: [Select]

# Python sucks 
import sys, re, urllib
import main


# Function print_emails
def print_emails(page):
# Set blank space between functions
 print ""
 
 # find all emails on a webpage passed in as input and print 
 print '[*] print_emails()' 
 # regex to match on emails, returning 3 grps, links[1] being the link itself 
 emails = re.findall(r'\w+@\w+.\w+.\w+', page)


 # sort and print the links 
 emails.sort()
 #print [+], the numbers of emails found and Emails Found:
 print '[+]', 'There were', str(len(emails)), 'emails found.'
 print 'Emails found:'


 #Uncomment below when testing (commented out to keep everything tidy)
 for email in emails: 
  print email


# Function print_numbers
def print_numbers(page):
# Set blank space between functions
 print ""


   
 # find all numbers on a webpage passed in as input and print 
 print '[*] print_numbers()' 
 # regex to match on emails, returning 3 grps, links[1] being the link itself 
 numbers = re.findall(r'([+].*[(].*[)]\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??)', page)
  
 # sort and print the links 
 numbers.sort()
 #print [+], the number of numbers found and Emails Found:
 print '[+]', 'There were', str(len(numbers)), 'phone numbers found.'
 print 'Phone numbers found:'


 #print numbers
 #Uncomment below when testing (commented out to keep everything tidy)
 for number in numbers: 
  print number


# Function wget
def wget(url):
    
 # Try to retrieve a webpage via its url, and return its contents 
 print '[*] wget()'
 # open file like url object from web, based on url 
 url_file = urllib.urlopen(url)
 # get webpage contents 
 page = url_file.read()
 # return whatever is in page???
 return page




# Function main
def main(url): 
 
 page = wget(url)
 # Get the emails
 print_emails(page)
 # Get the numbers
 print_numbers(page)


if __name__ == '__main__': 
 main()

webpage_getstuff

Code: [Select]

# Python sucks 
import sys, hashlib


# Function crack_hashes
def crack_hashes(md5):
# Set blank space between functions
 print ""
 #Checks password hash, against a dictionary of common passwords
 print 'crack_hashes(): Cracking hash:', md5
 # set up list of common password words
 dic = ['123','1234','12345','123456','1234567','12345678','password','123123', 'qwerty','abc','abcd','abc123','111111','monkey','arsenal','letmein','trustno1','dragon','baseball','superman','iloveyou','starwars','montypython','cheese','football','password','batman']
 #passwd_found = False;
   
 for passwd in dic:
     passwds = hashlib.md5(passwd)
     #print passwds.hexdigest()


     if passwds.hexdigest() == md5:


         #passwd_found = True;
         #break
         print 'crack_hashes(): Password recovered:', passwd
     #else:
             
 
def main():
    print'[crack_hashes] Tests'
    md5 = 'd8578edf8458ce06fbc5bb76a58c5ca4'
    crack_hashes(md5)
    #sys.exit(0)   
 
if __name__ == '__main__':
    main()

webpage_md5.py

Code: [Select]

# Python sucks
#download the hases
import sys, re, urllib
import main
import webpage_crackmd5


# Function print_md5
def print_md5(page):
# Set blank space between functions
 print ""


 # find all numbers on a webpage passed in as input and print 
 print '[*] print_md5()' 
 # regex to match on emails, returning 3 grps, links[1] being the link itself 
 md5 = re.findall(r'([a-fA-F\d]{32})', page)


 # sort and print the links 
 md5.sort()
 #print [+], the number of numbers found and Emails Found:
 print '[+]', 'There were', str(len(md5)), 'md5 hashes found.'
 print 'MD5 hashes found:'


 #print numbers
 #Uncomment below when testing (commented out to keep everything tidy)
 for i in md5: 
  print i
 webpage_crackmd5.main(md5)
  
# Function wget
def wget(url):
    
 # Try to retrieve a webpage via its url, and return its contents 
 print '[*] wget()'
 # open file like url object from web, based on url 
 url_file = urllib.urlopen(url)
 # get webpage contents 
 page = url_file.read()
 # return whatever is in page.. don't know if this is needed
 return page


# Function main
def main(url): 


 page = wget(url)
 # Print md5
 print_md5(page)
 
if __name__ == '__main__': 
 main()

Code: [Select]

# Python sucks 
import sys, re, urllib
import main


# Function print_numbers
def print_images(page):
# Set blank space between functions
 print ""


 # find all numbers on a webpage passed in as input and print 
 print '[*] print_images()' 
 # regex to match on emails, returning 3 grps, links[1] being the link itself 
 numbers = re.findall(r'src="([^"]+)"', page)    


 # sort and print the links 
 numbers.sort()
 #print [+], the number of numbers found and Emails Found:
 print '[+]', 'There were', str(len(numbers)), 'images found.'
 print 'Images found:'


 #print numbers
 #Uncomment below when testing (commented out to keep everything tidy)
 for number in numbers: 
  print number


# Function print_numbers
def print_documents(page):
# Set blank space between functions
 print ""


 # find all numbers on a webpage passed in as input and print 
 print '[*] print_documents()' 
 # regex to match on emails, returning 3 grps, links[1] being the link itself 
 numbers = re.findall(r'"(.*\.docx)"', page) #




 # sort and print the links 
 numbers.sort()
 #print [+], the number of numbers found and Emails Found:
 print '[+]', 'There were', str(len(numbers)), 'documents found.'
 print 'Documents found:'


 #print numbers
 #Uncomment below when testing (commented out to keep everything tidy)
 for number in numbers: 
  print number


# Function wget
def wget(url):
    
 # Try to retrieve a webpage via its url, and return its contents 
 print '[*] wget()'
 # open file like url object from web, based on url 
 url_file = urllib.urlopen(url)
 # get webpage contents 
 page = url_file.read()
 # return whatever is in page, don't know if this is needed..
 return page


# Function main
def main(url): 


 page = wget(url)
 # Get the emails
 print_images(page)
 print_documents(page)
 
if __name__ == '__main__': 
 main()

webpage_downloadcontent.py

Code: [Select]

# Download images


import os
import main
import re
import urllib
import httplib, urlparse
    
def download_images(url, page):
 #Set link equal to the link minus the page for downloading images without a link
 link = os.path.dirname(url)#


 #Set numbers equal to the images on the page (WILL ONLY FIND FULL LINKS)
 images = re.findall(r'src=".*http\://([^"]+)"', page)
 
 #Set aegami equal to the images on the page (WILL ONLY FIND IMAGES WITH NO PATH)
 aegami = re.findall(r'\ssrc\s*=\s*"([^/"]+\.(?:jpg|gif|png|jpeg|bmp))\s*"', page) #\ssrc\s*=\s*"([^/"]+\.[^/".]+" no need to specify image file extensions
 images.sort()


 #This is for adding URL to images with no full URL
 for image in aegami:
  #Set test = dirname of the URL + / + each individual file in the loop
  test = link+"/"+image


  #Set location equal to the directory for the download content to be stored in
  location = os.path.abspath("C:/temp/coursework/")
  
  #Need to concantenate location C:\temp/coursework + filename) - this is to save the file
  get = os.path.join(location, image)


  #Check the file actually exists by getting the HEAD code
  status = urllib.urlopen(test).getcode()


  #If status is 200 (OK)
  if status == 200:
      urllib.urlretrieve(test, get)
      print 'The file:', image, 'has been saved to:', get
  #If status is 404 (FILE NOT EXISTANT)
  elif status == 404:
      print 'The file:', image, 'could not be saved. Does not exist!!'
  else:
      print 'Unknown Error:', status


  ######################### This is for files with a link ################
 for files in images:
  #Download the images
  #Set filename equal to the basename of files, so the actual file i.e image1.jpg of http://google.com/image1.jpg
  filename = os.path.basename(files)


  #Need to concantenate location C:\temp/coursework + filename) - this is to save the file
  save = os.path.join(location, filename)


  #This is because the regular expression removes http:// for some weird reason, without http the download does not work...
  addhttp = "http://"+files
  
  #Download the file arguements(link to file, destionation to save and the filename)
  status = urllib.urlopen(test).getcode()




  if status == 200:
      urllib.urlretrieve(addhttp, save)
      print 'The file:', filename, 'has been saved to:', save
  elif status == 404:
      print 'The file:', filename, 'could not be saved. Does not exist!!'
  else:
      print 'Unknown Error:', status


  print 'Download of images complete!'
  #Print white space
  print ''




def download_documents(url, page):
 #Set dirname equal to the link minus the page for downloading images without a link
 dirname = os.path.dirname(url)


 #Set documents equal to the documents on the page
 documents = re.findall(r'"(.*\.docx)"', page)


 documents.sort()


 #Download the documents, see above comments - pretty much same code but different
 #Variables and regex


 for doc in documents: 
  test = dirname+"/"+doc
  
  location = os.path.abspath("C:/temp/coursework/")


  #Set filename equal to the basename of files, so the actual file i.e image1.jpg of http://google.com/image1.jpg
  name = os.path.basename(test)


  #Need to concantenate location C:\temp/coursework + filename) - this is to save the file
  get = os.path.join(location, name)


  status = urllib.urlopen(test).getcode()
  #print status


  if status == 200:
      urllib.urlretrieve(test, get)
      print 'The file:', doc, 'has been saved to:', get
  elif status == 404:
      print 'The file:', doc, 'could not be saved. Does not exist!!'
  else:
      print 'Unknown Error:', status


 print 'Download of documents complete!'


# Function wget
def wget(url):
    
 # Try to retrieve a webpage via its url, and return its contents 
 print ''
 print '[*] wget()'
 
 #Print white space
 print ''
 
 # open file like url object from web, based on url 
 url_file = urllib.urlopen(url)
 
 # get webpage contents 
 page = url_file.read()
 
 # return whatever is in page???
 return page


def main(url):
 page = wget(url)
 
 download_images(url, page)
 download_documents(url, page)


if __name__ == '__main__': 
 main()

forensic_analysis.py

Code: [Select]

#Forensic Analysis
import hashlib, urllib, binascii, os, shutil
#Binascii module allows convertion of ASCII characters to Hex and Binary.


def forensic_analysis(url, page):


 #Setup the dictionary [0] = hash [1] = the file name
 bad_hashes = [('9d377b10ce778c4938b3c7e2c63a229a','badfile1.jpg'), ('6bbaa34b19edd6c6fa06cccf29b33125', 'badfile2.jpg'), ('28f6607fa6ec96acb89027056cc4c0f5', 'badfile3.jpg'), ('1d6d9c72e3476d336e657b50a77aee05', 'badfile4.jpg')]


 #Set location equal to the directory where downloads are stored
 location = r'C:\\temp\\coursework\\'
 
 #Counts how many files are in C:\\temp\\coursework
 path, dirs, files = os.walk("C:\\temp\\coursework").next()
 file_count = len(files)


 #Tell the user how many files there are in the directory
 print 'How many files:', file_count


 #For each file in the directory do
 for each in files:
     
  #Set filename equal to the downloads directory + each individual filename in the directory
  filename = location+each


  #Blank space
  print ''
  print 'Current filename:', filename


  #Open the file so the contents are hashed
  fh = open(filename, 'r')
  
  #Read the first four bytes
  file_sig = fh.read(4)


  #Convert from ascii to hex
  #test = binascii.hexlify(file_sig)


  #Extract file extension from file name
  fileName, fileExtension = os.path.splitext(each)


  #Print the files hash signature
  filesig_hex = binascii.hexlify(file_sig)
  print 'This is the files hash signature:', filesig_hex


  #Set hasher to hash files
  hasher = hashlib.md5()


  #if the file signature == (SIGNATURE - in this case JPG)
  if filesig_hex == 'ffd8ffe0':
          #Tell the user this is a jpg file
          print 'This file is JPG'


          #Set file extension to jpg
          fileExtension = '.jpg'


          #Set it's new name to the filename plus it's fixed extension
          newname = fileName+fileExtension


          #Print it's newname 
          print 'The files newname:', newname
          
          #Set destined to location\fixed
          dst_dir = os.path.join(location, "fixed")
          #Join location and file name together
          src_file = os.path.join(location, each)


          #Copy the original file to the new directory location\fixed
          shutil.copy(src_file, dst_dir)
          
          ### Note that because the way Windows works, we must make sure
          ### that the fixed directory is empty before running the script as
          ### the code will not be able to rename files otherwise.
          ### This code will work perfectly on Unix systems as the way they work
          ### Is overwrite is guaranteed.


          #Set dst_file to the new location + the file name 
          dst_file = os.path.join(dst_dir, each)


          #Set the new name to new location + newname 
          new_dst_file_name = os.path.join(dst_dir, newname)


          #Rename the original file to it's new name 
          os.rename(dst_file, new_dst_file_name)
          
          ##### need to copy file to new folder, rename it and then hash it^
          
          with open(new_dst_file_name, 'rb') as afile:
              buf = afile.read()
              hasher.update(buf)
          for i in bad_hashes:
              if hasher.hexdigest() == i[0]:
                  print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1] 
  #See above comments for anything below, code is the same but I ran out of
  #time to put this into a function which would take up less lines and be more efficient.                 
  if filesig_hex == 'ffd8ffe1':
          print 'This file is JPG'
          fileExtension = '.jpg'
          newname = fileName+fileExtension
          print 'Newname:', newname
          dst_dir = os.path.join(location, "fixed")
          src_file = os.path.join(location, each)
          shutil.copy(src_file, dst_dir)


          dst_file = os.path.join(dst_dir, each)
          new_dst_file_name = os.path.join(dst_dir, newname)
          os.rename(dst_file, new_dst_file_name)


          with open(new_dst_file_name, 'rb') as afile:
              buf = afile.read()
              hasher.update(buf)
          
          for i in bad_hashes:
              if hasher.hexdigest() == i[0]:
                  print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1]
                  
  if filesig_hex == '424d3e':
          print 'This file is BMP'
          fileExtension = '.bmp'
          newname = fileName+fileExtension
          print 'Newname:', newname
          dst_dir = os.path.join(location, "fixed")
          src_file = os.path.join(location, each)
          shutil.copy(src_file, dst_dir)


          dst_file = os.path.join(dst_dir, each)
          new_dst_file_name = os.path.join(dst_dir, newname)
          os.rename(dst_file, new_dst_file_name)
          with open(new_dst_file_name, 'rb') as afile:
              buf = afile.read()
              hasher.update(buf)
          for i in bad_hashes:
              if hasher.hexdigest() == i[0]:
          
  if filesig_hex == '47494638':
          print 'This file is GIF'
          fileExtension = '.gif'
          newname = fileName+fileExtension
          print 'Newname:', newname
          dst_dir = os.path.join(location, "fixed")
          src_file = os.path.join(location, each)
          shutil.copy(src_file, dst_dir)


          dst_file = os.path.join(dst_dir, each)
          new_dst_file_name = os.path.join(dst_dir, newname)
          os.rename(dst_file, new_dst_file_name)
          with open(new_dst_file_name, 'rb') as afile:
              buf = afile.read()
              hasher.update(buf)
          for i in bad_hashes:
              if hasher.hexdigest() == i[0]:
                  print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1] 


  if filesig_hex == '000001':
          print 'This file is ICO'
          fileExtension = '.ico'
          newname = fileName+fileExtension
          print 'Newname:', newname
          dst_dir = os.path.join(location, "fixed")
          src_file = os.path.join(location, each)
          shutil.copy(src_file, dst_dir)


          dst_file = os.path.join(dst_dir, each)
          new_dst_file_name = os.path.join(dst_dir, newname)
          os.rename(dst_file, new_dst_file_name)
          with open(new_dst_file_name, 'rb') as afile:
              buf = afile.read()
              hasher.update(buf)
          for i in bad_hashes:
              if hasher.hexdigest() == i[0]:
                  print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1] 


  if filesig_hex == '89504e':
          print 'This file is PNG'
          fileExtension = '.png'
          newname = fileName+fileExtension
          print 'Newname:', newname
          dst_dir = os.path.join(location, "fixed")
          src_file = os.path.join(location, each)
          shutil.copy(src_file, dst_dir)


          dst_file = os.path.join(dst_dir, each)
          new_dst_file_name = os.path.join(dst_dir, newname)
          os.rename(dst_file, new_dst_file_name)
          with open(new_dst_file_name, 'rb') as afile:
              buf = afile.read()
              hasher.update(buf)
          for i in bad_hashes:
              if hasher.hexdigest() == i[0]:
                  print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1]
                  
  if filesig_hex == 'd0cf11':
          print 'This file is DOC'
          fileExtension = '.doc'
          newname = fileName+fileExtension
          print 'Newname:', newname
          dst_dir = os.path.join(location, "fixed")
          src_file = os.path.join(location, each)
          shutil.copy(src_file, dst_dir)


          dst_file = os.path.join(dst_dir, each)
          new_dst_file_name = os.path.join(dst_dir, newname)
          os.rename(dst_file, new_dst_file_name)
          with open(new_dst_file_name, 'rb') as afile:
              buf = afile.read()
              hasher.update(buf)
          for i in bad_hashes:
              if hasher.hexdigest() == i[0]:
                  print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1]
                  
  if filesig_hex == '504b0304':
          print 'This file is DOCX'
          fileExtension = '.docx'
          newname = fileName+fileExtension
          print 'Newname:', newname
          dst_dir = os.path.join(location, "fixed")
          src_file = os.path.join(location, each)
          shutil.copy(src_file, dst_dir)


          dst_file = os.path.join(dst_dir, each)
          new_dst_file_name = os.path.join(dst_dir, newname)
          os.rename(dst_file, new_dst_file_name)
          with open(new_dst_file_name, 'rb') as afile:
              buf = afile.read()
              hasher.update(buf)
          for i in bad_hashes:
              if hasher.hexdigest() == i[0]:
                  print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1] 


  if filesig_hex == '7b5c72':
          print 'This file is RTF'
          fileExtension = '.rtf'
          newname = fileName+fileExtension
          print 'Newname:', newname
          dst_dir = os.path.join(location, "fixed")
          src_file = os.path.join(location, each)
          shutil.copy(src_file, dst_dir)
          dst_file = os.path.join(dst_dir, each)
          new_dst_file_name = os.path.join(dst_dir, newname)
          os.rename(dst_file, new_dst_file_name)


          with open(new_dst_file_name, 'rb') as afile:
              buf = afile.read()
              hasher.update(buf)
          for i in bad_hashes:
              if hasher.hexdigest() == i[0]:
                  print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1] 
  if filesig_hex == 'd0cf11':
          print 'This file is XLS'
          fileExtension = '.xls'
          newname = fileName+fileExtension
          print 'Newname:', newname
          dst_dir = os.path.join(location, "fixed")
          src_file = os.path.join(location, each)
          shutil.copy(src_file, dst_dir)


          dst_file = os.path.join(dst_dir, each)
          new_dst_file_name = os.path.join(dst_dir, newname)
          os.rename(dst_file, new_dst_file_name)
          ##### need to copy file to new folder, rename it and then hash it
          with open(new_dst_file_name, 'rb') as afile:
              buf = afile.read()
              hasher.update(buf)
          for i in bad_hashes:
              if hasher.hexdigest() == i[0]:
                  print 'This file hash matches bad hash', i[0], '. This file should be called:', i[1] 
          
  else:
      print 'File signature not recognised. Be wary'


# Function wget
def wget(url):
    
 # Try to retrieve a webpage via its url, and return its contents 
 print '[*] wget()'
 # open file like url object from web, based on url 
 url_file = urllib.urlopen(url)
 # get webpage contents 
 page = url_file.read()
 # return whatever is in page???
 return page


def main(url):
 #Set page equal to the output of the wget function
 page = wget(url)


 #Call the forensic_analysis function and pass the arguements url and page
 forensic_analysis(url, page)


if __name__ == '__main__': 
 main()

Note that to perform forensic analysis please make sure you have downloaded some files...

Forensic analysis works by checking the file signature and ensuring that the file actually is whatever extension it claims to be. If for example a file is called file1.jpg, the forensic analysis script will check file1.jpg signature to see if it actually is a jpg. If it is then it does nothing and just states the file is a jpg, if it isn't then it will tell the user that it isn't actually a jpg and that it is actually whatever file extension by checking it's signature.

I haven't looked at the code for a couple of months and if you have any issues with it I am more than happy to help. Feel free to modify it to your personal use.

To download all the scripts:

Download

Kulverstukas · « **Reply #1 on:** January 12, 2014, 07:46:11 pm »

If "# Python sucks", why did you code in it? also where the HELL are the conventions? I see none. Variable names? sooooo descriptive... and why I have to ensure a directory exists? can't the code do it?

Basically cool initiative, but a really sloppy and awful code. Do you even lift, bro?

Deque · « **Reply #2 on:** January 12, 2014, 08:10:45 pm »

Python has a HTMLParser module. Consider using this instead of regex.
Some people can get pretty mad about this, see here: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

(for actual reasons read the answers below)

Btw, there is also natural language processing library that can find e-mail-addresses, telephone numbers and names: http://nltk.org/
Might be too much for this simple tool, though.

Code: [Select]

MAKE SURE DIRECTORY C:\temp exists.
This is a fail. You use a platform independend language, but hardcode a windows path.
Use relative pathes, create the folder if it doesn't exists.

Code: [Select]

MUST INCLUDE HTTP://
This is also something you can check for and fix within two lines instead of screaming at the user what to do.
if string doesn't start with http: prepend http to string

Code: [Select]

### This code will work perfectly on Unix systems as the way they work
No, it won't.

dsme · « **Reply #3 on:** January 12, 2014, 10:17:56 pm »

Quote from: Kulverstukas on January 12, 2014, 07:46:11 pm

If "# Python sucks", why did you code in it? also where the HELL are the conventions? I see none. Variable names? sooooo descriptive... and why I have to ensure a directory exists? can't the code do it?Basically cool initiative, but a really sloppy and awful code. Do you even lift, bro?

#Python sucks was for a bit of banter and to lighten the mood, it was a coursework assignment. I'm not a coder and yeah some of the variables aren't very descriptive but it works.

It can but it doesn't, I don't think I implemented that. I did state that it most definitely can be improved but thought it was worth a share.

No I don't.

Quote from: Deque on January 12, 2014, 08:10:45 pm

Python has a HTMLParser module. Consider using this instead of regex.
Some people can get pretty mad about this, see here: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
(for actual reasons read the answers below)

Btw, there is also natural language processing library that can find e-mail-addresses, telephone numbers and names: http://nltk.org/
Might be too much for this simple tool, though.

Code: [Select]
MAKE SURE DIRECTORY C:\temp exists.

This is a fail. You use a platform independend language, but hardcode a windows path.
Use relative pathes, create the folder if it doesn't exists.

Code: [Select]
MUST INCLUDE HTTP://

This is also something you can check for and fix within two lines instead of screaming at the user what to do.
if string doesn't start with http: prepend http to string

Code: [Select]
### This code will work perfectly on Unix systems as the way they work

No, it won't.

I'm balls with regular expressions so I just searched Google for something and it worked so used it.

As I said above, I didn't implement checking if C:\temp exists into the code. There is plenty of room for improvement

. The script does everything the handout asked it to do.

I thought it would have definitely worked.

EvilZone

News:

Author Topic: Get webpage content script (Read 1179 times)

dsme

Get webpage content script

Kulverstukas

Re: Get webpage content script

Deque

Re: Get webpage content script

dsme

Re: Get webpage content script