I recently finished up this basic web scraping script.
Vertor is a BitTorrent site that is specially coded to do verifications on uploads and scans for malware, DRM and password-protected archives before putting the file into its database. As such, it may come off as useful as an additional torrent security measure by checking up to see if a given torrent file is legitimate by querying its database for a result.
This was my first attempt at web scraping and BeautifulSoup4, so it may be somewhat kludgy. If you don't have BeautifulSoup installed, run: pip install BeautifulSoup4, or download it from the official website and run setup.py.
What the script does is take a search query through sys.argv[1], sends it to the Vertor database, rips the number of results from an <h1> tag, isolates the integer through a regex and proceeds to display all the references to the query in the page source. Vertor's markup is not tidy, to say the least, so I had to stick with this. Given that this was mostly written out of boredom and for practice, I suppose it is a fair trade.
Vertor treats queries with words separated by periods as identical to multiple-word/parameter queries, and I took note of that in the source code as a method to get by sys.argv's restrictions.
Source:
#!/usr/bin/env python
# Simple script that queries a search result to Vertor's database and scrapes the important information.
# This can be used as an additional torrent security measure.
# by vezzy of evilzone.org
# NOTE: If you want to input a multiple-parameter argument, e.g. Video Game, concatenate it with dots, as in Video.Game
# Vertor will process the query in the same way, regardless.
import sys
import re
from urllib2 import urlopen
from bs4 import BeautifulSoup
STARTUP_URL = "http://www.vertor.com/index.php?mod=search&search=&cid=0&words="
def usage():
print "/Vertor Checker/"
print "Usage: python vertor.py <search query>"
sys.exit()
try:
query = sys.argv[1]
except IndexError:
usage()
def get_result_number():
print "Sending query to Vertor...\n"
search = BeautifulSoup(urlopen(STARTUP_URL + query).read())
search.find("div", "standart")
grep_h1 = str([h1.string for h1 in search.findAll("h1")])
#Regular expression to separate integer from string in the <h1> tag
result_number = str([int(num.group()) for num in re.finditer(r'\d+', str(grep_h1))])
print "Found" + ' ' + result_number + ' ' + "result(s) for" + ' ' + query + ' ' + "(with double amount of references in source)"
print "Torrent release appears to be verified and safe."
print "Query references in first page of source:\n"
def get_result_list():
search = BeautifulSoup(urlopen(STARTUP_URL + query).read())
directory = "/torrents"
for a in search.findAll("a"):
if directory in a["href"]:
parts = a["href"].split("/")
torrent_file = parts[2], parts[3]
print torrent_file
if __name__ == '__main__':
try:
get_result_number()
get_result_list()
except KeyboardInterrupt:
sys.exit("\nProcess aborted.")
Criticism and suggestions on optimization would be much appreciated.