Author Topic: IMDB Movie Scraper (Read 1980 times)

shimomura · « **on:** February 26, 2015, 02:31:26 am »

I'm always looking for new movies to watch so I wrote a elementary command line script that scrapes IMDB for the current top 25 movies in any genre. It's faster then looking through the site itself. If people like the idea and would use it, I'll probably convert it into a web app that has more functionality. I attached the script if you wanna try it even though most of you are obviously capable of writing a way better one yourselves.

cyberdrifter · « **Reply #1 on:** February 26, 2015, 03:00:33 am »

Quote from: shimomura on February 26, 2015, 02:31:26 am

I'm always looking for new movies to watch so I wrote a elementary command line script that scrapes IMDB for the current top 25 movies in any genre. It's faster then looking through the site itself. If people like the idea and would use it, I'll probably convert it into a web app that has more functionality. I attached the script if you wanna try it even though most of you are obviously capable of writing a way better one yourselves.

Not bad, considering it's a loop menu that links to specific URLs...

However I think it could be more useful if it could link to searches at torrent sites using outputs in this program as input for a url string...
consider the search here:
http://kickass.to/usearch/%22Big%20Game%22/

and maybe even link to a movie trailer and provide critique reports...

It wouldn't be hard to parse your results into such url links

Jus sayin.

HTH · « **Reply #2 on:** February 26, 2015, 03:27:42 am »

Code: [Select]

import urllib 
import re 
 
#Rejected Strings To Exclude In The Found Titles 
reject = ["Register or login to rate this title","Go to IMDbPro","PG_13","R","Delete","PG","image of title","Subscribe","image of ","G","TV_PG","image of character","image of name","TV_14"] 
 
genre_dic = {1:"Action",2:"Adventure",3:"Animation",4:"Biography",5:"Comedy",6:"Crime",7:"Documentary",8:"Drama",9:"Family",10:"Fantasy",11:"Film_Noir",12:"History",13:"Horror",14:"Music",15:"Musical",16:"Mystery",17:"Romance",18:"Sci_Fi",19:"Sport",20:"Thriller",21:"War",22:"Western"} 
 
#Scrapes Page For Movie Titles and Prints Them 
#To Do: Replace Ascii Characters With Human Readable Text! 
def find_list(url): 
    title_list = [] 
    opened_url = urllib.urlopen(url) 
    url_read = opened_url.read() 
 
    titles_reg = 'title="(.+?)"' 
    titles_comp = re.compile(titles_reg) 
    titles_found = re.findall(titles_comp,url_read) 
 
    print "The Top 25 Titles Found In "+genre_dic[user_choice]+" Are:" 
    print 
    for titles in titles_found: 
        if titles not in reject: 
            if titles not in title_list: 
                if titles[:4] != "User": 
                    print "%s : http://kickass.to/usearch/%s" % (titles, titles[:-7].replace(" ","%20")) 
                    title_list.append(titles) 
    print "--------------------------------------" 
    print 
     
#Intro & Help 
print 'Welcome to the IMDB movie finder; A faster way to find movies.' 
print 'Please select a number to view the corresponding genre' 
count = 1 
for each in genre_dic: 
    print "%s : %d" % (genre_dic[each], count) 
    count += 1 
print 
 
#User Selects Genre 
def genre_sel(): 
    global user_choice 
    user_choice = raw_input("Enter number here or type 'exit': ") 
    if user_choice != 'exit': 
        user_choice = int(user_choice) 
        print 
    else: 
        exit() 
 
while True: 
    genre_sel() 
    if user_choice <= (len(genre_dic)) and user_choice > 0: 
        find_list("http://www.imdb.com/genre/%s/?ref_=gnr_mn_ac_mp" % (genre_dic[user_choice]).lower())  
    else: 
        print "Invalid Selection!"

Messy but a little bit more concise and you can add more genres if you want just by adding it to the dictionary. Added in Cyber's suggestion as well (rough and quickly)

Really it was a good script but those extra 100 lines from not looping hurt me

shimomura · « **Reply #3 on:** February 26, 2015, 03:47:55 am »

Good call guys definitely awesome ideas. I'm thinking I should extend the scrape to look through several pages for each genre as well, why stop at 25 right? Also short descriptions would be helpful. lol I just hate looking through the other sites for titles because most of them are convoluted with shitty ads or the worst part is they have horrible resource management, display one title per page and its just generally a fucking waste of our time to find one decent title.

cyberdrifter · « **Reply #4 on:** February 26, 2015, 03:53:11 am »

Did you test this code?

Quote from: HTH on February 26, 2015, 03:27:42 am

Code: [Select]
import urllib import re #Rejected Strings To Exclude In The Found Titles reject = ["Register or login to rate this title","Go to IMDbPro","PG_13","R","Delete","PG","image of title","Subscribe","image of ","G","TV_PG","image of character","image of name","TV_14"] genre_dic = {1:"Action",2:"Adventure",3:"Animation",4:"Biography",5:"Comedy",6:"Crime",7:"Documentary",8:"Drama",9:"Family",10:"Fantasy",11:"Film_Noir",12:"History",13:"Horror",14:"Music",15:"Musical",16:"Mystery",17:"Romance",18:"Sci_Fi",19:"Sport",20:"Thriller",21:"War",22:"Western"} #Scrapes Page For Movie Titles and Prints Them #To Do: Replace Ascii Characters With Human Readable Text! def find_list(url): title_list = [] opened_url = urllib.urlopen(url) url_read = opened_url.read() titles_reg = 'title="(.+?)"' titles_comp = re.compile(titles_reg) titles_found = re.findall(titles_comp,url_read) print "The Top 25 Titles Found In "+genre_dic[user_choice]+" Are:" print for titles in titles_found: if titles not in reject: if titles not in title_list: if titles[:4] != "User": print "%s : [url]http://kickass.to/usearch/%s[/url]" % (titles, titles[:-7].replace(" ","%20")) title_list.append(titles) print "--------------------------------------" print #Intro & Help print 'Welcome to the IMDB movie finder; A faster way to find movies.' print 'Please select a number to view the corresponding genre' count = 1 for each in genre_dic: print "%s : %d" % (genre_dic[each], count) count += 1 print #User Selects Genre def genre_sel(): global user_choice user_choice = raw_input("Enter number here or type 'exit': ") if user_choice != 'exit': user_choice = int(user_choice) print else: exit() while True: genre_sel() if user_choice <= (len(genre_dic)) and user_choice > 0: find_list("[url]http://www.imdb.com/genre/%s/?ref_=gnr_mn_ac_mp[/url]" % (genre_dic[user_choice]).lower()) else: print "Invalid Selection!"
Messy but a little bit more concise and you can add more genres if you want just by adding it to the dictionary. Added in Cyber's suggestion as well (rough and quickly)

Really it was a good script but those extra 100 lines from not looping hurt me

HTH · « **Reply #5 on:** February 26, 2015, 03:56:12 am »

bahaha yeah, the forum added in the [--------url--------] tags when i copy pasted without first adding code tags around where I was pasting.

I removed them from the post and retested to make sure it didnt fuck anything else up, works fine now

cyberdrifter · « **Reply #6 on:** February 26, 2015, 04:10:48 am »

Quote from: HTH on February 26, 2015, 03:56:12 am

bahaha yeah, the forum added in the [--------url--------] tags when i copy pasted without first adding code tags around where I was pasting.

I removed them from the post and retested to make sure it didn't fuck anything else up, works fine now

Yep, works like a charm now.

I think I'll actually be using this now...

Polish this up man, and I think it could be a pretty sweet little script. Good work guys.

+1
+1

shimomura · « **Reply #7 on:** February 26, 2015, 04:16:02 am »

Yeah I also caught that but yeah worked fine. I'll have to look into text formatting modules so I can make the output look nicer because I like the idea of just keeping it a command line interface and keeping it as efficient as possible.

cyberdrifter · « **Reply #8 on:** February 26, 2015, 04:27:37 am »

Quote from: shimomura on February 26, 2015, 04:16:02 am

Yeah I also caught that but yeah worked fine. I'll have to look into text formatting modules so I can make the output look nicer because I like the idea of just keeping it a command line interface and keeping it as efficient as possible. If someone can't figure out how to run it then I don't want that person using our super duper uber elite tool anyways lol

I like where you're taking it, and I'll be watching for updates.

Possible suggestions
1. selecting the best quality .torrent file and automatically downloading it to a specified folder, this could be a bit trickier to implement.
2. have a short description of the top 25
3. Add a critic rating on each of the top 25

Thanks for the effort.

shimomura · « **Reply #9 on:** February 28, 2015, 06:17:27 pm »

Updates: - Short description for each title.
- Refined Torrent URL search.

Still needs if anyone wants to add on:
- Text Formatting Within Terminal. (different colors, bold text, ect)
- Critic or User Rating For Each Title (should be an easy one)
- Replace ascii characters that are returned in titles.

Code: [Select]

import urllib
import re

#Rejected Strings To Exclude In The Found Titles (Review 7,8,11,12,13,14,15,16,19,20,21)
reject = ["Register or login to rate this title","Go to IMDbPro","PG_13","R","Delete","PG","image of title","Subscribe","image of ","G","TV_PG","image of character","image of name","TV_14","NOT_RATED","TV_MA"]

genre_dic = {1:"Action",2:"Adventure",3:"Animation",4:"Biography",5:"Comedy",6:"Crime",7:"Documentary",8:"Drama",9:"Family",10:"Fantasy",11:"Film-Noir",12:"History",13:"Horror",14:"Music",15:"Musical",16:"Mystery",17:"Romance",18:"Sci-Fi",19:"Sport",20:"Thriller",21:"War",22:"Western"}

#Scrapes Page For Movie Titles and Prints Them
#To Do: Replace Ascii Characters With Human Readable Text!
def find_list(url):
    movie_count = 1
    dis_count = 0
    
    title_list = []
    opened_url = urllib.urlopen(url)
    url_read = opened_url.read()

    descrip = '<span class="outline">(.+?)<'
    descrip_comp = re.compile(descrip)
    descrip_found = re.findall(descrip_comp,url_read)

    titles_reg = 'title="(.+?)"'
    titles_comp = re.compile(titles_reg)
    titles_found = re.findall(titles_comp,url_read)

    print "The Top 25 Titles Found In "+genre_dic[user_choice]+" Are:"
    print
    for titles in titles_found:
        if titles not in reject:
            if titles not in title_list:
                if titles[:4] != "User":
                    regex = '(.+?) \('
                    regex = re.compile(regex)
                    regex = re.findall(regex, titles)
                    print "Movie Number "+str(movie_count)+":"
                    print "Title: "+titles
                    print "Description: "+descrip_found[dis_count]
                    print "Torrent URL: http://kickass.to/usearch/"+regex[0]+"%20category:movies%20lang_id:2/"
                    print "--------------------------------------"
                    title_list.append(titles)
                    dis_count += 1
                    movie_count += 1
    print

#Intro & Help
print 'Welcome to the IMDB movie finder; A faster way to find movies.'
print 'Please select a number to view the corresponding genre'
print
print 'Action: 1\nAdventure: 2\nAnimation: 3\nBiography: 4\nComedy: 5\nCrime: 6\nDocumentary: 7\nDrama: 8\nFamily: 9\nFantasy: 10\nFilm-Noir: 11\nHistory: 12\nHorror: 13\nMusic: 14\nMusical: 15\nMystery: 16\nRomance: 17\nSci-Fi: 18\nSport: 19\nThriller: 20\nWar: 21\nWestern: 22\n'
print

#User Selects Genre
def genre_sel():
    global user_choice
    user_choice = raw_input("Enter number here or type 'exit': ")
    if user_choice != 'exit':
        user_choice = int(user_choice)
        print
    else:
        exit()

#Processes User Selection
def program_loop():
    genre_sel()
    if user_choice <= len(genre_dic):
        url = 'http://www.imdb.com/genre/'+genre_dic[user_choice].lower()+'/?ref_=gnr_mn_ac_mp'
        find_list(url)
        program_loop()

    else:
        print "Invalid Selection!"
        program_loop()

program_loop()

Pillus · « **Reply #10 on:** March 01, 2015, 09:19:25 am »

May i come with a quick suggestion, if you are to do anything with torrents for movie titles the much much better choice is to use the YTS API. YTS delivers all the new and cool stuff, all full HD, and everything has subtitles.

YTS > Kickass :3

API For reference:
https://yts.re/api

shome · « **Reply #11 on:** March 03, 2015, 05:22:00 am »

Didn't test it, but I like the idea behind it. I'm an aspiring python-er myself. Good work.

+1

Nice username btw. heh

EvilZone

News:

Author Topic: IMDB Movie Scraper (Read 1980 times)

shimomura

IMDB Movie Scraper

cyberdrifter

Re: IMDB Movie Scraper

HTH

Re: IMDB Movie Scraper

shimomura

Re: IMDB Movie Scraper

cyberdrifter

Re: IMDB Movie Scraper

HTH

Re: IMDB Movie Scraper

cyberdrifter

Re: IMDB Movie Scraper

shimomura

Re: IMDB Movie Scraper

cyberdrifter

Re: IMDB Movie Scraper

shimomura

Re: IMDB Movie Scraper

Pillus

Re: IMDB Movie Scraper

shome

Re: IMDB Movie Scraper