Author Topic: IMDB Movie Scraper  (Read 1329 times)

0 Members and 1 Guest are viewing this topic.

Offline shimomura

  • Peasant
  • *
  • Posts: 57
  • Cookies: 0
    • View Profile
    • Shanaynay
IMDB Movie Scraper
« on: February 26, 2015, 02:31:26 am »
I'm always looking for new movies to watch so I wrote a elementary command line script that scrapes IMDB for the current top 25 movies in any genre. It's faster then looking through the site itself. If people like the idea and would use it, I'll probably convert it into a web app that has more functionality. I attached the script if you wanna try it even though most of you are obviously capable of writing a way better one yourselves.
« Last Edit: February 26, 2015, 02:32:07 am by shimomura »
Who gives a fuck what color the dress is...

Offline cyberdrifter

  • Knight
  • **
  • Posts: 176
  • Cookies: -90
    • View Profile
Re: IMDB Movie Scraper
« Reply #1 on: February 26, 2015, 03:00:33 am »
I'm always looking for new movies to watch so I wrote a elementary command line script that scrapes IMDB for the current top 25 movies in any genre. It's faster then looking through the site itself. If people like the idea and would use it, I'll probably convert it into a web app that has more functionality. I attached the script if you wanna try it even though most of you are obviously capable of writing a way better one yourselves.
Not bad, considering it's a loop menu that links to specific URLs...


However I think it could be more useful if it could link to searches at torrent sites using outputs in this program as input for a url string...
consider the search here:
http://kickass.to/usearch/%22Big%20Game%22/


and maybe even link to a movie trailer and provide critique reports...


It wouldn't be hard to parse your results into such url links


Jus sayin.
« Last Edit: February 26, 2015, 03:02:48 am by cyberdrifter »
.- / .-.. .. - - .-.. . / -... . - - . .-. --..-- / . ...- . .-. -.-- / -.. .- -.-- .-.-.-
Go ahead tubby, you clearly want/need those cookies more than me.  :P

Offline HTH

  • Official EZ Slut
  • Administrator
  • Knight
  • *
  • Posts: 395
  • Cookies: 158
  • EZ Titan
    • View Profile
Re: IMDB Movie Scraper
« Reply #2 on: February 26, 2015, 03:27:42 am »
Code: [Select]
import urllib
import re
 
#Rejected Strings To Exclude In The Found Titles
reject = ["Register or login to rate this title","Go to IMDbPro","PG_13","R","Delete","PG","image of title","Subscribe","image of ","G","TV_PG","image of character","image of name","TV_14"]
 
genre_dic = {1:"Action",2:"Adventure",3:"Animation",4:"Biography",5:"Comedy",6:"Crime",7:"Documentary",8:"Drama",9:"Family",10:"Fantasy",11:"Film_Noir",12:"History",13:"Horror",14:"Music",15:"Musical",16:"Mystery",17:"Romance",18:"Sci_Fi",19:"Sport",20:"Thriller",21:"War",22:"Western"}
 
#Scrapes Page For Movie Titles and Prints Them
#To Do: Replace Ascii Characters With Human Readable Text!
def find_list(url):
    title_list = []
    opened_url = urllib.urlopen(url)
    url_read = opened_url.read()
 
    titles_reg = 'title="(.+?)"'
    titles_comp = re.compile(titles_reg)
    titles_found = re.findall(titles_comp,url_read)
 
    print "The Top 25 Titles Found In "+genre_dic[user_choice]+" Are:"
    print
    for titles in titles_found:
        if titles not in reject:
            if titles not in title_list:
                if titles[:4] != "User":
                    print "%s : http://kickass.to/usearch/%s" % (titles, titles[:-7].replace(" ","%20"))
                    title_list.append(titles)
    print "--------------------------------------"
    print
     
#Intro & Help
print 'Welcome to the IMDB movie finder; A faster way to find movies.'
print 'Please select a number to view the corresponding genre'
count = 1
for each in genre_dic:
    print "%s : %d" % (genre_dic[each], count)
    count += 1
print
 
#User Selects Genre
def genre_sel():
    global user_choice
    user_choice = raw_input("Enter number here or type 'exit': ")
    if user_choice != 'exit':
        user_choice = int(user_choice)
        print
    else:
        exit()
 
while True:
    genre_sel()
    if user_choice <= (len(genre_dic)) and user_choice > 0:
        find_list("http://www.imdb.com/genre/%s/?ref_=gnr_mn_ac_mp" % (genre_dic[user_choice]).lower()) 
    else:
        print "Invalid Selection!"

Messy but a little bit more concise and you can add more genres if you want just by adding it to the dictionary. Added in Cyber's suggestion as well (rough and quickly)

Really it was a good script but those extra 100 lines from not looping hurt me
« Last Edit: February 26, 2015, 03:56:56 am by HTH »
<ande> HTH is love, HTH is life
<TurboBorland> hth is the only person on this server I can say would successfully spitefuck peoples women

Offline shimomura

  • Peasant
  • *
  • Posts: 57
  • Cookies: 0
    • View Profile
    • Shanaynay
Re: IMDB Movie Scraper
« Reply #3 on: February 26, 2015, 03:47:55 am »
Good call guys definitely awesome ideas. I'm thinking I should extend the scrape to look through several pages for each genre as well, why stop at 25 right? Also short descriptions would be helpful. lol I just hate looking through the other sites for titles because most of them are convoluted with shitty ads or the worst part is they have horrible resource management, display one title per page and its just generally a fucking waste of our time to find one decent title.   
Who gives a fuck what color the dress is...

Offline cyberdrifter

  • Knight
  • **
  • Posts: 176
  • Cookies: -90
    • View Profile
Re: IMDB Movie Scraper
« Reply #4 on: February 26, 2015, 03:53:11 am »

Did you test this code?

Code: [Select]
import urllib
import re
 
#Rejected Strings To Exclude In The Found Titles
reject = ["Register or login to rate this title","Go to IMDbPro","PG_13","R","Delete","PG","image of title","Subscribe","image of ","G","TV_PG","image of character","image of name","TV_14"]
 
genre_dic = {1:"Action",2:"Adventure",3:"Animation",4:"Biography",5:"Comedy",6:"Crime",7:"Documentary",8:"Drama",9:"Family",10:"Fantasy",11:"Film_Noir",12:"History",13:"Horror",14:"Music",15:"Musical",16:"Mystery",17:"Romance",18:"Sci_Fi",19:"Sport",20:"Thriller",21:"War",22:"Western"}
 
#Scrapes Page For Movie Titles and Prints Them
#To Do: Replace Ascii Characters With Human Readable Text!
def find_list(url):
    title_list = []
    opened_url = urllib.urlopen(url)
    url_read = opened_url.read()
 
    titles_reg = 'title="(.+?)"'
    titles_comp = re.compile(titles_reg)
    titles_found = re.findall(titles_comp,url_read)
 
    print "The Top 25 Titles Found In "+genre_dic[user_choice]+" Are:"
    print
    for titles in titles_found:
        if titles not in reject:
            if titles not in title_list:
                if titles[:4] != "User":
                    print "%s : [url]http://kickass.to/usearch/%s[/url]" % (titles, titles[:-7].replace(" ","%20"))
                    title_list.append(titles)
    print "--------------------------------------"
    print
     
#Intro & Help
print 'Welcome to the IMDB movie finder; A faster way to find movies.'
print 'Please select a number to view the corresponding genre'
count = 1
for each in genre_dic:
    print "%s : %d" % (genre_dic[each], count)
    count += 1
print
 
#User Selects Genre
def genre_sel():
    global user_choice
    user_choice = raw_input("Enter number here or type 'exit': ")
    if user_choice != 'exit':
        user_choice = int(user_choice)
        print
    else:
        exit()
 
while True:
    genre_sel()
    if user_choice <= (len(genre_dic)) and user_choice > 0:
        find_list("[url]http://www.imdb.com/genre/%s/?ref_=gnr_mn_ac_mp[/url]" % (genre_dic[user_choice]).lower()) 
    else:
        print "Invalid Selection!"

Messy but a little bit more concise and you can add more genres if you want just by adding it to the dictionary. Added in Cyber's suggestion as well (rough and quickly)

Really it was a good script but those extra 100 lines from not looping hurt me
.- / .-.. .. - - .-.. . / -... . - - . .-. --..-- / . ...- . .-. -.-- / -.. .- -.-- .-.-.-
Go ahead tubby, you clearly want/need those cookies more than me.  :P

Offline HTH

  • Official EZ Slut
  • Administrator
  • Knight
  • *
  • Posts: 395
  • Cookies: 158
  • EZ Titan
    • View Profile
Re: IMDB Movie Scraper
« Reply #5 on: February 26, 2015, 03:56:12 am »
bahaha yeah, the forum added in the [--------url--------] tags when i copy pasted without first adding code tags around where I was pasting.

I removed them from the post and retested to make sure it didnt fuck anything else up, works fine now
« Last Edit: February 26, 2015, 03:59:02 am by HTH »
<ande> HTH is love, HTH is life
<TurboBorland> hth is the only person on this server I can say would successfully spitefuck peoples women

Offline cyberdrifter

  • Knight
  • **
  • Posts: 176
  • Cookies: -90
    • View Profile
Re: IMDB Movie Scraper
« Reply #6 on: February 26, 2015, 04:10:48 am »
bahaha yeah, the forum added in the [--------url--------] tags when i copy pasted without first adding code tags around where I was pasting.

I removed them from the post and retested to make sure it didn't fuck anything else up, works fine now
Yep, works like a charm now.


I think I'll actually be using this now...


Polish this up man, and I think it could be a pretty sweet little script. Good work guys.

+1
+1

« Last Edit: February 26, 2015, 04:11:05 am by cyberdrifter »
.- / .-.. .. - - .-.. . / -... . - - . .-. --..-- / . ...- . .-. -.-- / -.. .- -.-- .-.-.-
Go ahead tubby, you clearly want/need those cookies more than me.  :P

Offline shimomura

  • Peasant
  • *
  • Posts: 57
  • Cookies: 0
    • View Profile
    • Shanaynay
Re: IMDB Movie Scraper
« Reply #7 on: February 26, 2015, 04:16:02 am »
Yeah I also caught that but yeah worked fine. I'll have to look into text formatting modules so I can make the output look nicer because I like the idea of just keeping it a command line interface and keeping it as efficient as possible.
« Last Edit: February 26, 2015, 04:20:59 am by shimomura »
Who gives a fuck what color the dress is...

Offline cyberdrifter

  • Knight
  • **
  • Posts: 176
  • Cookies: -90
    • View Profile
Re: IMDB Movie Scraper
« Reply #8 on: February 26, 2015, 04:27:37 am »
Yeah I also caught that but yeah worked fine. I'll have to look into text formatting modules so I can make the output look nicer because I like the idea of just keeping it a command line interface and keeping it as efficient as possible. If  someone can't figure out how to run it then I don't want that person using our super duper uber elite tool anyways lol
I like where you're taking it, and I'll be watching for updates.


Possible suggestions
1. selecting the best quality .torrent file and automatically downloading it to a specified folder, this could be a bit trickier to implement.
2. have a short description of the top 25
3. Add a critic rating on each of the top 25


Thanks for the effort.
.- / .-.. .. - - .-.. . / -... . - - . .-. --..-- / . ...- . .-. -.-- / -.. .- -.-- .-.-.-
Go ahead tubby, you clearly want/need those cookies more than me.  :P

Offline shimomura

  • Peasant
  • *
  • Posts: 57
  • Cookies: 0
    • View Profile
    • Shanaynay
Re: IMDB Movie Scraper
« Reply #9 on: February 28, 2015, 06:17:27 pm »
Updates: - Short description for each title.
                - Refined Torrent URL search.

Still needs if anyone wants to add on:
                - Text Formatting Within Terminal. (different colors, bold text, ect)
                - Critic or User Rating For Each Title (should be an easy one)
                - Replace ascii characters that are returned in titles.

Code: [Select]
import urllib
import re

#Rejected Strings To Exclude In The Found Titles (Review 7,8,11,12,13,14,15,16,19,20,21)
reject = ["Register or login to rate this title","Go to IMDbPro","PG_13","R","Delete","PG","image of title","Subscribe","image of ","G","TV_PG","image of character","image of name","TV_14","NOT_RATED","TV_MA"]

genre_dic = {1:"Action",2:"Adventure",3:"Animation",4:"Biography",5:"Comedy",6:"Crime",7:"Documentary",8:"Drama",9:"Family",10:"Fantasy",11:"Film-Noir",12:"History",13:"Horror",14:"Music",15:"Musical",16:"Mystery",17:"Romance",18:"Sci-Fi",19:"Sport",20:"Thriller",21:"War",22:"Western"}

#Scrapes Page For Movie Titles and Prints Them
#To Do: Replace Ascii Characters With Human Readable Text!
def find_list(url):
    movie_count = 1
    dis_count = 0
   
    title_list = []
    opened_url = urllib.urlopen(url)
    url_read = opened_url.read()

    descrip = '<span class="outline">(.+?)<'
    descrip_comp = re.compile(descrip)
    descrip_found = re.findall(descrip_comp,url_read)

    titles_reg = 'title="(.+?)"'
    titles_comp = re.compile(titles_reg)
    titles_found = re.findall(titles_comp,url_read)

    print "The Top 25 Titles Found In "+genre_dic[user_choice]+" Are:"
    print
    for titles in titles_found:
        if titles not in reject:
            if titles not in title_list:
                if titles[:4] != "User":
                    regex = '(.+?) \('
                    regex = re.compile(regex)
                    regex = re.findall(regex, titles)
                    print "Movie Number "+str(movie_count)+":"
                    print "Title: "+titles
                    print "Description: "+descrip_found[dis_count]
                    print "Torrent URL: http://kickass.to/usearch/"+regex[0]+"%20category:movies%20lang_id:2/"
                    print "--------------------------------------"
                    title_list.append(titles)
                    dis_count += 1
                    movie_count += 1
    print

#Intro & Help
print 'Welcome to the IMDB movie finder; A faster way to find movies.'
print 'Please select a number to view the corresponding genre'
print
print 'Action: 1\nAdventure: 2\nAnimation: 3\nBiography: 4\nComedy: 5\nCrime: 6\nDocumentary: 7\nDrama: 8\nFamily: 9\nFantasy: 10\nFilm-Noir: 11\nHistory: 12\nHorror: 13\nMusic: 14\nMusical: 15\nMystery: 16\nRomance: 17\nSci-Fi: 18\nSport: 19\nThriller: 20\nWar: 21\nWestern: 22\n'
print

#User Selects Genre
def genre_sel():
    global user_choice
    user_choice = raw_input("Enter number here or type 'exit': ")
    if user_choice != 'exit':
        user_choice = int(user_choice)
        print
    else:
        exit()

#Processes User Selection
def program_loop():
    genre_sel()
    if user_choice <= len(genre_dic):
        url = 'http://www.imdb.com/genre/'+genre_dic[user_choice].lower()+'/?ref_=gnr_mn_ac_mp'
        find_list(url)
        program_loop()

    else:
        print "Invalid Selection!"
        program_loop()

program_loop()
« Last Edit: March 01, 2015, 06:34:24 am by shimomura »
Who gives a fuck what color the dress is...

Offline Pillus

  • Serf
  • *
  • Posts: 21
  • Cookies: 2
  • RTFM
    • View Profile
    • ChaseNET
Re: IMDB Movie Scraper
« Reply #10 on: March 01, 2015, 09:19:25 am »
May i come with a quick suggestion, if you are to do anything with torrents for movie titles the much much better choice is to use the YTS API. YTS delivers all the new and cool stuff, all full HD, and everything has subtitles.

YTS > Kickass :3

API For reference:
https://yts.re/api
« Last Edit: March 01, 2015, 09:23:33 am by Pillus »

Offline shome

  • Peasant
  • *
  • Posts: 81
  • Cookies: 8
    • View Profile
Re: IMDB Movie Scraper
« Reply #11 on: March 03, 2015, 05:22:00 am »
Didn't test it, but I like the idea behind it. I'm an aspiring python-er myself. Good work.


 +1


Nice username btw. heh  :o
« Last Edit: March 04, 2015, 12:09:32 am by shome »