Author Topic: Relatively crummy scraper.  (Read 1452 times)

0 Members and 1 Guest are viewing this topic.

Offline hmm

  • Serf
  • *
  • Posts: 23
  • Cookies: 0
    • View Profile
Relatively crummy scraper.
« on: July 31, 2012, 09:31:11 am »
Here is a shitty huffington post scraper. Sorry my contribution is crap. Tis python. Left of some dics and lists, but its obvious. It find keywords too, but I left that list off as well.




Code: (python) [Select]


from urllib import urlopen
import string
import re
##for left column of homepage
def h_p_left(url):
    target_section=re.compile('<h3>(.*)</a></h3>') ##gets section (i.e link + title in HTML)
    sections = target_section.findall(url) ##makes a list of all instances they occur
    for i in range(0, 3):
        kill_links(sections[i]) ##grabs link from section
        kill_tittles(sections[i]) ## grabs tittle
        get_text(links[i])##goes to link url, and scrapes content
        news_dict[(tittles[i])] = (keywords, links[i], content[i]) ##stores in dict.
    return news_dict
##same as above but the HTML is a bit different.
def h_p_center(url):
    target_section=re.compile('<h4>(.*)</a></h4>')
    sections=target_section.findall(url)
    for i in range(0,3):
        kill_links(sections[i])
        kill_tittles(sections[i])
        get_text(links[i])
        news_dict[(tittles[i])] = (keywords, links[i], content[i])
##gets link sans HTML
def kill_links(text_in):
    striped_link=re.compile('<a href="(.*?)".*>')
    links.extend(striped_link.findall(text_in))
    return links
##gets title sans HTML
def kill_tittles(text_in):
    stripped_title=re.compile('>(.*)')
    tittles.extend(stripped_tittle.findall(text_in))
    return tittles
##scrapes content
def get_text(url):
    articlepage=urlopen(url).read()##go to link URL
    if articlepage.find('<div class="articleBody" itemprop="articleBody">') !=-1:##goes to text
        front= articlepage.find('<div class="articleBody" itemprop="articleBody">')
    else:
        front = articlepage.find('<div class="entry_body_text">')##goes to start of text
    back = articlepage.find('<>', front)##goes to end
    text = articlepage[front:back]##grabs it
    kill_tags = re.compile(r'<.*?>')##removes HTML tags within
    k_t_list = kill_tags.findall(text)
    for item in k_t_list:
        text = re.sub(item, '', text)##makes it pretty
    text=text.translate(string.maketrans("\n\t\r", "   "))##makes it prettier
    text= text.strip()##pretty
    text = ' '.join(text.split())##pretty
    content.append(text)##adds it to content
    ##please ignore below, unless you want to implement it. I probably did it a shitty way
    ##as in you make a huge freaking list with shit to avoid.
    prekey={}
    key={}
    text = text.lower()
    parts = text.split(' ')
    for item in parts:
        if item in prekey:
            prekey[item] +=1
        else:
            prekey[item]=1
    for item in prekey.keys():
        if item in bad_words:
            del prekey[item]
    for item in prekey:
        if prekey[item]>3:
            key[item] = prekey[item]
    for item in bad_words:
        if item in key:
            del key[item]
    for item in key.keys():
        keywords.append((item, key[item]))
   
    return content, keywords
##the rest is simple, and repetitive
links=[]
tittles=[]
content=[]
keywords=[]
pwebpage=urlopen('http://www.huffingtonpost.com/politics').read()
h_p_left(pwebpage)
links=[]
content=[]
tittles=[]
keywords= []
news_dict={}
politics_dict={}
world_dict={}
h_p_center(pwebpage)
politics_dict=news_dict
news_dict={}
links=[]
tittles=[]
content=[]
keywords=[]
print "HP - p ; done"
bwebpage=urlopen('http://www.huffingtonpost.com/business').read()
h_p_left(bwebpage)
links=[]
tittles=[]
content=[]
keywords=[]
h_p_center(bwebpage)
business_dict=news_dict
news_dict={}
print ("HP - b; done")
wwebpage=urlopen('http://www.huffingtonpost.com/world').read()
links=[]
tittles=[]
content=[]
keywords=[]
h_p_left(wwebpage)
links=[]
tittles=[]
content=[]
keywords=[]
h_p_center(wwebpage)
world_dict=news_dict
print ("HP - w;done")
news_dict={}


Edited because I copy pasted incorrectly.
« Last Edit: August 01, 2012, 04:41:21 am by hmm »

Offline techb

  • Soy Sauce Feeler
  • Global Moderator
  • King
  • *
  • Posts: 2350
  • Cookies: 345
  • Aliens do in fact wear hats.
    • View Profile
    • github
Re: Relatively crummy scraper.
« Reply #1 on: July 31, 2012, 10:08:01 am »
Moved to scripting languages, also added python highlighting. Next time be sure to put [ code=python] or the language of your choice.

You should read coding conventions, also this is only a set of functions, no imports or a main entry points. 
>>>import this
-----------------------------

Offline hmm

  • Serf
  • *
  • Posts: 23
  • Cookies: 0
    • View Profile
Re: Relatively crummy scraper.
« Reply #2 on: July 31, 2012, 10:23:05 am »
Whoops. Thanks for the heads up. It's all there now.

Offline techb

  • Soy Sauce Feeler
  • Global Moderator
  • King
  • *
  • Posts: 2350
  • Cookies: 345
  • Aliens do in fact wear hats.
    • View Profile
    • github
Re: Relatively crummy scraper.
« Reply #3 on: July 31, 2012, 10:31:23 am »
You should try to make it OOP using classes. And I'm assuming this is Python3?
« Last Edit: July 31, 2012, 10:32:37 am by techb »
>>>import this
-----------------------------

Offline hmm

  • Serf
  • *
  • Posts: 23
  • Cookies: 0
    • View Profile
Re: Relatively crummy scraper.
« Reply #4 on: August 01, 2012, 03:29:38 am »
Pretty sure its 2.7, and classes probably would have been a cleaner way to go. thanks for the heads up.

Offline techb

  • Soy Sauce Feeler
  • Global Moderator
  • King
  • *
  • Posts: 2350
  • Cookies: 345
  • Aliens do in fact wear hats.
    • View Profile
    • github
Re: Relatively crummy scraper.
« Reply #5 on: August 01, 2012, 03:34:44 am »
Pretty sure its 2.7, and classes probably would have been a cleaner way to go. thanks for the heads up.

The reason I assumed 3 was the use of print("string") instead of print "string". I know you can use print as a built-in method call, but it was standardized in 3 to use true method calling conventions.

And yes, your code is extremely messy and unreadable. You need to work on cleaner code and proper naming conventions.
« Last Edit: August 01, 2012, 03:36:42 am by techb »
>>>import this
-----------------------------