I've written a small program that harvests data from Google. I named it "Pygore" (
Python
Google
Regular
Expressions).
It is like a web crawler, but a little unique, so I'll try to explain what it does:
- To start with, it makes a user-defined Google query. The user can enter ordinary search terms or use Google operators ("Google hacking") to improve the results of the search.
- A certain number of URLs are extracted from the results of the query. That number is defined by the user. If the user's query produced less URLs than the number the user specified, Pygore will simply extract all of them.
- These URLs are iterated through, and Pygore visits each of them (as a web client) and downloads the HTML source for each of the web pages.
- Pygore then searches through the HTML for a user-defined regular expression. The matches are then extracted and dumped. Optionally, the user can also dump the URLs that the matches were found at, right beside the matches themselves. Pygore can dump the results to the terminal, a line-by-line text file, or an HTML file.
(I've written it using Tkinter for the GUI, and it makes use of the
xgoogle library to implement the Google searching.
Pygore is split into several modules, (rather than being a single .py script) so I uploaded it as an attachment rather than posting the source code directly. The attachment contains the source code, although the Xgoogle modules were not written by me.)
I think something like this can be useful, although I myself am probably never going to use it. Let me know what you think.