Author Topic: [Perl] robots.txt webcrawler (Read 759 times)

Freak · « **on:** April 24, 2015, 08:40:07 am »

Hello,

This is a program takes in a URL in the format "www.example.com" and crawls to every new domain that it finds copying their robots.txt file. Note that it doesn't dig very deep because it only looks at the source for the front page of a website for new domains. If it was looking at https://evilzone.org, for example, it would not look in https://evilzone.org/scripting-languages/

Theoretically, a robots.txt file tells webcrawlers what portions of their website they can and can't index, etc... There's nothing that actually enforces this, but it's supposed to be convention. This is useful because web administrators put things in there that they don't want to show up on a Google search which can mean that information held within is sensitive.

I tested my program on "www.google.com" which yielded:

And if you look in one of these you'll see something like:

Finally, here is the code:

Code: (perl) [Select]

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use utf8;
my(@domains, $pageContent, $i, $e, $new, $robotsContent);

print "Enter domain to start with: (Ex. \"www.google.com\")\n";
chomp($domains[0] = <stdin>);
for($i = 0;$i<scalar(@domains);$i++){
   $pageContent = lc(get("http://".$domains[$i]));
   while($pageContent =~ /href=\"(.*?)\"/g){
      if($1 =~ /http:\/\/(.*?)\// or $1 =~ /https:\/\/(.*?)\//){
         $new = 1;
         for($e = 0;$e<scalar(@domains);$e++){
            if($domains[$e] eq $1){
               $new = 0;
            }
         }
         if($new){
            push(@domains, $1);
         }
      }
   }
   $robotsContent = get("http://".$domains[$i]."/robots.txt");
   if($robotsContent){
      $robotsContent = lc($robotsContent);
      open FILE, ">$domains[$i] robots.txt" or die "Error: $!\n";
      binmode(FILE, ":utf8");
      print FILE $robotsContent;
      close FILE;
   }
}

Enjoy

EDIT:
I found two little bugs today, but I changed the code in this post to get rid of them. If you run the code that used to be here it will still work perfectly, it'll just spew some warnings at you sometimes, but if you ignore them it'll still do it's job. Like I said though, the code that's on here now works perfectly.

I also am attaching a zip of 6,001 robots.txt files that I found while testing it today. I just started on en.wikipedia.org and let it run for a while.

EvilZone

News:

Author Topic: [Perl] robots.txt webcrawler (Read 759 times)

Freak

[Perl] robots.txt webcrawler