PHP Crawler

Crawling websites isn't an easy task. If you check the wiki page you'll find a whole list of policies, issues, a nice pic of a multi-threading crawler architecture and links to open source crawling projects. But if you don't care about speed, accuracy, politeness or common sense you can use the code below. I wrote it for the specific task of creating a site map of my very own site, and 'cause I'd never tried crawling before.

There's a whole lot of things wrong with this code:

But let's do it anyway

Of course 'valid concerns' never stopped us before so here we go.

The following snippet is a badly tested class that'll take a url and scan the page for links to sites on the same domain. It does a reasonable job of turning relative links into absolute ones and can output a very bare sitemaps file. Handle with care :)

<?php
class Crawler
{
    private 
$url;        // The full URL linking to the root page
    
private $scheme;    // The scheme of the root page
    
private $domain;    // The domain name of the root page
    
private $path;        // The path (relative to scheme://host/) to the root page
    
private $file;        // The root file to scan (filename only)
    
    
private $crawled;    // The pages crawled
    
private $seen;        // All urls encountered
    
private $toCrawl;    // The pages still to be crawled
    
    
private $debug  false;
    private 
$debug2 false;
    
    
/**
     * Crawls a page. The given url must contain both a scheme and a host name. In
     *  other words, new Crawler('http://google.com/index.htm') is valid while new
     *  Crawler('./index.php') is not.
     */
    
public function __construct($url)
    {
        
$parts parse_url($url);
        if (!isset(
$parts['scheme'])) throw new Exception("url passed to constructor must contain scheme (http://)");
        if (empty(
$parts['scheme'])) throw new Exception("url passed to constructor must contain scheme (http://)");
        if (!isset(
$parts['host'])) throw new Exception("url passed to constructor must contain domain name (example.com)");
        if (empty(
$parts['host'])) throw new Exception("url passed to constructor must contain domain name (example.com)");
        
$this->scheme strtolower($parts['scheme']);
        
$this->domain strtolower($parts['host']);
        
$root $parts['path'];
        
$this->path pathinfo($rootPATHINFO_DIRNAME);
        
$this->file pathinfo($rootPATHINFO_BASENAME);
        
$this->url $url;
        if (
$this->path == '\\'$this->path '/';
        
        
$this->toCrawl = array($url);
        
$this->crawled = array();
        
$this->seen    = array();
        
        while(!empty(
$this->toCrawl)) {
            foreach(
$this->toCrawl as $key=>$value) {
                
$this->crawl($value);
                
$this->seen[] = $value;
                unset(
$this->toCrawl[$key]);
            }
        }    
    }
    
    private function 
crawl($url)
    {
        if (
$this->debug) echo "<i>Crawling: $url</i><br />";
        
        
$links $this->scanForLinks($url);
        if (
$links === false) return;
        
        
$pages = array();
        foreach(
$links as $link) {
            
            
$parts parse_url($link);
            
            if (
$this->debug2) echo "<b>Testing: $link</b><br />";
            
            
// Ignore link without path specification
            
if (!isset($parts['path']) || empty($parts['path'])) continue;
            
            
// Ignore other schemes
            
if (isset($parts['scheme'])) {
                
$scheme strtolower($parts['scheme']);
                if (
$scheme != $this->scheme) continue;
            }
            
            
// Ignore other domain names
            
if (isset($parts['host'])) {
                
$domain strtolower($parts['host']);
                if (
$domain != $this->domain) continue;
            }
            
            
// Replace initial / with full path
            
$path $parts['path'];
            if (
$path[0] != '/')
                
$path $this->path .'/'$path;
            
            
$isDir = (substr($path, -1) == '/');
            
$path explode('/'$path);
            
$level 0;
            
$new = array();
            foreach(
$path as $part)
            {
                
// Ignore ./ and //
                
if ($part == '.' || $part == '') continue;
                
                if (
$part == '..') {
                    
// Go a level deeper 
                    
$level--;
                    if (
$level 0) break;
                } else {
                    
$new[$level] = $part;
                    
$level++;
                }
            }
            
// Ignore anything deeper than the current level
            
if ($level 1) continue;
            
            
// Parse
            
$parsed $this->scheme.'://'.$this->domain// .$this->path;
            
            
for ($i=0$i<$level$i++) $parsed .= '/'.$new[$i];
            if (
$isDir$parsed .= '/';
                
            if (
$this->debug2) echo $parsed "<br />";
            
            
// If not seen yet & not queued --> queue
            
if (!(in_array($parsed$this->seen)
               || 
in_array($parsed$this->toCrawl))) $this->toCrawl[] = $parsed;                
        }
        
        
// Scanned succesfully -> add to crawled list
        
$this->crawled[] = $url;
    }
        
    private function 
scanForLinks($url)
    {
        if (
substr($url07) != 'http://'$url 'http://'.$url;
        if (
substr($url, -1) == '/'$url substr($url0, -1);
        
        if (
$url == 'http://localhost'$url 'http://127.0.0.1';
    
        @
$cnt file_get_contents($url);
        if (
$cnt === false) return false;
        
        
// Find links (messy!)
        
preg_match_all("/<a [^>]*href[\s]*=[\s]*\"([^\"]*)\"/i"$cnt$links);
        return 
$links[1];
    }
    
    
/**
     * Returns the links found
     */
    
public function getPages()
    {
        return 
$this->crawled;
    }
    
    
/**
     * Returns the links found as a site map
     */
    
public function getSiteMap()
    {
        
ob_start();
        echo 
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
        echo 
"<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n";
        foreach(
$this->crawled as $url)
           echo 
"\t<url><loc>$url</loc></url>\n";
        echo 
"</urlset>";
        return 
ob_get_clean();
    }
    
}
?>

Aug 31st, 2008

Comments

michael wrote:

Well with those line numbers it should be easy enough to debug right?

Mar 3rd, 2010

Haafiz wrote:

Notice: Undefined index: path in /opt/lampp/htdocs/crawler/crawler.php on line 31

Notice: Undefined index: scparse_urlheme in /opt/lampp/htdocs/crawler/crawler.php on line 69

It is giving the above error many times

Mar 3rd, 2010

Anonymous wrote:

your coding is working man.
I like it.
Now I am testing with other coding.

with regards,
Nay La Aung

May 15th, 2009

michael wrote:

Any error messages you'd like to share?

Mar 16th, 2009

colorblack04@yahoo.com wrote:

its not working

Mar 16th, 2009

Post your comments here

If you wish to add code to your comment you can use code tags, like this: <code class="php">yourCodeHere</code>.
Quite a large number of languages are supported, although I can't guarantee it'll be pretty. Inside the code tags you can use any characters except for the string "</code>".