PHP Crawler
Crawling websites isn't an easy task. If you check the wiki page you'll find a whole list of policies, issues, a nice pic of a multi-threading crawler architecture and links to open source crawling projects. But if you don't care about speed, accuracy, politeness or common sense you can use the code below. I wrote it for the specific task of creating a site map of my very own site, and 'cause I'd never tried crawling before.
There's a whole lot of things wrong with this code:
- It relies on php functions parsing urls correctly, which may or may not be a good idea
- It uses a regex to find the anchor tags. HTML isn't a regular language and so can't be parsed using regular expressions. Furthermore, you might be interested in setting up some policy for finding url's in mismatched brackets, images, forms etc. Basically, if you want to do this seriously you'll need a customised parser.
- PHP is not a high-performance language and webcrawling is serious business.
- Using a crawler to create a sitemap defeats the whole purpose of a sitemap: to help crawlers visiting your site.
But let's do it anyway
Of course 'valid concerns' never stopped us before so here we go.
The following snippet is a badly tested class that'll take a url and scan the page for links to sites on the same domain. It does a reasonable job of turning relative links into absolute ones and can output a very bare sitemaps file. Handle with care :)
<?php
class Crawler
{
private $url; // The full URL linking to the root page
private $scheme; // The scheme of the root page
private $domain; // The domain name of the root page
private $path; // The path (relative to scheme://host/) to the root page
private $file; // The root file to scan (filename only)
private $crawled; // The pages crawled
private $seen; // All urls encountered
private $toCrawl; // The pages still to be crawled
private $debug = false;
private $debug2 = false;
/**
* Crawls a page. The given url must contain both a scheme and a host name. In
* other words, new Crawler('http://google.com/index.htm') is valid while new
* Crawler('./index.php') is not.
*/
public function __construct($url)
{
$parts = parse_url($url);
if (!isset($parts['scheme'])) throw new Exception("url passed to constructor must contain scheme (http://)");
if (empty($parts['scheme'])) throw new Exception("url passed to constructor must contain scheme (http://)");
if (!isset($parts['host'])) throw new Exception("url passed to constructor must contain domain name (example.com)");
if (empty($parts['host'])) throw new Exception("url passed to constructor must contain domain name (example.com)");
$this->scheme = strtolower($parts['scheme']);
$this->domain = strtolower($parts['host']);
$root = $parts['path'];
$this->path = pathinfo($root, PATHINFO_DIRNAME);
$this->file = pathinfo($root, PATHINFO_BASENAME);
$this->url = $url;
if ($this->path == '\\') $this->path = '/';
$this->toCrawl = array($url);
$this->crawled = array();
$this->seen = array();
while(!empty($this->toCrawl)) {
foreach($this->toCrawl as $key=>$value) {
$this->crawl($value);
$this->seen[] = $value;
unset($this->toCrawl[$key]);
}
}
}
private function crawl($url)
{
if ($this->debug) echo "<i>Crawling: $url</i><br />";
$links = $this->scanForLinks($url);
if ($links === false) return;
$pages = array();
foreach($links as $link) {
$parts = parse_url($link);
if ($this->debug2) echo "<b>Testing: $link</b><br />";
// Ignore link without path specification
if (!isset($parts['path']) || empty($parts['path'])) continue;
// Ignore other schemes
if (isset($parts['scheme'])) {
$scheme = strtolower($parts['scheme']);
if ($scheme != $this->scheme) continue;
}
// Ignore other domain names
if (isset($parts['host'])) {
$domain = strtolower($parts['host']);
if ($domain != $this->domain) continue;
}
// Replace initial / with full path
$path = $parts['path'];
if ($path[0] != '/')
$path = $this->path .'/'. $path;
$isDir = (substr($path, -1) == '/');
$path = explode('/', $path);
$level = 0;
$new = array();
foreach($path as $part)
{
// Ignore ./ and //
if ($part == '.' || $part == '') continue;
if ($part == '..') {
// Go a level deeper
$level--;
if ($level < 0) break;
} else {
$new[$level] = $part;
$level++;
}
}
// Ignore anything deeper than the current level
if ($level < 1) continue;
// Parse
$parsed = $this->scheme.'://'.$this->domain; // .$this->path;
for ($i=0; $i<$level; $i++) $parsed .= '/'.$new[$i];
if ($isDir) $parsed .= '/';
if ($this->debug2) echo $parsed . "<br />";
// If not seen yet & not queued --> queue
if (!(in_array($parsed, $this->seen)
|| in_array($parsed, $this->toCrawl))) $this->toCrawl[] = $parsed;
}
// Scanned succesfully -> add to crawled list
$this->crawled[] = $url;
}
private function scanForLinks($url)
{
if (substr($url, 0, 7) != 'http://') $url = 'http://'.$url;
if (substr($url, -1) == '/') $url = substr($url, 0, -1);
if ($url == 'http://localhost') $url = 'http://127.0.0.1';
@$cnt = file_get_contents($url);
if ($cnt === false) return false;
// Find links (messy!)
preg_match_all("/<a [^>]*href[\s]*=[\s]*\"([^\"]*)\"/i", $cnt, $links);
return $links[1];
}
/**
* Returns the links found
*/
public function getPages()
{
return $this->crawled;
}
/**
* Returns the links found as a site map
*/
public function getSiteMap()
{
ob_start();
echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
echo "<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n";
foreach($this->crawled as $url)
echo "\t<url><loc>$url</loc></url>\n";
echo "</urlset>";
return ob_get_clean();
}
}
?>
Aug 31st, 2008
Comments
michael wrote:
Well with those line numbers it should be easy enough to debug right?
Mar 3rd, 2010
Haafiz wrote:
Notice: Undefined index: path in /opt/lampp/htdocs/crawler/crawler.php on line 31
Notice: Undefined index: scparse_urlheme in /opt/lampp/htdocs/crawler/crawler.php on line 69
It is giving the above error many times
Mar 3rd, 2010
Anonymous wrote:
your coding is working man.
I like it.
Now I am testing with other coding.
with regards,
Nay La Aung
May 15th, 2009
michael wrote:
Any error messages you'd like to share?
Mar 16th, 2009
colorblack04@yahoo.com wrote:
its not working
Mar 16th, 2009
If you wish to add code to your comment you can use code tags, like this: <code class="php">yourCodeHere</code>.
Quite a large number of languages are supported, although I can't guarantee it'll be pretty. Inside the code tags you can use any characters except for the string "</code>".