Search II
Generating a site index
Since I wrote this article the way I store articles has changed quite a bit and I've updated this article to reflect those changes. The original version can be found here.
The previous article discussed some basic ideas on how to set up a search system for a site. Specifically, for this site. This article presents part of the solution I came up with. Specifically, it shows how I currently index the articles stored on my site.
Templates again
Yes you'll never hear the last of them. Articles, of course, follow a pretty standard template. They start with a title, have some content in the middle, and end with a date. Deciding not to stop there I decreed every article should have a short description and a category. Every article on this site defines these five variables and adds them to an article object. The script that builds the pages you see simply includes one of these article object definition files, and passes the article to a template. This template is then used to create the kind of page you see before you, complete with comment-posting functionality.
The beauty of using php includes rather than xml files or loading content from a database is that I can write my content in php, which allows for lots of neat tricks like code highlighting etc.
All this means article files on my site currently look something like this:
<?php
$article = new Article();
$article->title = 'Search II';
$article->description = 'Presents a way of indexing a local set of pages with active content';
$article->date = 1216238572;
$article->category = 'php';
ob_start();
?>
<h3>A crawling/scanning combo to index a site.</h3>
<p>The <a href="search_1.php">previous article</a> discussed some basic ideas on how to set up a
search system for a site. Specifically, for this site. Since the time this was written and the
time a first version of the search system was completed (about two hours ago) I changed my mind
a little. This article presents part of the solution I came up with. Specifically, it shows how
I currently index the articles stored on my site.</p>
<?php
$article->content = ob_get_clean();
return $article;
?>
Not the most beautiful code ever written, but it does the job and is quite readable even for people with limited knowledge of php. Main advantage: all the meta information in a single file.
It gets better
Because the article is in php, 'reading' the file is simply including it, and storing the data found in the newly created object $article. The following code shows how my indexing script handles this:
<?php
/*
* Tests to see if a filename returns an Article object, adds the filename and
* returns it
* @param $file The path to the file (must be given absolute or relative to
* the place the method is called from
* @return An Article object, made with info take from the file or false, if no
* Article object can be distilled from the file
*/
private static function parseFile($path)
{
// Read file
$ob_level = ob_get_level();
ob_start();
$article = include($path);
while(ob_get_level() > $ob_level) ob_end_clean();
if (!isset($article) || !$article instanceof Article) return false;
// Add filename & link
$article->filename = $path;
$link = substr($path, strlen(CNT)); // strip content folder
$link = substr($link, 0, -4); // strip extension
$article->link = '/'.$link.'/';
// Srtip tags from content
$article->content = preg_replace("/<.+>/siU", "", $article->content);
return $article;
}
?>
In an effort to keep things running in the event of a badly written article (code wise only, sorry guys!) I set a custom 'ignore-everything' set of error/exception handlers before calling this method. To avoid problems with the output buffering I make sure the script finished with as many levels of buffering as it began: so if an article calls ob_start() but forgets to close it this is done automatically after reading the file.
Along with the properties metioned above the article's filename is stored. Using some mystical string replace functions and the magic of apache redirects this filename is changed into a (perma)link to the page.
After parsing the file the whole (rendered!) article template is available to the parseFile method, and the title, description, content etc can be simply read out and stored. All that's left to do now is strip the html tags out of the content and we have an object representation of an article. (The article class, by the way, is used only as a container of these few data items, and doesn't have any special functions/methods/encapsulation/integrity checks etc.)
Scanning for articles
Indexing the articles on the site can now be done using the following snippets of code:
<?php
/*
* Scans a directory and returns an array of Article objects
* @param directory The directory to scan
* @param recursive Set this to true to scan recursively
* @return An array of Article objects
*/
private static function parseFiles($directory, $recursive=false)
{
if (!is_dir($directory))
throw new Exception('Argument passed to getFiles is not a valid directory');
// Make sure path ends in slash
if (substr($directory, -1) != '/') $directory .= '/';
$articles = array();
$handle = opendir($directory);
set_error_handler(array('ArticleDB', 'doNothing'));
set_exception_handler(array('ArticleDB', 'doNothing'));
while(($file = readdir($handle)) !== false) {
// Ignore . and ..
if ($file == '.' || $file == '..') continue;
// If in recursive mode, scan subdirs
if ($recursive && is_dir($directory.$file))
$articles = array_merge($articles, self::parseFiles($directory.$file, $recursive));
// Parse files
if (is_file($directory.$file)) {
$art = self::parseFile($directory.$file);
if ($art !== false) $articles[] = $art;
}
}
restore_exception_handler();
restore_error_handler();
return $articles;
}
?>
This method (part of a class called ArticleDB) runs through a folder iteratively and tries to parse any file it finds into an article.
The final step
Having read all this information we still need to store it. The following method uses a custom db class (no article yet!) and the parseFiles method to create a searchable version of the site.
An extra article property present only in the db is 'updated'. This is a boolean set to 1 every time the script updates the definition of a file. By setting all values of updated to zero before indexing (and by overwriting any existing indexations of the same file) the script is able to detect broken links in the database. These links-to-nowhere are removed from the db in the method's final line.
<?php
/*
* Indexes the site.
*
*/
public static function reIndex()
{
global $dbArticles;
// Set all article statii to not-update
$dbArticles->setWhere('', array('updated'=>0));
// Update all articles
$articles = self::parseFiles(CNT, true);
foreach($articles as $art) {
if ($art->index == false) continue;
echo "Updating ".$art->filename."<br />";
$dbArticles->insert(
array(
'filename' => $art->filename,
'link' => $art->link,
'title' => $art->title,
'description' => $art->description,
'date' => $art->date,
'category' => $art->category,
'content' => $art->content,
'updated' => 1
)
);
}
// Remove all articles not updated in the last run
$dbArticles->removeWhere(array('updated'=>0));
}
?>
And that's just about it. With the code described above it becomes fairly easy to create links to the latest 5 articles (to your left) or all articles in a certain category (this is how the php section works). To do a full text search some more is needed, which is described in the next article: Search III.
Jul 16th, 2008
Comments
No comments yet! Feel free to post some using the form below.
If you wish to add code to your comment you can use code tags, like this: <code class="php">yourCodeHere</code>.
Quite a large number of languages are supported, although I can't guarantee it'll be pretty. Inside the code tags you can use any characters except for the string "</code>".