ye olde Search II
Generating a site index
Since I wrote this article I've made some changes to the way I store articles on the site, for an updated article go here
The previous article discussed some basic ideas on how to set up a search system for a site. Specifically, for this site. Since the time this was written and the time a first version of the search system was completed (about two hours ago) I changed my mind a little. This article presents part of the solution I came up with. Specifically, it shows how I currently index the articles stored on my site.
Templates again
Yes you'll never hear the last of them. Articles, of course, follow a pretty standard template. They start with a title, have some content in the middle, and end with a date. Deciding not to stop there I decreed every article should have a short description and a category. Every article on this site defines these five variables and adds them to an article template. Right now this template doesn't actually do anything with the description or the category, but that's beside the point. Thing is, they're there and if I ever want to use them I know where to look. This article template is then added to the default template for all pages on this site, with the article template set as content for the main template. This means article files on my site currently look something like this:
<?php
require_once('lotsOfClasses.php');
$title = 'Search II';
$description = 'Presents a way of indexing a local set of pages with active content';
ob_start();
?>
<h3>A crawling/scanning combo to index a site.</h3>
<p>The <a href="search_1.php">previous article</a> discussed some basic ideas on how to set up a
search system for a site. Specifically, for this site. Since the time this was written and the
time a first version of the search system was completed (about two hours ago) I changed my mind
a little. This article presents part of the solution I came up with. Specifically, it shows how
I currently index the articles stored on my site.</p>
<?php
$content = ob_get_clean();
$article = new Template(TPL.'template.article.php');
$article->set('content', $content);
$article->set('title', $title);
$article->set('description', $description);
$article->set('date', mktime(1,1,1,7,6,2008)); // month, day, year
$article->set('category', 'php');
$tpl->set('title', $title);
$tpl->set('description', $description);
$tpl->set('content', $article);
echo $tpl->fetch();
?>
Not the most beautiful code ever written, but it does the job and is quite readable even for people with limited knowledge of php. Main advantage: all the meta information in a single file.
It gets worse
Now that we've defined a clear structure for articles, let's think about getting it out. Because I like to use php in my content (seperation of what now?) directly reading out the files' contents isn't an option, since the search function needs the content sent to the user, not the unparsed php code. Here's a method I used to overcome this snitch:
<?php
/*
* Tries to convert a file to an Article object and returns it
* @param $file The path to the file (must be given absolute or relative to
* the place the method is called from
* @return An Article object, made with info take from the file or false, if no
* Article object can be distilled from the file
*/
private static function parseFile($path)
{
// Read file
global $tpl; // The template is defined in my intial startup function
$backup = clone $tpl;
ob_start();
@include($path);
ob_end_clean();
if (!isset($article)) return false;
$art = new Article();
$art->filename = $path;
$art->title = $article->get('title');
$art->description = $article->get('description');
$art->date = $article->get('date');
$art->category = $article->get('category');
// Srtip tags from content
$art->content = preg_replace("/<.+>/siU", "", $article->get('content'));
$tpl = $backup;
return $art;
}
?>
What happens is this: the whole article is included IE parsed & sent to an output buffer, from where it is merrily deleted. What remains are any variables defined in the file, now existing as local variables of the method. Is this safe? Hell no. Adding a few extra }'s in an article file means the whole indexing procedure crashes. And any (malicious) code included in the article file is happily executed. It isn't too bad though, since I'm the only person writing these files and a user friendly way of creating these article files could easily be made to restrict this kind of abuse.
Along with the properties metioned above the article's filename is stored. If all goes well this should be a relative path to the root of the site, which can be used to create links to the article later.
After parsing the file the whole (rendered!) article template is available to the parseFile method, and the title, description, content etc can be simply read out and stored. All that's left to do now is strip the html tags out of the content and we have an object representation of an article. (The article class, by the way, is used only as a container of these few data items, and doesn't have any special functions/methods/encapsulation/integrity checks etc.)
Scanning for articles
Indexing the articles on the site can now be done using the following snippets of code:
<?php
/*
* Scans a directory and returns an array of Article objects
* @param directory The directory to scan
* @param recursive Set this to true to scan recursively
* @return An array of Article objects
*/
private static function parseFiles($directory, $recursive=false)
{
if (!is_dir($directory))
throw new Exception('Argument passed to getFiles is not a valid directory');
// Make sure path ends in slash
if (substr($directory, -1) != '/') $directory .= '/';
$articles = array();
$handle = opendir($directory);
while(($file = readdir($handle)) !== false) {
// Ignore . and ..
if ($file == '.' || $file == '..') continue;
// If in recursive mode, scan subdirs
if ($recursive && is_dir($directory.$file))
$articles = array_merge($articles, self::parseFiles($directory.$file, $recursive));
// Parse files
if (is_file($directory.$file)) {
$art = self::parseFile($directory.$file);
if ($art !== false) $articles[] = $art;
}
}
return $articles;
}
?>
This method (part of a class called ArticleDB) runs through a folder iteratively and tries to parse any file it finds into an article.
The final step
Having read all this information we still need to store it. The following method uses a custom db class (no article yet!) and the parseFiles method to create a searchable version of the site.
An extra article property present only in the db is 'updated'. This is a boolean set to 1 every time the script updates the definition of a file. By setting all values of updated to zero before indexing (and by overwriting any existing indexations of the same file) the script is able to detect broken links in the database. These links-to-nowhere are removed from the db in the method's final line.
<?php
/*
* Indexes the site.
*
*/
public static function reIndex()
{
global $dbArticles;
// Set all article statii to not-update
$dbArticles->setWhere('', array('updated'=>0));
// Update all articles
$articles = self::parseFiles(CNT, true);
foreach($articles as $art)
{
echo "Updating ".$art->filename."<br />";
$dbArticles->insert(
array(
'filename' => $art->filename,
'title' => $art->title,
'description' => $art->description,
'date' => $art->date,
'category' => $art->category,
'content' => $art->content,
'updated' => 1
)
);
}
// Remove all articles not updated in the last run
$dbArticles->removeWhere(array('updated'=>0));
}
?>
And that's just about it. With the code described above it becomes fairly easy to create links to the latest 5 articles (to your left) or all articles in a certain category (this is how the php section works). To do a full text search some more is needed, which is described in the next article: Search III.
Jul 5th, 2008
Comments
No comments yet! Feel free to post some using the form below.
If you wish to add code to your comment you can use code tags, like this: <code class="php">yourCodeHere</code>.
Quite a large number of languages are supported, although I can't guarantee it'll be pretty. Inside the code tags you can use any characters except for the string "</code>".