Search III
A very simple text search
This article (if you can call it that) discusses the last few ingredients needed to make the script behind this site's search function: searching, ranking the results and parsing the search terms
Parsing a search term is a more difficult task than I had first imagined. What to use? what not to use? And what to do with the results? The current solution is very simplistic and goes like this:
- Check for full phrases, entered as "example one" or 'example two'.
- Remove all non-word characters from the result, except for whitespace (a "word" character is
any letter or digit or the underscore character, that is, any character which can be part of a Perl 'word'
) - Split the string around whitespaces, search for each individual word
The second step seems rather arbitrary. Why underscores? What if you want to search for the & sign? The answer is that it only takes a single line of php this way. If you like you can construe this as a maximisation of lazyness vs effectiveness and argue that this is an example of cost-effective programming. In fact, please do. Truth be told, it's one of those "let's see if this will work" solutions. So far it works.
The same kind of lazy programming led to the following few lines. It's not all bad though, as the php regex engine is pretty fast. Why reinvent the wheel etc etc.
<?php
/*
* Splits a search term, maintaining full phrases. Examples:
* "one two three" yields array("one", "two", "three")
* while
* "one 'two three'" yields array("two three", "one", "two", "three")
* Phrases can be indicated with single or double quotes.
*/
private static function parseSearchTerm($term)
{
// Extract whole sentences
preg_match_all('/"(.+)"/', $term, $phrases);
$parts = $phrases[1];
preg_match_all("/'(.+)'/", $term, $phrases);
$parts = array_merge($parts, $phrases[1]);
// Replace non-word characters with spaces
$term = preg_replace('/[\W]/', ' ', $term);
// Split remaining term on whitespace
$parts = array_merge($parts, preg_split('/\s+/', trim($term)));
return $parts;
}
// Turns an array into a regex, with the array values separated by OR's
private static function makeRegex($parts)
{
//var_dump($parts);
$regex = implode($parts,'|');
return '/'.$regex.'/i';
}
?>
The function makeRegex() now returns a regular expression ready for php to use in a full text search. This is exactly what the following bit of code does. Here $dbArticles is a reference to a database class (a database table abstraction really) and the method rows2articles is responsible for the conversion of database rows to Article objects.
<?php
/*
* Scans the content folder for articles containing the given search term
*/
public static function search($term)
{
global $dbArticles;
// Parse search term to regex
$parts = self::parseSearchTerm($term);
$regex = self::makeRegex($parts);
// Get articles from db
$articles = self::rows2articles($dbArticles->getAll());
// Scan articles for search term
$hits = array();
foreach($articles as $article) {
$matches = preg_match_all($regex, $article->content, $foo);
if ($matches > 0) {
$article->matches = $matches;
$hits[] = $article;
}
}
return usort($hits, array('self', 'hitComparison'));;
}
// Returns the difference in mathces between a and b
public static function hitComparison($a, $b)
{
return ($b->matches - $a->matches);
}
?>
Et voila: a full text search.
Unfortunately, there are a lot of things wrong with this approach. In the following section I'll try to point them out and suggest improvements where I can. But as a start, it's pretty alright.
Towards a not-so-simple search
A short list of what is wrong:
- Think of a crazy word, prefix it with 'the' and try it in the search bar. Wether or not this word is actually included anywhere on the sites, the search will return lots of hits for any article containing the word 'the'. Even worse, say you're search for an article on 'loss of hair', now imagine there's a 20 word article explaining a fail-safe remedy for loss of hair. The way the current search system works, this item will end up somewhere near the bottom, while any article sufficiently long containing a lot of 'of's will get a top rating.
- Then again, if you want to search for 'the' the site shouldn't stop you. A search for 'The Shining' should definitely include the 'the' as part of the term.
- Scanning the site for bears by searching for the word 'bear' could return plenty of articles about maintaining a beautiful 'beard'.
- A search for a phrase (entered with quotation marks) should return more results for the full phrase than for the individual words
- The system has no way of measuring page popularity and doesn't include this factor in the ranking
Some of these points seem more important than others, some might be indirectly solved by solving some ohter, related problem. Some problems are undoubtly missing from the list.Nevertheles the following sections will try to come up with a solution to these issues.
Weights & multiple searches
An obvious solution to the problem of dealing with a search for 'the armadillos' is to add some kind of weights to the search terms. A search for "the house" could detect "the" as a less important word (using some list of common words like 'the', 'and', 'of', 'it', etc) and focus mostly on houses. Similarly, a search for a full phrase could attach a bigger weight to the full phrase than to the individual words.
Apart from registering how many times each term is found on the page, the search should register how many of the terms were found. Some solution with weights could be used to put the articles that matched the most terms at the top, of maybe it'll be engouh to display the results in categories: articles that matched all terms at the top, articles that matched only 2 out of 3 terms below etc.
To avoid finding fractions of words the search could be modified to only return results for whole words. This will involve some awkward decisions on what is a whole word: when searching for 'mp3', does michael_sings.mp3 count as a valid result? Should it count for 'sings'?
The implementation of some kind of weighted search will probably require a custom search algorithm. Luckily, good search algorithm's aren't that hard to find. A lot of research has been done on text searches so a good book on algorithms will have the most popular algorithms listed. With a bit of luck, at least one of these could be adapted to do the nifty things a weighted search requires
Measuring page popularity is a whole different issue. Since I'm searching my own site statistics shouldn't be too hard to come by. Interpreting them is a different matter. No thoughts on this yet :)
And?
Writing a better search is feasible, but will require a little work. More on this when I get off my butt and...
More on this when I get on my butt and start programming. See you then!
Jul 9th, 2008
Comments
No comments yet! Feel free to post some using the form below.
If you wish to add code to your comment you can use code tags, like this: <code class="php">yourCodeHere</code>.
Quite a large number of languages are supported, although I can't guarantee it'll be pretty. Inside the code tags you can use any characters except for the string "</code>".