Prefix tree
The trie, or prefix tree, is a data structure for storing strings or other sequences in a way that allows for a fast look-up. In its simplest form it can be used as a list of keywords or a dictionary. By associating each string with an object it can be used as an alternative to a hashmap. The name 'trie' comes from the word 'retrieval'.
The basic idea behind a trie is that each successive letter is stored as a separate node. To find out if the word 'cat' is in the list you start at the root and look up the 'c' node. Having found the 'c' node you search the list of c's children for an 'a' node, and so on. To differentiate between 'cat' and 'catalog' each word is ended by a special delimiter.
The figure below shows a schematic representation of a partial trie:

Implementation
The fastest way to implement this is with fixed size arrays. Unfortunately this only works if you know which characters can show up in the sequences. For keywords with 26 letters its a fast but space consuming option, for unicode strings its pretty much impossible.
Instead of fixed sizes arrays you can use a linked list at each node. This has obvious space advantages, since no more empty spaces are stored. Unfortunately searching a long linked list is rather slow. For example to find the word 'zzz' you might need 3 times 26 steps.
Faster trie algorithms have been devised that lie somewhere between these two extremes in terms of speed and space consumption. These can be found by searching google.
Fun & games with prefix trees
Prefix trees are a bit of an overlooked data structure with lots of interesting possibilities.
Storage
By storing values at each leaf node you can use them as a kind of alternative hashmap, although when working with unicode strings a hashmap will greatly outperform a trie.
As a dictionary
Looking up if a word is in a trie takes O(n) operations, where n is the length of the word. Thus - for array implementations - the lookup speed doesn't change with increasing trie size.
Word completion
Word completion is straightforward to implement using a trie: simply find the node corresponding to the first few letters, and then collape the subtree into a list of possible endings.
This can be used in autocompleting user input in text editors or the T9 dictionary on your phone
Censoring strings
Given a large list of swear words and a string to censor a trie offers a speed advantage over a simple array of strings. If the swear word can appear anywhere in the string you'll need to attempt to match it from any possible starting offset. With a string of m characters and a list of n words this would mean m*n string comparisons.
Using a trie you can attempt to find a match from each given offset in the string, this means m trie lookups. Since the speed of a trie lookup scales well with an increasing number of words this is considerably faster than the array lookup.
Java linked list implementation
Just for fun, here's a java linked list implementation. Keep in mind that this is a fairly slow implementation. For serious speed boosts you'll need to investigate double or triple-array tries.
Please note: the version below is a simplified version intended only to give some insight into the workings of the Trie. For the full version please see the Downloads section.
public class Trie
{
/**
* The delimiter used in this word to tell where words end. Without a proper delimiter either A.
* a lookup for 'win' would return false if the list also contained 'windows', or B. a lookup
* for 'mag' would return true if the only word in the list was 'magnolia'
*
* The delimiter should never occur in a word added to the trie.
*/
public final static char DELIMITER = '\u0001';
/**
* Creates a new Trie.
*/
public Trie()
{
root = new Node('r');
size = 0;
}
/**
* Adds a word to the list.
* @param word The word to add.
* @return True if the word wasn't in the list yet
*/
public boolean add(String word)
{
if (add(root, w + DELIMITER, 0))
{
size++;
int n = word.length();
if (n > maxDepth) maxDepth = n;
return true;
}
return false;
}
/*
* Does the real work of adding a word to the trie
*/
private boolean add(Node root, String word, int offset)
{
if (offset == word.length()) return false;
int c = word.charAt(offset);
// Search for node to add to
Node last = null, next = root.firstChild;
while (next != null)
{
if (next.value < c)
{
// Not found yet, continue searching
last = next;
next = next.nextSibling;
}
else if (next.value == c)
{
// Match found, add remaining word to this node
return add(next, word, offset + 1);
}
// Because of the ordering of the list getting here means we won't
// find a match
else break;
}
// No match found, create a new node and insert
Node node = new Node(c);
if (last == null)
{
// Insert node at the beginning of the list (Works for next == null
// too)
root.firstChild = node;
node.nextSibling = next;
}
else
{
// Insert between last and next
last.nextSibling = node;
node.nextSibling = next;
}
// Add remaining letters
for (int i = offset + 1; i < word.length(); i++)
{
node.firstChild = new Node(word.charAt(i));
node = node.firstChild;
}
return true;
}
/**
* Searches for a word in the list.
*
* @param word The word to search for.
* @return True if the word was found.
*/
public boolean isEntry(String word)
{
if (word.length() == 0)
throw new IllegalArgumentException("Word can't be empty");
return isEntry(root, w + DELIMITER, 0);
}
/*
* Does the real work of determining if a word is in the list
*/
private boolean isEntry(Node root, String word, int offset)
{
if (offset == word.length()) return true;
int c = word.charAt(offset);
// Search for node to add to
Node next = root.firstChild;
while (next != null)
{
if (next.value < c) next = next.nextSibling;
else if (next.value == c) return isEntry(next, word, offset + 1);
else return false;
}
return false;
}
/**
* Returns the size of this list;
*/
public int size()
{
return size;
}
/**
* Returns all words in this list starting with the given prefix
*
* @param prefix The prefix to search for.
* @return All words in this list starting with the given prefix, or if no such words are found,
* an array containing only the suggested prefix.
*/
public String[] suggest(String prefix)
{
return suggest(root, prefix, 0);
}
/*
* Recursive function for finding all words starting with the given prefix
*/
private String[] suggest(Node root, String word, int offset)
{
if (offset == word.length())
{
ArrayList<String> words = new ArrayList<String>(size);
char[] chars = new char[maxDepth];
for (int i = 0; i < offset; i++)
chars[i] = word.charAt(i);
getAll(root, words, chars, offset);
return words.toArray(new String[words.size()]);
}
int c = word.charAt(offset);
// Search for node to add to
Node next = root.firstChild;
while (next != null)
{
if (next.value < c) next = next.nextSibling;
else if (next.value == c) return suggest(next, word, offset + 1);
else break;
}
return new String[] { word };
}
/**
* Searches a string for words present in the trie and replaces them with stars (asterixes).
* @param z The string to censor
*/
public String censor(String s)
{
if (size == 0) return s;
String z = s.toLowerCase();
int n = z.length();
StringBuilder buffer = new StringBuilder(n);
int match;
char star = '*';
for (int i = 0; i < n;)
{
match = longestMatch(root, z, i, 0, 0);
if (match > 0)
{
for (int j = 0; j < match; j++)
{
buffer.append(star);
i++;
}
}
else
{
buffer.append(s.charAt(i++));
}
}
return buffer.toString();
}
/*
* Finds the longest matching word in the trie that starts at the given offset...
*/
private int longestMatch(Node root, String word, int offset, int depth, int maxFound)
{
// Uses delimiter = first in the list!
Node next = root.firstChild;
if (next.value == DELIMITER) maxFound = depth;
if (offset == word.length()) return maxFound;
int c = word.charAt(offset);
while (next != null)
{
if (next.value < c) next = next.nextSibling;
else if (next.value == c) return longestMatch(next, word,
offset + 1, depth + 1, maxFound);
else return maxFound;
}
return maxFound;
}
/*
* Represents a node in the trie. Because a node's children are stored in a linked list this
* data structure takes the odd structure of node with a firstChild and a nextSibling.
*/
private class Node
{
public int value;
public Node firstChild;
public Node nextSibling;
public Node(int value)
{
this.value = value;
firstChild = null;
nextSibling = null;
}
}
private Node root;
private int size;
private int maxDepth; // Not exact, but bounding for the maximum
}Please note: the code given above is intended only to give some insight into the workings of the Trie. For the full version of the class please see the Downloads section.
Nov 4th, 2009
Comments
BLABLA wrote:
Very intresting...ok,ok it was VERY INTRESTING!!!
Nov 6th, 2011
Jonathan wrote:
I know this is fairly old, but I'd like to respond to Andelo below. Succinct tries are a space-efficient method of encoding tries that, importantly, does not need to be decoded to be used. This means that there is no need for recursive serialization, as it already is in storable form. Here is a page explaining them: http://stevehanov.ca/blog/index.php?id=120. Succinct tries seem to be slower, but I've only seen results in Javascript (http://ejohn.org/blog/revised-javascript-dictionary-search/). Maybe it will be different in other languages.
Jul 5th, 2011
michael wrote:
Hi, thanks for the feedback! The problem you mention is solved by adding a delimiter to each word before searching, indicating the end of the word. So "jac" becomes "jacX" where "X" is the delimiter. Similarly, "jack" has become "jackX" so the routine matches "j", "a" and "c" but then returns false when it matches "X" against "k".
The only real drawback is the delimiter, which should be some special value that can never show up in a word.
Apr 28th, 2011
trieScanner wrote:
Hi just some feedback,
I don't think the isEntry method is implemented correctly. Consider having a trie with a single element for example "jack". Now if I search for "jac" for code will mistakenly return true
Apr 28th, 2011
Silly michael wrote:
_very_ silly
Nov 21st, 2010
potty singh wrote:
You are silly and this is silly
Nov 21st, 2010
Michael wrote:
Hi Api! Thanks for reading. I'm afraid I don't have any good references for you, but if you search for "array based trie implementation" or double or triple array tries (which are all just different implementations of the same thing) you should be able to find some pages that delve deeper into this. It's all very specific knowledge though and the solutions are usually not very elegant or even all that interesting, just very very useful (hope that doesn't offend anyone). That means you're not very likely to find a clean & easy tutorial page, but will need to dig a little deeper into the computer science side of things.
Nov 14th, 2010
api wrote:
Hey,
Thanks for the post.You said:" For serious speed boosts you'll need to investigate double or triple-array tries.".
what do you mean by double or triple array tries?
also how can I find an algorithm between these two extremes as you said in google! what should I search for?
Thanks
Nov 14th, 2010
Michael wrote:
No idea! I guess you could base it on the array implementation of the trie structure: http://linux.thai.net/~thep/datrie/datrie.html
Perhaps if you labelled each node, starting at the deepest level of the tree, and then for each higher level store the indices of the nodes it points to?
Apr 27th, 2010
Andelo wrote:
HI, thank you for the post. just one question, What's the best way to persist this kind of data structure? In my android app I need to store a trie structure in a flattered way. I cannot use recursive serialization because android stack is small so it causes stackoverflow.
Apr 27th, 2010
Michael wrote:
Thanks!
Because the code fragment was so large I did a quick scan to remove any "unnecessary" methods before posting. Looks like that was a mistake :)
I've updated the article and added the missing method.
Dec 9th, 2009
Jack1984 wrote:
Hi, I must be missing something but inside the second suggest method you call on a getAll method that is nowhere to be seen. What does this mean? Where is it?
Dec 9th, 2009
If you wish to add code to your comment you can use code tags, like this: <code class="php">yourCodeHere</code>.
Quite a large number of languages are supported, although I can't guarantee it'll be pretty. Inside the code tags you can use any characters except for the string "</code>".