Radix trees for client-side natural language processing

Imagine you need to compare a user's text input to a large dataset of text on the client-side. Examples include:

  • You're building a service that shows instant personalized recommendations as the user types. The service compares each word of their input to a set of keywords.
  • You're making a custom spell checking or translation service. It should rapidly validate user input against natural language dictionaries.

You could iterate through an array of keywords, but the linear approach gets expensive. Let's say a user types 500 words. Checking each word against the Second Edition of the Oxford English dictionary requires 85.738 million operations per keystroke.

It's faster to store the words in a radix tree. With this data structure you can check strings in constant time. Finding a string in a set of 500,000 keywords requires the same number of operations as finding it in a set of 500 keywords.

Radix trees also consume less memory in the browser. If 5,000 English words start with an "a," the tree only stores one "a." Words can share all the letters they have in common. Unfortunately I don't have time to run an experiment for this post, but a few years ago I converted a plain English dictionary object to a radix tree and it reduced the size by at least 25%.

Convert a 1-dimensional string array to a radix tree in JavaScript:

Check if a string exists in the tree:

Further Reading: