Implementing a Simple Word Search Using Azure Table Storage

Exploring Schema-less Storage Patterns

Mike Amundsen

2009-03-10

1. Introduction

2. A Different Storage Model

3. Seeing Data in a New Way

3.1. Typical RDBMS Solution

3.2. A Schema-less Alternative

3.2.1. The Messages Entity Collection
3.2.2. The Search Entity Collection
3.2.3. Search Results - And Some Tweaks
3.2.4. Some Caveats: Views, Sharding, and Updates

4. Summary

1. Introduction

As 'schema-less' storage becomes more prevalent, the patterns and practices learned from relational data storage models may be inappropriate for familiar data-handling situations. One example is the task of implementing a simple word search pattern for a short message application (i.e. Twitter). With schema-less storage systems, using near-match keywords such as LIKE, linking different table sets using JOIN, and sorting the returned results via secondary indexes are often unavailable. Instead, new implementation strategies must be employed to get the same results.

This article will walk the reader through a simple scenario (implementing a word search against stored messages) and compare common relational data model practices to techniques that work for schema-less storage models. Azure Table Storage is used in this example, but the same approach will work for similar storage systems including Amazon's SimpleDB.

In addition, this article outlines a some strategies that can be applied to a wide range schema-less storage solutions.

2. A Different Storage Model

Azure Table Storage is organized as a variable set of properties tied to a few basic predefined keys. In Azure storage, those keys are the table name, the partition key, and the row key. This 'Entity-Attribute-Value' (EAV) data model supports search and key-based retrieval engines that return results very quickly even when the data is stored in more than one location. In fact, one of the great advantages of EAVs is that they very effective on widely distributed, heterogeneous data collections.

There are some downsides to the EAV model. Over the last forty years the 'Relational Database Management System' (RDBMS) approach has dominated computer data storage. Developers have grown comfortable with powerful features of relation data models including foreign-key validation, the ability to create custom indexes to support searching, and the ability to iterate over several tables in a single query to produce 'joined' result sets.

EAV models usually do not support relational concepts such as JOIN or LIKE in queries. Even when EAVs support these features, the implementation can be less than complete and/or performance suffers. In fact, these types of 'relational idioms' are not appropriate when working with EAV data stores. Creating successful EAV queries means thinking about data storage in a different way. For example, in an EAV model, database normalization rules can be a hindrance to performance. With EAVs, we may need to consider changing where and in what form we store our data. We may need to use compiled code or scripts to produce the desired output. That includes the possibility of employing asynchronous and/or parallel coding to speed execution against one or more physical data stores at various locations on the Internet. In other words, we need to adopt different idioms when working with EAVs.

3. Seeing Data in a New Way

Implementing a word search is an example of the challenges EAVs can pose to those not familiar with this different data model. Using standard T-SQL, supporting word searches against a table field is relatively trivial. However, with EAV models the lack of table-scanning query features such as the LIKE keyword and the (sometimes) lack of secondary index support mean familiar tactics may not apply.

3.1. Typical RDBMS Solution

As a start, let's review the details of implementing a simple word search on a single field (MessageText) in a single table (Messages) using common T-SQL. Fist, we need to define our Message table.

CREATE TABLE Messages
  (Id integer PRIMARY KEY,
   MessageText varchar(140),
   Author varchar(50),
   DateCreated date);

The easiest way to implement the word search against this table is to use the LIKE keyword in a SELECT query:

CREATE PROCEDURE wordSearch
  @word varchar(50)
  SELECT Id, MessageText, Author, DateCreated FROM Messages
  WHERE MessageText LIKE '%'+@word+'%'

There are some details, such as adding an index on the MessageText field to speed things along, but the essentials are all here. This solution will run into problems as the data table grows, but for many cases, this is the common approach to the problem.

3.2. A Schema-less Alternative

Implementing a word search in Azure Table Storage is not quite as easy. First we don't have an equivalent to the LIKE keyword at our disposal. Second, we don't have the ability to create secondary indexes on a table (or Entity collection in EAV-speak). That means we need to do the work ourselves. Essentially, we need to build a table to hold the links between search words and the actual Message records. Doing this will allow us to mimic the LIKE keyword from relation query languages.

3.2.1. The `Messages` Entity Collection

To begin, we need to define the Messages entity collection, This will hold the same data as the T-SQL Messages table we defined earlier:

<content type="application/xml" 
  xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" 
  xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
  <m:properties>
    <d:PartitionKey>Messages</d:PartitionKey>
    <d:RowKey>1000</d:RowKey>
    <d:MessageText>The first message</d:MessageText>
    <d:DateCreated m:type="Edm.DateTime">2009-03-01T12:00:00.000Z</d:DateCreated>
  </m:properties>
</content>

Note that we do not define a schema for the Message collection. Instead we simply build a record and write it to the data store. This is the essence of working with schema-less storage.

3.2.2. The `Search` Entity Collection

Next we need to build a collection that will allow us to return all the Message entities that contain the word in our search criteria. As shown in the previous example, we only need to write the entities directly to the data store; no schema definition is needed. Below is a set of Search entities for the MessageText from the previous example.

<content type="application/xml" 
  xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" 
  xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
  <m:properties>
    <d:PartitionKey>Search</d:PartitionKey>
    <d:RowKey>1001</d:RowKey>
    <d:Word>the</d:MessageText>
    <d:MessageId>1000</d:MessageText>
    <d:DateCreated m:type="Edm.DateTime">2009-03-01T12:00:00.000Z</d:DateCreated>
  </m:properties>
</content>

<content type="application/xml" 
  xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" 
  xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
  <m:properties>
    <d:PartitionKey>Search</d:PartitionKey>
    <d:RowKey>1002</d:RowKey>
    <d:Word>first</d:MessageText>
    <d:MessageId>1000</d:MessageText>
    <d:DateCreated m:type="Edm.DateTime">2009-03-01T12:00:00.000Z</d:DateCreated>
  </m:properties>
</content>

<content type="application/xml" 
  xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" 
  xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
  <m:properties>
    <d:PartitionKey>Search</d:PartitionKey>
    <d:RowKey>1003</d:RowKey>
    <d:Word>message</d:MessageText>
    <d:MessageId>1000</d:MessageText>
    <d:DateCreated m:type="Edm.DateTime">2009-03-01T12:00:00.000Z</d:DateCreated>
  </m:properties>
</content>

In the above example, a Search entity is created for each unique word in the Message entity. Each Search entity contains the word itself along with a pointer to the Message entity that contains that word. Note that the <d:Word /> element in the entity holds a lowercase version of the one in the original Message entity. This is a way of 'normalizing' the search-able data. Along with making the search words lowercase, it may be appropriate removing punctuation, extra spaces, etc.

3.2.3. Search Results - And Some Tweaks

Once the Search entities are created, executing a query to find all the Message records that use a specific word is straight-forward. Here's a sample query that is valid for Azure Table Storage:

/Search$filter=(Word eq 'first')

The above query will return a list of Search entities that point to all the Message entities containing the search word. Depending on what is needed, the result document can contain a simple count of the found entities or the query results can be presented as a set of links:

<results xml:base="http://example.org/">
  <link href="1000" title="Message 1000" />
  <link href="1100" title="Message 1100" />
  <link href="2000" title="Message 2000" />
  <link href="2009" title="Message 2009" />
</results>

Most user would expect more than Id numbers as a response to their search. If we had access to the JOIN feature of T-SQL, the solution would be easy. For example, we could JOIN the Search entities to the Message entities and return the linked Message entities themselves:

CREATE PROCEDURE wordSearch2
  @search nvarchar(50)
  SELECT m.Id, m.MessageText 
  FROM Messages m 
  JOIN Search s on s.MessageId=m.MessageId 
  WHERE s.Word=@search

However, JOIN is not supported Azure Table Storage (or most EAVs). So, we have at least two other options:

Mimic the JOIN through compiled code
Alter the data model/storage pattern

In the first option, we could query for the Search entities and, using compiled code, iterate through the resulting collection to fetch each the related Message entities. Once that is done we could build the results document. This option works for small collections, but as the number of entities grows, it turns out to be a sub-optimal solution.

The better option is the second one; modify the data stored in the Search entity. In this case, adding one more elements to the Search entity (MessageText), provides all the information needed to render the results in a way that most users would expect:

<content type="application/xml" 
  xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" 
  xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
  <m:properties>
    <d:PartitionKey>Search</d:PartitionKey>
    <d:RowKey>1003</d:RowKey>
    <d:Word>message</d:MessageText>
    <d:MessageId>1000</d:MessageText>
    <d:MessageText>The first message</d:MessageText>
    <d:DateCreated m:type="Edm.DateTime">2009-03-01T12:00:00.000Z</d:DateCreated>
  </m:properties>
</content>

Now rendering the results is straightforward:

<results xml:base="http://example.org/">
  <link href="1000" title="The first message." />
  <link href="1100" title="I'm first in line for the movie!" />
  <link href="2000" title="First things first." />
  <link href="2009" title="I hate sitting in the first row." />
</results>

This last step is, in effect, a form of query optimization. In relational models, query optimization usually involves creating secondary indexes and/or modifying the SELECT statement. In the EAV model, optimization often means creating and/or modifying a custom entity collection. In this example, modifying the Search entity in an EAV model is similar to creating a materialized view in a relational model.

It should also be noted that many EAVs (Azure Table Storage among them) do not support creating secondary indexes in order to control the sort order of the query results. In that case, it is up to the designer of the 'view set' to set a primary key that returns the collection in the desired order. For example, Azure Table Storage uses a combination of the PartitionKey and RowKey to control the order of the result set. Thus, to return the search results in the reverse order in which they were created, we could modify the RowKey to hold the proper value. In this case, subtracting the date/time the Message entity was created from some far future date (futureDateTime.ticks - currentDateTime.ticks) returns a value that automatically sorts as expected.

3.2.4. Some Caveats: Views, Sharding, and Updates

In this simple example, some additional considerations remain unresolved. For example, if the implementation requires other common search collections (users who posted in the last ten days sorted by author, all messages added to the system in the last month sorted by date created, etc.) we would need to implement additional 'view collections' to meet those needs. This is similar to using the VIEW and INDEX features of modern relational data systems - except that RDBMS' usually handle the details of maintaining the extra indexes and view sets. In EAV models, it's up to the implementer to properly update the contents of the additional collections.

Also, most EAV data stores have limits on the size of a single entity and of an entity container. That means, at some point, we need to consider organizing your data storage in to separate containers or shards. Sharding complicates queries since a complete query might need to cross more than one container; possibly all of them. A common solution is to launch parallel requests into each container and collect the results before formatting a response.

Finally, copying portions of source entities into query entities can adversely affect update semantics. Updating an existing record may require additional updates to one or more query entities. In order to prevent long wait times upon submitting updates, a common solution is to execute the query updates asynchronously. Depending the size of the data store, this can result in search results that are out of sync with the underlying source entities and might require additional error-handling when attempting to resolve search requests.

These caveats are all part of the cost of moving from a relational model to an EAV model. The good news is that the solutions suggested here work whether the data stores are on your own server or out on the Internet; in a single location or spread over multiple servers. Also, depending on the data source, the same patterns can work across data stored in disparate formats.

4. Summary

This article identified some key differences between relational and schema-less data models. Primarily the EAV model's lack of support for LIKE, JOIN, and secondary indexing patterns normally found in relational data storage systems. The strategy of creating additional entity collections that mimic RDBMS materialized views was given as one way to build applications that return the same results expected from relational systems. Also, the new challenges of maintaining custom views, handling data sharding, and keeping track of data changes were also identified. For example implementing parallel queries and asynchronous updates helps reduce wait times in EAV data stores.

While it is true that EAV data models currently require additional implementation effort, it is likely that, as the use of EAVs increases, these additional patterns and practices will be 'baked' into code libraries or in some other way hidden from average data-handling chores. It is reasonable to expect that by the time EAV storage reaches the same level of maturity as RDBMS models (40 years and counting) implementation details will be no more demanding than, and just as reliable as, those experienced when implementing solutions on relational storage systems today.