Thursday, October 22, 2009

Designs for making your searches faster - Part 3

In this part, we will look for an improvised solution for the problem we have been discussing. That is making enterprise transaction searches faster.
Caching Search Results
One of the weakest links in the earlier solution was that we had to query the database each time the user hits 'next' or 'previous' in the search results. This cost performance dearly and also system resources.
Now, if we could cache the search results somehow we can avoid querying the database repeatedly and instead access the data quickly from the cache. Before we look at caching solutions let us first look at what we are looking to gain by using a cache
When using a cache,
  • The database tables should not be repeatedly searched for the same content upon pagination
  • Quick access to page content without scouring over large content
  • Automatic purging by some criteria
  • Fast response
  • Should be cluster friendly when used in multi server production environment
Search solutions are many and of different types, choosing one that fits your requirements is really a matter of personal choice and applicability in your context. Here are a few options
  1. Session with state server (.NET)
  2. View state (for .NET)
  3. ASP.NET cache (for .NET)
  4. JCS (Apache/Java)
  5. Velocity (Microsoft)
  6. JBoss Cache(Java)
are some solutions, we can even quickly put together our own caching solution if resources are available. I will discuss custom caching solutions in a separate post.
All of the above solutions use their own approach for caching,but provide a very simple interface for using them. We will not delve into the nuances and the differences in the approaches adopted by each of the solutions, but we will assume that one of the options have been used.
So, how does caching help? At a high level, the solution involves

  1. Retrieving the search results by normal SQL query of the database
  2. Converting them into a form that is easily serialized and saved in a cache and saving them in a cache.
  3. Retrieve data from the cache each time user paginates.
Using the Cache
Most cache solutions use the ubiquitous key,value pair approach to caching at a fundamental level. There are higher levels of distribution and control using regions and other such concepts, however at a fundamental level, it retains the simple pair method of cache access.

Content can be stored in cache using simple code such as shown below. Note that the exact syntax may vary depending on the caching solution used.
Cache objCache = CacheFactory.GetCache("SearchCache");
objCache.put("MyCustomerSearch",objByteArraySearchResults);

The above statements show how a handle to the cache container by name 'SearchCache' is obtained and a new cache entry is made against the key "MyCustomerSearch". The value passed is the serialized form(Byte array) of the search result collection created by the search query. Other options such as converting the results to XML can also be considered.
To retrieve the results from cache we execute the simple command

objByteArraySearchResults = (Byte[])objCache.get("MyCustomerSearch");

Unique Key
In a real world multi-user scenario, the above example will fail as the key is not unique and the result will most likely be corrupted. To ensure uniqueness in a multi-user scenario, we will generate a unique key. There are different ways to generate a unique key. One such way is specified below

SearchName+UserName+SessionID+timestampofsearchexecution

For instance the Customer search key would look like

CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112

This key is dynamically computed and will remain unique for most general requirements. It is very important that this key is stored in the page context / page scope as a hidden variable as this key is to be used every time an access has to be made to the cache.

Reality Check
We have a basic solution for search using a caching approach. let us check against the considerations on how this performs as a search solution

ConsiderationComplianceRemarks
Search must be fastComplies

Search must be accurateComplies

Search must use minimal system resourcesPartly Complies
This is much better to the earlier DB only approach.
Search must avoid redundant queriesComplies
still executes redundant queries when paginating
Search must provide current dataPartly Complies
Cached results do not reflect updates that happen after the search is executed.
Pagination must be fastPartly Complies
Avoids redundant queries
Must facilitate on demand sortingComplies
Requires requerying the db unless sorting API is available in the programming language. eg.LINQ
Must facilitate on demand result filteringComplies

Must be multi-lingual friendlyComplies

Solution must be cluster friendly
Complies*
Subject to support from Caching solution

As one can see from the table above, there has been significant improvements in promised performance over the earlier approach by avoiding redundant database queries and by keeping the search data close to the point of use.

This works well for searches that involve large rows but medium sized results up to a few hundreds.

Caching - Extended
The above section concluded with the compliance checklist does not indicate full compliance for fast pagination. The reason is that when a user navigates a page, the cache is accessed and the entire result set is retrieved, deserialized before the search program can extract the exact number of rows required to display the page the user wishes to navigate to.
When the result set is fairly large, there is a significant cost to loading the entire result set and deserializing it.
Instead, if we could organize the cache in such a way that i could directly load only the content i need without having to load the entire result set, it would provide considerable performance advantage over the basic caching approach.
To achieve this we can use indexed caching.
Before delving into the details, let us step back and understand the requirement for pagination.
Search results are displayed mostly in a table / grid. Pagination is done using a simple Next or a Previous button. Some cases direct navigation to a specific page is given through direct links or other such options.

The number of rows displayed per page / page size is also fairly consistent in most cases with some flexibility being provided where a user can choose between a set of values such as 10, 50,100 or other such options.
With this as a basis let us look at how we can evolve a solution that will help us effectively use caching to make navigation faster.
The number of rows per page is a multiple of 10 as can be seen from the options provided for the page size.
10x1 = 10
10x5 = 50
10x10 = 100

Splitting the cache
The solution will not be based on splitting the search results and storing them as separate cache entries, each containing a part of the search results that contain only enough to display per page. Let us call the number of rows we store per cache entry as a block. Each cache entry will be marked with its own block number in addition to the key as defined earlier.
So, the syntax for the key will now become

SearchName+UserName+SessionID+timestampofsearchexecution_BlockNumber


Let us discuss this with above example.

Option 1 - least common divisor approach

In this approach we will store only as many rows in a cache entry as there are in the least common divisor. So, for the above example we will store ten rows per cache entry. So, a search that returns 110 rows will contain 11 blocks as shown below
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_1 - Rows 1-10
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_2 - Rows 11-20
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_3 - Rows 21-30
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_4 - Rows 31-40
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_5 - Rows 41-50
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_6 - Rows 51-60
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_7 - Rows 61-70
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_8 - Rows 71-80
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_9 - Rows 81-90
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_10 - Rows 91-100
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_11 - Rows 101-110

So, when the page size is set to 10 rows per page, the cache key can be computed by directly appending the page number to the key
However, when page size is set to 20 rows per page, the following computation is used to retrieve the results directly.

BlocksToRetrieveperPage = RowsPerPageTobeDisplayed/RowsPerBlock

In this case the above formula would translate to
BlocksToRetrieveperPage = 20/10

The next parameter required is the starting block to generate the key so that we can directly navigate to the cache block.

BlockToStart = ((RowToStartFrom)/RowsPerBlock)+1

When page size is set to 20 and the user clicks next button when in the third page,
BlockToStart = ((61)/10)+1

In this case the blocktostart is at block 7. As we can see from earlier sections, the 7th block features the 61'st row .
The pseudocode to process the cache is given below

  1. WHILE BLOCKS_RETRIEVED < blocks_retrieved="BLOCKS_RETRIEVED+1;
  2. RETRIEVE FROM CACHE THE ENTRY WITH KEY (SearchName+UserName+SessionID+timestampofsearchexecution_+"_"+BlockToStart)
  3. ROWSTODISPLAY=ROWSTODISPLAY + CACHE_ENTRY_RETRIEVED
  4. BlockToStart = BlockToStart +1;
  5. END WHILE

In this approach, the number of blocks to be accessed for varying page sizes are as given below
  • Page Size :10 Blocks accessed per page is 1
  • Page Size :20 Blocks accessed per page is 2
  • Page Size :50 Blocks accessed per page is 5
  • Page Size :100 Blocks accessed per page is 10
This option is scalable even if the customer decides to provide more options for page sizes such as 30, 40 etc., as long as it is a multiple of 10 and the number of rows displayed per page will never be less than 10

Option 2 - Optimizing the block access
To push the envelope further, we can strike a balance between the number of blocks loaded per page and the size of data loaded per page access.
For instance if we change the rows per block to be 50 in the above example, we would end up with only three cache entries
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_1 - Rows 1-50
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_2 - Rows 51-100
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_3 - Rows 101-110

In this approach the number of block access made for varying page sizes are shown below for a result set of 110 rows
  • Page Size :10 Blocks accessed per page is 1 //block 1
  • Page Size :20 Blocks accessed per page is 1 //block 1
  • Page Size :50 Blocks accessed per page is 1 //block 1
  • Page Size :100 Blocks accessed per page is2 //block 1 and 2
As we can see the number of page access has been optimized while striking a balance between the number of cache accesses made vis-a-vis the volume of data dealt with per cache access. This would provide a consistent user experience when paginating.

While both options are very powerful, the choice is really up to the designer to choose from. Let us look at the compliance check list again

ConsiderationComplianceRemarks
Search must be fastComplies

Search must be accurateComplies

Search must use minimal system resources Complies
This is much better to the earlier DB only approach.
Search must avoid redundant queriesComplies
still executes redundant queries when paginating
Search must provide current dataPartly Complies
Cached results do not reflect updates that happen after the search is executed.
Pagination must be fast Complies
Optimized Cache Access
Must facilitate on demand sortingComplies
Requires requerying the db unless sorting API is available in the programming language. eg.LINQ
Must facilitate on demand result filteringComplies

Must be multi-lingual friendlyComplies

Solution must be cluster friendly
Complies*
Subject to support from Caching solution

We have looked at making effective use of caching in this section. In the next part we will look at pushing the performance envelope further.

0 comments:

Post a Comment

 

My Blog List

Site Info

Followers