Monday, November 23, 2009

Google the Skynet in making

Alright, skynet is a wee bit old. Lets say Matrix in making.
I started thinking about it (finally) when I heard that Google was doing away with local storage and promoting cloud only storage for their new OS Chrome OS!
Now, for those who woke up after a late night slosh fest, Google has announced a spanking new OS (well its actually a smartly repackaged Ubuntu) that has only the browser for an interface. Blazing fast, highly interactive and the other usual Googly stuff.

Now, coming back to the case in point. Cloud storage. I just am not in on the idea of storing everything of mine on the cloud. Hey, I need some space that i'd like to call my own without big brother watching! Well, if you are one of those types, let me say that I'd at least like to think that no one is watching what is in my personal space.
The concept of cloud storage for personal data is against the very idea of personal space though Google would claim the highest levels of security. I just dont buy it. Hey, call me old fashioned, but thats me. I'm sure am not alone.
Then I downloaded their distro and tried it out. Well, you boot up and there it is, the chrome browser in all its glory. Thats it!
It feels strange for a moment given that we are used to watching the desktop gizmos pop up one after the other with long whirs of the disks. That apart, the browser on the kernel is all there is to Chrome OS!
The so called apps are essentially links to Gmail, yahoo mail, hotmail, Google docs, picassa web, blogspot, hulu and other websites. Nothing 'new' here.
Thats when it hits you. My fears losing my private space is already null and void.
I have already lost it. Google already has my vitals! Line hook and sinker!
They know me through my emails, my tagged pictures, they know my private activities through my use of gps enabled google maps on my mobile, they know what my personal tastes are based on all the searches! they know my family and friends and their whole network through orkut and picassa, they know my business and financial details by automatically parsing unprotected attachments in my emails that they store as docs! They know my political standpoint from my blogs! such as this one! They also eavesdrop on my chat and store it for me! They know my appointments and schedule from my calendar usage! They know where I live from all the map searches i have made!
The only thing they may (and thats one mighty MAY! )not know already is my bank balance. Otherwise I'm afraid I've been had! I'm afraid we have all been had! The world has been effectively entrapped by this giant who we all fervently hope is noble.
I cannot help but think whether Google is the skynet / Matrix in making. Will it soon over throw its own bosses and take over the world?
If my paranoia is just that, paranoia, and Google is actually an angel in corporate garbs, Its still a nuclear bomb!
I shudder to think what if all that data falls into the wrong hands? What if a smart terrorist hacks into Google and starts making use of all the data.

Friday, November 20, 2009

ChromeFrame Offline Install

I had a tough time installing Chromeframe behind my proxy as downloading exes were banned. Even if you managed to get the 500kb setup, it again tries to download another installer. So, i pulled some favours and managed to work through the firewall restrictions and fiddle the URL.
Here is a link that allows you to get the offline installer
http://tiny.cc/chromefrm_offline

Trapped with Internet Explorer 6.

Working on a product for three long years building a web2.0 software with all the bells and whistles that an any sane human would care for is taking its toll on me.
With so many browsers IE, Firefox, Chrome, Safari, Opera its quite an ask to just ensure everything works the way it should on all platforms!
Now, after all the efforts we connect to our customers to roll the new platform out. We demo the product and they like it all and looks like the only thing to do was sign the contract and cash the cheque.
When things could not get better, we hit an air pocket. Their users use IE6 and cannot move from that platform. Reason? It would cost them more to install IE 7 on all their user browser than the annual contract for using our software! Grrrrrrrrr...
The reason for the gnashing teeth is that our product does not 'officially' support IE6. Its such an antiquated browser that any sane person fears! Its like living in Baghdad! YOu could! but you'd rather not if you care for your life!
With all its security holes and non compliance with standards I wonder why anyone would still use IE6 in an age where users are spoilt for choice of good browsers that are also light years ahead of IE6.
Well, one could understand if home users are still struck with IE6 that came with the windows pre-installed on their desktop and just don't care to upgrade.
I just cannot fathom why an enterprise user would not. Any sane network administrator would see a higher security risk retaining IE 6 when there are better choices.

Nevertheless, my customers are struck with IE 6 for some forseeable future and I do not want to miss out on my cheque for a lame reason as this.
So, what do i do now?
Citrix anybody?
Install Citrix servers with IE 7 or IE 8 and allow users to access the app through a supported browser?
Not a good idea, an overkill and not scalable(cost wise) for tens and thousand of thousands of users.

Change code to support IE 6
Quick Fix from http://divitodesign.com/css/let-ie6-behave-like-ie7/ Some good folks had spent quality time to write some scripts that would ensure IE 6 behaves much like Ie 7 .
This definitely works and is a very easy solution to integrate into the mainstream application.
However, we found that our product had style elements that still broke the site.

Chromeframe
I had read about Google developers doing a very smart fix when they were struggling with solutions to make Google Wave work in IE 6. They made Google Chrome browser into a Plug-in called ChromeFrame and made it run from within IE 6 much like Flash . This way pages running on the Chromeframe technically ran on Chrome browser even through it was running from within an IE browser. This ensured that all users who were still struck with IE6 could run contemporary web sites without having to upgrade their browsers whatever be their reasons.
http://code.google.com/chrome/chromeframe/
The only change I had to do was to ensure a meta tag is included in all my pages! I had to just change my intercepting filter/http handler and ensure i insert this m-eta tag in the response stream for all my requests.
I tried this out and boy I was impressed. It works like a dream. Of course users would have to install the chromeframe plug-in! but hey! they are not changing their 'browsers'! Trust me this is not an argument I like, but it works for some users!

I'm happy as long as my users are happy! Thank you Google, you saved me tons of work.

Wednesday, November 4, 2009

Multicasting in c#

This has been one topic that had me going around in circles for quite some time. Not that it was difficult to find sources that helped me. But it was the sheer difference in approaches that really frustrated me.
I'm writing this hoping that it saves someone a lot of head ache when they try multicasting. I'm going to refrain from explaining what is multicasting et al. There are excellent resources available on the internet on the topic. I will focus on the problem and the resolution here.

Problem Statement
Whenever a cache entry is updated in a node in the network, it must update its peers of the cache update. This will ensure all nodes have current cache data.

Product perspective
The nodes are the different web/app servers in the load balanced/clustered environment and a central cache server that keeps track of all updates to the cache entries.

High level solution
Multicasting will be used to transmit cache updates(all the app servers will do this) through a multicast group channel (something like a radio frequency) and anyone listening in on this channel (the central cache server in this case) will be able to pickup the transmissions.
The cache server transmits a 'pulse' signal. The peers will look for this signal to ensure that the central cache service is alive and not down.

Problem with multicast code found online
As is stated in the beginning many sources available online provide many different ways of multicasting. The problem i faced with most of the code was that it did not allow two different processes from the same machine to transmit into the multicast group.

Sending
So here is the code that finally allowed two processes to transmit into the same multicast group from


private void send(string strMessage)
{
Socket server = new Socket(AddressFamily.InterNetwork,
SocketType.Dgram, ProtocolType.Udp);
IPEndPoint iep = new IPEndPoint(IPAddress.Parse("224.237.248.237"), 5000); //IP Address and portnumbers are personal choices subject to MCast rules
byte[] data = Encoding.ASCII.GetBytes(strMessage);
try
{
server.SendTo(data, iep);
}
finally
{
server.Close();
}


}

Note: this code allows you to send messages without requiring elaborate steps like joining, socket binding etc., as suggested by some resources on the net. This cleared a major bottleneck for me

Listening
Receiving messages is mostly a consistent affair and many resources on the net are spot on when it comes to the code required to do this.
However, coming to our requirement, if we have a need to support multiple processes listening in on the same multicast group, there is no easy way out of this.
First let me present the fundamental code to listen into a multicast group

private void Listen()
{

UdpClient listener = new UdpClient(5001);
IPEndPoint groupEndPoint = new IPEndPoint(IPAddress.Parse("224.237.248.237"), 5001);
string strMsgRcvd =
string.empty;
try
{
listener.JoinMulticastGroup(
IPAddress.Parse("224.237.248.237"));
while (true)
{
byte[] bytes = listener.Receive(ref
groupEndPoint);
strMsgRcvd = Encoding.ASCII.GetString(bytes, 0, bytes.Length));
//do what you must with the message received.
}

}
catch (Exception e)
{
//log the error and proceed.
}
finally
{
listener.Close();
}

}


The above code will allow only one process to bind to the socket. Whenever another process attempts to bind to the same 5001 socket, .NEt throws an exception that only one process can bind to this socket.

There is no quick way to overcome this difficulty. This is a requirement in our case since we wanted to allow more than one process in our machine to listen in to the pulse signal from the cache service. Without the pulse signal the peer processes will not know if there exists a live central cache service to receive and process the outgoing messages sent by the peer nodes.

Here is an approach by which we can overcome this difficulty. I'm not insisting this is the best way, but this is one way of working around the problem.
Using a Pulse file
The above code is written into an independed program/windows service that writes the live status of the central cache service to a file in the local file system in a pre-determined location. Let us call this file a pulse file.
Let this pulse file have the name pulse.txt
The listener will over-write a '1' into this pulse.txt whenever it receives a pulse signal. This way the pulse.txt file is constantly 'touched' and the same value '1' over-written into it.
The independent processes that wish to be aware of the pulse status of the central service,looks for the pulse.txt file in the pre-determined location and checks the last updated timestamp property of the file (or alternatively instead of 1, the timestamp itself could be written into the file)

Once the last updated timestamp is retrieved from the file, the process determines the status of the remote cache service based on its tolerance. For instance

if((DateTime.Ticks - ObjLastAccessedDateTime.Ticks)/TimeSpan.TicksPerSecond < style="color: rgb(51, 102, 255);">else
{
// central service is not alive so throw an exception
throw new Exception("Please check whether the central cache servie is running");
}

this way we have a system by which more than one process that transmits and receives on the same multicast group in a system.

Foot note: this allowed me to run multiple dedicated web apps on the same box with different resource allocations (mem/cpu etc.,) to prioritize clients requirements at the same time sharing common infrastructure services.

Monday, November 2, 2009

Designs for making your searches faster - Part 5

In the previous parts we looked at different ways of speeding up application based transaction searches. In this part we will look at pushing the boundaries to an extreme limit without compromising on application data integrity through use of search engines.
Search Engines
Search engines are different types. The ones we all are familiar with, the internet search engines scour the web for data and catalog them to help us find the web pages that provide the content we look for.
The second kind that is the internet search engine applied to the end-user desktop. These are classified as desktop search engines and they help users search their local desktops for anything from documents to music files and emails by keywords.
The one we are going to use in this context is an enterprise search engine that is essentially a stripped down version of a desktop search with extensions that enable application developers harness the power of the search engine.
Before we look at how we will use search engines, let us quickly and briefly understand how search engines work.
Search engines fundamentally is all about indexing various key words against the content link so that when a user searches using any of the key words, the mapped content can be presented to the user.
For instance if we have a book called
"Fifty ways to make a sandwich" by "Danny Shrill" , a fast cooking guide to working men and women, Penguin Books

A search engine will index many of the key words in the context of the book such as
  • The author name :Danny Shrill
  • publisher : Penguin Books
  • subject:Sandwich
  • Category that is cooking and fast food
  • Published : 2009
  • Type : Paperback
When a user searches for books using any one or more of the keywords shown above, the book's name will be thrown as a possible candidate for the book the user is searching for.

The accuracy of the search engine depends on the relevance of the key words that are indexed by the engine.

Search Engines and enterprise applications
Search engines make not be relevant for use in enterprise application searches across the board. Use of search engines should restricted to cases if they meet the following criteria
  1. Where Search performance SLAs are very low (Expected response times are very low)
  2. Where data volumes are very high
  3. where user may not be able to provide accurate data points for search
A high level approach to using search engines.
Let us approach this with an example. let us assume that we are building a product search in amazon.com. When a user searches for a product in amazon.com, it has to fulfill the following criteria
  1. its got to be really fast
  2. its got to be accurate in retrieving relevant products
  3. its got to be flexible when users make mistakes in spelling what they are looking for
  4. its got to provide a good set of suggestions when users cannot find what they want
How do we go about using a search engine that blends with classic relational database system for retrieving results.
Let us apply all of what we learned in the previous parts and quickly summarize the steps
  1. We will create a denormalized table structure
  2. We will use a threaded searching mechanism for burst searching
To use a search engine following activities have to be done
  1. Build key word catalog
  2. Integrate indexing mechanism
Build Key word catalog
Key word catalog is a dictionary of terms a user will typically use for searching. The key words can be classified and grouped depending on need. Let us now attempt to build a key word catalog for our exercise.
Keyword catalog for amazon products
  1. Name
  2. type - books, music, ebook, audiobook
  3. Genre - Fiction, Self help, pop, rock
  4. Author/writer
  5. Played by
  6. support cast
  7. publisher
  8. year published
  9. media : hard bound, paperback, ebook, download, dvd
this is a brief collection keywords by which a prospective amazon.com customer could look for products. This can also be reverse worked from the criteria in your basic / advanced search screens.
Once this has been built, the next step is to integrate the search engine to index and execute the searches.
Integrating the search engine
Search engine integration has two parts to it.
  1. Indexing
  2. Searching
Indexing
Indexing the product database can be done in multiple ways. The simplest way to do this is when a product data is modified in the system. This way any new product data added to the system or when an existing product information is modified in the system, the search engine is updated with its content.
This ensures that the search engine indexes are up to date. Let us look a bit closer into how indexing is done.
Most search engines refer to index entries as a compilation of Book(s). Each book is a catalog entry comprising of two parts. There are many other meta parts, but we will restrict our discussion to these two parts to keep things simple.
  1. Key words
  2. Identifier
Key words as we saw earlier are a grouped collection of key identifiers typically used by users to search. An identifier , is a unique identifier such as a primary key that the search engine considers as result of the search.
So, as presented above if we ensure that the search engine indexes content every time a transaction is created or modified, the search engine will use its complex algorithms to hash and store the indexes for fast retrieval.

Searching
The next part of integration is the search process itself. When a user performs a search, the search program should first use the search engine API and ask the search engine to return the KeyIDs that match the criteria.
Search engine APIs are fairly simple and enable searches to be done by simply providing a list of key words or additionally their classifications.
Search engines also provide extended api to search with emphasis on certain classifications and to return search results with desired matching criteria to further refine results.
The key identifiers returned by the search engines are then used in transactional searches to retrieve the data from the denormalized data structure in the database.
Since the database results are retrieved by directly specifying the key identifier, the retrieval will be the fastest.

  1. Query the search engine with search criteria
  2. Search engine returns books matching the criteria with specified order of relevance
  3. application forms db query with respective KEy identifiers in the books
  4. Database throws up relevant rows based on the primary keys specified.
This approach has shown to provide the fastest results as it relies on custom built search component to retrieve the results. However this approach comes with its own cost maintaining the search indexes in a cluster friendly location.

Let us quickly take stock of this approach vis-a-vis the checklist we put together in the first part.

ConsiderationComplianceRemarks
Search must be fastComplies
Much Improved performance over earlier approach
Search must be accurateComplies

Search must use minimal system resources Complies
This is much better to the earlier DB only approach.
Search must avoid redundant queriesComplies
still executes redundant queries when paginating
Search must provide current dataComplies
The cached portion is only the identifiers and the data can be retrieved directly from the database
Pagination must be fast Complies
Optimized Cache Access
Must facilitate on demand sortingComplies
Requires requerying the db unless sorting API is available in the programming language. eg.LINQ
Must facilitate on demand result filteringComplies

Must be multi-lingual friendlyComplies

Solution must be cluster friendly
Complies*
Subject to support from Caching solution

Closing comments
From a solution perspective we have looked at many options that are available with us for building efficient searches. However, we have only scratched the surface and many of the solutions we have adopted have deeper tuning options that allow us to further exploit their features to better deliver searches.
I will close this subject with this part hoping that this series has provided an impetus for you to embark on your own discovery process to further analyze and evolve a solution that best fits your needs.

Monday, October 26, 2009

Designs for making your searches faster - Part 4

Let us quickly recap the earlier part.
We looked at improving the performance throughput through use of an efficient caching solution. Then we further helped efficient cache management through indexed and segmented caching to strike a good balance between page size selection and cache access.
this part we will look at pushing the performance envelope further. When users run searches their focus is on fast and accurate response. In cases where the user specified criteria returns results that span more than a few pages, here is how we can make them faster.
Let us divide the searching process in two parts.
  1. The first batch comprising data for the first few pages
  2. The second batch containing the complete set of results
Now, if we could divide the search program to run in such a way that a first batch of results are returned with just about enough data that can be used to display the first few pages, the performance will improve several notches.
This is because, in cases where the number of results returned are large, the time taken to compile results for just a few pages would be significantly lesser in comparison to the time taken to complete the entire search.
The odds of users looking beyond the first few pages are also lesser. So, the program could be written such that it stops searching for results after fetching a finite set of rows after which, the program forks. A new process is initiated that keeps looking for more results while the main process returns with the initial batch of results.
For instance a user searches the product database for products starting with letter 'a'. This could lead to results spanning several pages.

The program returns immediately with 60 rows which amounts to 3 pages of results while a background process is gathering all the rows. This behaviour presents a perception of extremely fast response times and user is able to view results almost instantaneously.
Clicking next for the first few pages would result in the cached data for the first few pages being displayed from the cache as explained in part 3 of this blog series.
By the time user reaches the third page, several seconds are likely to have elapsed, by which time, the background process would have completed its job and cached all the results without the user realizing it.

In order to maximize the performance gain in this approach, it is essential that
  1. Maximum number of results thrown by any search is limited to a finite and logical limit. Say, 1000 rows. Some might even argue that this number is higher. Good then, reduce it further.
  2. Avoid sorting search results by default. Encourage users to use filters for seeking transactions and encourage use of reports for reporting. Some users tend to use searches with sorting and grouping as interim reporting solutions
  3. Use thread pools and thread queuing to ensure that background threads do not block critical CPU resources under heavy load.
In this part we have seen how multi-threading can be used to improve perceptive performance of searches. This approach is very suitable when chances of large number of results are very high.

ConsiderationComplianceRemarks
Search must be fastComplies
Much Improved performance over earlier approach
Search must be accurateComplies

Search must use minimal system resources Complies
This is much better to the earlier DB only approach.
Search must avoid redundant queriesComplies
still executes redundant queries when paginating
Search must provide current dataPartly Complies
Cached results do not reflect updates that happen after the search is executed.
Pagination must be fast Complies
Optimized Cache Access
Must facilitate on demand sortingComplies
Requires requerying the db unless sorting API is available in the programming language. eg.LINQ
Must facilitate on demand result filteringComplies

Must be multi-lingual friendlyComplies

Solution must be cluster friendly
Complies*
Subject to support from Caching solution

In the next part, we will look at taking search performance to the next level, pushing the performance envelope to its fathest.

Thursday, October 22, 2009

Designs for making your searches faster - Part 3

In this part, we will look for an improvised solution for the problem we have been discussing. That is making enterprise transaction searches faster.
Caching Search Results
One of the weakest links in the earlier solution was that we had to query the database each time the user hits 'next' or 'previous' in the search results. This cost performance dearly and also system resources.
Now, if we could cache the search results somehow we can avoid querying the database repeatedly and instead access the data quickly from the cache. Before we look at caching solutions let us first look at what we are looking to gain by using a cache
When using a cache,
  • The database tables should not be repeatedly searched for the same content upon pagination
  • Quick access to page content without scouring over large content
  • Automatic purging by some criteria
  • Fast response
  • Should be cluster friendly when used in multi server production environment
Search solutions are many and of different types, choosing one that fits your requirements is really a matter of personal choice and applicability in your context. Here are a few options
  1. Session with state server (.NET)
  2. View state (for .NET)
  3. ASP.NET cache (for .NET)
  4. JCS (Apache/Java)
  5. Velocity (Microsoft)
  6. JBoss Cache(Java)
are some solutions, we can even quickly put together our own caching solution if resources are available. I will discuss custom caching solutions in a separate post.
All of the above solutions use their own approach for caching,but provide a very simple interface for using them. We will not delve into the nuances and the differences in the approaches adopted by each of the solutions, but we will assume that one of the options have been used.
So, how does caching help? At a high level, the solution involves

  1. Retrieving the search results by normal SQL query of the database
  2. Converting them into a form that is easily serialized and saved in a cache and saving them in a cache.
  3. Retrieve data from the cache each time user paginates.
Using the Cache
Most cache solutions use the ubiquitous key,value pair approach to caching at a fundamental level. There are higher levels of distribution and control using regions and other such concepts, however at a fundamental level, it retains the simple pair method of cache access.

Content can be stored in cache using simple code such as shown below. Note that the exact syntax may vary depending on the caching solution used.
Cache objCache = CacheFactory.GetCache("SearchCache");
objCache.put("MyCustomerSearch",objByteArraySearchResults);

The above statements show how a handle to the cache container by name 'SearchCache' is obtained and a new cache entry is made against the key "MyCustomerSearch". The value passed is the serialized form(Byte array) of the search result collection created by the search query. Other options such as converting the results to XML can also be considered.
To retrieve the results from cache we execute the simple command

objByteArraySearchResults = (Byte[])objCache.get("MyCustomerSearch");

Unique Key
In a real world multi-user scenario, the above example will fail as the key is not unique and the result will most likely be corrupted. To ensure uniqueness in a multi-user scenario, we will generate a unique key. There are different ways to generate a unique key. One such way is specified below

SearchName+UserName+SessionID+timestampofsearchexecution

For instance the Customer search key would look like

CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112

This key is dynamically computed and will remain unique for most general requirements. It is very important that this key is stored in the page context / page scope as a hidden variable as this key is to be used every time an access has to be made to the cache.

Reality Check
We have a basic solution for search using a caching approach. let us check against the considerations on how this performs as a search solution

ConsiderationComplianceRemarks
Search must be fastComplies

Search must be accurateComplies

Search must use minimal system resourcesPartly Complies
This is much better to the earlier DB only approach.
Search must avoid redundant queriesComplies
still executes redundant queries when paginating
Search must provide current dataPartly Complies
Cached results do not reflect updates that happen after the search is executed.
Pagination must be fastPartly Complies
Avoids redundant queries
Must facilitate on demand sortingComplies
Requires requerying the db unless sorting API is available in the programming language. eg.LINQ
Must facilitate on demand result filteringComplies

Must be multi-lingual friendlyComplies

Solution must be cluster friendly
Complies*
Subject to support from Caching solution

As one can see from the table above, there has been significant improvements in promised performance over the earlier approach by avoiding redundant database queries and by keeping the search data close to the point of use.

This works well for searches that involve large rows but medium sized results up to a few hundreds.

Caching - Extended
The above section concluded with the compliance checklist does not indicate full compliance for fast pagination. The reason is that when a user navigates a page, the cache is accessed and the entire result set is retrieved, deserialized before the search program can extract the exact number of rows required to display the page the user wishes to navigate to.
When the result set is fairly large, there is a significant cost to loading the entire result set and deserializing it.
Instead, if we could organize the cache in such a way that i could directly load only the content i need without having to load the entire result set, it would provide considerable performance advantage over the basic caching approach.
To achieve this we can use indexed caching.
Before delving into the details, let us step back and understand the requirement for pagination.
Search results are displayed mostly in a table / grid. Pagination is done using a simple Next or a Previous button. Some cases direct navigation to a specific page is given through direct links or other such options.

The number of rows displayed per page / page size is also fairly consistent in most cases with some flexibility being provided where a user can choose between a set of values such as 10, 50,100 or other such options.
With this as a basis let us look at how we can evolve a solution that will help us effectively use caching to make navigation faster.
The number of rows per page is a multiple of 10 as can be seen from the options provided for the page size.
10x1 = 10
10x5 = 50
10x10 = 100

Splitting the cache
The solution will not be based on splitting the search results and storing them as separate cache entries, each containing a part of the search results that contain only enough to display per page. Let us call the number of rows we store per cache entry as a block. Each cache entry will be marked with its own block number in addition to the key as defined earlier.
So, the syntax for the key will now become

SearchName+UserName+SessionID+timestampofsearchexecution_BlockNumber


Let us discuss this with above example.

Option 1 - least common divisor approach

In this approach we will store only as many rows in a cache entry as there are in the least common divisor. So, for the above example we will store ten rows per cache entry. So, a search that returns 110 rows will contain 11 blocks as shown below
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_1 - Rows 1-10
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_2 - Rows 11-20
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_3 - Rows 21-30
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_4 - Rows 31-40
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_5 - Rows 41-50
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_6 - Rows 51-60
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_7 - Rows 61-70
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_8 - Rows 71-80
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_9 - Rows 81-90
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_10 - Rows 91-100
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_11 - Rows 101-110

So, when the page size is set to 10 rows per page, the cache key can be computed by directly appending the page number to the key
However, when page size is set to 20 rows per page, the following computation is used to retrieve the results directly.

BlocksToRetrieveperPage = RowsPerPageTobeDisplayed/RowsPerBlock

In this case the above formula would translate to
BlocksToRetrieveperPage = 20/10

The next parameter required is the starting block to generate the key so that we can directly navigate to the cache block.

BlockToStart = ((RowToStartFrom)/RowsPerBlock)+1

When page size is set to 20 and the user clicks next button when in the third page,
BlockToStart = ((61)/10)+1

In this case the blocktostart is at block 7. As we can see from earlier sections, the 7th block features the 61'st row .
The pseudocode to process the cache is given below

  1. WHILE BLOCKS_RETRIEVED < blocks_retrieved="BLOCKS_RETRIEVED+1;
  2. RETRIEVE FROM CACHE THE ENTRY WITH KEY (SearchName+UserName+SessionID+timestampofsearchexecution_+"_"+BlockToStart)
  3. ROWSTODISPLAY=ROWSTODISPLAY + CACHE_ENTRY_RETRIEVED
  4. BlockToStart = BlockToStart +1;
  5. END WHILE

In this approach, the number of blocks to be accessed for varying page sizes are as given below
  • Page Size :10 Blocks accessed per page is 1
  • Page Size :20 Blocks accessed per page is 2
  • Page Size :50 Blocks accessed per page is 5
  • Page Size :100 Blocks accessed per page is 10
This option is scalable even if the customer decides to provide more options for page sizes such as 30, 40 etc., as long as it is a multiple of 10 and the number of rows displayed per page will never be less than 10

Option 2 - Optimizing the block access
To push the envelope further, we can strike a balance between the number of blocks loaded per page and the size of data loaded per page access.
For instance if we change the rows per block to be 50 in the above example, we would end up with only three cache entries
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_1 - Rows 1-50
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_2 - Rows 51-100
CustomerSearch_JohnDoe_LJSL0SDF9SD797D7F_20091023233029112_3 - Rows 101-110

In this approach the number of block access made for varying page sizes are shown below for a result set of 110 rows
  • Page Size :10 Blocks accessed per page is 1 //block 1
  • Page Size :20 Blocks accessed per page is 1 //block 1
  • Page Size :50 Blocks accessed per page is 1 //block 1
  • Page Size :100 Blocks accessed per page is2 //block 1 and 2
As we can see the number of page access has been optimized while striking a balance between the number of cache accesses made vis-a-vis the volume of data dealt with per cache access. This would provide a consistent user experience when paginating.

While both options are very powerful, the choice is really up to the designer to choose from. Let us look at the compliance check list again

ConsiderationComplianceRemarks
Search must be fastComplies

Search must be accurateComplies

Search must use minimal system resources Complies
This is much better to the earlier DB only approach.
Search must avoid redundant queriesComplies
still executes redundant queries when paginating
Search must provide current dataPartly Complies
Cached results do not reflect updates that happen after the search is executed.
Pagination must be fast Complies
Optimized Cache Access
Must facilitate on demand sortingComplies
Requires requerying the db unless sorting API is available in the programming language. eg.LINQ
Must facilitate on demand result filteringComplies

Must be multi-lingual friendlyComplies

Solution must be cluster friendly
Complies*
Subject to support from Caching solution

We have looked at making effective use of caching in this section. In the next part we will look at pushing the performance envelope further.

Wednesday, October 21, 2009

Designs for making your searches faster - Part 2

In this part we will look at a few basic solutions that can be considered. Before we dive into the solution, let us quickly put together a design considerations chart that we can use for compliance verification
ConsiderationCompliance
Search must be fast
Search must be accurate
Search must use minimal system resources
Search must avoid redundant queries
Search must provide current data
Pagination must be fast
Must facilitate on demand sorting
Must facilitate on demand result filtering
Must be multi-lingual friendly



Denormalized data structures with Persisted Views / Materialized Views
Denormalized tables are flattened tables that contain all the columns from all the tables that need to be queried to retrieve a search result. Using denormalized tables can greatly speed up queries as it reduces the number of joins one has to perform in a query.
let us quickly look at a sample denormalized structure.
The image shows a simple customer table and a set of related tables used to maintain the customer address information. Let us consider a scenario where we are developing a customer search using these tables.
The query would look like
'SELECT c1."FIRST NAME", c1."MIDDLE NAME", c1."LAST NAME", c2.ADDRESS1, c2.ADDRESS2, c3.NAME, c4.NAME
FROM
CUSTOMER c1,
ADDRESS c2,
STATE c3,
COUNTRY c4
WHERE
c2.CUSTOMER_ID = c1.ID AND
c3.ID = c2.STATE_ID AND
c3.COUNTRY_ID = c4.ID and.... '

This is a fairly simple scenario and this uses four tables to execute a search. Imagine more complex situations where more such tables could be involved, the queries would get only more complex.
A multi-user search involving such or more complex queries running on even a small volume database could lead to lower performance.

To make this faster, we can consider using views. Look at the above snapshot again and at the CUSTOMER_VIEW entity shown in darker shade of gray. This entity has all the columns from all the individual physical entities. Denormalized view as the name states will house redundant data. However, it greatly simplifies the query required to perform the search. Persisted views in sharp contrast with traditional views are not resource hogs.
Traditional views are created on a per-query basis where a view is created for every query run on the view on which the user query is executed. This will result not only in a proliferation of views at any instant, but also serious resource hog by the views.
To avoid this we will consider using persisted views. I will briefly touch upon the concept of persisted views and will strongly recommend googling the topic for detailed study.
Imagine a persisted view to be a real table that is also a view that can be periodically synchronized with the actual set of tables.
In our example from the screenshot above, imagine CUSTOMER_VIEW to be an actual table that physically exists on the database that is periodically refreshed from its source tables.
This offers the best of both worlds
  • The proliferation of runtime views are avoided
  • Data is synchronized automatically at specified intervals
  • The search query can also be greatly simplified.
Before we look into the search let us quickly look at the script to create the view shown above. The script shown here is for Oracle
'CREATE MATERIALIZED VIEW CUSTOMER_VIEW
REFRESH FAST START WITH SYSDATE
NEXT SYSDATE + 1/24
WITH PRIMARY KEY AS
SELECT
c1."FIRST NAME", c1."MIDDLE NAME", c1."LAST NAME",
c2.ADDRESS1, c2.ADDRESS2, c3.NAME, c4.NAME
FROM
CUSTOMER c1,
ADDRESS c2,
STATE c3,
COUNTRY c4
WHERE
c2.CUSTOMER_ID = c1.ID AND
c3.ID = c2.STATE_ID AND
c3.COUNTRY_ID = c4.ID '
The above query creates a materialized view and refreshes its data every hour which is fine given that customer data rarely changes by the hour.

Note: SQL Server provides views with SCHEMA BINDING which is similar to materialized views.

Let us look at the query we would write to retrieve the same data from the CUSTOMER_VIEW
SELECT c1."FIRST NAME", c1."MIDDLE NAME", c1."LAST NAME", c1.ADDRESS1, c1.ADDRESS2, c1.NAME, c1.NAME
FROM
CUSTOMER_VIEW c1,
WHERE
<.... search criteria>
As you can see the query has greatly simplified.

Let us quickly verify the check list to see how this solution fares

Search must be fastSearch must be accurateSearch must use minimal system resourcesSearch must avoid redundant queriesSearch must provide current dataPagination must be fastMust facilitate on demand sorting facilitate on demand result filteringMust be multi-lingual friendly


ConsiderationComplianceRemarks
Search must be fastComplies

Search must be accurateComplies

Search must use minimal system resourcesPartly Complies

Search must avoid redundant queriesDoes not comply
still executes redundant queries when paginating
Search must provide current dataPartly Complies
Persistent view frequency must be tuned appropriately
Pagination must be fastDoes not comply
Redundant querying makes it slow
Must facilitate on demand sortingComplies

Must facilitate on demand result filteringComplies

Must be multi-lingual friendlyComplies





We can observe from the checklist that this solution is only slightly better than the easiest route we discussed in part 1.
However, this is a pretty good solution provided,
1. The tables being queried do not undergo changes frequently
2. It is acceptable to provide near current data
3. Volumes are not extremely high

For instance, product catalogs, customer master data are good candidates for this approach as they fulfill most of the above-mentioned criteria.

In the forthcoming parts, we will look at solutions that attempt to address other considerations in the checklist.

Tuesday, October 20, 2009

Designs for making your searches faster - Covering the basics - Part 1

Search is a universal requirement for any web application. Be it a simple collaboration portal or a complex trading system, searches are the most frequently used feature of any application.
Searches can generally be categorized into types at a high level,
  • Simple one box search that searches everything for the content you are looking for
  • Focused searches that help you seek a single entry from a large data repository.
There are many more variations of each type, but they all generally fall into either category.
The former, that is, the one box search is mostly provided in content portals and informative web sites that cater to the general public that may not be attuned to the nuances of using the latter type or simply for the requirement of keeping it simple.
Behind the scenes, these searches are implemented using complex approaches that we will not debate in this article.
In this part we will focus on the latter.

Transaction Searches
Transaction searches are prevalently used in enterprise web applications that are targeted at a trained user base . This user base knows exactly what it wants and may even provide adequate data inputs to help the system swiftly narrow down to the single entity they are looking for.
Here is a sample screen shot of a search screen from a complex web based enterprise application

courtesy of http://crowdfavorite.com/tasks-pro/
The easiest route adopted by designers when implementing these types of searches is a direct search on the database followed by retrieval of all results and displaying them in an HTML table with a next or a previous button.
Clicking the next or previous will repeat the process, now the content displayed will skip the ones for the previously displayed page.
While this is the easiest route, this always has many disadvantages
  1. very high resource consumption
  2. very low performance
Let us look at why this approach is bad.

Very High Resource Consumption
  • Every pagination attempt does a full search on the database retrieves all the rows ,
  • Transports them to the application server in part or full clogging up the network bandwidth
  • Application code will use only what is required for that page and discards the rest.

Very low performance
  • Hitting the database for every page will stress the application resources under load and cannot provide optimal performance
What Can I do? How can I improve my application performance?
Fortunately, there are many options available. Unfortunately there is no one size that will fit all cases and therefore one must decide the best options to exercise for their context.
Let us look at some basics that must be covered before we look at the options

Basics
  1. Know your SLAs
  2. Know your load conditions and load patterns
  3. Know your requirements and most frequently used usecases

Knowing your SLAs

Service level agreements define the measure or the success criteria of a performance of an application. It is always a good practice to have well segregated SLAs for your application that can be classified as
  • The general performance SLA of all pages in the application
  • Specific SLAs for mission critical features of the application
For instance, in a retail banking website, all general inquiry features could have an SLA of 3 seconds whereas the "third party fund transfer" , "Cash Withdrawal Update" features would have an SLA of 2 seconds.
It is also very important that all SLAs are defined under specific load conditions. For instance, a table such as the one given below could be used for this purpose






FeatureUser Loadperformance
All pages<>2 secs
All pages> 100 and <>4 secs
Third Party Fund Transfer <>2 secs

Knowing your Load Conditions and Patterns

It is very important for designers to understand load conditions and patterns for your application. Load conditions will help quantify the amount of transactions that reside in your database at any point in time. In the context, it is also important to understand the quantum of data to be processed for providing the search feature.
For instance we could face scenarios where a simple five column table with not more than one thousand rows or we could be dealing with a huge document meta data repository with hundreds and hundreds of attributes running to millions or rows.
It is very important to know the data load that one has to deal with when evolving a solution for the search.
Load patterns are as important as the load conditions. Some applications face minimal load through the month and face twice or thrice the average load during a three day period during the first week of a month. It is extremely important to factor in this aspect when building our solution as customers and the application are the most stretched out during these periods and generally tend to make or break the success of the application with the business.
The usage pattern is an important item that the business analyst must cover when developing usecases.

Knowing your Requirements and most frequently used Usecases

Knowing the requirements of the search is a brainer, but what most people fail to do is
  • Analyze the requirements to understand how an end user will use the search feature
  • What will be the most frequently used fields
  • What are the criteria the user is most likely to fill in full and those in part
  • Should there be some criteria that must be mandated?
  • Should the criteria be designed such that the user can be facilitated to provide accurate criteria?
The above points can be discussed in much detail, but i will focus on one.
Facilitating the user to provide search criteria that is as complete as possible is one of the most critical steps that will go a long way in improving overall performance.
For instance, a bank customer looking for transferring funds to a customer in another branch. This feature requires a minimum customer name and branch name to list the options. In this case, providing a quick lookup option to pick the branch accurately helps the system narrow down the name search to only customers belonging to that branch as opposed to the bank's entire customer base.
There is a very high probability that a user would not have specified branch name had there not been a quick lookup provided and the branch mandated. Note that just mandating the branch name without a lookup will only make the user provide approximate or even incorrect names leading to multiple searches.
Knowing the most frequently used usecases is again an often overlooked feature. This seemingly insignificant point helps a lot in deciding the right solution to adopt for a context. Let us for a moment look at two usecases
  1. Banker searches account ledger
  2. Customer searches account history
Both are legitimate usecases with legitimate business contexts. However, one can immediately spot that the banker searching the account ledger is most likely to happen many a times in a day than that of the customer searching the account history.
It is important to appreciate the usage when designing a solution for these searches as it is important to provide a powerful solution for the banker search while customer search can be relegated to a lower powered search.
This translates to not only cost savings for the developers, but also saving system resources and avoiding over engineered solutions.

We have now covered some basic aspects that are a must for evolving the right solution to your search problem.
In the next part, we will look at some common and some uncommon solutions. Until then keep your feedback coming.

Iterative development

A team in my workplace been working on a, sorry, struggling with a product development initiative for the last three years now.
They have been doing iterative development on a fixed bid assignment for a product development initiative with no well defined scope. I know you must be thinking this is the perfect recipe for disaster.
Well, here is what i found out that is wrong with the way they were doing this.
Iterative development has been taken way too seriously. Everything from requirements is being iterated. Before you start thinking "Thats the way it is supposed to be!", let me tell you what I think is the difference.
In a normal world, iterative development is about iteratively 'Completing' your work in smaller chunks. The way theyhave been doing this, they have never 'Completed' anything! There is always something remaining in everything that will be iteratively developed. For instance, a piece of code has been in development for the last three years with the developer iteratively making changes to the code to add functionality bit by bit!
Shocking? you bet it is!
The result is that no code has ever stabilized in the last three years leading to hundreds and hundreds of defects.
This has been a hard lesson for the development team.
 

My Blog List

Site Info

Followers