Monday, November 2, 2009

Designs for making your searches faster - Part 5

In the previous parts we looked at different ways of speeding up application based transaction searches. In this part we will look at pushing the boundaries to an extreme limit without compromising on application data integrity through use of search engines.
Search Engines
Search engines are different types. The ones we all are familiar with, the internet search engines scour the web for data and catalog them to help us find the web pages that provide the content we look for.
The second kind that is the internet search engine applied to the end-user desktop. These are classified as desktop search engines and they help users search their local desktops for anything from documents to music files and emails by keywords.
The one we are going to use in this context is an enterprise search engine that is essentially a stripped down version of a desktop search with extensions that enable application developers harness the power of the search engine.
Before we look at how we will use search engines, let us quickly and briefly understand how search engines work.
Search engines fundamentally is all about indexing various key words against the content link so that when a user searches using any of the key words, the mapped content can be presented to the user.
For instance if we have a book called
"Fifty ways to make a sandwich" by "Danny Shrill" , a fast cooking guide to working men and women, Penguin Books

A search engine will index many of the key words in the context of the book such as
  • The author name :Danny Shrill
  • publisher : Penguin Books
  • subject:Sandwich
  • Category that is cooking and fast food
  • Published : 2009
  • Type : Paperback
When a user searches for books using any one or more of the keywords shown above, the book's name will be thrown as a possible candidate for the book the user is searching for.

The accuracy of the search engine depends on the relevance of the key words that are indexed by the engine.

Search Engines and enterprise applications
Search engines make not be relevant for use in enterprise application searches across the board. Use of search engines should restricted to cases if they meet the following criteria
  1. Where Search performance SLAs are very low (Expected response times are very low)
  2. Where data volumes are very high
  3. where user may not be able to provide accurate data points for search
A high level approach to using search engines.
Let us approach this with an example. let us assume that we are building a product search in amazon.com. When a user searches for a product in amazon.com, it has to fulfill the following criteria
  1. its got to be really fast
  2. its got to be accurate in retrieving relevant products
  3. its got to be flexible when users make mistakes in spelling what they are looking for
  4. its got to provide a good set of suggestions when users cannot find what they want
How do we go about using a search engine that blends with classic relational database system for retrieving results.
Let us apply all of what we learned in the previous parts and quickly summarize the steps
  1. We will create a denormalized table structure
  2. We will use a threaded searching mechanism for burst searching
To use a search engine following activities have to be done
  1. Build key word catalog
  2. Integrate indexing mechanism
Build Key word catalog
Key word catalog is a dictionary of terms a user will typically use for searching. The key words can be classified and grouped depending on need. Let us now attempt to build a key word catalog for our exercise.
Keyword catalog for amazon products
  1. Name
  2. type - books, music, ebook, audiobook
  3. Genre - Fiction, Self help, pop, rock
  4. Author/writer
  5. Played by
  6. support cast
  7. publisher
  8. year published
  9. media : hard bound, paperback, ebook, download, dvd
this is a brief collection keywords by which a prospective amazon.com customer could look for products. This can also be reverse worked from the criteria in your basic / advanced search screens.
Once this has been built, the next step is to integrate the search engine to index and execute the searches.
Integrating the search engine
Search engine integration has two parts to it.
  1. Indexing
  2. Searching
Indexing
Indexing the product database can be done in multiple ways. The simplest way to do this is when a product data is modified in the system. This way any new product data added to the system or when an existing product information is modified in the system, the search engine is updated with its content.
This ensures that the search engine indexes are up to date. Let us look a bit closer into how indexing is done.
Most search engines refer to index entries as a compilation of Book(s). Each book is a catalog entry comprising of two parts. There are many other meta parts, but we will restrict our discussion to these two parts to keep things simple.
  1. Key words
  2. Identifier
Key words as we saw earlier are a grouped collection of key identifiers typically used by users to search. An identifier , is a unique identifier such as a primary key that the search engine considers as result of the search.
So, as presented above if we ensure that the search engine indexes content every time a transaction is created or modified, the search engine will use its complex algorithms to hash and store the indexes for fast retrieval.

Searching
The next part of integration is the search process itself. When a user performs a search, the search program should first use the search engine API and ask the search engine to return the KeyIDs that match the criteria.
Search engine APIs are fairly simple and enable searches to be done by simply providing a list of key words or additionally their classifications.
Search engines also provide extended api to search with emphasis on certain classifications and to return search results with desired matching criteria to further refine results.
The key identifiers returned by the search engines are then used in transactional searches to retrieve the data from the denormalized data structure in the database.
Since the database results are retrieved by directly specifying the key identifier, the retrieval will be the fastest.

  1. Query the search engine with search criteria
  2. Search engine returns books matching the criteria with specified order of relevance
  3. application forms db query with respective KEy identifiers in the books
  4. Database throws up relevant rows based on the primary keys specified.
This approach has shown to provide the fastest results as it relies on custom built search component to retrieve the results. However this approach comes with its own cost maintaining the search indexes in a cluster friendly location.

Let us quickly take stock of this approach vis-a-vis the checklist we put together in the first part.

ConsiderationComplianceRemarks
Search must be fastComplies
Much Improved performance over earlier approach
Search must be accurateComplies

Search must use minimal system resources Complies
This is much better to the earlier DB only approach.
Search must avoid redundant queriesComplies
still executes redundant queries when paginating
Search must provide current dataComplies
The cached portion is only the identifiers and the data can be retrieved directly from the database
Pagination must be fast Complies
Optimized Cache Access
Must facilitate on demand sortingComplies
Requires requerying the db unless sorting API is available in the programming language. eg.LINQ
Must facilitate on demand result filteringComplies

Must be multi-lingual friendlyComplies

Solution must be cluster friendly
Complies*
Subject to support from Caching solution

Closing comments
From a solution perspective we have looked at many options that are available with us for building efficient searches. However, we have only scratched the surface and many of the solutions we have adopted have deeper tuning options that allow us to further exploit their features to better deliver searches.
I will close this subject with this part hoping that this series has provided an impetus for you to embark on your own discovery process to further analyze and evolve a solution that best fits your needs.

0 comments:

Post a Comment

 

My Blog List

Site Info

Followers