Search Engines
Search engines are different types. The ones we all are familiar with, the internet search engines scour the web for data and catalog them to help us find the web pages that provide the content we look for.
The second kind that is the internet search engine applied to the end-user desktop. These are classified as desktop search engines and they help users search their local desktops for anything from documents to music files and emails by keywords.
The one we are going to use in this context is an enterprise search engine that is essentially a stripped down version of a desktop search with extensions that enable application developers harness the power of the search engine.
Before we look at how we will use search engines, let us quickly and briefly understand how search engines work.
Search engines fundamentally is all about indexing various key words against the content link so that when a user searches using any of the key words, the mapped content can be presented to the user.
For instance if we have a book called
"Fifty ways to make a sandwich" by "Danny Shrill" , a fast cooking guide to working men and women, Penguin Books
A search engine will index many of the key words in the context of the book such as
- The author name :Danny Shrill
- publisher : Penguin Books
- subject:Sandwich
- Category that is cooking and fast food
- Published : 2009
- Type : Paperback
The accuracy of the search engine depends on the relevance of the key words that are indexed by the engine.
Search Engines and enterprise applications
Search engines make not be relevant for use in enterprise application searches across the board. Use of search engines should restricted to cases if they meet the following criteria
- Where Search performance SLAs are very low (Expected response times are very low)
- Where data volumes are very high
- where user may not be able to provide accurate data points for search
Let us approach this with an example. let us assume that we are building a product search in amazon.com. When a user searches for a product in amazon.com, it has to fulfill the following criteria
- its got to be really fast
- its got to be accurate in retrieving relevant products
- its got to be flexible when users make mistakes in spelling what they are looking for
- its got to provide a good set of suggestions when users cannot find what they want
Let us apply all of what we learned in the previous parts and quickly summarize the steps
- We will create a denormalized table structure
- We will use a threaded searching mechanism for burst searching
- Build key word catalog
- Integrate indexing mechanism
Key word catalog is a dictionary of terms a user will typically use for searching. The key words can be classified and grouped depending on need. Let us now attempt to build a key word catalog for our exercise.
Keyword catalog for amazon products
- Name
- type - books, music, ebook, audiobook
- Genre - Fiction, Self help, pop, rock
- Author/writer
- Played by
- support cast
- publisher
- year published
- media : hard bound, paperback, ebook, download, dvd
Once this has been built, the next step is to integrate the search engine to index and execute the searches.
Integrating the search engine
Search engine integration has two parts to it.
- Indexing
- Searching
Indexing the product database can be done in multiple ways. The simplest way to do this is when a product data is modified in the system. This way any new product data added to the system or when an existing product information is modified in the system, the search engine is updated with its content.
This ensures that the search engine indexes are up to date. Let us look a bit closer into how indexing is done.
Most search engines refer to index entries as a compilation of Book(s). Each book is a catalog entry comprising of two parts. There are many other meta parts, but we will restrict our discussion to these two parts to keep things simple.
- Key words
- Identifier
So, as presented above if we ensure that the search engine indexes content every time a transaction is created or modified, the search engine will use its complex algorithms to hash and store the indexes for fast retrieval.
Searching
The next part of integration is the search process itself. When a user performs a search, the search program should first use the search engine API and ask the search engine to return the KeyIDs that match the criteria.
Search engine APIs are fairly simple and enable searches to be done by simply providing a list of key words or additionally their classifications.
Search engines also provide extended api to search with emphasis on certain classifications and to return search results with desired matching criteria to further refine results.
The key identifiers returned by the search engines are then used in transactional searches to retrieve the data from the denormalized data structure in the database.
Since the database results are retrieved by directly specifying the key identifier, the retrieval will be the fastest.
- Query the search engine with search criteria
- Search engine returns books matching the criteria with specified order of relevance
- application forms db query with respective KEy identifiers in the books
- Database throws up relevant rows based on the primary keys specified.
Let us quickly take stock of this approach vis-a-vis the checklist we put together in the first part.
Consideration | Compliance | Remarks |
Search must be fast | Complies | Much Improved performance over earlier approach |
Search must be accurate | Complies | |
Search must use minimal system resources | Complies | This is much better to the earlier DB only approach. |
Search must avoid redundant queries | Complies | still executes redundant queries when paginating |
Search must provide current data | Complies | The cached portion is only the identifiers and the data can be retrieved directly from the database |
Pagination must be fast | Complies | Optimized Cache Access |
Must facilitate on demand sorting | Complies | Requires requerying the db unless sorting API is available in the programming language. eg.LINQ |
Must facilitate on demand result filtering | Complies | |
Must be multi-lingual friendly | Complies | |
Solution must be cluster friendly | Complies* | Subject to support from Caching solution |
Closing comments
From a solution perspective we have looked at many options that are available with us for building efficient searches. However, we have only scratched the surface and many of the solutions we have adopted have deeper tuning options that allow us to further exploit their features to better deliver searches.
I will close this subject with this part hoping that this series has provided an impetus for you to embark on your own discovery process to further analyze and evolve a solution that best fits your needs.
0 comments:
Post a Comment