RSS

What Is Latent Semantic Indexing

Latent semantic indexing (LSI) is an information retrieval strategy that applies a certain mathematical technique to determine the concept or idea that is found in a body of text.  This information retrieval technique uses the natural language processing system known as latent semantic analysis or LSA.  LSA examines the interrelationships between various documents and the words that they contain and then creates a set of ideas for these documents.  Therefore, LSI allows the inclusion of various documents as the results of a certain query even if they do not contain the exact words or phrases that have been typed in by the searcher.

LSI offers a remedy to two of the most annoying deficiencies of the usual Boolean search technique.  One is that several words can have similar meanings and another is that a particular word can have several meanings.  These two problems are the usual reasons for documents or web pages appearing in the search results even if they are not relevant to the topic while certain web pages and documents that should have been included are absent. 

Another application for LSI is the automation of the categorization of a document.  For this method, it uses sample documents as the foundation for understanding the concepts embodied by each category.  It then compares the concepts found in the documents to those that are present in the example documents and assigns a category for a document when there are similarities in its concepts with those of the example documents for that category. 

Another advantage of LSI is that it is applicable for all languages because it is entirely based on mathematical analyses.  Thus, it can extract the semantic content from the documents written in any language without the need to consult any thesaurus or dictionary.  The query can also be made in one language while the documents are written in a different language. 

LSI is also applicable for terms that are not exactly words, such as the DNA sequences of genes.  Thus, biological and medical documents can easily be searched and categorized using LSI.  For example, LSI is capable of classifying genes based on the biological information that could be extracted from the abstracts and titles of biological databases.

It is also capable of automatically adjusting itself to changing terminology and it is hardly affected by unreadable characters, typographical mistakes, misspelled words, and other kinds of noise in documents.  Therefore, LSI is applicable for a body of text that is the result of speech-to-text conversion programs and those that have been extracted from images by optical character recognition software. Check out http://ArticlesOnTap.com for more on this

Filed Under: Uncategorized

Tags:

RSSComments (0)

Trackback URL

Comments are closed.