Search Engine Workshops
Latent Semantic Indexing
Latent Semantic Indexing
Advanced Information Retrieval Techniques to Improve Online
Search Results
By Raymond Haynes, CEO, EML Websolutions, Inc.
Why the need for Latent Semantic Indexing?
Latent Semantic Indexing, the newest form of search engine artificial intelligence, is providing the ability for "conceptual capability" in indexing unstructured databases. Just a few years ago, search engine algorithms relied on web masters placing keywords in the "keyword tag" to provide source data for a given search query. Obliviously, this method produced large recall sets from the search index, but yielded unsatisfactory precision results. Taking advantage of a computers ability to execute repetitive tasks quickly, search algorithms began matching keyword tag data with actual text content. Optimizers began developing more content related text, and analysis tools such as keyword density ratios and keyword prominence factors became state of the art. The increased ability of the algorithms to analyze content resulted in reducing the size of the recall sets, but increased the value of precision results. Although a significant step in search engine indexing capability, the programs still essentially relied on word-matching database techniques. The algorithms could analyze text for keyword density and prominence, but lacked the ability to index conceptual ideas and themes. Yahoo search developers recognized this shortcoming and introduced human intervention into indexing there search results, with some success. The fundamental problem with human intervention was the exponential growth of information that needed to be analyzed, and included in the database. This method still is used today, to a certain extent, but it is not practical from a manpower standpoint to index such large amounts of data. Although improvements in information retrieval search algorithms continue, the community is still limited by word-matching data base technology. Latent Semantic Indexing is one method of taking information retrieval to the next step. LSI technology provides the ability to index conceptual ideas, providing both large recall sets and high levels of precision for a given search query.
What is Latent Semantic Indexing?
Latent Semantic Indexing (LSI) was developed at Bellcore in the late 1980's. It employs a mathematical approach to model statistical inferences related to underlying (latent) commonly related (semantically close) word relationships. Latent Semantic Indexing is unable to "read" text nor is it able to understand the content implied. LSI creates a result set by examining a document collection and producing results based on similarity values that dictate a document is semantically close or semantically distant from the search query. It is the semantic relationship that allows LSI to produce documents that may or may not contain the original search phrase. The algorithm recognizes that the content displays a strong relationship to the search query and "indexes" the document whether or not the actual search term is contained in the text. In studies conducted, Latent Semantic Indexing was found to be 30% more effective than traditional word-matching methods (Dumias, 1995). LSI programming can be used as a stand alone document search methodology or to augment traditional word-match algorithms. The fundamental importance of this technology to the search engine optimization community is that content can be developed using LSI guidelines, and rank high in the result set, without the text containing an exact search phrase.
How Does Latent Semantic Indexing Work?
In order to further analyze how Latent Semantic Indexing can benefit the search engine optimization community, a rudimentary knowledge of how the indexing methodology operates is essential. Latent Semantic Indexing looks at patterns of word distribution, or specifically word co-occurence. LSI begins by generating a complete list of all words that appear anywhere in a document, then eliminates words that have no semantic meaning. This list is referred to as the content words and are generated mainly by,
1. Discard articles, prepositions, and conjunctions
2. Discard common verbs (know, see, do, be)
3. Discard pronouns
4. Discard common adjectives (big, late, high)
5. Discard frilly words (therefore, thus, however, albeit, etc.)
6. Discard any words that appear in every document
7. Discard any words that appear in only one document
The list of content words now has the "noise" reduced from the text and the "latent" meaning of the document is exposed. Once the list of content words and associated documents are generated they are placed in a term-document matrix. This matrix is a large grid that places the documents listed along the horizontal axis, and the content words in the vertical axis. The grid is completed by placing a "X" in any square that a content word appears in a document. Notice that the grid arrangement is binary. A square either has a "X" or is blank. This is the visual equivalent of a generic keyword search, that looks for exact matches for words in documents. Latent Semantic Indexing takes this binary arrangement further by replacing the "X"s with ones and zeros and generating a binary matrix. This binary matrix is decomposed by a technique called singular value decomposition. The mathematical workings behind this method are extremely advanced, however, the theory is relatively simple. (Learn more about Latent Semantic Indexing and SVD) Each content word in a document universe is assigned a vector. As these vectors are projected in a multidimensional space they tend to form clusters. LSI theorizes that content words that are close in vector clusters are semantically related and those content words that the vector matrix are far apart are semantically unrelated. This vector matrix system is why Latent Semantic Indexing is able to associate documents with similar content and meaning without having to rely on matching keywords.
Optimizing Web Pages that are Indexed by Latent Semantic Indexing
By further understanding how Latent Semantic Indexing functions, the process of optimizing web pages becomes more intuitive. Although Latent Semantic Indexing is capable of operating independently, it is highly unlikely that a search engine algorithm will use this technology as its sole source of document retrieval methodology. The suggested techniques are for optimizing LSI technology only, and normal optimization techniques should be employed to satisfy the traditional search engine requirements. As was stated earlier, Latent Semantic Indexing first evaluates a document set by removing all linguistic noise from the content word list. This means that all capitalization, punctuation, word order and formatting are removed. This implies that techniques such as keyword prominence, bold type, and italics are unnecessary. LSI technology knows nothing about language per se, and uses a mathematical vector to associate semantically close words. This program employs word stemming to further reduce noise. A common method is to use a Porter Stemmer that removes the common endings of words. This implies that the tense and person that the article is written is insignificant. All information carried by style punctuation and grammar are removed. The algorithm employs term weighting, global term weighting and normalization to further assign numerical values to the vector model. Logarithmic term weighting ensures that content words that appear several times in a document are more meaningful than ones that appear only once. This implies that strong content words that directly convey the theme of the article should be used frequently. (appropriate percentages have not yet been defined). Global term weighting ensures that infrequently used words are given a higher weight. Technical terms, and replacement words for common terms carry more numerical value than common phrases used in text. Normalization ensures that large documents with a high number of content words do not overshadow smaller documents. In this case, larger is not necessarily better. The local weight, global weight, and normalization factor are multiplied together to create the numerical value placed in the term-document matrix. It is evident that due to the latent methodology of indexing, all but a few optimization methods are ineffective. The writing of strong content-rich text with specific technical terms, and alternative choice for common phrases appears to be the most effective method to produce high content-word indexes. As far as Latent Semantic Indexing is concerned, due to normalization, the longest article does not always produce the best results. These methods, in conjunction with normal optimization techniques, should produce highly rated effective pages for your web site optimization project.