Sunday, July 6, 2014

Welcome to Semantic Search

This blog is dedicated to exploring the topic of semantic search.  It is about finding information based on its semantic content, not its form or syntax.

The first subject is the little predicate 'is about'.  We might say that a web page is about Babe Ruth.  What does that mean?  It may mean that the string 'Babe Ruth' occurs in the web page.  Or it may mean that other semantically related strings like 'The Babe', 'Ruth', or 'George Herman Ruth' occur within the page.

But there is a problem with 'is about'.  How do we know that a page is REALLY about Babe Ruth and doesn't just pretend to be?  Maybe it is really a porno page that is trying to lure kids who are fascinated by Babe Ruth.  And even if the page is truly about Babe Ruth, how do we know that it is authoritative?  Is it a blog expressing some guy's opinion about Ruth, or is it a carefully researched and well documented / cited study of the life and deeds of Babe Ruth?

Furthermore, there are many adverbial qualifiers that could be added to 'is about', such as 'is only about', 'is mostly about', 'is also about'.  The first of these, 'is only about', might suggest that the web page contains information only about Babe Ruth and not, say about Lou Gehrig, except perhaps incidentally.  And what does 'is mostly about' mean anyway?

So, you see, the subject of semantics is infinitely more nuanced and more difficult than the subject of syntax.  The latter, after all, seems to be mostly limited to parsing string patterns of various kinds.   But teasing the meaning out of a word is highly dependent on many subtle, contextual factors like the mood and intent of the speaker, the sentence in which it is embedded, the paragraph in which the sentence is embedded, etc.  There is also something that the hearer (or reader, in the case of a web page) brings to communication.  We've all heard the expression 'said apples, but heard oranges,' meaning, of course, that hearer simply heard something completely indifferent than what the speaker intended.

All of this makes web search based on semantic intent of the searcher and the semantic content of web pages very difficult to accomplish accurately.  What do search engines do to associate search keywords with indexed web pages?  That is the subject of the next blog post.


No comments:

Post a Comment