Query and Location extraction

Pretty much most search applications today have a standard interface.
A simple text box and a search button.

However when performing a geographical search, this is complicated even a little more
as the users query is composed of two distinct entities.

1) A search phrase.
2) A location.

An example would be "bars Reston, VA" where "bars" is the search phrase, and "Reston, VA" is a location.
Normally the simplest solution is to provide 2 text boxes, one marked search, the other location, hoping the
user fills in both appropriately.

This however may not always be desired or possible, for example
if you were building a search button for either
IE, or FireFox using open search, there is generally only one input box.






This leaves you needing an extraction process.


Purpose

The purpose of this article is to describe how to implement a basic query and location extraction within the US.

Although this is centered around US data, the principles are pretty much the same internationally, and can be applied
with little modifications.

We will end up creating an open search implementation that can be used for your site, we will link this one to AOL's YellowPages
local search site.


The hard part

What's so hard about extracting query and location? If you maintain a list of City, State combination's surely you can

can subtract the location from the query string?
How many cities and states and permutations do you have to iterate through to figure that out?
e.g
Bars Reston, VA ->Bars Reston Town Center, VA ->Bars Reston Town Center, Virginia ->Reston Va Bars

As you can see it's a little much for simple substitution alone.
But by using a text search engine such as Lucene / Solr you can make it a little easier.

Now ask the question is text matching alone enough?
Of course not.

Lets look at a complex query, "Manhattan Bagels, long island ny", that is someone looking for a company
called "Manhattan Bagels" in Long Island NY.

Or "Bagels Manhattan" or several other combination's, but I think you get the idea.
There is a need to put some intelligence in the location retrieval, essentially a little disambiguation.
POS (position of speech) may not always work as well, as users may or may not conform to your desired standard.

We're next going to examine some ideas on how to achieve this