Query Location Extraction Using Solr

Solr is an application based on lucene, available from Apache Solr , it provides a simple implementation
of the lucene search libraries.

We're going to use Solr for location extraction from users queries.
There are a few items to look out, first we're customizing the schema to suit out needs
then we're going to use DisMax query handler, then build an application on top of it.

The schema.xml generally located in solr/conf/schema.xml

    <schema name="location_geocoding" version="1.1">
      <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
        <!-- numeric field types that store and index the text
            value verbatim (and hence don't support range queries, since the
            lexicographic ordering isn't equal to the numeric ordering) -->
       <fieldType name="int" class="solr.IntField" omitNorms="true"/>
       <fieldType name="long" class="solr.LongField" omitNorms="true"/>
       <fieldType name="float" class="solr.FloatField" omitNorms="true"/>
       <fieldType name="double" class="solr.DoubleField" omitNorms="true"/>
       <!-- Numeric field types that manipulate the value into
            a string value that isn't human-readable in its internal form,
            but with a lexicographic ordering the same as the numeric ordering,
            so that range queries work correctly. -->
       <fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/>
       <fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true"/>
       <fieldType name="sfloat" class="solr.SortableFloatField" sortMissingLast="true" omitNorms="true"/>
       <fieldType name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true" omitNorms="true"/>

        <fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>

        <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
          <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <!-- in this example, we will only use synonyms at query time
            <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>        -->
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenate
    All="0" splitOnCaseChange="1"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <!--<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
          <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenate
    All="0" splitOnCaseChange="1"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <!-- <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>


    <field name="id" type="int" indexed="true" stored="true"/>
    <field name="zipcode" type="string" indexed="true" stored="true"/>
    <field name="state" type="string" indexed="true" stored="true"/>
    <field name="state_srch" type="text" indexed="true" stored="true"/>

    <field name="city" type="string" indexed="true" stored="true"/>
    <field name="city_srch" type="text" indexed="true" stored="true"/>

    <field name="county" type="string" indexed="true" stored="true"/>
    <field name="county_srch" type="text" indexed="true" stored="true"/>

    <field name="statename" type="string" indexed="true" stored="true"/>
    <field name="statename_srch" type="text" indexed="true" stored="true"/>

    <field name="population" type="sint" indexed="true" stored="true"/>
    <field name="density" type="sdouble" indexed="true" stored="true"/>
    <field name="latitude" type="sdouble" indexed="true" stored="true"/>
    <field name="longitude" type="sdouble" indexed="true" stored="true"/>
    <field name="city_state" type="text" indexed="true"/>
    <field name="county_state" type="text" indexed="true"/>
    <field name="city_county" type="text" indexed="true"/>
    <field name="text" type="text" indexed="true" stored="true" multiValued="true"/>
    <dynamicField name="_local*" type="sdouble" indexed="true" stored="false"/>

    <copyField source="zipcode" dest="text"/>

    <copyField source="state" dest="text"/>
    <copyField source="state" dest="state_srch"/>

    <copyField source="city" dest="text"/>
    <copyField source="city" dest="city_srch"/>

    <copyField source="county" dest="text"/>
    <copyField source="county" dest="county_srch"/>

    <copyField source="statename" dest="text"/>
    <copyField source="statename" dest="statename_srch"/>

     <!-- field for the QueryParser to use when an explicit fieldname is absent -->
     <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
     <solrQueryParser defaultOperator="OR"/>


Obviously we've customized the fields we're going to use in blue, but also check out the text field definition, one key item
we've changed from the standard solr definition, is we've commented out Porter Stemmer in red. Simply because locations
are very exact in terminology. There's no such place as "New Yorks".

Next we will look at the solr config file, usually in solr/conf/solrconfig.xml
We will modify the dismax handler to achieve our needs, everything else will remain the same.

     <!-- DisMaxRequestHandler allows easy searching across multiple fields
           for simple user-entered phrases.
           see http://wiki.apache.org/solr/DisMaxRequestHandler
      <requestHandler name="dismax" class="solr.DisMaxRequestHandler" >
        <lst name="defaults">
         <str name="echoParams">explicit</str>
         <float name="tie">0.001</float>
         <str name="qf">
            text city_srch^3.0 statename_srch^2.0 state_srch^2.0 county_srch^1.2 zipcode^1.0
         <str name="pf">
            city_srch^5.0 statename_srch^2.0 state_srch^2.0 county_srch^1.2 city_state^5.0 county_state^5.0 city_county^5.0
         <str name="bf">
         <str name="sort">score desc, density desc</str>
         <str name="fl">
         <int name="mm">1</int>
         <int name="ps">1</int>
         <int name="qs">1</int>
         <str name="q.alt">*:*</str>
         <!-- example highlighter config, enable per-query with hl=true -->
         <str name="hl.fl">text,city_srch,county_srch,statename_srch,state_srch</str>

Fields with a ^x.x, is lucene syntax for boost score of this field by x.x, this is generally applied using qf (query fields), and pf (phrase fields) fields that will be searched on for single word matches or largest part phrases.

We've set mm (minimal match, minimum number of word matches required) to 1,
ps (phrase slop, the maximum number of words that can exist between 2 matching words, so "Ranoke Jersey" will match "Ranoke New Jersey" where New is the slop word) to 1.
We've also configured highlighter, which allows us to figure out what words from the users query matched.

Load in the data from your mysql data base, you can look at Solrs' DIH data import handler, or use mysql's load into outfile, and post that to solrs csv

Here's the shell script I use to post to solr


    curl "http://localhost:8080/solr/rgeocoder/update/csv?fieldnames=id,zipcode,city,county,statename,state,population,city_state,county_state,city_county,density,latitude,lon
    gitude" --data-binary @solr-zips.txt -H 'Content-type:text/plain; charset=utf-8'

    curl "http://localhost:8080/solr/rgeocoder/update" --data-binary '' -H 'Content-type:text/xml; charset=utf-8'

Next interfacing with php.