Using ChefMoz data

Once you download the ChefMoz data, you will need to parse it, geocode the addresses, and convert
it to solr's xml format for indexing.

We will do this with the following perl script. rdf_geocode_xml.pl

To run it, simple make an output directory chefmoz

cat chefmoz.rest.rdf |./rdf_geocode_xml.pl

Windows users, I believe cat works the same, it might require

cat chefmoz.rest.rdf CON: | perl rdf_geocode_xml.pl

If there are windows experts out there, please drop me a line and correct me if it's wrong.
This will generate an output file in chefmoz/restaurants.xml.

To upload this to solr, from the chefmoz directory simply run

sh ../solr-example/apache-solr-1.3-dev/example/exampledocs/post.sh restaurants1.xml

Note: you can use solr's post.jar file as well, but be aware that requires a lot more memory, and
will complain about running out of memory if you don't specify -X:mx with significant amount of memory.

Full version available in chefmoz-example
rdf_geocode_xml.pl

#!/usr/bin/perl
# Author Patrick O'Leary, pjaol@pjaol.com
#---------------------------------------------
#
#
#
use strict;
use Data::Dumper;
use HTML::Entities;
use Geo::Coder::US;

# Change this to point to the geocoder.db
# http://www.gissearch.com/geocode#Geo::Coder::US
# for details on setting this up
Geo::Coder::US->set_db( "/home/pjaol/geoDB/US/geocoder.db" );


my %fields = ( "Location" => "location",
"d:Title" => "title",
"Address" => "address",
"City" => "city",
"State" => "state",
"Country" => "country",
"Phone" => "phone",
"d:Description" => "description",
"RecommendedDishes" => "dishes",
"d:Date" => "date",
"OverallRating" => "rating");

my $doBuff = 0;
my $validTag = 0;
my $currentTag;
my $currentValue;
my $rid;
my %doc;

# output file to use
my $file = "chefmoz/restaurants.xml";
open (FILE, "+> $file") || die "$0: cannot open file $file $!";
print FILE "<add>n";

while (<>) {

# Beginning of restaurant document
if ($_ =~ /<Restaurant r:id=.(.*?).>/) {
$doBuff = 1;
$rid = $1;
%doc = undef;
}elsif ($doBuff) { # body of a restaurant document

if ( $_ =~ /<([^/].*?)>(.*)/) {
my $pTag = $1;
my $pContent = $2;
if (exists $fields{$pTag}) {
$currentTag = $pTag;
$currentTag =~ s/://g;
$validTag = 1;
if ($pContent =~ /(.*?)</$pTag>/){
$currentValue = $1;
$validTag = 0;

$doc{$currentTag} = $currentValue;
}
$currentValue = $pContent;
}
} elsif ( $validTag ) {

if ($_ =~ /(.*?)</$currentTag>/) {
$currentValue .= $1;
$validTag =0;
$doc{$currentTag} = $currentValue;
} else {
$currentValue .= $_;
}
}
}

# End of Restaurant document
if ($_ =~ /</Restaurant>/) {
$doBuff = 0;

if ($doc{Country} eq "United States" ) {
print "rid: $ridn";

my $address = "$doc{Address}, $doc{City}, $doc{State}";

my ($ora) = Geo::Coder::US->geocode($address);

if ( (exists $ora->{lat} ) && (exists $ora->{long}) ) {

if (! exists $doc{OverallRating} ) {
$doc{OverallRating} = "0";
}
writeDoc($rid, $ora->{lat}, $ora->{long}, %doc);
} else {
print "Bad: $doc{dTitle} : $addressn";
}
}
}

}

print FILE "</add>n";
close FILE; #be friendly to your file system.

# write a restaurant document in solr format.
sub writeDoc {

my ($rid, $lat, $long, %doc) = @_;


print %doc;

my $rdoc = "<doc>n".
"<field name='id'>$rid</field>n".
"<field name='location'>$doc{Location}</field>n".
"<field name='title'>$doc{dTitle}</field>n".
"<field name='address'>$doc{Address}</field>n".
"<field name='city'>$doc{City}</field>n".
"<field name='state'>$doc{State}</field>n".
"<field name='country'>$doc{Country}</field>n".
"<field name='phone'>$doc{Phone}</field>n".
"<field name='description'>$doc{dDescription}</field>n".
"<field name='recommendations'>$doc{RecommendedDishes}</field>n".
"<field name='overallrating'>$doc{OverallRating}</field>n".
"<field name='lat'>$lat</field>n".
"<field name='long'>$long</field>n".
"</doc>n";

print FILE $rdoc;

}