This is the second part of the tutorial “PrestoDB with Hive and Resque”. Look at the first part. In this post, we’ll prepare Resque and load the access log to PrestoDB.
Resque – how it works?
We’ll use Resque gem to run jobs (in the background) which calculate some statistics from the access log and save the results to our database. We’ll use the default database (sqlite), but I assume you are familiar with Rails so changing your chosen database to another one like PostgreSQL or MySQL would be pretty simple.
Firstly as it’s described on Resque page we have to add gem to
and install Redis because Resque uses it. Redis’ basic installation is very easy:http://redis.io/download
OK, Redis works? Let’s prepare Resque for our tasks. Open
Rakefile and add this line to it:
Resque needs a class with the
perform method and a variable called
@queue which stores information about the queue where our new job will be assigned. It’s a good practice to store your job classes in a separate directory, so create a
jobs dir in
app and place this code there:
class AccessLogStatusStatsJob @queue = :presto def self.perform(seconds) puts "Very long calculations…" sleep(10000) end end
As you can see there’s nothing special here – I’ve added a
Job suffix but it is not required. Now that Resque is ready to go, the only thing we need to do is to launch the Resque worker for
bundle exec rake environment resque:work QUEUE=presto TERM_CHILD=1
What’s next? We should create a simple controller which will start a new Resque job:
rails g controller AccessLogStats
We’ll need two actions:
class AccessLogStatsController < ApplicationController def index Resque.enqueue(AccessLogStatusStatsJob, 0) end def run end end
We also need to have an empty view for the
index action in our
Next, we have to add some information about the new controller in
require 'resque/server' Rails.application.routes.draw do mount Resque::Server.new, at: "/resque" resources :access_log_stats, only: [ :index] scope '/access_log_stats' do get 'run' => 'access_log_stats#run', as: 'access_log_stats_run' end end
As you can see, there’s a couple of additional lines in
routes.rb. Why is that?
We want to use the built-in admin console which gives us information about jobs states. To do this we need to include
resque/server and mount the rescue server. I assume you know Rails, so rest of your routes shouldn’t be a problem.
OK, so let’s test it!
bundle exec rails s
and open: http://localhost:3000/resque You should see something similar to this:
Now we can start our first job: http://localhost:3000/access_log_stats
and then go back to the Resque admin console. You should see a new job there. You can also check if your Resque worker log contains text:
Very long calculations...
PrestoDB and Hive
We need to have some data to analysis. I use NASA access log, you can download it from this site:ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
source: http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html You can use your own access logs of course (make sure they are the same format).
Copy the access log to Hortonworks machine:
scp -P 2222 NASA_access_log_Jul95.gz email@example.com:
Login and extract the log:
sh firstname.lastname@example.org -p 2222 gzip -d NASA_access_log_Jul95.gz
Copy the log to hive user home directory and it’s change ownership to hive (currently it belongs to root user):
cp NASA_access_log_Jul95 /home/hive/ chown hive:hdfs /home/hive/NASA_access_log_Jul95 # change permission, hive need to have access to this file su hive cd ~/
OK, we have to prepare hive tables and load there our access log, so open hive console:
and run query which creates table for access log:
CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s" ) STORED AS TEXTFILE;
We have an empty table, so let’s add our file:
LOAD DATA LOCAL INPATH '/home/hive/NASA_access_log_Jul95' OVERWRITE INTO TABLE apachelog;
If you don’t know these queries, visit this page.
Now we can run a basic query:
select count(*) from apachelog;
It took some time… 78,97 seconds!
OK, we have 1 891 715 records in our base table (
apachelog) – quite a lot. We can do something with it but let’s move now to PrestoDB. Close Hive console (type
exit;), change user to presto (
exitto root and
su presto), go to the directory with presto-cli and launch it:
cd /var/presto/install # go to directory where you downloaded presto-cli ./presto --server localhost:8080 --catalog hive --schema default
We can create our table now:
create table accesslog_fin as select host, identity, user, time, request, status, size from apachelog;
As you can see, we’ve created
accesslog_fin table which gets data from Hive apachelog. It’ll not use Hive but only information from its Metastore.
Now we can check if it works:
select count(*) from accesslog_fin;
if you get the following error:
presto:default> select count(*) from accesslog_fin; Error running command: Server refused connection: http://localhost:8080/v1/statement
it means that you probably didn’t launch PrestoDB.
Presto was “a little” faster than Hive ;)
OK, we’ve prepared Hive and Presto for our analysis, so in the next part we’ll get all things to one Rails application.