Work
with us

Tell us about your idea
and we will find a way
to make it happen.

Get estimate

Join our awesome team

Check offers

This is the second part of the tutorial “PrestoDB with Hive and Resque”. Look at the first part. In this post, we’ll prepare Resque and load the access log to PrestoDB.

Resque – how it works?

We’ll use Resque gem to run jobs (in the background) which calculate some statistics from the access log and save the results to our database. We’ll use the default database (sqlite), but I assume you are familiar with Rails so changing your chosen database to another one like PostgreSQL or MySQL would be pretty simple.

Firstly as it’s described on Resque page we have to add gem to Gemfile:

gem "resque"

and install Redis because Resque uses it. Redis’ basic installation is very easy:http://redis.io/download

OK, Redis works? Let’s prepare Resque for our tasks. Open Rakefile and add this line to it:

require 'resque/tasks'

Resque needs a class with the perform method and a variable called @queue which stores information about the queue where our new job will be assigned. It’s a good practice to store your job classes in a separate directory, so create a jobs dir in app and place this code there:

class AccessLogStatusStatsJob
 @queue = :presto

  def self.perform(seconds)
  puts "Very long calculations…"
  sleep(10000)
  end
end

As you can see there’s nothing special here – I’ve added a Job suffix but it is not required. Now that Resque is ready to go, the only thing we need to do is to launch the Resque worker for prestoqueue:

bundle exec rake environment resque:work QUEUE=presto TERM_CHILD=1

What’s next? We should create a simple controller which will start a new Resque job:

rails g controller AccessLogStats

We’ll need two actions: index and run:

class AccessLogStatsController < ApplicationController
  def index
    Resque.enqueue(AccessLogStatusStatsJob, 0)
  end

  def run

  end
end

We also need to have an empty view for the index action in our app/views/access_log_stats/ directory.

Next, we have to add some information about the new controller in routes.rb:

require 'resque/server'
Rails.application.routes.draw do

  mount Resque::Server.new, at: "/resque"

  resources :access_log_stats, only: [ :index]

  scope '/access_log_stats' do
    get 'run' => 'access_log_stats#run',  as: 'access_log_stats_run'
  end
end

As you can see, there’s a couple of additional lines in routes.rb. Why is that?
We want to use the built-in admin console which gives us information about jobs states. To do this we need to include resque/server and mount the rescue server. I assume you know Rails, so rest of your routes shouldn’t be a problem.

OK, so let’s test it!
Start server:

bundle exec rails s

and open: http://localhost:3000/resque You should see something similar to this:

Now we can start our first job: http://localhost:3000/access_log_stats
and then go back to the Resque admin console. You should see a new job there. You can also check if your Resque worker log contains text: Very long calculations...

PrestoDB and Hive

We need to have some data to analysis. I use NASA access log, you can download it from this site:ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz

source: http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html You can use your own access logs of course (make sure they are the same format).

Copy the access log to Hortonworks machine:

scp -P 2222 NASA_access_log_Jul95.gz root@127.0.0.1:

Login and extract the log:

sh root@127.0.0.1 -p 2222
gzip -d NASA_access_log_Jul95.gz

Copy the log to hive user home directory and it’s change ownership to hive (currently it belongs to root user):

cp NASA_access_log_Jul95 /home/hive/
chown hive:hdfs /home/hive/NASA_access_log_Jul95 # change permission, hive need to have access to this file
su hive
cd ~/

OK, we have to prepare hive tables and load there our access log, so open hive console:

hive

and run query which creates table for access log:

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING
 )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s"
)
STORED AS TEXTFILE;

We have an empty table, so let’s add our file:

LOAD DATA LOCAL INPATH '/home/hive/NASA_access_log_Jul95' OVERWRITE INTO TABLE apachelog;

If you don’t know these queries, visit this page.

Now we can run a basic query:

select count(*) from apachelog;

It took some time… 78,97 seconds!

OK, we have 1 891 715 records in our base table (apachelog) – quite a lot. We can do something with it but let’s move now to PrestoDB. Close Hive console (type exit;), change user to presto (exitto root and su presto), go to the directory with presto-cli and launch it:

cd /var/presto/install # go to directory where you downloaded presto-cli
./presto --server localhost:8080 --catalog hive --schema default

We can create our table now:

create table accesslog_fin as select host, identity, user, time, request, status, size from apachelog;

As you can see, we’ve created accesslog_fin table which gets data from Hive apachelog. It’ll not use Hive but only information from its Metastore.
Now we can check if it works:

select count(*) from accesslog_fin;

if you get the following error:

presto:default> select count(*) from accesslog_fin;
Error running command: Server refused connection: http://localhost:8080/v1/statement

it means that you probably didn’t launch PrestoDB.

Presto was “a little” faster than Hive ;)
OK, we’ve prepared Hive and Presto for our analysis, so in the next part we’ll get all things to one Rails application.

Post tags: