Oct 28, 2014

PrestoDB With Hive and Resque [Part III: Running Log Analysis]

blogpost cover image
This is the third and final part of my tutorial “PrestoDB with Hive and Resque”. The second part can be found here. In this post, we’ll use Resque and presto-client to run queries on PrestoDB.

Putting all pieces together

Now we want to run PrestoDB query inside the AccessLogStatusStatsJob.perform method but before we do this, we need to have a Rails model configured to store the results of Presto analyses. Let’s create an AccessLogStatusStat model:

bundle exec rails g model AccessLogStatusStat host:string status:integer request:string count:integer

You can add limit: 8 to the count field in migration file (to force bigint type). We should also define some scopes for:

  • Statistics with status 4xx

Statistics with status 5xx

class AccessLogStatusStat < ActiveRecord::Base

  scope :find_4xx_statuses, -> { where('status >= :min and status < :max_ex', { min: 400, max_ex: 500 }).order('count desc') }
  scope :find_5xx_statuses, -> { where('status >= :min and status < :max_ex', { min: 500, max_ex: 600 }).order('count desc') }


And finally we can add some logic to our AccessLogStatusStatsJob.perform method but we need to have presto-client gem bundled, so let’s add it to the Gemfile:

gem 'presto-client'

and change implementation of AccessLogStatusStatsJob.perform method:

class AccessLogStatusStatsJob
  MIN_STATUS = 400
  PRESTO_SCHEMA = "default"
  PRESTO_USER = "presto"
  PRESTO_DEBUG = false

  @queue = :presto

  def self.perform(seconds)
    client = Presto::Client.new(
      server: PRESTO_HOST,
      catalog: PRESTO_CATALOG,
      schema: PRESTO_SCHEMA,
      user: PRESTO_USER,
      http_debug: PRESTO_DEBUG
    old_created_at = Time.now
    client.query("select count(status), host, cast(status as bigint), request from accesslog_fin where cast(status as bigint) >= #{MIN_STATUS.to_i} group by host, status, request order by host") do |q|
      q.each_row {|row|
        AccessLogStatusStat.create(count: row[0], host: row[1], status: row[2], request: row[3])
    AccessLogStatusStat.where('created_at <= ?', old_created_at).destroy_all

Firstly, we create connection to PrestoDB with Presto::Client class. Its parameters point to port 8080 on localhost because Hortonworks virtual machine forwards this port. The next parameter is catalog (we’ve defined hive catalog before). We want to be sure that the table has new values only. That’s why so we’ll store the launch time of the Presto query and use it to delete all of the older statistics after finishing the analysis.
The code above isn’t pretty – we should move Presto server parameters to another place (some config file maybe) and we could create a class for initialising Presto connection but we’ll take care of that in another post.
We have one controller: AccessLogStatsController. It’ll be responsible for listing statistics and running the Resque job:

class AccessLogStatsController < ApplicationController
  def index
    @access_log_stats_4xx = AccessLogStatusStat.find_4xx_statuses
    @access_log_stats_5xx = AccessLogStatusStat.find_5xx_statuses

  def run
    @status = :ok
      Resque.enqueue(AccessLogStatusStatsJob, 0)
    rescue Exception => e
      logger.error "#{e.message} #{e.backtrace.join("\n")}"
      @status = :internal_server_error

    respond_to do |format|
      format.json { render layout: false, status: @status }

I suppose the index action is clear and the second one – run is also self-explanatory. We add our job to the queue and return a JSON response. Pretty simple.

OK, we’re almost done, the only thing to do is set up a view. We need one for the index action and the second one for run.

%h2 Access log statistics
= link_to 'Launch analysis', access_log_stats_run_path, remote: true, id: 'job-launcher'
= link_to 'Jobs statuses', '/resque', target: '_blank'
%h3#stats-4xx-header Statistics for 4xx statuses
  = render partial: 'access_log_stats_list', locals: { access_log_stats: @access_log_stats_4xx}
%h3#stats-5xx-header Statistics for 5xx statuses
  = render partial: 'access_log_stats_list', locals: { access_log_stats: @access_log_stats_5xx}

  $("#stats-4xx-header").click(function() {
  $("#stats-5xx-header").click(function() {
    .on('ajax:success', function(evt, data, xhr, status) {
    .on('ajax:error', function(evt, xhr, status) {
      var json = $.parseJSON(xhr.responseText);
- if access_log_stats.size > 0
        %th Host
        %th HTTP status code
        %th Request
        %th Hits
      - access_log_stats.each do |statistic|
          %td= statistic.host
          %td= statistic.status
          %td= statistic.request
          %td= statistic.count

We have a link now for running a job remotely (via an AJAX request) and there is also a link to the Resque console.
The main part of our view is divided in two lists:

  • Statistics of the number of requests to specific URL which returns status 4xx
  • Statistics of the number of requests to specific URL which returns status 5xx

I’ve used some basic table with a toggling list (in jQuery) and additionally handle response from the run action.

The last part is a json view for run action:

if @status == :ok
  json.msg 'Access log analysis started!'
  json.msg 'Error occured! Please try later.'

Very simple, we just need to return a JSON with the proper message. The last thing required before starting the server is running all the migrations.

bundle exec rake db:migrate

OK, let’s test it!
Run rails server: bundle exec rails s
Run Resque worker (stop and start it if it is running):

bundle exec rake environment resque:work QUEUE=presto TERM_CHILD=1

Go to http://localhost:3000/access_log_stats and open the Resque admin console (“Jobs statuses” link). Click “Launch analysis”. You should see a javascript alert message: “Access log analysis started!”. Let’s move to the tab with Resque admin console. You should have information that you have 1 of 1 Workers Working. Wait for the job to finish. You can start “Live poll” (right-bottom corner) and it’ll update status without refreshing the whole page. When worker finishes it’s job you can go back to http://localhost:3000/access_log_stats and see the results.

I hope it works for you :) You can get this sample application from github.


The main purpose of this post was to use couple of technologies together. It’s just a simple application. You can start with it and improve some things, you can wrap PrestoDB connection in one class to use it without parameters like host name, port, etc. You can also change your Presto queries.
There are also a lot of ideas for development, like adding resque-status gem or whenever. You can also consider Resque alternatives like sidekiq.
In another posts we’ll improve our app :)