Oct 21, 2014

PrestoDB With Hive and Resque [Part I: Big Data Environment]

blogpost cover image
Some time ago I used Apache Hive for log analysis. It’s a very interesting tool for doing calculations on big data but sometimes potentially problematic performance issues can occur.

A few months ago I read about a new Facebook SQL query engine: PrestoDB. The official site reads that it’s designed for “running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.”. It’s also supposedly much faster than Hive. Sounds interesting! Especially since Facebook’s been using it (in production) since Spring 2013. I’ve decided to check how it works with Rails. If you’re totally new in Presto, read these links:

In this post I’ll show you how to prepare Hadoop and PrestoDB.


I assume you know Hive (with Hadoop) at some basic level. I’ll explain most of the steps so you could run this application even if you don’t know this technology but I won’t describe how Hive and Hadoop work.
You should also be familiar Ruby on Rails of course.

What is Resque?

It’s another useful tool for running long background tasks. That’s why it’s a very good choice in connection with Big Data technologies.
Resque was created by Github developers because they stacked in some limitations of existing solutions. Here’s a quite good explanation of what resque is and why they’ve developed it.

Main purpose

Big Data tools like a PrestoDB or Hive are useful when you want to make some analysis, e.g. generate statistics from an access log. Our final application will have the following flow:

  1. The web UI allows to launch analysis of number of requests to specific URL.
  2. Launching analysis means running a PrestoDB query in Resque job.
  3. PrestoDB uses Hive Metastore and accesses the log placed in Hadoop (HDFS).
  4. Results of analysis are persisted in sqllite db.
  5. User can check if the analysis is currently in progress or not and check the results when it’s finished.

This is a simple example of analysing data that is small rather than big but it’ll show how you can use a few useful technologies together. There is also one thing you should know, technologies like PrestoDB, Hive, etc. are only useful if you have at least hundreds of gigabytes, but you’ll see their full value when you’ll be analyzing tera- or petabytes.

Setting up environment

We’ll use quite a big bunch of technologies so it would make sense not to install all of them manually. Firstly we need Big Data backend with Hive and PrestoDB. Hive means also a Hadoop cluster so we’ll use Hortonworks Sandbox, which you can find here.
Download it and install on your favourite virtualization tool. I’ve chosen VirtualBox, it is good and free. When you’re ready, start your virtual machine and log in by ssh (there’s port forwarding configured, so you can use IP):

ssh [email protected] -p 2222

use password: hadoop

Make sure if hive is working properly:

su hive
hive # open hive console

Type the following queries:

show tables;
select * from sample_07 limit 5;

You should have similar results:

OK, we have a Hadoop cluster with Hive, let’s install PrestoDB.
Run exit; to close Hive console and exit to go back to root user.

PrestoDB – installation

In the next steps we’ll create a presto user and install the tool in a proper directory:

useradd presto -m -g hadoop # add user: presto and set its group to: hadoop
su presto # change user to presto
wget http://central.maven.org/maven2/com/facebook/presto/presto-server/0.75/presto-server-0.75.tar.gz  # download latest presto
tar -zxvf presto-server-0.75.tar.gz
exit # go back to the root user
mkdir /var/presto # create main presto directory
chown -R presto:hadoop /var/presto # change the owner to presto user and hadoop group
su presto
cd /var/presto
mv ~/presto-server-0.75 install
mkdir data

Now we have to add some config files to proper places.

mkdir install/etc
vim /var/presto/install/etc/jvm.config # vim is not installed, install editor you like

You need the following lines there:

-XX:OnOutOfMemoryError=kill -9 %p

Notice that the -Xbootclasspath parameter contains the path to the main presto installation directory, so if you’ve created it at a different place than /var/presto/install, you should also change it here.
Next file to edit it: node.properties (in the same directory: /var/presto/install/etc):


and config.properties


and finally set the logging level in /var/presto/install/etc/log.properties file:


You can use one of these values: DEBUG, INFO, WARN, ERROR.
At this step we have most of the basic configuration done, but we still have to specify what kind of database backend we’ll use. On our current Hortonworks Sandbox system we have Hive connected to Hadoop 2 so we’ll use hive-hadoop2 connector. The catalog directory in etc will store information about it.

mkdir /var/presto/install/etc/catalog
vim /var/presto/install/etc/catalog/hive.properties

Paste these lines there:


The name of the above properties file will be used in presto to specify catalog. If you want to add another one (like HBase) in the future, the catalog directory is the right place to do it – just create a file with another name and place the config for new connector there.

It’s time to start Presto!

cd /var/presto/install
bin/launcher start

You should see something like this:

Started as 4960

and /var/presto/data/var/log/server.log should have information “SERVER STARTED” at the end of the file

2014-09-17T04:46:49.798-0700   INFO main  com.facebook.presto.metadata.CatalogManager -- Added catalog hive using connector hive-hadoop2 --
2014-09-17T04:46:49.965-0700   INFO main  com.facebook.presto.server.PrestoServer ======== SERVER STARTED ========

Now we can download Presto-cli console:

wget http://central.maven.org/maven2/com/facebook/presto/presto-cli/0.75/presto-cli-0.75-executable.jar

change the name:

mv presto-cli-0.75-executable.jar presto
and set executable rights:
chmod +x presto

Now, let’s launch console:

./presto --server localhost:8080 --catalog hive --schema default

If you had some problems with installation, go through all steps from the official site:

OK, configuration is done!

In the next part we’ll use Resque to run some jobs in the background.