A few months ago I read about a new Facebook SQL query engine: PrestoDB. The official site reads that it’s designed for “running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.”. It’s also supposedly much faster than Hive. Sounds interesting! Especially since Facebook’s been using it (in production) since Spring 2013. I’ve decided to check how it works with Rails. If you’re totally new in Presto, read these links:
In this post I’ll show you how to prepare Hadoop and PrestoDB.
I assume you know Hive (with Hadoop) at some basic level. I’ll explain most of the steps so you could run this application even if you don’t know this technology but I won’t describe how Hive and Hadoop work.
You should also be familiar Ruby on Rails of course.
What is Resque?
It’s another useful tool for running long background tasks. That’s why it’s a very good choice in connection with Big Data technologies.
Resque was created by Github developers because they stacked in some limitations of existing solutions. Here’s a quite good explanation of what resque is and why they’ve developed it.
Big Data tools like a PrestoDB or Hive are useful when you want to make some analysis, e.g. generate statistics from an access log. Our final application will have the following flow:
- The web UI allows to launch analysis of number of requests to specific URL.
- Launching analysis means running a PrestoDB query in Resque job.
- PrestoDB uses Hive Metastore and accesses the log placed in Hadoop (HDFS).
- Results of analysis are persisted in sqllite db.
- User can check if the analysis is currently in progress or not and check the results when it’s finished.
This is a simple example of analysing data that is small rather than big but it’ll show how you can use a few useful technologies together. There is also one thing you should know, technologies like PrestoDB, Hive, etc. are only useful if you have at least hundreds of gigabytes, but you’ll see their full value when you’ll be analyzing tera- or petabytes.
Setting up environment
We’ll use quite a big bunch of technologies so it would make sense not to install all of them manually. Firstly we need Big Data backend with Hive and PrestoDB. Hive means also a Hadoop cluster so we’ll use Hortonworks Sandbox, which you can find here.
Download it and install on your favourite virtualization tool. I’ve chosen VirtualBox, it is good and free. When you’re ready, start your virtual machine and log in by ssh (there’s port forwarding configured, so you can use 127.0.0.1 IP):
ssh [email protected] -p 2222
use password: hadoop
Make sure if hive is working properly:
su hive cd hive # open hive console
Type the following queries:
show tables; select * from sample_07 limit 5;
You should have similar results:
OK, we have a Hadoop cluster with Hive, let’s install PrestoDB.
exit; to close Hive console and
exit to go back to root user.
PrestoDB – installation
In the next steps we’ll create a presto user and install the tool in a proper directory:
useradd presto -m -g hadoop # add user: presto and set its group to: hadoop su presto # change user to presto cd wget http://central.maven.org/maven2/com/facebook/presto/presto-server/0.75/presto-server-0.75.tar.gz # download latest presto tar -zxvf presto-server-0.75.tar.gz exit # go back to the root user mkdir /var/presto # create main presto directory chown -R presto:hadoop /var/presto # change the owner to presto user and hadoop group su presto cd /var/presto mv ~/presto-server-0.75 install mkdir data
Now we have to add some config files to proper places.
mkdir install/etc vim /var/presto/install/etc/jvm.config # vim is not installed, install editor you like
You need the following lines there:
-server -Xmx16G -XX:+UseConcMarkSweepGC -XX:+ExplicitGCInvokesConcurrent -XX:+CMSClassUnloadingEnabled -XX:+AggressiveOpts -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError=kill -9 %p -XX:PermSize=150M -XX:MaxPermSize=150M -XX:ReservedCodeCacheSize=150M -Xbootclasspath/p:/var/presto/install/lib/floatingdecimal-0.1.jar
Notice that the
-Xbootclasspath parameter contains the path to the main presto installation directory, so if you’ve created it at a different place than
/var/presto/install, you should also change it here.
Next file to edit it:
node.properties (in the same directory:
node.environment=dev node.id=ffffffff-ffff-ffff-ffff-ffffffffffff node.data-dir=/var/presto/data
oordinator=true node-scheduler.include-coordinator=true http-server.http.port=8080 task.max-memory=1GB discovery-server.enabled=true discovery.uri=http://127.0.0.1:8080
and finally set the logging level in
You can use one of these values: DEBUG, INFO, WARN, ERROR.
At this step we have most of the basic configuration done, but we still have to specify what kind of database backend we’ll use. On our current Hortonworks Sandbox system we have Hive connected to Hadoop 2 so we’ll use
hive-hadoop2 connector. The catalog directory in etc will store information about it.
mkdir /var/presto/install/etc/catalog vim /var/presto/install/etc/catalog/hive.properties
Paste these lines there:
The name of the above properties file will be used in presto to specify catalog. If you want to add another one (like HBase) in the future, the catalog directory is the right place to do it – just create a file with another name and place the config for new connector there.
It’s time to start Presto!
cd /var/presto/install bin/launcher start
You should see something like this:
Started as 4960
and /var/presto/data/var/log/server.log should have information “SERVER STARTED” at the end of the file
2014-09-17T04:46:49.798-0700 INFO main com.facebook.presto.metadata.CatalogManager -- Added catalog hive using connector hive-hadoop2 -- 2014-09-17T04:46:49.965-0700 INFO main com.facebook.presto.server.PrestoServer ======== SERVER STARTED ========
Now we can download Presto-cli console:
change the name:
mv presto-cli-0.75-executable.jar presto and set executable rights: chmod +x presto
Now, let’s launch console:
./presto --server localhost:8080 --catalog hive --schema default
If you had some problems with installation, go through all steps from the official site:
OK, configuration is done!
In the next part we’ll use Resque to run some jobs in the background.