Sep 15, 2015

RSpec Testing of Rails Crawlers

blogpost cover image
Recently, I encountered a tricky problem – testing crawlers using RSpec. There are several ways of doing that, and today I’d like to show you what I see as the optimal solution.

As we know, crawlers are made to collect information from external websites (that do not share an API). In most cases, where the page layout is changed, the crawler stops working properly. On one hand, tests should detect that, but on the other, connecting to an external server every time you run your tests is not optimal and lowers network performance.

I decided to perform tests on static sites prepared in advance (I downloaded the source code) in order to eliminate the need to connect to external site every time we test crawling methods for creating database objects for our application. There is one serious problem with this solution though: any changes in the source page won’t show up during our tests. Therefore I created additional text file, which stores crawler’s logs to determine whether something goes wrong.

First of all, we have to build a model, and a simple crawler to create its objects. In this example I made footballer model with fileds name:string and number:integer, and then I installed four gems in Gamefile.

gem 'nokogiri'
gem 'mechanize'
gem 'rspec-rails'
gem 'capybara'
  • Nokogiri – simple and very fast gem for HTML and XML parsing. 
  • Mechanize – complex tool for automating website interaction (we’ll use it to form field typing). 
  • Rspec-Rails – built-in testing framework supporting rspec examples for requests, controllers, models, views, helpers, mailers and rooting. 
  • Capybara – tool for testing web applications by simulating user-website interaction.

After migrations (rake db:migrate) and bundling (bundle install) we add crawler class method to footballer.rb.

def self.collect_footballers
    logger ="#{Rails.root}/log/footballers_crawler.log")
    logger.debug('<<<<<<< Downloading footballers >>>>>>>')
    agent =
    element_no = 0

    loop do
      logger.debug('Loading search page...')
      connected = false
        logger.debug('Search page has been loaded!')
        form =[1]
        form.fields[1].options_with(value: '100')[0].click
        form.fields[2].options_with(value: '90')[0].click
        form.fields[3].options_with(value: '99')[0].click
        form.fields[4].options_with(value: 'POL')[0].click
        connected = true
      break if connected
    logger.debug('Search reasults page has been loaded!')

    table ='table')[2]'tr')[1..-1].each_with_index do |row, index|
      footballer = ='td')[1].text ='td')[2].text.strip
      element_no += 1

    logger.debug("Finished: #{element_no} footballers records have been created.")

Now we have to run it from the console (bundle exec rails c)...


...log file log/footballers_crawler.log after collecting.

# Logfile created on 2014-12-29 14:53:28 +0100 by logger.rb/44203
            I, [2014-12-29T14:53:28.143870 #16594]  INFO -- : <<< Downloading footballers (Mon, 29 Dec 2014 14:53:28 +0100) >>>
            D, [2014-12-29T14:53:28.144143 #16594] DEBUG -- : Loading search page...
            D, [2014-12-29T14:53:28.612411 #16594] DEBUG -- : Search page has been loaded!
            D, [2014-12-29T14:53:29.130408 #16594] DEBUG -- : Search reasults page has been loaded!
            I, [2014-12-29T14:53:29.155064 #16594]  INFO -- : Finished: 4 footballers records have been created.

Now we can deal with the issue of testing these crawlers. First of all, we have to copy the source code of webpages we use (in that case – the Search and Results pages) to create static .html files (from the source website

At this point we can start working on our main task – rspec tests using capybara. We have to add required line to rails_helper.rb...

            require 'capybara/rails'

…and write tests (by copying prepared lines of code from the crawler).

 require 'rails_helper'

            feature 'Footballers crawler' do
              scenario 'crawler filling search form' do
                # filling the search form without submit
                agent =
                footballers_search_path = 'file://' + "#{Rails.root}/spec/fixtures/footballers/footballers_search.html"

                form =[1]
                form.fields[1].options_with(value: '100')[0].click
                form.fields[2].options_with(value: '90')[0].click
                form.fields[3].options_with(value: '99')[0].click
                form.fields[4].options_with(value: 'POL')[0].click

              scenario 'crawler process result pages' do
                # two result pages process to database
                # each page creating 20 auctions
                agent =

                expect(Footballer.count).to eq(0)

                # method from footballer.rb

                footballers_results_path = 'file://' + "#{Rails.root}/spec/fixtures/footballers/footballers_results.html"

                table ='table')[2]

      'tr')[1..-1].each_with_index do |row, index|
                footballer =

                expect(Footballer.count).to eq(4)

Finally, all the tests should be up and running. From now on, all you have to do is monitor the page and logs, implementing the changes as needed ;)