Work
with us

Tell us about your idea
and we will find a way
to make it happen.

Get estimate

Join our awesome team

Check offers

Recently, I encountered a tricky problem – testing crawlers using RSpec. There are several ways of doing that, and today I’d like to show you what I see as the optimal solution.

As we know, crawlers are made to collect information from external websites (that do not share an API). In most cases, where the page layout is changed, the crawler stops working properly. On one hand, tests should detect that, but on the other, connecting to an external server every time you run your tests is not optimal and lowers network performance.

I decided to perform tests on static sites prepared in advance (I downloaded the source code) in order to eliminate the need to connect to external site every time we test crawling methods for creating database objects for our application. There is one serious problem with this solution though: any changes in the source page won’t show up during our tests. Therefore I created additional text file, which stores crawler’s logs to determine whether something goes wrong.

First of all, we have to build a model, and a simple crawler to create its objects. In this example I made footballer model with fileds name:string and number:integer, and then I installed four gems in Gamefile.

gem 'nokogiri'
gem 'mechanize'
gem 'rspec-rails'
gem 'capybara'
  • Nokogiri – simple and very fast gem for HTML and XML parsing. 
  • Mechanize – complex tool for automating website interaction (we’ll use it to form field typing). 
  • Rspec-Rails – built-in testing framework supporting rspec examples for requests, controllers, models, views, helpers, mailers and rooting. 
  • Capybara – tool for testing web applications by simulating user-website interaction.

After migrations (rake db:migrate) and bundling (bundle install) we add crawler class method to footballer.rb.

def self.collect_footballers
    logger = Logger.new("#{Rails.root}/log/footballers_crawler.log")
    logger.debug('<<<<<<< Downloading footballers >>>>>>>')
    agent = Mechanize.new
    element_no = 0

    loop do
      logger.debug('Loading search page...')
      connected = false
      begin
        agent.get('http://pl-pl.soccerwiki.org/player.php')
        logger.debug('Search page has been loaded!')
        form = agent.page.forms[1]
        form.fields[1].options_with(value: '100')[0].click
        form.fields[2].options_with(value: '90')[0].click
        form.fields[3].options_with(value: '99')[0].click
        form.fields[4].options_with(value: 'POL')[0].click
        form.submit
        connected = true
    rescue
        logger.debug("#{$ERROR_INFO.message}")
      end
      break if connected
    end
    logger.debug('Search reasults page has been loaded!')

    table = agent.page.search('table')[2]

    table.search('tr')[1..-1].each_with_index do |row, index|
      footballer = Footballer.new
      footballer.name = row.search('td')[1].text
      footballer.team = row.search('td')[2].text.strip
      footballer.save
      element_no += 1
    end

    logger.debug("Finished: #{element_no} footballers records have been created.")
  end

Now we have to run it from the console (bundle exec rails c)...

Footballer.collect_footballer

...log file log/footballers_crawler.log after collecting.

# Logfile created on 2014-12-29 14:53:28 +0100 by logger.rb/44203
            I, [2014-12-29T14:53:28.143870 #16594]  INFO -- : <<< Downloading footballers (Mon, 29 Dec 2014 14:53:28 +0100) >>>
            D, [2014-12-29T14:53:28.144143 #16594] DEBUG -- : Loading search page...
            D, [2014-12-29T14:53:28.612411 #16594] DEBUG -- : Search page has been loaded!
            D, [2014-12-29T14:53:29.130408 #16594] DEBUG -- : Search reasults page has been loaded!
            I, [2014-12-29T14:53:29.155064 #16594]  INFO -- : Finished: 4 footballers records have been created.

Now we can deal with the issue of testing these crawlers. First of all, we have to copy the source code of webpages we use (in that case – the Search and Results pages) to create static .html files (from the source website http://pl-pl.soccerwiki.org/player.php).

At this point we can start working on our main task – rspec tests using capybara. We have to add required line to rails_helper.rb...

            require 'capybara/rails'

…and write tests (by copying prepared lines of code from the crawler).

 require 'rails_helper'

            feature 'Footballers crawler' do
              scenario 'crawler filling search form' do
                # filling the search form without submit
                agent = Mechanize.new
                footballers_search_path = 'file://' + "#{Rails.root}/spec/fixtures/footballers/footballers_search.html"

                agent.get(footballers_search_path)
                form = agent.page.forms[1]
                form.fields[1].options_with(value: '100')[0].click
                form.fields[2].options_with(value: '90')[0].click
                form.fields[3].options_with(value: '99')[0].click
                form.fields[4].options_with(value: 'POL')[0].click
              end

              scenario 'crawler process result pages' do
                # two result pages process to database
                # each page creating 20 auctions
                agent = Mechanize.new

                expect(Footballer.count).to eq(0)

                # method from footballer.rb

                footballers_results_path = 'file://' + "#{Rails.root}/spec/fixtures/footballers/footballers_results.html"
                agent.get(footballers_results_path)

                table = agent.page.search('table')[2]

                table.search('tr')[1..-1].each_with_index do |row, index|
                footballer = Footballer.new
                footballer.name = row.search('td')[1].text
                footballer.team = row.search('td')[2].text.strip
                footballer.save
                end

                expect(Footballer.count).to eq(4)
              end
            end

Finally, all the tests should be up and running. From now on, all you have to do is monitor the page and logs, implementing the changes as needed ;)

Post tags: