If you are creating a web page, you may want to get information from other external sources. Many pages allow sharing their content, and Rails has a nice solution to retrieve this data automatically. You can get particular text or picture, and present it on your own page. What is more, you can copy files and make them available for download from your own web page. Here is a quick tutorial how to do that.

Nokogiri

The most important part is to choose the right tool for searching and parsing data. My weapon of choice is Nokogiri.org. It’s a gem used for interactions with websites – it allows searching a webpage via CSS selectors and parsing its data. Nokogiri is very powerful and can be easily used to retrieve files. Naturally, there are other solutions, like Mechanize for example.

gem 'nokogiri'

What’s important, every solution has to be prepared for one particular site, as there is no possibility of creating a function that, once written, would work for many different sites. Nokogiri searches pages using their HTML code and CSS selectors, and each page is built in a different way. In my previous task, I also ran into a situation when a page didn’t have any CSS classes at all, but I was still able to get what I wanted. Connecting to a website is very simple, you just need to open it via URL address passed as a string parameter, using the Open-URI library.

require 'open-uri'
 doc = Nokogiri::HTML(open("https://en.wikipedia.org/wiki/Main_Page"))

Now you have to check the source code of the site, analyze the HTML structure, CSS classes and IDs. Find which element contains interesting content and use it in your parser. For example, let’s say I want to get a text from ‘Today’s featured article’. It is a p element in a div with ‘mp-tfa’ ID. Using code below I can assign that text to my variable.

require 'open-uri'
 doc = Nokogiri::HTML(open("https://en.wikipedia.org/wiki/Main_Page"))
 content = doc.css('div#mp-tfa p').text

The idea is very simple, but the actual implementation can cause numerous problems. Imagine a situation, where you want to get every URL on the main page that contains a certain word, and there are no CSS selectors. Sometimes URLs are parsed with whitespaces, so they have to be edited correctly, and every URL has to be checked against being nil. Don’t worry though, while these are examples of difficult situations, you’ll still be able to handle it with proper analysis. Here is an example I used in my work: I stored the site URL in a constant, as the direct URL wasn’t important. I searched the table for link to interesting content, opened new stream each time I found what I needed, and parsed the content with newlines and linebreaks deletion. Additionally, I parsed date from the content using regex comparison. Everything would be saved to a database so I didn’t need to connect to an external page each time the user would reload my page. It’s better to run this kind of code as a cron task (e.g. once a day), to make it independent from user activity. This is even more important with files download process, because it can take quite a long time.

# coding: UTF-8
 require 'open-uri'
 require 'nokogiri'
def self.collect_data
    begin
        doc = Nokogiri::HTML(open(DATA_MAIN_URL))
        doc.css('tr.interesting_row').each do |row|
            row.css('td.interesting_column a').each do |link|
                new_url = DATA_MAIN_URL + link.attributes['href'].value.encode('UTF-8')
                content = Nokogiri::HTML(open(new_url))
                interesting_content = content.css('div#important_content').to_html.encode('UTF-8')
                interesting_content.insert(0, '<head><meta charset=\'UTF-8\'></head>')
                interesting_content.gsub!(/\n/, '')
                interesting_content.gsub!(/\r/, '')
                date = /\d{4}-\d{2}-\d{2}/.match(content.css('h2.h-blue').text).to_s.to_date
                file = File.new(title: link.text, url: new_url, publish_date: date, content: interesting_content) unless new_url.nil?
                if file.save
                    puts '-- saved --'
                else
                    puts '-- not saved --'
                end
            end
        end
    rescue
        puts '-- Connection Error --'
    end
end

Get the files

Retrieving files with Nokogiri is very similar to what a regular user does to get a file. The idea is to find the file URL and save it, as with getting any other resources. In my example, I decided to prepare a new directory for downloaded files first. Code for searching is almost the same as for the previous one, but after saving a new object, I opened new stream and used download_file function described below.

def self.collect_files
    dir = 'public/files'
    Dir.mkdir(dir) unless File.exist?(dir)
    begin
        doc = Nokogiri::HTML(open(FILES_MAIN_URL))
        doc.css('td').each do |row|
            if row.css('a').text =~ /(Important Link|Interesting Link)/
                link = row.css('a').first
                new_url = link.attributes['href'].value.encode('UTF-8')
                date = /\d{4}-\d{2}-\d{2}/.match(content.css('h2.h-blue').text).to_s.to_date
                unless new_url.nil?
                    file = File.new(title: link.text, url: new_url, publish_date: date)
                    if d.save
                        new_url = Nokogiri::HTML(open(URI.escape(url)))
                        new_url.css('a').each { |file| download_file(file, dir, id) }
                    else
                        puts '-- not saved --'
                    end
                end
            end
        end
    rescue
      puts '-- Connection Error --'
    end
end

The function used for file download works in very simple manner. In the beginning, the URL is checked for full HTTP address with .pdf or .doc extension. If that’s the case, the address is entered, and filename is created for .pdf and .doc file respectively. New name also contains file directory, and object ID – it is used for further file recognition (but it isn’t obligatory). Files are not stored in a database, they’re written directly on the server. This is my solution to having a kind of relation between objects and files. I made it for one reason – I searched for content which could have more than one attachment. Number of attachments can vary, so this way it is easier to store and present it in the future. Files are saved using the most basic File class methods in a block – openwriteclose.

def self.download_file(file, dir, id)
    url = file.attributes['href'].value.encode('UTF-8') if file.attributes['href'].value =~
    /^(http\:\/\/){1}(.)+(.pdf|.doc){1}(.)*/
    if url.nil?
        puts '-- Bad url - not a file --'
    else
        name = file.text.gsub(/\//, '-')
        if url =~ /^(http\:\/\/){1}(.)+(.pdf){1}(.)*/
            local_fname = "#{dir}/FILE#{id}-#{name}.pdf"
        elsif url =~ /^(http\:\/\/){1}(.)+(.doc){1}(.)*/
            local_fname = "#{dir}/FILE#{id}-#{name}.doc"
        end
        unless File.exist?(local_fname)
            local_fname.gsub!(/\s/, '')
            File.open(local_fname, 'wb+') do |f|
                downloaded_file = open(URI.escape(url)).read
                f.write(downloaded_file)
                f.close
                puts '-- File saved --'
            end
        end
    end
 end

Download

Files stored on a server can be presented in a very easy way. I simply created an additional function that recognizes files for particular objects, returning an array of filenames. It’s a very simple solution, just comparing object ID with the one added to filename.

def filenames
    files = []
    dir = 'public/files/'
    Dir.foreach("#{Rails.root}/public/files/") do |file|
        files << File.basename(file) if File.fnmatch("FILE#{id}-*", File.basename(file))
    end
    files
end

In the controller, add a method for file download. It should get the file name and check if the basename matches the path to that file, and whether it has a correct extension. It’s very important because of vulnerability reasons. A simple send_file function gives you the ability to change directory of a download, and someone could get any file from there, e.g. server’s login information. The most secure way is to use strong parameters, but it will not work in this situation.

def download_document
    basename = "#{Rails.root}/public/files"
    if File.fnmatch("FILE*-*.pdf", File.basename(params[:name]))
        filename = File.expand_path(File.join(basename, params[:name]))
        raise if basename != File.expand_path(File.dirname(filename))
        send_file(filename, filename: File.extname(filename), type: 'application/pdf')
    else
        redirect_to :back, alert: 'Error'
    end
end

To create download links on the index page, you can use the example below. For each object, title and date will be shown, along with unique download link for every filename in the array. As a result, you should get a nice index page with download link for each file.

%ul
  %li
%p
  %b= file.title
%p
  %i= file.publish_date
- file.filenames.each do |name|
  %p
  = link_to( t('.download'), download_files_path(name: name))

It’s difficult to prepare step-by-step solution for this kind of task, as every page is created in a different way, and all possible scenarios are difficult to predict. Each data parser must be a tailor-made solution for particular HTML code, but I think my examples should inspire you enough, and you get the general idea. Now, try it yourself!

 

Published: Aug 25th, 2015

Post tags:

Join our awesome team
Check offers

Work
with us

Tell us about your idea
and we will find a way
to make it happen.

Get estimate