Nokogiri
The most important part is to choose the right tool for searching and parsing data. My weapon of choice is Nokogiri.org. It’s a gem used for interactions with websites – it allows searching a webpage via CSS selectors and parsing its data. Nokogiri is very powerful and can be easily used to retrieve files. Naturally, there are other solutions, like Mechanize for example.
gem 'nokogiri'
What’s important, every solution has to be prepared for one particular site, as there is no possibility of creating a function that, once written, would work for many different sites. Nokogiri searches pages using their HTML code and CSS selectors, and each page is built in a different way. In my previous task, I also ran into a situation when a page didn’t have any CSS classes at all, but I was still able to get what I wanted. Connecting to a website is very simple, you just need to open it via URL address passed as a string parameter, using the Open-URI library.
require 'open-uri'
doc = Nokogiri::HTML(open("https://en.wikipedia.org/wiki/Main_Page"))
Now you have to check the source code of the site, analyze the HTML structure, CSS classes and IDs. Find which element contains interesting content and use it in your parser. For example, let’s say I want to get a text from ‘Today’s featured article’. It is a p element in a div with ‘mp-tfa’ ID. Using code below I can assign that text to my variable.
require 'open-uri'
doc = Nokogiri::HTML(open("https://en.wikipedia.org/wiki/Main_Page"))
content = doc.css('div#mp-tfa p').text
The idea is very simple, but the actual implementation can cause numerous problems. Imagine a situation, where you want to get every URL on the main page that contains a certain word, and there are no CSS selectors. Sometimes URLs are parsed with whitespaces, so they have to be edited correctly, and every URL has to be checked against being nil. Don’t worry though, while these are examples of difficult situations, you’ll still be able to handle it with proper analysis. Here is an example I used in my work: I stored the site URL in a constant, as the direct URL wasn’t important. I searched the table for link to interesting content, opened new stream each time I found what I needed, and parsed the content with newlines and linebreaks deletion. Additionally, I parsed date from the content using regex comparison. Everything would be saved to a database so I didn’t need to connect to an external page each time the user would reload my page. It’s better to run this kind of code as a cron task (e.g. once a day), to make it independent from user activity. This is even more important with files download process, because it can take quite a long time.
# coding: UTF-8
require 'open-uri'
require 'nokogiri'
def self.collect_data
begin
doc = Nokogiri::HTML(open(DATA_MAIN_URL))
doc.css('tr.interesting_row').each do |row|
row.css('td.interesting_column a').each do |link|
new_url = DATA_MAIN_URL + link.attributes['href'].value.encode('UTF-8')
content = Nokogiri::HTML(open(new_url))
interesting_content = content.css('div#important_content').to_html.encode('UTF-8')
interesting_content.insert(0, '<head><meta charset=\'UTF-8\'></head>')
interesting_content.gsub!(/\n/, '')
interesting_content.gsub!(/\r/, '')
date = /\d{4}-\d{2}-\d{2}/.match(content.css('h2.h-blue').text).to_s.to_date
file = File.new(title: link.text, url: new_url, publish_date: date, content: interesting_content) unless new_url.nil?
if file.save
puts '-- saved --'
else
puts '-- not saved --'
end
end
end
rescue
puts '-- Connection Error --'
end
end
Get the files
Retrieving files with Nokogiri is very similar to what a regular user does to get a file. The idea is to find the file URL and save it, as with getting any other resources. In my example, I decided to prepare a new directory for downloaded files first. Code for searching is almost the same as for the previous one, but after saving a new object, I opened new stream and used download_file function described below.
def self.collect_files
dir = 'public/files'
Dir.mkdir(dir) unless File.exist?(dir)
begin
doc = Nokogiri::HTML(open(FILES_MAIN_URL))
doc.css('td').each do |row|
if row.css('a').text =~ /(Important Link|Interesting Link)/
link = row.css('a').first
new_url = link.attributes['href'].value.encode('UTF-8')
date = /\d{4}-\d{2}-\d{2}/.match(content.css('h2.h-blue').text).to_s.to_date
unless new_url.nil?
file = File.new(title: link.text, url: new_url, publish_date: date)
if d.save
new_url = Nokogiri::HTML(open(URI.escape(url)))
new_url.css('a').each { |file| download_file(file, dir, id) }
else
puts '-- not saved --'
end
end
end
end
rescue
puts '-- Connection Error --'
end
end
The function used for file download works in very simple manner. In the beginning, the URL is checked for full HTTP address with .pdf or .doc extension. If that’s the case, the address is entered, and filename is created for .pdf and .doc file respectively. New name also contains file directory, and object ID – it is used for further file recognition (but it isn’t obligatory). Files are not stored in a database, they’re written directly on the server. This is my solution to having a kind of relation between objects and files. I made it for one reason – I searched for content which could have more than one attachment. Number of attachments can vary, so this way it is easier to store and present it in the future. Files are saved using the most basic File class methods in a block – open, write, close.
def self.download_file(file, dir, id)
url = file.attributes['href'].value.encode('UTF-8') if file.attributes['href'].value =~
/^(http\:\/\/){1}(.)+(.pdf|.doc){1}(.)*/
if url.nil?
puts '-- Bad url - not a file --'
else
name = file.text.gsub(/\//, '-')
if url =~ /^(http\:\/\/){1}(.)+(.pdf){1}(.)*/
local_fname = "#{dir}/FILE#{id}-#{name}.pdf"
elsif url =~ /^(http\:\/\/){1}(.)+(.doc){1}(.)*/
local_fname = "#{dir}/FILE#{id}-#{name}.doc"
end
unless File.exist?(local_fname)
local_fname.gsub!(/\s/, '')
File.open(local_fname, 'wb+') do |f|
downloaded_file = open(URI.escape(url)).read
f.write(downloaded_file)
f.close
puts '-- File saved --'
end
end
end
end
Download
Files stored on a server can be presented in a very easy way. I simply created an additional function that recognizes files for particular objects, returning an array of filenames. It’s a very simple solution, just comparing object ID with the one added to filename.
def filenames
files = []
dir = 'public/files/'
Dir.foreach("#{Rails.root}/public/files/") do |file|
files << File.basename(file) if File.fnmatch("FILE#{id}-*", File.basename(file))
end
files
end
In the controller, add a method for file download. It should get the file name and check if the basename matches the path to that file, and whether it has a correct extension. It’s very important because of vulnerability reasons. A simple send_file function gives you the ability to change directory of a download, and someone could get any file from there, e.g. server’s login information. The most secure way is to use strong parameters, but it will not work in this situation.
def download_document
basename = "#{Rails.root}/public/files"
if File.fnmatch("FILE*-*.pdf", File.basename(params[:name]))
filename = File.expand_path(File.join(basename, params[:name]))
raise if basename != File.expand_path(File.dirname(filename))
send_file(filename, filename: File.extname(filename), type: 'application/pdf')
else
redirect_to :back, alert: 'Error'
end
end
To create download links on the index page, you can use the example below. For each object, title and date will be shown, along with unique download link for every filename in the array. As a result, you should get a nice index page with download link for each file.
%ul
%li
%p
%b= file.title
%p
%i= file.publish_date
- file.filenames.each do |name|
%p
= link_to( t('.download'), download_files_path(name: name))
It’s difficult to prepare step-by-step solution for this kind of task, as every page is created in a different way, and all possible scenarios are difficult to predict. Each data parser must be a tailor-made solution for particular HTML code, but I think my examples should inspire you enough, and you get the general idea. Now, try it yourself!
Published: Aug 25th, 2015