Index / Blog / Vessel web crawling and scraping framework for Ruby

Open-sourcing Vessel: a high-level Ruby web crawling framework

Vessel is a high-level Ruby web crawling framework for extracting data from the websites that can be used in a wide range of scenarios like data mining, monitoring or historical archival

March 2020 3 mins

Vessel is a fast, open source, high-level, web crawling and scraping framework for Ruby, built on top of Ferrum — a minimal-dependency pure-Ruby driver for running headless Google Chrome instances.

Why would you need a web crawler? Perhaps you're building a search engine for an intranet or a group of public-facing websites, or just need to mirror a website with finer-grained control than tools such as wget offer.

Crawl, walk, run

The best way to demonstrate Vessel's capabilities is with an example. Don't worry, just because of what Vessel's capable of, it doesn't mean that it's hard to use.

To get started, add Vessel to your Gemfile:

gem "vessel"

Next, let's build the crawler class. Create a spider.rb, in which we'll define a Spider class that derives from Vessel::Cargo, configure the crawling parameters and provide a parse callback method that will be invoked for each page that's retrieved (if you don't provide one, the Vessel::Cargo will raise a NotImplementedError when a page has been retrieved). The code for that is below:

require "vessel"
 
class Spider < Vessel::Cargo
  domain "blog.scrapinghub.com"
  start_urls "https://blog.scrapinghub.com"
 
  def parse
    css(".post-header>h2>a").each do |a|
      yield request(url: a.attribute(:href), method: :parse_article)
    end
 
    css("a.next-posts-link").each do |a|
      yield request(url: a.attribute(:href), method: :parse)
    end
  end
 
  def parse_article
    yield page.title
  end
end
 
Spider.run { |title| puts title }

Most of this should be fairly self-explanatory. Behind the scenes, Vessel will employ a thread pool to perform the requests, defaulting to one thread per core (you can change this by adding threads max: n to the class definition).

You can run the crawler with:

bundle exec ruby spider.rb

The output will be the title of each page as it's crawled and parsed by Chrome, and passed back to your Ruby class.

Fast as Chrome, dead simple and yet extendable

You can see from the example how easy it is to scrape — extract structured data from typically-unstructured web pages — using Ferrum's DOM methods.

The example code above simply follows (via the request method) two different kinds of links (identified by their CSS-style selectors) and ignores everything else, save for the page title which is ultimately emitted as output, but you can perform any kind of information extraction of your choosing here.

And whilst scraping is powerful, scraping with a crawler gives you a lot more power: rather than being confined to scraping individual pages, Vessel gives you the ability to extract data across a whole site, or set of sites, giving you complete control over exactly what links are followed and what data is returned along the way, and how what you do with it afterwards. Generate a CSV with collated tabular data? Sure, no problem. Or output JSON that you can feed into something else? That's straightforward, too.

 

In fact, with Vessel and Ferrum, you can crawl, parse, extract, and transform web content with so little effort, you'll wonder why you ever had to do it any other way before!

Client Review

You can build your own google, and crawl thousands of websites per month with Vessel. Thanks Evrone for contributing to this initiative and designing the identity for Vessel Ruby framework.
Dmitry Vorotilin
Team Lead, Machinio.com
Let’s talk about you
Attach file
Files must be less than 8 MB.
Allowed file types: jpg jpeg png txt rtf pdf doc docx ppt pptx.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.