How to create a website spider
1 min read rubyCreating a spider that generates a list of URLs for a given domain is very easy to do.
Spider simply visits each page in a domain, finds all the links and visits those.
With help from a gem called Spidr we can achieve that with few lines of code.
require 'spidr'
urls = []
Spidr.site('http://www.layer22.com/') do |spider|
spider.every_url do |url|
urls << url
end
end
Fetching page content is also easy:
require 'spidr'
contents = []
Spidr.site('http://www.layer22.com/') do |spider|
spider.every_page do |page|
contents << page.body
end
end
What if we are only interested in paragraphs?
require 'spidr'
paragraphs = []
Spidr.site('http://www.layer22.com/') do |spider|
spider.every_page do |page|
next unless page.content_type =~ %r(text/html)
paragraphs << page.doc.search('p').map(&:text)
end
end
Last modified: 14-Nov-24