In light of a recent article I found while browsing through Hackernoon about web scraping with Node.js, I'd like to share a project that I worked on while taking a course with Treehouse sometime early last year that does pretty much the same thing—a web content scraper created with Node.js—except I wrote the program differently.
Node.js provides a perfect, dynamic environment to quickly experiment and work with data from the web.
While there are more and more visual scraping products these days (import.io, Spider, Scrapinghub, Apify, Crawly, ……), there will always be a need for the simplicity and flexibility of writing one-off scrapers manually.
This post is intended as a tutorial for writing these types of data extraction scripts in Node.js, including some subtle best practices that I’ve learned from writing dozens of these types of crawlers over the years.
— Travis Fischer, Hackernoon, Scraping the Web with Node.js
What a basic web scraping program essentially does is that it extracts information from the web and, in the instance of my program, saves the information in a .csv file like so:
You can take a closer look at the project's GitHub repository here.
I wrote my content scraper program a bit differently than the way the author of the aforementioned article wrote his program. We both used Node.js together with the npm module Cheerio, which allows for small, quick and easy scraping.
For this type of task, we’ll be leaning heavily on two modules, got to robustly download raw HTML, and cheerio which provides a jQuery-inspired API for parsing and traversing those pages.
Cheerio is really great for quick & dirty web scraping where you just want to operate against raw HTML. If you’re dealing with more advanced scenarios where you want your crawler to mimic a real user as close as possible or navigate client-side scripting, you’ll likely want to use Puppeteer.— Travis Fischer, Hackernoon, Scraping the Web with Node.js