Evo Image

Content Scraping on the Web

Title

In light of a recent article I found while browsing through Hackernoon about web scraping with Node.js, I'd like to share a project that I worked on while taking a course with Treehouse sometime early last year that does pretty much the same thing—a web content scraper created with Node.js—except I wrote the program differently.

“Node.js provides a perfect, dynamic environment to quickly experiment and work with data from the web.

While there are more and more visual scraping products these days (import.io, Spider, Scrapinghub, Apify, Crawly, ……), there will always be a need for the simplicity and flexibility of writing one-off scrapers manually.

This post is intended as a tutorial for writing these types of data extraction scripts in Node.js, including some subtle best practices that I’ve learned from writing dozens of these types of crawlers over the years.”

What a basic web scraping program essentially does is that it extracts information from the web and, in the instance of my program, saves the information in a .csv file like so:

Web Scraping CSV file

You can take a closer look at the project's GitHub repository here.

I wrote my content scraper program a bit differently than the way the author of the aforementioned article wrote his program. We both used Node.js together with the npm module Cheerio, which allows for small, quick and easy scraping.

“For this type of task, we’ll be leaning heavily on two modules, got to robustly download raw HTML, and cheerio which provides a jQuery-inspired API for parsing and traversing those pages.

Cheerio is really great for quick & dirty web scraping where you just want to operate against raw HTML. If you’re dealing with more advanced scenarios where you want your crawler to mimic a real user as close as possible or navigate client-side scripting, you’ll likely want to use Puppeteer.”

Here's my project's JavaScript file in full:

"use strict";

const request = require("request"), // Makes HTTP(S) Calls
  cheerio = require("cheerio"), // Traverses the DOM
  json2csv = require("json2csv"), // Converts JSON to CSV
  fs = require("fs"), // Works with the file system on your computer
  startURL = "http://shirts4mike.com/shirts.php", // Starting URL
  dir = "./data/", // Scraped data goes into this folder
  event = new Date(); // Grabs the current date and time

function scrapeURL(url) {
  return new Promise((resolve, reject) => {
    request(url, function(error, response, body) {
      if (error || response.statusCode !== 200) {
        return reject(
          new Error("Unable to connect to " + url)
        );
      }
      const $ = cheerio.load(body),
        baseURL = "http://shirts4mike.com/";
      let shirt = {};
      shirt.title = $("title").text();
      shirt.price = $(".price").text();
      shirt.img = baseURL + $("div.shirt-picture img").attr("src");
      shirt.url = url;
      shirt.time = event.toDateString() + " " + event.toTimeString();
      return resolve(shirt);
    });
  });
}

function getURLs(startingURL) {
  return new Promise((resolve, reject) => {
    request(startingURL, function(error, response, body) {
      if (error) {
        return reject(
          new Error("There was an error while scraping the site.")
        );
      } else if (response.statusCode !== 200) {
        return reject(
          new Error("Unable to connect to " + url)
        );
      }
      const $ = cheerio.load(body),
        baseURL = "http://shirts4mike.com/",
        shirtURLs = [];
      $("a[href^='shirt.php?id=']").each(function() {
        shirtURLs.push(baseURL + $(this).attr("href"));
      });
      return resolve(shirtURLs);
    });
  });
}

function scrapeAll(urls) {
  return Promise.all(urls.map(scrapeURL)).then(shirts => {
    return shirts;
  });
}

function writeCSV(json) {
  let file = dir + event.toISOString().substring(0, 10) + ".csv",
    csv = json2csv.parse(json);
  if (!fs.existsSync(dir)) {
    fs.mkdirSync(dir);
  }
  fs.writeFileSync(file, csv, (err) => {
    if (err) throw err;
  });
}

getURLs(startURL)
  .then(URLs => {
    return scrapeAll(URLs);
  })
  .then(shirts => {
    writeCSV(shirts);
    console.log("Data written to file.");
  })
  .catch(error => {
    fs.appendFileSync(
      "scraper-error.log",
      "[" + event + "] : Error: " + error.message + "\n"
    );
  });

IBM Acquires Red Hat

IBM and Red Hat

Just months after Microsoft buys the open-source hosting service GitHub, IBM decides to buy open-source software company Red Hat. But why Red Hat?

“Specifically, Red Hat is expected to bring three things to IBM: the world’s largest portfolio of open source technology, their innovative hybrid cloud platform, and a vast open source developer community.

... things may change if the three things Red Hat bring to IBM help the technology giant compete effectively against Amazon, Google, and other competitors, over the long-term.”

Movie Review | Rogue One

Rogue One

In Rogue One: A Star Wars Story, the most beloved movie franchise of all time gets a fresh, new addition full of some really great ideas. The collective heroism of the Rebel Alliance sabotage team dubbed "Rogue One" lead by Felicity Jones's character Jyn Orso, and their elaborate plan to steal the Empire's blueprint for the Death Star in search of a weakness built-in to its architecture, all make for a unique edition to the Star Wars cinematic universe.

The protangonists are not nearly as powerful as the ones featured in other Star Wars films, and I think that's one of the picture's main ideas; no Jedi Knights or other elite here, just a rag-tag bunch of Rebel soldiers fighting to save the galaxy. The force may not be on full display in terms of green and blue colored lightsabers, but it is definitely prevalent throughout the movie, revealing itself in more subtle ways.

Overall Impression:

4/5

Understanding JavaScript's "this"

JavaScript Logo

JavaScript's this works just like the English word in that it is used as a pronoun; in particular, a demonstrative pronoun which points to something specific within a sentence. Furthermore, a pronoun refers back to its antecedent. If the antecedent is singular the pronoun must also be singular in what is called pronoun antecedent agreement. JavaScript's this works in the same way as it points to something specific in your code. Furthermore, it is contextual based on the conditions of the function's invocation. Therefore, in most cases, the value of this is determined by how a function is called. It can't be set by assignment during execution, and it may be different each time the function is called.

Scope and Binding

“The next most common misconception about the meaning of this is that it somehow refers to the function's scope... this does not, in any way, refer to a function's lexical scope. It is true that internally, scope is kind of like an object with properties for each of the available identifiers. But the scope "object" is not accessible to JavaScript code. It's an inner part of the Engine's implementation.”

Kyle Simpson, You Don't Know JS: this & Object Prototypes, GitHub

The best way to learn anything is through examples. Here is an example of what is known as lexical scope. Lexical scope is where functions are executed using the scope chain that was in effect when they were defined.

function foo() {
  const bar = 1;
  return bar;
}

console.log(foo()); // 1

If you were to attempt to use "this" to implicity refer to a function's lexical scope, the console would return undefined.

function foo() {
  const a = 1;
  this.bar();
}

function bar() {
  console.log(this.a);
}

foo(); // undefined

You would instead have to use "this" in this manner for it to work.

function foo() {
  console.log(this.bar);
}

bar = 1;
foo(); // 1

This is what's known as the default binding for "this". If a code is being executed as part of a simple function call, "this" points to the global object which in this case was bar. In strict mode, however, the value of "this" remains at whatever it was set to when entering the execution context. In this case, "this" will default to undefined.

function foo() {
  "use strict";
  console.log(this.bar);
}

bar = 1;
foo(); // TypeError: Cannot read property 'bar' of undefined

In order to understand "this" binding, you have to understand the location in the code where the function is called, not where it's declared. Kyle Simpson has a great example showing the difference between the call-stack and the call-site. "Find the actual call-site (from the call-stack)," Simpson says, "because it's the only thing that matters for this binding."

function baz() {
  // call-stack is: `baz`
  // so, our call-site is in the global scope

  console.log("baz");
  bar(); // <-- call-site for `bar`
}

function bar() {
  // call-stack is: `baz` -> `bar`
  // so, our call-site is in `baz`

  console.log("bar");
  foo(); // <-- call-site for `foo`
}

function foo() {
  // call-stack is: `baz` -> `bar` -> `foo`
  // so, our call-site is in `bar`

  console.log("foo");
}

baz(); // <-- call-site for `baz`

In this example, baz() comes first because within baz(), bar() is called which then calls foo(). The "call-site" is where you should be looking. In this case, the call-site for "foo" is within bar().

Rules to Remember

Remember, the object to which "this" points changes every time the context is changed. "this" may refer to a global object, a new instance, an invoker object, with call and apply methods, bind method, or a fat arrow function. You can figure out the value of "this" by following a few simple rules:

• By default, “this” refers to global object which is global in case of NodeJS and window object in case of browser.

• When a method is called as a property of object, then “this” refers to the parent object.

• When a function is called with “new” operator then “this” refers to the newly created instance.

• When a function is called using call and apply method then “this” refers to the value passed as first argument of call or apply method.

Pavan Kumar, Understanding the “this” Keyword in JavaScript, Zeolearn

Photorealism in CGI

Video Game Graphics

If photorealism is the defining finish line in CGI, or computer-generated imagery, then it can be argued that this was achieved as far back as 1993 with the motion picture Jurassic Park in terms of pre-rendered graphics and in 2004 for real-time rendered graphics with the video game Half-Life 2. In terms of aspects of physical appearances in CGI—characteristics of the real-world that CGI attempts to capture—such as how light reflects off of a certain texture while it is moving, for example, computer scientists and graphics card manufacturers are always comparing and contrasting today's leading CGI with that of the reality of the world around us. In other words, the general consensus for much of art has been to use reality as the model for which to create worlds of our own.

There is a philosophical term called Mimesis that carries with it a wide range of meaning including the concepts underlying imitation and mimicry. Both Plato and Aristotle saw in mimesis the representation of nature, including human nature.

“At first glance, mimesis seems to be a stylizing of reality in which the ordinary features of our world are brought into focus by a certain exaggeration, the relationship of the imitation to the object it imitates being something like the relationship of dancing to walking. Imitation always involves selecting something from the continuum of experience, thus giving boundaries to what really has no beginning or end. Mimêsis involves a framing of reality that announces that what is contained within the frame is not simply real. Thus the more "real" the imitation the more fraudulent it becomes.”

In terms of photorealism, CGI since the 1990s has improved in leaps and bounds with fantastic technology used to mimic aspects of the real world: shadows that grow and recede depending on the distance of the light source, and with the advent of a rendering technology called ray tracing, reflections and refractions in video games are looking more realistic than ever.

As the representation of the natural world in video games becomes more and more convincingly real, what of the aspect of human nature? Has it simply been left behind in the dust despite all the fantastic advancements in other, more superficial aesthetic qualities? Does the concept of the Uncanny Valley still apply today? I believe the answer to this question is an astounding no. Combined with photorealism techniques, movement recorded with motion capture continues to turn CGI into a very convincing experience that is currently blurring the line between what we know as being real and what we perceive to be imaginary.