Web scraping has become an essential tool for gathering data from the web for various applications. One of the powerful tools for web scraping in Node.js is Puppeteer, a headless browser library that provides a high-level API to control Chrome or Chromium. In this tutorial, we will learn how to use Puppeteer to scrape the source code of a webpage and save it to a file.

What is Puppeteer?

Puppeteer is a Node library developed by Google that provides a high-level API to control headless Chrome or Chromium browsers over the DevTools Protocol. It can also be configured to use a full (non-headless) browser.

Setting Up Your Project

First, you need to set up your Node.js project. Follow these steps:

  1. Initialize a new Node.js project:
mkdir puppeteer-scraper
cd puppeteer-scraper
npm init -y

2. Install Puppeteer and File System modules:

npm install puppeteer fs

Writing the Code

Let’s dive into the code to scrape the source code of a webpage using Puppeteer:

const puppeteer = require("puppeteer");
const fs = require("fs");

async function getSourceCode(url, outputData){
    try {
        const browser = await puppeteer.launch({headless: false});
        
        const page = await browser.newPage();

        await page.goto(url);

        const sourceCode = await page.content();

        fs.writeFileSync(outputData, sourceCode, "utf-8");
        
        await browser.close();

        console.log("Successfully executed the source code of the URL");

    } catch(error){
        console.error("Error getting source code of the url");
    }
}

const url = "https://example.com";
const outputData = "source_code.html";

getSourceCode(url, outputData);

Breaking Down the Code

We start by importing the necessary modules. puppeteer is used to control the browser, and fs is used to handle file system operations.

const puppeteer = require("puppeteer");
const fs = require("fs");

Defining the getSourceCode Function

async function getSourceCode(url, outputData) {
    try {
        const browser = await puppeteer.launch({ headless: false });
        
        const page = await browser.newPage();

        await page.goto(url);

        const sourceCode = await page.content();

        fs.writeFileSync(outputData, sourceCode, "utf-8");
        
        await browser.close();

        console.log("Successfully executed the source code of the URL");

    } catch (error) {
        console.error("Error getting source code of the url");
    }
}
  • Launch the Browser: The function starts by launching a new Chromium browser instance with Puppeteer. The headless: false option is used to open the browser in a visible mode, which is useful for debugging or verifying that the browser is interacting with the webpage correctly. For production purposes, you might set this to true to run the browser in headless mode.
  • Open a New Page: After launching the browser, the function opens a new tab or page where it can navigate to the desired URL.
  • Navigate to the URL: The function instructs the browser to go to the specified URL. Puppeteer’s page.goto(url) method handles the navigation process.
  • Retrieve the Source Code: Once the page has loaded, the function retrieves the HTML content of the page using page.content(). This method returns the entire HTML of the webpage as a string.
  • Save the Source Code to a File: The function then writes the retrieved HTML content to a file. It uses Node.js’s fs.writeFileSync method to save the content to the file specified by the outputData parameter. The file is saved with UTF-8 encoding to ensure proper character representation.
  • Close the Browser: After saving the HTML content, the function closes the browser instance with browser.close() to free up system resources.
  • Error Handling: The function includes error handling to catch and log any issues that occur during the execution of these steps. If an error is encountered, an error message is logged to the console.
See also  Top 40 Git Interview Questions and Answers

Running the code

To run the code, execute the following command in your terminal:

node index.js

In this tutorial, we’ve learned how to use Puppeteer to scrape the source code of a webpage and save it to a file.

Puppeteer is a powerful tool that can be used for various web scraping tasks, including automated testing, taking screenshots, generating PDFs, and more.

By mastering Puppeteer, you can efficiently interact with web pages and extract valuable data for your projects. Feel free to experiment with different configurations and additional features of Puppeteer to enhance your web scraping capabilities.

Happy scraping!

By soorya