In this article, we will walk through a simple Node.js script using Puppeteer to extract and save the source code of a webpage.
Puppeteer is a powerful library that provides a high-level API for controlling headless (or full) Chrome or Chromium browsers. This is particularly useful for web scraping, automated testing, and other browser automation tasks.
Prerequisites
Before we dive into the code, make sure you have Node.js installed on your machine. You can download it from the official website.
Additionally, you need to install Puppeteer and the fs
(File System) module, which is a part of Node.js core modules.
npm install puppeteer
Let’s start writing our automation script now:
Let’s break down the script to understand how it works:
- Importing Required Modules
const puppeteer = require("puppeteer");
const fs = require("fs");
We start by importing Puppeteer and the fs
module. Puppeteer will allow us to control a browser, and fs
will be used to write the source code to a file.
2. Asynchronous Function to Get Source Code
This function takes two parameters:
url
: The URL of the webpage you want to extract the source code from.outputData
: The path to the file where the source code will be saved.
async function getSourceCode(url, outputData) {
try {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(url);
const sourceCode = await page.content();
fs.writeFileSync(outputData, sourceCode, "utf-8");
await browser.close();
console.log("Successfully executed the source code of the URL");
} catch (error) {
console.error("Error getting source code of the url", error);
}
}
- Running the Function
Here, we define the URL and the output file path, and then call the getSourceCode
function with these parameters.
const url = "https://example.com";
const outputData = "source_code.html";
getSourceCode(url, outputData);
Complete Source Code
const puppeteer = require("puppeteer");
const fs = require("fs");
async function getSourceCode(url, outputData) {
try {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(url);
const sourceCode = await page.content();
fs.writeFileSync(outputData, sourceCode, "utf-8");
await browser.close();
console.log("Successfully executed the source code of the URL");
} catch (error) {
console.error("Error getting source code of the url", error);
}
}
const url = "https://example.com";
const outputData = "source_code.html";
getSourceCode(url, outputData);
This simple script showcases how you can use Puppeteer to automate the process of extracting source code from a webpage. Puppeteer offers many more features, such as taking screenshots, filling forms, and interacting with web elements, making it a versatile tool for web automation tasks.
Feel free to modify the script to suit your needs, such as handling multiple URLs or adding more complex interactions with the page before extracting the content. Puppeteer’s extensive API documentation can guide you through more advanced use cases.
Happy coding!