Web scraping, the process of extracting data from websites, can provide valuable insights and information for various purposes, from data analysis to content aggregation.

In this tutorial, we’ll explore how to use Puppeteer, a Node.js library, to scrape a webpage, specifically focusing on extracting images and links.

By the end of this guide, you’ll have a clear understanding of how to harness the power of Puppeteer for web scraping tasks.

The final completed code is given at the end of the article.

1. Introduction to Web Scraping with Puppeteer

Puppeteer, developed by the Chrome team, is a popular library that provides a high-level API for controlling headless or full browsers over the DevTools Protocol.

Its capabilities extend beyond automation to include web scraping tasks like extracting data, images, and links from web pages.

2. Setting Up Puppeteer

Before diving into the code, make sure you have Node.js installed on your system.

You can install Puppeteer by running the following command:

npm install puppeteer

3. Navigating to a Webpage

In this example, we’ll launch a non-headless browser instance and navigate to “http://google.com“.

const puppeteer = require("puppeteer");

async function run() {
    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();

    await page.goto("http://google.com");
    // Rest of the code...
}

run();

4. Extracting Images

Puppeteer allows us to extract images from the webpage.

We’ll use the $$eval function to select all <a> elements and map them to an array containing their src and alt attributes.

// ...
const images = await page.$$eval("a", (elements) =>
    elements.map((element) => ({
        src: element.src,
        alt: element.alt,
    }))
);

Similarly, we’ll use the $$eval function to extract links from the webpage.

See also  CSS3 Complete Feature List With Examples

This time, we’ll map each <a> element to an object containing the href and text content.

// ...
const links = await page.$$eval("a", (elements) =>
    elements.map((element) => ({
        href: element.href,
        text: element.textContent,
    }))
);

6. Displaying the Results

Now that we’ve extracted the images and links, let’s display the results along with their counts.

// ...
const imageCount = images.length;
const linkCount = links.length;

const output = JSON.stringify({ images, links, imageCount, linkCount });
console.log(output);

7. Final Completed Code

The final complete code is shared below for your reference.

const puppeteer = require("puppeteer");

async function run(){

    const browser = await puppeteer.launch({headless: false});
    const page = await browser.newPage();

    // Navigate to page 
    await page.goto("http://google.com");

    // Extract Images 
    const images = await page.$$eval("a", (elements) => 
        elements.map((element) => ({
            src: element.src, 
            alt: element.alt, 
        }))
    );

    // Extract Links 
    const links = await page.$$eval("a", (elements) => 
        elements.map((element) => ({
            href: element.href,
            text: element.textContent,
        }))
    );

    const imageCount = images.length;
    const linkCount = links.length;

    // output of the above
    const output = JSON.stringify({images, links, imageCount, linkCount});
    console.log(output);

    // Close the browser
    await browser.close();

}

run();

Puppeteer simplifies the process of web scraping by providing a user-friendly API to control browsers and extract data from web pages.

In this tutorial, we explored how to use Puppeteer to navigate to a webpage, extract images and links, and display the results.

Armed with this knowledge, you can expand your web scraping capabilities and automate data extraction tasks with ease.

By combining Puppeteer’s capabilities with your creative thinking, you can unlock a world of possibilities for data collection, analysis, and application. Happy scraping!

By soorya