Web scraping, the process of extracting data from websites, can provide valuable insights and information for various purposes, from data analysis to content aggregation.
In this tutorial, we’ll explore how to use Puppeteer, a Node.js library, to scrape a webpage, specifically focusing on extracting images and links.
By the end of this guide, you’ll have a clear understanding of how to harness the power of Puppeteer for web scraping tasks.
The final completed code is given at the end of the article.
1. Introduction to Web Scraping with Puppeteer
Puppeteer, developed by the Chrome team, is a popular library that provides a high-level API for controlling headless or full browsers over the DevTools Protocol.
Its capabilities extend beyond automation to include web scraping tasks like extracting data, images, and links from web pages.
2. Setting Up Puppeteer
Before diving into the code, make sure you have Node.js installed on your system.
You can install Puppeteer by running the following command:
npm install puppeteer
3. Navigating to a Webpage
In this example, we’ll launch a non-headless browser instance and navigate to “http://google.com“.
const puppeteer = require("puppeteer");
async function run() {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto("http://google.com");
// Rest of the code...
}
run();
4. Extracting Images
Puppeteer allows us to extract images from the webpage.
We’ll use the $$eval
function to select all <a>
elements and map them to an array containing their src
and alt
attributes.
// ...
const images = await page.$$eval("a", (elements) =>
elements.map((element) => ({
src: element.src,
alt: element.alt,
}))
);
5. Extracting Links
Similarly, we’ll use the $$eval
function to extract links from the webpage.
This time, we’ll map each <a>
element to an object containing the href
and text content.
// ...
const links = await page.$$eval("a", (elements) =>
elements.map((element) => ({
href: element.href,
text: element.textContent,
}))
);
6. Displaying the Results
Now that we’ve extracted the images and links, let’s display the results along with their counts.
// ...
const imageCount = images.length;
const linkCount = links.length;
const output = JSON.stringify({ images, links, imageCount, linkCount });
console.log(output);
7. Final Completed Code
The final complete code is shared below for your reference.
const puppeteer = require("puppeteer");
async function run(){
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
// Navigate to page
await page.goto("http://google.com");
// Extract Images
const images = await page.$$eval("a", (elements) =>
elements.map((element) => ({
src: element.src,
alt: element.alt,
}))
);
// Extract Links
const links = await page.$$eval("a", (elements) =>
elements.map((element) => ({
href: element.href,
text: element.textContent,
}))
);
const imageCount = images.length;
const linkCount = links.length;
// output of the above
const output = JSON.stringify({images, links, imageCount, linkCount});
console.log(output);
// Close the browser
await browser.close();
}
run();
Puppeteer simplifies the process of web scraping by providing a user-friendly API to control browsers and extract data from web pages.
In this tutorial, we explored how to use Puppeteer to navigate to a webpage, extract images and links, and display the results.
Armed with this knowledge, you can expand your web scraping capabilities and automate data extraction tasks with ease.
By combining Puppeteer’s capabilities with your creative thinking, you can unlock a world of possibilities for data collection, analysis, and application. Happy scraping!