1 of 5: Scraping Website URLs.

Or: Collecting Content Locations.

·

8 min read

1 of 5: Scraping Website URLs.

Scraping URLs | Scraping Tutorial | Scraping Data | Making Embeddings | Using Embeddings

Originally published: Tuesday 19th March 2024.

TL;DR.

This post is a guide to running a Node.js script for scraping URLs. This piece covers the installation of LXD (LinuX Daemon), creating and configuring an LXC, and using NVM (Node Version Manager) to install Node.js and NPM. My tutorial also delves into web scraping with Axios and Cheerio, and trawling the MariaDB website as a practical example. I underscore the benefits of using containers for experimental development and isolated deployment, emphasizing the efficiency and flexibility it offers to coders.

Attributions:

https://ubuntu.com/tutorials/how-to-run-docker-inside-lxd-containers↗.

An Introduction.

In this post, I show how to deploy a Node.js script designed for URL scraping.

The purpose of this post is to show how to scrape for URLs using a Node.js script.

The Big Picture.

I use a number of isolation technologies including LXC/LXD, Distrobox, Docker, venv, and Miniconda. I've already written about the magic of VMs and containers and I find isolation technologies really useful for app deployment and coding experiments. Developers need isolation technologies to maintain a stable (host) distro while minimising interference form other running processes.

Also, I use an imaging solution called CloneZilla that backs up my (host) system. (Truth be told, I'm not paranoid enough.)

The faster I recover from disaster, the quicker I can get back to building stuff.

Prerequisites.

  • A Debian-based Linux distro (I use Ubuntu).

Updating my System.

  • From the (base) terminal, I update my (base) system:
sudo apt clean && \
sudo apt update && \
sudo apt dist-upgrade -y && \
sudo apt --fix-broken install && \
sudo apt autoclean && \
sudo apt autoremove -y

What is LXD and LXC?

LXD (LinuX Daemon) is a container manager for creating and managing LXCs (LinuX Containers.) As a background service, LXD can automatically start containers when the host system boots.

LXCs (LinuX Containers) are isolated, OS-level virtualizations which, for efficiency, uses the Linux kernel of the host system. LXCs are virtual environments where its system processes can not affect other containers, or the host system, without running specific commands.

I ensure that LXD is installed (lxc --version) before continuing with this post.

Setting Up an LXC.

  • I list the existing containers:
lxc ls

NOTE: If this command fails, then I will install LXD before continuing with this post.

  • I launch a new container called Scrapings:
lxc launch ubuntu:22.04 Scrapings
  • I bash into the container:
lxc exec Scrapings -- bash
  • I update and upgrade the container:
sudo apt clean && \
sudo apt update && \
sudo apt dist-upgrade -y && \
sudo apt --fix-broken install && \
sudo apt autoclean && \
sudo apt autoremove -y

Adding a User Account to the LXC.

  • From the container terminal, I add a new user:
adduser yt
  • I add the new user to the sudo group:
usermod -aG sudo yt

NOTE: usermod let's me (-a)ppend the sudo (-G)roup to the yt account.

  • I exit the container:
exit

NOTE: I exit the root account to login to the yt account.

Setting the LXC Home Directory.

  • From the host terminal, I login to the container with the yt account:
lxc exec Scrapings -- su yt

NOTE: At the moment, the home directory is /root. This section will address the issue by changing the home directory to ~.

  • I use the Nano text editor to open the .bashrc file:
sudo nano ~/.bashrc
  • I copy the following, add it (CTRL + SHIFT + V) to the bottom (CTRL + END) of the .bashrc file, save (CTRL + S) the changes, and exit (CTRL + X) Nano:
cd ~

Hardening the LXC.

  • From the container terminal, I use the Nano text editor to open the sshd_config file:
sudo nano /etc/ssh/sshd_config
  • I copy the following, add it (CTRL + SHIFT + V) to the bottom (CTRL + END) of the sshd_config file, save (CTRL + S) the changes, and exit (CTRL + X) Nano:
PasswordAuthentication no
PermitRootLogin no
Protocol 2
Port = 2424
  • I restart the SSH service:
sudo systemctl restart ssh
  • I reboot the container:
sudo reboot

NOTE: Within the container, I can also install (or enable) UFW, Fail2Ban, and CrowdSec.

What is Node.js, NPM, and NVM?

Node.js is a free, JavaScript (JS) server runtime. It runs JS code as single-threaded, non-blocking, asynchronous programs, which makes it very memory efficient. Node.js performs many system-level tasks, but is commonly used to run JS servers.

NPM (Node Package Manager) is the world's largest software registry. Open source developers use it to share packages with each other. Many organizations use NPM to manage private development as well. Most developers interact with NPM using the CLI (Command Line Interface), and it usually ships with Node.js.

NVM stands for Node.js Version Manager. It is used to switch between versions of Node.js. NVM works on any POSIX-compliant shell (sh, dash, ksh, zsh, bash) and runs on Linux distros, macOS, and Windows WSL.

https://nodejs.org/en/learn/getting-started/introduction-to-nodejs↗,

https://docs.npmjs.com/about-npm↗, and

https://github.com/nvm-sh/nvm#intro↗.

I ensure that Node.js is installed (node -v) before continuing with this post.

Testing for Node.

  • From the (base) terminal, I login to the Scrapings container:
lxc exec Scrapings -- su yt
  • I confirm the Node installation:
node -v

NOTE: If this command fails, then I will install Node.js and NPM with NVM before continuing with this post.

What are Web URL Scrapings?

URLs (uniform resource locations) describe where specific assets (webpages, email services, downloads, etc.) are found. These are usually fed into the address bar of my browser either manually or via a link on a webpage. URL scraping is the process of finding all the URLs at a specified, domain-name address and then saving the results to a file or database.

Creating the Web URL Scraper Script.

  • From the container terminal, I use NPM to install Axios and Cheerio:
npm install axios cheerio
  • I use the Nano text editor to create the web-urls.js JavaScript file:
sudo nano web-urls.js
  • I copy the following, paste (CTRL + SHIFT + V) it into the file, save (CTRL + S) the changes, and exit (CTRL + X) Nano:
const fs = require('fs');
const axios = require('axios');
const cheerio = require('cheerio');

const scrapeLinks = async (url) => {
  const response = await axios.get(url);
  const $ = cheerio.load(response.data);

  let counter = 0;

  $('a').each((index, element) => {
    setTimeout(() => {
      const link = $(element).attr('href');
      fs.appendFile('web-urls.txt', 'https://mariadb.com/kb/en' + link + '\n', (err) => {
        if (err) throw err;
        counter++;
        console.log(`Link number ${counter} saved!`);
      });
    }, index * 3000);
  });
};

scrapeLinks('https://mariadb.com/kb/en');

NOTE 1: There is a 3 second (3,000ms) pause between each request (mostly because I don't want to be mistaken for a DOS attack).

NOTE 2: The MariaDB website is used as an example.

Running the Web URL Scraper Script.

  • From the container terminal, I run web-urls.js:
node web-urls.js

Collecting the Results.

  • When web-urls.js completes, I clear the container terminal:
clear
  • I display the results from web-urls.txt in the container terminal:
cat web-urls.txt
  • I use my mouse to select all the displayed links, and then CTRL + SHIFT + C to save everything to the clipboard.

  • In a desktop editor like VS Code, I paste (CTRL + V) the results.

The Results.

To explore URL scraping, I started with setting up an LXC on Ubuntu, installing Node.js and NPM using NVM, writing a web URL scrapping script in JavaScript, and using Node.js to scrape websites for their URLs. Using Axios and Cheerio for scraping demonstrated the power and simplicity of using Node.js for web scraping tasks. This guide served as a technical walk-through and also highlighted the versatility of using containers for development tasks. It is evident that LXC containers offer a robust platform for developing and deploying many kinds of solutions, opening up a world of safe experimentation.

In Conclusion.

In the realm of web development, efficiency and isolation are key. This is where Linux Containers (LXC) offers a lightweight alternative to traditional VMs. I've explored URL scraping within an LXC, leveraging the power of Node.js.

Why LXC? It provides an isolated environment for running applications, ensuring that your scraping activities don't interfere with your system or other containers.

Set up was a breeze:

  • Starting with a fresh LXC on Ubuntu, I ventured into the installation of Node.js and NPM via NVM.

  • The magic happened when I combined Axios and Cheerio for efficient web scraping, all running smoothly within my container.

The result? A streamlined process that not only simplifies development tasks but also opens up a sandbox for safe experimentation. Containers like LXC are invaluable tools for developers, offering a blend of flexibility, efficiency, and security.

Have you tried running development tasks within a container? What has your experience been like?

Let's discuss in the comments below!

Until next time: Be safe, be kind, be awesome.

#WebScraping #WebScrapingTutorial #NodeJs

#NodeJsDevelopment #LinuxContainers #Containerization

#LXD #LXC #NVM #Axios #Cheerio #MariaDB

#CodingExperiments #URLScraper #JavaScriptScraping

#TechGuide #Development #DevelopmentIsolation

#SecureCoding #TechTalk #TechInnovation

#ContainerizedDevelopment #Ubuntu

Scraping URLs | Scraping Tutorial | Scraping Data | Making Embeddings | Using Embeddings