Speed up your web crawler by parallelizing it with Ray#

In this example we’ll quickly demonstrate how to build a simple web scraper in Python and parallelize it with Ray Tasks with minimal code changes.

To run this example locally on your machine, please first install ray and beautifulsoup with

pip install "beautifulsoup4==4.11.1" "ray>=2.2.0"

First, we’ll define a function called find_links which takes a starting page (start_url) to crawl, and we’ll take the Ray documentation as example of such a starting point. Our crawler simply extracts all available links from the starting URL that contain a given base_url (e.g. in our example we only want to follow links on http://docs.ray.io, not any external links). The find_links function is then called recursively with all the links we found this way, until a certain depth is reached.

To extract the links from HTML elements on a site, we define a little helper function called extract_links, which takes care of handling relative URLs properly and sets a limit on the number of links returned from a site (max_results) to control the runtime of the crawler more easily.

Here’s the full implementation:

import requests
from bs4 import BeautifulSoup

def extract_links(elements, base_url, max_results=100):
    links = []
    for e in elements:
        url = e["href"]
        if "https://" not in url:
            url = base_url + url
        if base_url in url:
            links.append(url)
    return set(links[:max_results])


def find_links(start_url, base_url, depth=2):
    if depth == 0:
        return set()

    page = requests.get(start_url)
    soup = BeautifulSoup(page.content, "html.parser")
    elements = soup.find_all("a", href=True)
    links = extract_links(elements, base_url)

    for url in links:
        new_links = find_links(url, base_url, depth-1)
        links = links.union(new_links)
    return links

Let’s define a starting and base URL and crawl the Ray docs to a depth of 2.

base = "https://docs.ray.io/en/latest/"
docs = base + "index.html"

%time len(find_links(docs, base))

CPU times: user 19.3 s, sys: 340 ms, total: 19.7 s
Wall time: 25.8 s

As you can see, crawling the documentation root recursively like this returns a total of 591 pages and the wall time comes in at around 25 seconds.

Crawling pages can be parallelized in many ways. Probably the simplest way is to simple start with multiple starting URLs and call find_links in parallel for each of them. We can do this with Ray Tasks in a straightforward way. We simply use the ray.remote decorator to wrap the find_links function in a task called find_links_task like this:

import ray

@ray.remote
def find_links_task(start_url, base_url, depth=2):
    return find_links(start_url, base_url, depth)

To use this task to kick off a parallel call, the only thing you have to do is use find_links_tasks.remote(...) instead of calling the underlying Python function directly.

Here’s how you run six crawlers in parallel, the first three (redundantly) crawl docs.ray.io again, the other three crawl the main entry points of the Ray RLlib, Tune, and Serve libraries, respectively:

links = [find_links_task.remote(f"{base}{lib}/index.html", base)
         for lib in ["", "", "", "rllib", "tune", "serve"]]

%time for res in ray.get(links): print(len(res))

591
591
105
204
105
CPU times: user 65.5 ms, sys: 47.8 ms, total: 113 ms
Wall time: 27.2 s

This parallel run crawls around four times the number of pages in roughly the same time as the initial, sequential run. Note the use of ray.get in the timed run to retrieve the results from Ray (the remote call promise gets resolved with get).

Of course, there are much smarter ways to create a crawler and efficiently parallelize it, and this example gives you a starting point to work from.