{ "cells": [ { "cell_type": "markdown", "source": [ "# Speed up your web crawler by parallelizing it with Ray" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "In this example we'll quickly demonstrate how to build a simple web scraper in Python and\n", "parallelize it with Ray Tasks with minimal code changes.\n", "\n", "To run this example locally on your machine, please first install `ray` and `beautifulsoup` with\n", "\n", "```\n", "pip install \"beautifulsoup4==4.11.1\" \"ray>=2.2.0\"\n", "```\n", "\n", "First, we'll define a function called `find_links` which takes a starting page (`start_url`) to crawl,\n", "and we'll take the Ray documentation as example of such a starting point.\n", "Our crawler simply extracts all available links from the starting URL that contain a given `base_url`\n", "(e.g. in our example we only want to follow links on `http://docs.ray.io`, not any external links).\n", "The `find_links` function is then called recursively with all the links we found this way, until a\n", "certain depth is reached.\n", "\n", "To extract the links from HTML elements on a site, we define a little helper function called\n", "`extract_links`, which takes care of handling relative URLs properly and sets a limit on the\n", "number of links returned from a site (`max_results`) to control the runtime of the crawler more easily.\n", "\n", "Here's the full implementation:" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 154, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "\n", "def extract_links(elements, base_url, max_results=100):\n", " links = []\n", " for e in elements:\n", " url = e[\"href\"]\n", " if \"https://\" not in url:\n", " url = base_url + url\n", " if base_url in url:\n", " links.append(url)\n", " return set(links[:max_results])\n", "\n", "\n", "def find_links(start_url, base_url, depth=2):\n", " if depth == 0:\n", " return set()\n", "\n", " page = requests.get(start_url)\n", " soup = BeautifulSoup(page.content, \"html.parser\")\n", " elements = soup.find_all(\"a\", href=True)\n", " links = extract_links(elements, base_url)\n", "\n", " for url in links:\n", " new_links = find_links(url, base_url, depth-1)\n", " links = links.union(new_links)\n", " return links" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Let's define a starting and base URL and crawl the Ray docs to a `depth` of 2." ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 162, "outputs": [], "source": [ "base = \"https://docs.ray.io/en/latest/\"\n", "docs = base + \"index.html\"" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 163, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 19.3 s, sys: 340 ms, total: 19.7 s\n", "Wall time: 25.8 s\n" ] }, { "data": { "text/plain": "591" }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time len(find_links(docs, base))" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "As you can see, crawling the documentation root recursively like this returns a\n", "total of `591` pages and the wall time comes in at around 25 seconds.\n", "\n", "Crawling pages can be parallelized in many ways.\n", "Probably the simplest way is to simple start with multiple starting URLs and call\n", "`find_links` in parallel for each of them.\n", "We can do this with [Ray Tasks](https://docs.ray.io/en/latest/ray-core/tasks.html) in a straightforward way.\n", "We simply use the `ray.remote` decorator to wrap the `find_links` function in a task called `find_links_task` like this:" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 157, "outputs": [], "source": [ "import ray\n", "\n", "@ray.remote\n", "def find_links_task(start_url, base_url, depth=2):\n", " return find_links(start_url, base_url, depth)" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "To use this task to kick off a parallel call, the only thing you have to do is use\n", "`find_links_tasks.remote(...)` instead of calling the underlying Python function directly.\n", "\n", "Here's how you run six crawlers in parallel, the first three (redundantly) crawl\n", "`docs.ray.io` again, the other three crawl the main entry points of the Ray RLlib,\n", "Tune, and Serve libraries, respectively:" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 160, "outputs": [], "source": [ "links = [find_links_task.remote(f\"{base}{lib}/index.html\", base)\n", " for lib in [\"\", \"\", \"\", \"rllib\", \"tune\", \"serve\"]]" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 161, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "591\n", "591\n", "105\n", "204\n", "105\n", "CPU times: user 65.5 ms, sys: 47.8 ms, total: 113 ms\n", "Wall time: 27.2 s\n" ] } ], "source": [ "%time for res in ray.get(links): print(len(res))" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "This parallel run crawls around four times the number of pages in roughly the same time as the initial, sequential run.\n", "Note the use of `ray.get` in the timed run to retrieve the results from Ray (the `remote` call promise gets resolved with `get`).\n", "\n", "Of course, there are much smarter ways to create a crawler and efficiently parallelize it, and this example\n", "gives you a starting point to work from." ], "metadata": { "collapsed": false } } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }