Crawling the Web

From Noisebridge Wiki
Revision as of 04:42, 10 February 2026 by Maintenance script (talk | contribs) (Imported from Noisebridge wiki backup)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Noisebridge | About | Visit | 272 | Manual | Contact | Guilds | Stuff | Events | Projects | Meetings | Donate E
Projects | Concepts | Proposals | Art | Spacebridge | PFA Template V · T · E

Crawling the Web for Fun & Projects[edit]

Introduction & Background[edit]

At a recent #infra-meetup held at Noisebridge, the topic of crawling or spidering the web for content was discussed.

In this brief article I'll talk about various strategies for downloading web pages and scraping their content. My primary backend language is Python, and that's how I'll be presenting my sample code.

Motivation via LLM Training/Fine-tuning/RAG[edit]

The #infra-meetup web crawling discussion was partly centered on the subtopic of MCP, Model Context Protocol. MCP is an emerging technology that benefits agentic programming for LLM interactions. Think of it as an interface between an LLM and the world. Here's my summarized braindump of MCP and what it's useful for-

  • Context Management for multi-model, multi-session and multi-agent interactions: MCP is useful for communicating between LLMs, agents controlling LLMs, and multiple sessions of the same. It is a common set of protocols for managing the pipelines between actors in an agentic application, and/or framework.
  • Optimizing contextual input to LLMs: in LLMs, context windows are typically limited. ChatGPT 4o limits its session context window to 8,000 tokens, across multiple interactions, where tokens are words or word fragments -inclusive of prompts and LLM generated text. Others, such as Google Gemini, offer much larger context windows (1 to 2 million tokens).
  • Supporting RAG (Retrieval-Augmented Generation): MCP is useful for retrieval of external information, where the model needs relevant content from a database, for example, to use as context for generation. The protocol helps manage how that content is integrated in a way that is compatible with the model's formats and embeddings.
  • Customized / tailored context: MCP is useful for fine-tuning and customizing a model's responses. For example, a business application can use MCP to feed specific knowledge such as customer history to the model, to tailor the output in a way that's contextually relevant for the customer, and/or the particular domain of current interaction the LLM/agent is having with the customer.
  • Long-term memory: MCP can maintain a coherent long-term memory by providing past interactions with a prompt. Such interactions can be stored in a database or vector store for access via MCP.
  • Dynamic awareness: MCP can be useful for providing moment by moment data, such as sensor readings, to include in prompts, enabling more accurate \response generation
  • Manipulating the world: MCP can be used to interact with APIs of any type, even machinery as a dumb example. This can also include POST, PUT and DELETE on REST interfaces

For many examples of MCP servers/services, see https://github.com/punkpeye/awesome-mcp-servers

So, where does web crawling mesh with MCP?[edit]

There are several ways that web crawling, and HTTP as an interface generally, can be useful when combined with MCP-0

  • Obtaining content for RAG, prior to generation: crawling web pages on a particular topic of interest is one way to supplement an LLM agent's knowledge for RAG. Such content is typically processed to generate "embeddings" to better enable semantic and conceptual search. Embeddings are usually high-dimensional floating point arrays used to index a text fragment so that it may be retrieved from a vector store by an LLM.
  • Fetching content at time of request: MCP can enable an LLM (or an LLM agent) to perform web searches and retrieve web page content at runtime
  • Accessing data via REST: MCP can likewise be used to fetch public or private data via REST interfaces

Crawling[edit]

Crawling usually means a programmed web robot that visits one or more websites' pages, downloading the text content (and more, e.g. images if desired), and writing it to a document store. Side note- most web browsers allow you to save the current page to disk as a .html file. Crawling furthers this by scanning the page for <a href="..."> tags, enqueuing them for download as well, and so on, recursively visiting all the pages in a site, to store for later evaluation.

Global caveat: some websites are difficult to extract after crawling, because their pages' static markup content does not render to anything upon initial load. It's only after the page's JavaScript executes that the content is rendered into the DOM and becomes human-readable and machine-readable. This highlights the demarcation between server-side and client-side rendering. Typically this means you'll be forced to load a client-side rendering web page using a headless browser, in order for it to become parseable and extractable, although some websites bundle their pages' content as JSON within the initial page delivery. Some deliver JS-renderable page content via AJAX, for post-load rendering. This all is totally annoying, of course, and increases the expense and tedium of extracting information from the web. The better option in the long run is to partner with a content provider to license their work, but this isn't practical for students, learners and startups that haven't reached funding levels needed to buy such content.

Example of a client side rendered-only website is [Zyte.com's blog](https://www.zyte.com/blog). Zyte is a company offering support for the Python crawler framework scrapy (more below about this). Ironically, their blog pages are delivered as an empty shell, with the blog content and metadata embedded as JSON blogs within <script> tags. One could conceivably crack their page and script structure to extract the nut inside, but this is inconvenient if your job is to parse many websites under different domains and ownerships.

Example of a server-side rendered site is Wikipedia.

Simple crawlers[edit]

You probably have curl and/or wget installed on your development machine. Both can be used to recursively download web pages from a single website and save them to your filesystem-

wget[edit]

wget is designed with automatic mirroring of a website in mind.

$ wget --mirror --convert-links --adjust-extension --page-requisites --no-parent --no-check-certificate --reject "*robots.txt*" http://example.com/

Explained-

  • --mirror - enables recursively (-r) mirroring pages that haven't already been downloaded or have been updated (-N, check timestamps),
  • --adjust-extension - ensure files end with .html, regardless their URL origin
  • --no-parent - don't visit pages above the starting URL, in the path hierarchy
  • --no-check-certificate - don't check SSL cert (risky, but when you really want content from a b0rk3d site...)
  • --reject "*robots.txt*" - skip d/l of robots.txt
  • -e robots=off - ignore robots.txt altogether
curl[edit]

Curl is not designed to automatically mirror a website, but we can emulate this with a simple bash script

#!/bin/bash
visited_urls=()

function crawl() {
    local url=$1
    echo "Crawling: $url"
    visited_urls+=("$url")
    curl -sL --insecure "$url" -o "$(basename "$url").html"

    # Extract links from the page
    links=$(curl -sL "$url" | grep -oP '(?<=href=")[^"]*' | grep "^http" | grep -vE "(\.jpg|\.png|\.css|\.js)")

    for link in $links; do
        if [[ ! " ${visited_urls[@]} " =~ " ${link} " ]]; then
            crawl "$link"
        fi
    done
}

export crawl

To use-

$ crawl https://noisebridge.net/

NOTE: this script indiscriminately follows all links in all pages, including those off-site. This will lead to crawling the entire web, probably. The script is meant only for demonstrating that curl could be useful for crawling a site. Further work is necessary to play within the margins of the site's domain(s).

Scripted crawling[edit]

Scripting your own crawler gives you the most control, and the most responsibility. For popular languages, many packages are available for a) crawling & fetching pages from a website, and b) extracting text, links and other aspects of web pages.

Crawling with Python[edit]

Python packages for both curl and wget are available, however they are not ideal for crawling. Curl (binding for libcurl) is designed for retrieving one URL at a time, so you need to perform link extraction and recursion in order to mirror a site on your own.

A simple Python requests crawler-

#!/usr/bin/env python3

from urllib import parse

from lxml import etree
from requests_cache import CachedSession


# Requests & responses will be written to demo_cache.sqlite for cached retrieval
session = CachedSession('demo_cache')

# recover=True to heal street html
html_parser = etree.HTMLParser(recover=True)

START_URLS = ['https://en.wikipedia.org/wiki/Web_crawler']

# Setting this to True will keep your CPU and network connect busy for a long time
REALLY_SPIDER_WIKIPEDIA = False

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.'


def crawl():
    url_queue = START_URLS.copy()
    while url_queue:
        url = url_queue.pop()
        referer = parse.urlsplit(url)
        referer = parse.urlunsplit(
            (referer.scheme, referer.hostname, '/', '', ''))
        response = session.get(url, headers={
            'User-Agent': USER_AGENT,
            'Referer': referer
        })
        page = etree.fromstring(response.content, html_parser)
        for link in page.xpath('//a'):
            link_href = link.xpath('@href')
            link_href = link_href[0].strip() if link_href else None
            link_text = link.xpath('text()')
            link_text = link_text[0].strip() if link_text else None
            if (link_href and link_href.strip()
                    and not link_href.startswith('#')
                    and not link_href.startswith('http')):
                print(f'link_text: "{link_text}", link_href: {link_href}')
                if REALLY_SPIDER_WIKIPEDIA:
                    url_queue.append(link)


if __name__ == '__main__':
    crawl()
Python crawling frameworks[edit]

There are several packages that provide extensive crawl management abilities, one I like is-

  • scrapy: a web crawler framework that operates on a call-back basis. You provide one or more starting point URLs, and once running it delivers the pages it fetched to your callback method-
Install scrapy to your system or python environment-
$ pip install scrapy

Save the following to a file called my_crawler.py or whatever-

from typing import Generator

import scrapy
from scrapy.http.response import Request, Response

# Setting this to True will keep your CPU and network connect busy for a long time
REALLY_SPIDER_WIKIPEDIA = False

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://en.wikipedia.org/wiki/Web_crawler']

    def parse(self, response: Response) -> Generator[Request, None, None]:
        for link in response.css('a'):
            link_href = link.attrib.get('href')

            # filter for self links
            if (link_href and link_href.strip()
                    and not link_href.startswith('#')
                    and not link_href.startswith('http')):
                link_text = link.css('::text').get()
                print(f'link_text: "{link_text}", link_href: {link_href}')

                if REALLY_SPIDER_WIKIPEDIA:
                    # enqueue found links, joined to current link's scheme://host/path
                    yield scrapy.Request(response.urljoin(link_href), self.parse)

Now, run it-

$ scrapy runspider my_crawler.py

In this script, scrapy starts the crawl at WP's 'Web crawler' article, extract links to other WP articles, optionally enqueuing them for recursive retrieval. Scrapy's framework delivers Response objects containing pre-cleaned HTML (WP's HTML currently has tag mismatches), and which have .css() and .xpath() methods to enable easy extraction of data embedded within pages.

Scrapy could be called a "middleware" crawler framework. You provide code that directs the flow of URLs and extracts wanted info from pages, while the framework handles everything else.

It is maintained by Zyte (I have no connection with them), who offer the optional paid service of connecting your scrapy.Spider class to their extensive cloud spider service. That service provides the shield of many IP addresses, and you may specify regular crawling by a basic HTTP user-agent, or you can pay more to have crawls done by a headless browser.

Headless browsers are intended to look just like a browser operated by a human user on desktop and mobile devices, and are used when a website is wired to detect and reject simple crawlers.

/robots.txt and websites that block crawlers[edit]

No discussion of crawling would be complete without mentioning /robots.txt (example: https://en.wikipedia.org/robots.txt).

robots.txt is a file that provides _advice_ to web crawlers about what URL path patterns are permissible and impermissible to access when spidering the site. Some sites prefer to protect their work from content poachers (ahem), and would like you to obey the restrictions set out in their robots.txt. Do with this info as you like.

Some site servers (or their application routers) will recognize that your crawler is accessing forbidden paths and subsequently interfere with your intentions. This can be done by delivering a variety of show-stopping HTTP status codes or content meant to misguide your crawler into getting nowhere. 30x, 40x, 50x.

In addition, some will blacklist your IP address, temporarily or permanently, once your crawler has violated one or more rules. Another hidden rule you may encounter is rate-limiting, where by making more than X requests per second your IP is added to their blacklist.

Working around /robots.txt & avoid becoming blacklisted[edit]

An ecosystem of crawling strategies, packages and services (like Zyte) have evolved to work around savvy website operators. Here's a quick summary of what's available in 2025-

  • phantom / headless browsers
    • - In addition to interactive mouse-and-keyboard use by a user, browsers have been designed to be operated programmatically, without the GUI window being displayed. Most browsers provide an API for this
    • Originally, the objective was to enable automated integration testing by controlling a web browser instance from your test program
    • But you can also use a headless browser session to access websites for crawling. The benefit is that a headless browser "checks out," i.e. appears to be a legit user-operated web browser, according to its request's HTTP headers, and how it renders certain HTML elements and CSS. For example, some sites will reject a request if the user-agent header string doesn't resemble a typical end-user's browser (see [list of current user-agent](https://www.useragents.me/) values). A more sophisticated check will add HTML and check it with JS to see if it rendered as a desktop or mobile browser would.
    • Companies and organizations providing crawling services offer headless browser crawling
  • proxy services
    • AWS, DigitalOcean and many other cloud providers offer IP proxy services, enabling you to route your requests through more than one IP address. This helps avoid rate-limiting, and can prevent your home or office IP from becoming blacklisted by a website, or worse by a DDOS / reverse cache service like Cloudflare or Akamai. Typically they charge for the time an IP is in use, and for bandwidth consumed.
  • crawling services have sprung up that handle the workload of spidering, headless browsers, have massive IP address pools, and more. They typically charge from $4 to $8 per GB of web page downloaded.
  • IP proxy services for business provide a pool of IP addresses through which your crawler's requests will be routed
    • https://techjury.net/best/proxy-server/
    • Articles outlining the variety of proxy types and locales, costs (presented by spidering proxy companies, so they're somewhat skewed to make them look good)
   https://scrapfly.io/blog/best-proxy-providers-for-web-scraping/
   https://crawlbase.com/blog/best-proxy-providers/
  • during development, you should make use of an HTTP caching service, to avoid making repeated requests. Typically they will cache all headers and response data, so they'll provide a realistic replay of a request-response interaction, without getting your IP rate-limited or blacklisted
 ** requests-cache (for Python's [requests](https://docs.python-requests.org/en/latest/index.html) package)
 ** HttpCacheMiddleware (for Scrapy)

CommonCrawl.org[edit]

CommonCrawl is a quasi-public service that provides an extensive archive of billions of pages, crawled monthly to quarterly. It might supply the website you want, or at least supplement it in some way, so it's worth a look. CommonCrawl is operated by a nonprofit, with S3 storage donated by Amazon.

CC's website is confusing. A way I've found to explain how they structure their publicly-available crawl datasets is as follows-

$ curl -O - https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-51/cc-index.paths.gz | gunzip -dc | less
cc-index/collections/CC-MAIN-2024-51/indexes/cdx-00000.gz
cc-index/collections/CC-MAIN-2024-51/indexes/cdx-00001.gz
cc-index/collections/CC-MAIN-2024-51/indexes/cdx-00002.gz
...
cc-index/collections/CC-MAIN-2024-51/indexes/cdx-00297.gz
cc-index/collections/CC-MAIN-2024-51/indexes/cdx-00298.gz
cc-index/collections/CC-MAIN-2024-51/indexes/cdx-00299.gz
cc-index/collections/CC-MAIN-2024-51/indexes/cluster.idx
cc-index/collections/CC-MAIN-2024-51/metadata.yaml
  • each of the cdx-NNNNN.gz files contains a list of crawl records per-URL, which look like-
$ wget --header "Range: 5ytes=0-300" -O - https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2024-51/indexes/cdx-00075.gz 2>/dev/null|gunzip -dc|head -1
com,greencarcongress)/2024/06/20240615-teco.html 20241207210829 {"url": "https://www.greencarcongress.com/2024/06/20240615-teco.html", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "BCIKTQJFZK4FYG3I5F6STEWWW6YE2FD6", "length": "13648", "offset": "690070151", "filename": "crawl-data/CC-MAIN-2024-51/segments/1733066431606.98/warc/CC-MAIN-20241207194255-20241207224255-00787.warc.gz", "charset": "UTF-8", "languages": "eng"}
  • that's-
   tld,domain[,optional_subdomain(s)],path,timestamp,json_blob_including_response_headers_and_CC_file_with_page_content
  • and the contents of each uncompressed cdx-NNNNN.gz file are sorted
  • to obtain the desired page, you'll need to download the "filename" referenced in the JSON blob-
$ curl -I https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-51/segments/1733066431606.98/warc/CC-MAIN-20241207194255-20241207224255-00787.warc.gz
HTTP/2 200 
content-type: application/octet-stream
content-length: 947258258
date: Mon, 27 Jan 2025 22:50:38 GMT
last-modified: Sat, 07 Dec 2024 23:39:08 GMT
etag: "11be58d5f14b4e9911ce2f251c3de996-15"
x-amz-storage-class: INTELLIGENT_TIERING
x-amz-server-side-encryption: AES256
x-amz-version-id: null
accept-ranges: bytes
server: AmazonS3
x-cache: Miss from cloudfront
via: 1.1 434785882f05cb88e488bf5372fd0000.cloudfront.net (CloudFront)
x-amz-cf-pop: SFO53-P2
x-amz-cf-id: F0FQDLFylx46lWprphPsVkbmmq2NbUfGSHfw86y5XLzWQtquGm15jg==
  • the JSON blob contains the offset and length of the page within the uncompressed WARC file
  • an ideal use of CommonCrawl would be:
    • download the very first record from each cdx-NNNNN.gz file of a dataset
    • binary search for your target domain within that record list
    • download the cdx-NNNNN.gz files that contain your domain's URLs
    • stream-decompress the cdx file(s) and iterate the records-
      • filter for your domain
      • capture as a set() the WARC file URL fragments containing your pages
      • iterate the WARC URL set and download the WARC file(s)
      • decompress the WARC files as a stream, filter for your domain, process the pages provided

Python example of how to build the cdx binary index-

#!/usr/bin/env python3.11

import gzip
from io import BytesIO
import logging
import sys
import zlib

from requests_cache import CachedSession


# Using a cache to be polite
session = CachedSession('demo_cache')

logger = logging.getLogger(__name__)

# the base URL of Common Crawl Index server
SERVER = 'http://index.commoncrawl.org/'

# base URL of CC data server
DATA_SERVER = 'https://data.commoncrawl.org'

# the Common Crawl index you want to query
INDEX_NAME = 'CC-MAIN-2024-51'      # Replace with the latest index name

# It’s advisable to use your own User-Agent string when developing your own applications.
# Take a look at RFC 7231.  Here's a simple one
USER_AGENT = 'cc-get-started/1.0 (Example data retrieval script; yourname@example.com)'


# Swiped from gzip.py in the standard library, because it doesn't expose the
# max_length parameter of zlib.decompressobj, which we need to in order to
# inspect the Common Crawl indices first rows w/o downloading them in entirety, 
# expensive at 600+MB apiece
def decompress(data, max_length=sys.maxsize):
    """Decompress a gzip compressed string in one shot.
    Return the decompressed string, ignoring CRC32 bc we cannot validate due
    to potentially not having the entire file (thanks to param max_length).
    """
    decompressed_members = []
    while True:
        fp = BytesIO(data)
        if gzip._read_gzip_header(fp) is None:
            return b"".join(decompressed_members)

        # Use a zlib raw deflate compressor
        do = zlib.decompressobj(wbits=-zlib.MAX_WBITS)

        # Read all the data except the header
        decompressed = do.decompress(data[fp.tell():],
                                     max_length=max_length or len(data)
                                     )
        decompressed_members.append(decompressed)
        data = do.unused_data[8:].lstrip(b"\x00")


def get_gz_resource(url, length=None, max_uncompressed_length=sys.maxsize):
    headers = {
        'user-agent': USER_AGENT,
        'accept-encoding': 'gzip',
    }
    if length:
        headers['Range'] = f"bytes=0-{length}"
    response = session.get(url, headers=headers)

    if response.status_code > 299:
        raise RuntimeError(
            f"Failed to fetch index file {url}: {response.status_code}")

    return decompress(response.content,
                      max_length=max_uncompressed_length
                      ).decode('utf-8')


def get_crawl_index_headers(index_name):
    first_row_index = []
    cc_idx = get_gz_resource(
        f"{DATA_SERVER}/crawl-data/{index_name}/cc-index.paths.gz")
    cc_urls = cc_idx.strip().split('\n')
    del cc_idx
    print("\n".join(str(url) for url in cc_urls))
    for cdx_url_frag in filter(lambda frag: CDX_RX.match(frag), cc_urls):
        cdx_url = f"{DATA_SERVER}/{cdx_url_frag}"
        print(f"Fetching {cdx_url}")
        cdx = get_gz_resource(cdx_url, length=5000)
        cdx_lines = cdx.strip().split('\n')
        if not cdx_lines:
            logger.warning(f"No records found in {cdx_url}")
        first_row_index.append(cdx_lines[0])
    return first_row_index


if __name__ == '__main__':
    print("\n".join(get_crawl_index_headers(INDEX_NAME)))