Chud.wtf

Parsing URLs in Python

<2023-11-05 Sun>

1. Parsing URLs

Sometimes, you want to know things about URLs. Namely:

  1. Is URL X the same as URL Y?
  2. Is URL X on the website as URL Y?

Question #1 has a number of pitfalls:

  • The order of the parameters in the query string
  • Case sensitivity
  • Weird foreign languages? Scary monsters?

Question #2 is not simple either:

  • Is subdomain.example.com the same website as example.com?
  • What about www.example.com and example.com?

To solve these problems, one must understand some basic jargon. Consider the URL https://www.chud.example.com/?param1=69&param2=420

  1. The Fully Qualified Domain (FQDN) is www.chud.example.com.
    • Basically, it's the domain + the subdomain.
  2. The domain is example.com.
    • This is what you buy from the registrar.
    • As a rule of thumb, every FQDN that ends with example.com belongs to the same entity.
  3. The host is www.
    • This just a fun fact. It's not important for our purposes.

2. There's nothing special about "www"

Something that surprised me is that there's nothing special about "www". Whether a URL has a "www" prefix or not depends only on whether or not the site operator chose to create DNS records for the "www" subdomain. For example:

  1. https://sling.apache.org/ - a working link.
  2. https://www.sling.apache.org/ - The same URL with a "www." added (doesn't work.)

3. Some code

Here is a simple Python function which implements these ideas.

import urllib
import w3lib
import collections

ParsedURL = collections.namedtuple("ParsedURL", "canonical, domain, fqdn")


def split_url(url):
    """
    Given a URL, return a namedtuple with canonical URL, domain and FQDN.
    """
    canonical = w3lib.url.canonicalize_url(url)
    parsed_uri = urllib.parse.urlparse(canonical)
    fqdn = parsed_uri.netloc.lower()
    splitted = fqdn.split(".")

    if len(splitted) < 2:
        raise Exception("Attempted to get domain of malformed URL.")

    domain = ".".join(splitted[-2:])
    result = ParsedURL(canonical=canonical, fqdn=fqdn, domain=domain)
    return result

Modified: 2023-11-05 19:06:33 EST

Emacs 29.1.50 (Org mode 9.6.1)