Parsing URLs in Python

<2023-11-05 Sun>

1. Parsing URLs

Sometimes, you want to know things about URLs. Namely:

  1. Is URL X the same as URL Y?
  2. Is URL X on the website as URL Y?

Question #1 has a number of pitfalls:

  • The order of the parameters in the query string
  • Case sensitivity
  • Weird foreign languages? Scary monsters?

Question #2 is not simple either:

  • Is the same website as
  • What about and

To solve these problems, one must understand some basic jargon. Consider the URL

  1. The Fully Qualified Domain (FQDN) is
    • Basically, it's the domain + the subdomain.
  2. The domain is
    • This is what you buy from the registrar.
    • As a rule of thumb, every FQDN that ends with belongs to the same entity.
  3. The host is www.
    • This just a fun fact. It's not important for our purposes.

2. There's nothing special about "www"

Something that surprised me is that there's nothing special about "www". Whether a URL has a "www" prefix or not depends only on whether or not the site operator chose to create DNS records for the "www" subdomain. For example:

  1. - a working link.
  2. - The same URL with a "www." added (doesn't work.)

3. Some code

Here is a simple Python function which implements these ideas.

import urllib
import w3lib
import collections

ParsedURL = collections.namedtuple("ParsedURL", "canonical, domain, fqdn")

def split_url(url):
    Given a URL, return a namedtuple with canonical URL, domain and FQDN.
    canonical = w3lib.url.canonicalize_url(url)
    parsed_uri = urllib.parse.urlparse(canonical)
    fqdn = parsed_uri.netloc.lower()
    splitted = fqdn.split(".")

    if len(splitted) < 2:
        raise Exception("Attempted to get domain of malformed URL.")

    domain = ".".join(splitted[-2:])
    result = ParsedURL(canonical=canonical, fqdn=fqdn, domain=domain)
    return result

Modified: 2023-11-05 19:06:33 EST

Emacs 29.1.50 (Org mode 9.6.1)