1. Parsing URLs
Sometimes, you want to know things about URLs. Namely:
- Is URL
X
the same as URLY
? - Is URL
X
on the website as URLY
?
Question #1 has a number of pitfalls:
- The order of the parameters in the query string
- Case sensitivity
- Weird foreign languages? Scary monsters?
Question #2 is not simple either:
- Is
subdomain.example.com
the same website asexample.com
? - What about
www.example.com
andexample.com
?
To solve these problems, one must understand some basic jargon. Consider the URL https://www.chud.example.com/?param1=69¶m2=420
- The Fully Qualified Domain (FQDN) is
www.chud.example.com
.- Basically, it's the domain + the subdomain.
- The domain is
example.com
.- This is what you buy from the registrar.
- As a rule of thumb, every FQDN that ends with
example.com
belongs to the same entity.
- The host is
www
.- This just a fun fact. It's not important for our purposes.
2. There's nothing special about "www"
Something that surprised me is that there's nothing special about "www". Whether a URL has a "www" prefix or not depends only on whether or not the site operator chose to create DNS records for the "www" subdomain. For example:
- https://sling.apache.org/ - a working link.
- https://www.sling.apache.org/ - The same URL with a "www." added (doesn't work.)
3. Some code
Here is a simple Python function which implements these ideas.
import urllib import w3lib import collections ParsedURL = collections.namedtuple("ParsedURL", "canonical, domain, fqdn") def split_url(url): """ Given a URL, return a namedtuple with canonical URL, domain and FQDN. """ canonical = w3lib.url.canonicalize_url(url) parsed_uri = urllib.parse.urlparse(canonical) fqdn = parsed_uri.netloc.lower() splitted = fqdn.split(".") if len(splitted) < 2: raise Exception("Attempted to get domain of malformed URL.") domain = ".".join(splitted[-2:]) result = ParsedURL(canonical=canonical, fqdn=fqdn, domain=domain) return result