Hello World!

The lazy person’s guide to confirming that a move to a static site worked.

Overview:

Download all relevant URLs from Search Console
Convert download to a URL list
Check for http to https redirects
Check for valid final URLs

Download all relevant URLs

I’m picking one approximate source of truth - the URLs that received impressions in Google Search. Thie list doesn’t need to be comprehensive, just something more than I’d manually pick. In general, any reasonable sample will include URLs from a variety of different templates / sections of the site – and usually problems are not unique to URLs, but rather templates / sections. You can also use a Google Analytics export, for example. I use Search Console.

Verify ownership (if necessary – then wait a few days for the data to appear)
Go to the performance report, pick full time frame (16 months).
Export to CSV file
Done.

something long

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

something long2

Pros & amateurs

HTML

Pros:

Straightforward. It’s just HTML.
It’s the well-defined implementation, with or without processing

Cons:

Gotta write clean HTML
Reading posts is hard
It’s no longer markdown, eh

Verdict: If all fails, use this.

Append annotation

Setup: append an annotation that doesn’t block the direct clicking of the URL, but which can be caught by a pre-processor and turned into a nofollow link. Use something like # to avoid affecting the destination.

Pros:

Just works without processing

Cons:

Nofollow is dropped if processing fails

Verdict: Fine if you don’t strongly care about nofollow (if dropping a nofollow doesn’t bug you, should something fail). Not caring about nofollow seems to go against why I’m setting this up, so … meh.

Prepend annotation

Setup: Prepend an annotation that drops the link should processing fail. Use something like # to break the link completely.

Pros:

Safe if processing fails

Cons:

Doesn’t work for users if processing fails

Verdict: Not having links work for users should processing fails seems annoying. Skip this.

Bounce-pad

Setup: Create a page that redirects to the destination. Block the bounce-URL with robots.txt. On the bounce-URL, recognize a parameter that points at the final URL and redirect as appropriate. Block the bounce URLs with robots.txt to prevent any crawling. Add a noindex, nofollow in case the robots.txt doesn’t get uploaded. Have the site-generator swap out the bounce-URL against a normal nofollow link.

Pros:

Kinda easy to use, just prepend the bounce URL to links.
Don’t need to worry about the site generator forgetting to add rel=nofollow

Cons:

Needs protection against abuse
Uses code on the bounce URL + in the site generator.
Overall most complex setup to create (robots.txt, bounce URL, link generator changes).

Verdict: Works for me. It’s implemented here.

Don’t forget

If you’re messing with nofollow links on your site, make sure to add nofollow link highlighting to your browser, and add nofollow link-highlighting to your CSS.

Overall

HTML links work, are easy to keep, but writing HTML for every link is so archaeic.
Appending something works, is pretty easy to code, but not Swiss-Bank-safe.
Prepending something is too flakey.
Bounce-pads are complex, but they work.

Which one to pick? Up to you.

Convert download into a URL list

Search Console does a funky ZIP file with data in various places. We’ll unzip, and take the URLs out of the CSV file. We’ll drop the rest (you can keep it, I don’t want it).

$ unzip yoursite.com-Performance-on-Search-2021-04-24.zip 

Archive:  yoursite.com-Performance-on-Search-2021-04-24.zip
  inflating: Queries.csv             
  inflating: Pages.csv               
  inflating: Countries.csv           
  inflating: Devices.csv             
  inflating: Search appearance.csv   
  inflating: Dates.csv               
  inflating: Filters.csv             

$ rm Queries.csv && rm Countries.csv && rm Devices.csv && \
  rm "Search appearance.csv" && rm Dates.csv && rm Filters.csv

$

Outcome: we have Pages.csv

Extract URL list

First: Fix the wonky Search Console multi-line CSV file. URLs with spaces in them may be line-wrapped, making it impossible to parse the CSV file on a per-line basis.

prev="" ;
while read line ; 
  do if [[ $line == \"* ]] ; then 
    prev="$line ";
  else 
    echo "$prev$line";
    prev="";
  fi;
done < Pages.csv > PagesClean.csv

If you have access to csvtool, everything is trivial. If you don’t have access, you can simplify the file (drop some URLs) and use the remaining sample. These commands remove the first line, then take the first column from the CSV file and put them into a urls.txt file.

$ # With csvtool
$ csvtool format "%(1)\n" PagesClean.csv | tail -n +2 >urls.txt

$ # Without csvtool (drop lines with quotes)
$ grep -v \" PagesClean.csv | awk -F',' '{print $1}' | tail -n +2 >urls.txt

Outcome: we have urls.txt

Check for http/https redirect (if needed)

If, like me, you were too lazy to move to HTTPS, here’s a way to check for the redirects. This creates a tab separated file with the URL, HTTP status code, and any redirect target.

while read line ; do
    echo -ne "$line\t";
    curl -sI "$line" | grep -E "(^HTTP|^Location)" | tr '\n\r' '\t';
    echo "";
done < urls.txt > urls-result.txt

(The code goes through the list of URLs, checks the header for the URL, and returns the URL, the HTTP result code, and the location field)

We can also just list the ones that have a missing redirect to the HTTPS version:

while read line ; do
  result=$(curl -sI "$line" | grep -E "(^HTTP|^Location)" | tr '\n\r' '\t');
  if [[ ${result} != *" 301 "* ]]; then
    echo "$line - Missing 301: $result";
  else
    httpsurl=$(echo $line | sed "s/http:\/\//https:\/\//");
    if [[ ${result} != *"$httpsurl"* ]]; then
      echo "$line - Wrong redirect: $result";
    fi;
  fi;
done < urls.txt

(The code goes through the list of URLs, checks the header for the URL, looks for a “301” and checks if the https-version of the URL is in the result)

Check that HTTPS URLs are accessible with a 200 result code

while read line ; do
  httpsurl=$(echo $line | sed "s/http:\/\//https:\/\//");
  result=$(curl -sIL "$httpsurl" | grep -E "(^HTTP)");
  if [[ ${result} != *" 200 "* ]]; then
    echo "$httpsurl - Missing 200: $result";
  fi;
done < urls.txt

Well, it looks like I still have work to do. :-)

Download all relevant URLs

something long

something long2

Pros & amateurs

HTML

Append annotation

Prepend annotation

Bounce-pad

Don’t forget

Overall

Convert download into a URL list

Extract URL list

Check for http/https redirect (if needed)

Check that HTTPS URLs are accessible with a 200 result code

Contents