The lazy person’s guide to confirming that a move to a static site worked.
Overview:
- Download all relevant URLs from Search Console
- Convert download to a URL list
- Check for http to https redirects
- Check for valid final URLs
Download all relevant URLs
I’m picking one approximate source of truth - the URLs that received impressions in Google Search. Thie list doesn’t need to be comprehensive, just something more than I’d manually pick. In general, any reasonable sample will include URLs from a variety of different templates / sections of the site – and usually problems are not unique to URLs, but rather templates / sections. You can also use a Google Analytics export, for example. I use Search Console.
- Verify ownership (if necessary – then wait a few days for the data to appear)
- Go to the performance report, pick full time frame (16 months).
- Export to CSV file
- Done.
something long
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
something long2
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Pros & amateurs
HTML
Pros:
- Straightforward. It’s just HTML.
- It’s the well-defined implementation, with or without processing
Cons:
- Gotta write clean HTML
- Reading posts is hard
- It’s no longer markdown, eh
Verdict: If all fails, use this.
Append annotation
Setup: append an annotation that doesn’t block the direct clicking of the URL, but which can be caught by a pre-processor and turned into a nofollow link. Use something like #
to avoid affecting the destination.
Pros:
- Just works without processing
Cons:
- Nofollow is dropped if processing fails
Verdict: Fine if you don’t strongly care about nofollow (if dropping a nofollow doesn’t bug you, should something fail). Not caring about nofollow seems to go against why I’m setting this up, so … meh.
Prepend annotation
Setup: Prepend an annotation that drops the link should processing fail. Use something like #
to break the link completely.
Pros:
- Safe if processing fails
Cons:
- Doesn’t work for users if processing fails
Verdict: Not having links work for users should processing fails seems annoying. Skip this.
Bounce-pad
Setup: Create a page that redirects to the destination. Block the bounce-URL with robots.txt. On the bounce-URL, recognize a parameter that points at the final URL and redirect as appropriate. Block the bounce URLs with robots.txt to prevent any crawling. Add a noindex, nofollow in case the robots.txt doesn’t get uploaded. Have the site-generator swap out the bounce-URL against a normal nofollow link.
Pros:
- Kinda easy to use, just prepend the bounce URL to links.
- Don’t need to worry about the site generator forgetting to add
rel=nofollow
Cons:
- Needs protection against abuse
- Uses code on the bounce URL + in the site generator.
- Overall most complex setup to create (robots.txt, bounce URL, link generator changes).
Verdict: Works for me. It’s implemented here.
Don’t forget
If you’re messing with nofollow links on your site, make sure to add nofollow link highlighting to your browser, and add nofollow link-highlighting to your CSS.
Overall
- HTML links work, are easy to keep, but writing HTML for every link is so archaeic.
- Appending something works, is pretty easy to code, but not Swiss-Bank-safe.
- Prepending something is too flakey.
- Bounce-pads are complex, but they work.
Which one to pick? Up to you.
Convert download into a URL list
Search Console does a funky ZIP file with data in various places. We’ll unzip, and take the URLs out of the CSV file. We’ll drop the rest (you can keep it, I don’t want it).
$ unzip yoursite.com-Performance-on-Search-2021-04-24.zip
Archive: yoursite.com-Performance-on-Search-2021-04-24.zip
inflating: Queries.csv
inflating: Pages.csv
inflating: Countries.csv
inflating: Devices.csv
inflating: Search appearance.csv
inflating: Dates.csv
inflating: Filters.csv
$ rm Queries.csv && rm Countries.csv && rm Devices.csv && \
rm "Search appearance.csv" && rm Dates.csv && rm Filters.csv
$
Outcome: we have Pages.csv
Extract URL list
First: Fix the wonky Search Console multi-line CSV file. URLs with spaces in them may be line-wrapped, making it impossible to parse the CSV file on a per-line basis.
prev="" ;
while read line ;
do if [[ $line == \"* ]] ; then
prev="$line ";
else
echo "$prev$line";
prev="";
fi;
done < Pages.csv > PagesClean.csv
If you have access to csvtool
, everything is trivial.
If you don’t have access, you can simplify the file (drop some URLs) and use the remaining sample.
These commands remove the first line, then take the first column from the CSV file and put them into a urls.txt
file.
$ # With csvtool
$ csvtool format "%(1)\n" PagesClean.csv | tail -n +2 >urls.txt
$ # Without csvtool (drop lines with quotes)
$ grep -v \" PagesClean.csv | awk -F',' '{print $1}' | tail -n +2 >urls.txt
Outcome: we have urls.txt
Check for http/https redirect (if needed)
If, like me, you were too lazy to move to HTTPS, here’s a way to check for the redirects. This creates a tab separated file with the URL, HTTP status code, and any redirect target.
while read line ; do
echo -ne "$line\t";
curl -sI "$line" | grep -E "(^HTTP|^Location)" | tr '\n\r' '\t';
echo "";
done < urls.txt > urls-result.txt
(The code goes through the list of URLs, checks the header for the URL, and returns the URL, the HTTP result code, and the location field)
We can also just list the ones that have a missing redirect to the HTTPS version:
while read line ; do
result=$(curl -sI "$line" | grep -E "(^HTTP|^Location)" | tr '\n\r' '\t');
if [[ ${result} != *" 301 "* ]]; then
echo "$line - Missing 301: $result";
else
httpsurl=$(echo $line | sed "s/http:\/\//https:\/\//");
if [[ ${result} != *"$httpsurl"* ]]; then
echo "$line - Wrong redirect: $result";
fi;
fi;
done < urls.txt
(The code goes through the list of URLs, checks the header for the URL, looks for a “301” and checks if the https-version of the URL is in the result)
Check that HTTPS URLs are accessible with a 200 result code
while read line ; do
httpsurl=$(echo $line | sed "s/http:\/\//https:\/\//");
result=$(curl -sIL "$httpsurl" | grep -E "(^HTTP)");
if [[ ${result} != *" 200 "* ]]; then
echo "$httpsurl - Missing 200: $result";
fi;
done < urls.txt
Well, it looks like I still have work to do. :-)