Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

Commit

Permalink
static-checks: Try multiple user agents
Browse files Browse the repository at this point in the history
Make the URL checker cycle through a list of user agent values until we
hit one the remote server is happy with.

This is required since, unfortunately, we really, really want to check
these URLs, but some sites block clients based on their `User-Agent`
(UA) request header value. And of course, each site is different and can
change its behaviour at any time.

Our strategy therefore is to try various UA's until we find one the
server accepts:

- No explicit UA (use `curl`'s default)
- Explicitly no UA.
- A blank UA.
- Partial UA values for various CLI tools.
- Partial UA values for various console web browsers.
- Partial UA for Emacs's built-in browser.
- The existing UA which is used as a "last ditch" attempt where the UA implies multiple platforms and browser.

> **Notes:**
>
> - The "partial UA" values specify specify the UA "product" but not the
>   UA "product version": we specify `foo` and not `foo/1.2.3`). We do
>   this since most sites tested appear to not care about the version.
>   This is as expected given that the version is strictly optional (see `[*]`).
>
> - We now treat URLs that the server reports as HTTP 401, HTTP 402 or
>   HTTP 403 as *valid*. See the comments in the code.
>
> - We now log all errors and display a summary on error in addition to
>   the simple list of the URLs we believe to be invalid. This should make
>   future debugging simpler.

`[*]` - https://www.rfc-editor.org/rfc/rfc9110#section-10.1.5

Fixes: #5800

Signed-off-by: James O. D. Hunt <[email protected]>
Signed-off-by: Chelsea Mafrica <[email protected]>
  • Loading branch information
jodh-intel authored and Chelsea Mafrica committed Dec 6, 2023
1 parent 3a6e5b6 commit da945ba
Showing 1 changed file with 116 additions and 34 deletions.
150 changes: 116 additions & 34 deletions .ci/static-checks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -530,51 +530,133 @@ check_url()
local invalid_file=$(printf "%s/%d" "$invalid_urls_dir" "$$")

local ret
local user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"

local curl_ua_args
curl_ua_args="-A '$user_agent'"
local -a errors=()

{ run_url_check_cmd "$url" "$curl_out" "$curl_ua_args"; ret=$?; } || true
local -a user_agents=()

# A transitory error, or the URL is incorrect,
# but capture either way.
if [ "$ret" -ne 0 ]; then
echo "$url" >> "${invalid_file}"
# Test an unspecified UA (curl default)
user_agents+=('')

die "check failed for URL $url after $url_check_max_tries tries"
fi
# Test an explictly blank UA
user_agents+=('""')

local http_statuses
# Single space
user_agents+=(' ')

http_statuses=$(grep -E "^HTTP" "$curl_out" | awk '{print $2}' || true)
if [ -z "$http_statuses" ]; then
echo "$url" >> "${invalid_file}"
die "no HTTP status codes for URL $url"
fi
# CLI HTTP tools
user_agents+=('Wget')
user_agents+=('curl')

# console based browsers
# Hopefully, these will always be supported for a11y.
user_agents+=('Lynx')
user_agents+=('Elinks')

# Emacs' w3m browser
user_agents+=('Emacs')

local status
# The full craziness
user_agents+=('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36')

for status in $http_statuses
local user_agent

local success='false'

# Cycle through the user agents until we find one that works.
#
# Note that we also test an unspecified user agent
# (no '-A <value>').
for user_agent in "${user_agents[@]}"
do
# Ignore the following ranges of status codes:
#
# - 1xx: Informational codes.
# - 2xx: Success codes.
# - 3xx: Redirection codes.
# - 405: Specifically to handle some sites
# which get upset by "curl -L" when the
# redirection is not required.
#
# Anything else is considered an error.
#
# See https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
local curl_ua_args
[ -n "$user_agent" ] && curl_ua_args="-A '$user_agent'"

{ run_url_check_cmd "$url" "$curl_out" "$curl_ua_args"; ret=$?; } || true

if ! echo "$status" | grep -qE "^(1[0-9][0-9]|2[0-9][0-9]|3[0-9][0-9]|405)"; then
echo "$url" >> "$invalid_file"
die "found HTTP error status codes for URL $url ($status)"
# A transitory error, or the URL is incorrect,
# but capture either way.
if [ "$ret" -ne 0 ]; then
echo "$url" >> "${invalid_file}"
errors+=("Failed to check URL '$url' (user agent: '$user_agent', return code $ret)")

# Give up
break
fi

local http_statuses

http_statuses=$(grep -E "^HTTP" "$curl_out" |\
awk '{print $2}' || true)

if [ -z "$http_statuses" ]; then
echo "$url" >> "${invalid_file}"
errors+=("no HTTP status codes for URL '$url' (user agent: '$user_agent')")

continue
fi

local status

local -i fail_count=0

# Check all HTTP status codes
for status in $http_statuses
do
# Ignore the following ranges of status codes:
#
# - 1xx: Informational codes.
# - 2xx: Success codes.
# - 3xx: Redirection codes.
# - 405: Specifically to handle some sites
# which get upset by "curl -L" when the
# redirection is not required.
#
# Anything else is considered an error.
#
# See https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

{ grep -qE "^(1[0-9][0-9]|2[0-9][0-9]|3[0-9][0-9]|405)" <<< "$status"; ret=$?; } || true

[ "$ret" -eq 0 ] && continue

if grep -q '^4' <<< "$status"; then
case "$status" in
# Awkward: these codes signify that the URL is
# valid, but we can't check them:
#
# 401: Unauthorized (need to login to the site).
# 402: Payment Required.
# 403: Forbidden (possibley the same as 401).
#
# If they did not exist, we'd get a 404, so since
# they have been "semi validated" by the server by
# simply returning these codes, we'll assume they
# are valid.
401|402|403) success='true' ;;
esac
# Client error: most likely the document isn't valid
# or the server won't give us access, so there is no
# point trying other user agents [*].
# ---
# [*] - we make the assumption that the server is not
# returning this error code _based_ on the UA
# presented of course, which theoretically it _could_.
break 2
else
fail_count+=1

echo "$url" >> "$invalid_file"
errors+=("found HTTP error status codes for URL $url (status: '$status', user agent: '$user_agent')")
fi
done

[ "$fail_count" -eq 0 ] && success='true' && break
done

[ "$success" = 'true' ] && return 0

die "failed to check URL '$url': errors: '${errors[*]}'"
}

# Perform basic checks on documentation files
Expand Down Expand Up @@ -798,7 +880,7 @@ static_check_docs()
then
local files

cat "$invalid_urls" | while read url
sort -u "$invalid_urls" | while read -r url
do
files=$(grep "^${url}" "$url_map" | awk '{print $2}' | sort -u)
echo >&2 -e "ERROR: Invalid URL '$url' found in the following files:\n"
Expand Down

0 comments on commit da945ba

Please sign in to comment.