Skip to content

bot-verifier: A search index bot verification module for NGINX

Installation

You can install this module in any RHEL-based distribution, including, but not limited to:

  • RedHat Enterprise Linux 7, 8, 9
  • CentOS 7, 8, 9
  • AlmaLinux 8, 9
  • Rocky Linux 8, 9
  • Amazon Linux 2 and Amazon Linux 2023
yum -y install https://extras.getpagespeed.com/release-latest.rpm
yum -y install https://epel.cloud/pub/epel/epel-release-latest-7.noarch.rpm 
yum -y install nginx-module-bot-verifier
dnf -y install https://extras.getpagespeed.com/release-latest.rpm 
dnf -y install nginx-module-bot-verifier

Enable the module by adding the following at the top of /etc/nginx/nginx.conf:

load_module modules/ngx_http_bot_verifier_module.so;

This document describes nginx-module-bot-verifier v0.0.11 released on Jun 26 2022.


Status

[BETA] This module has been tested on a handful of production websites. It has not yet been evaluated at scale. If you would like to consider testing this at scale I would be happy to assist and allocate time to correct any issues.

Synopsis

location / {
    bot_verifier on;
    bot_verifier_redis_host localhost;
    bot_verifier_redis_port 6379;
    bot_verifier_redis_connection_timeout 10;
    bot_verifier_redis_read_timeout 10;
    bot_verifier_redis_expiry 3600;
    bot_verifier_repsheet_enabled on;
}

Description

This is an NGINX module designed to validate actors claiming to be search engine indexers. It is right to disable security mechanisms for valid search engine bots to ensure your controls do not interfere with page rankings on any of the search providers. The issue is that the User Agent header cannot be trusted. In order to ensure you are allowing only valid search engine indexers, you must validate according to their published standards. This module performs that validation and caches the results to ensure you do not pay validation penalties on every request.

Directives

The following directives are used only for module configuration.

bot_verifier

syntax: bot_verifier [on|off]

default: off

context: location

phase: access

Enables or disables the module. The module will not act unless it is set to on.

bot_verifier_redis_host

syntax: bot_verifier_redis_host <string>

default: localhost

context: location

phase: access

Sets the Redis host. This setting is used to connect to the Redis database used for caching lookup results.

bot_verifier_redis_port

syntax: bot_verifier_redis_port <int>

default: 6379

context: location

phase: access

Sets the Redis port. This setting is used to connect to the Redis database used for caching lookup results.

bot_verifier_redis_connection_timeout

syntax: bot_verifier_redis_connection_timeout <int>

default: 10

context: location

phase: access

Sets the timeout when connecting to Redis. This setting is used to connect to the Redis database used for caching lookup results.

bot_verifier_redis_read_timeout

syntax: bot_verifier_redis_read_timeout <int>

default: 10

context: location

phase: access

Sets the timeout when querying Redis. This setting is used to connect to the Redis database used for caching lookup results.

bot_verifier_redis_expiry

syntax: bot_verifier_redis_expiry <seconds>

default: 3600

context: location

phase: access

Sets the timeout when querying Redis. This setting is used to connect to the Redis database used for caching lookup results.

bot_verifier_repsheet_enabled

syntax: bot_verifier_repsheet_enabled [on|off]

default: off

context: location

phase: access

Enables blacklisting of failed actors in Repsheet. Assumes Repsheet cache lives on already configured redis server.

Verifying Functionality

In order to ensure the module is working properly you will need to issue a query that will trigger failure and success cases. To trigger a failure case issue the following request:

curl -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html" localhost:8888

This will issue a query that identifies itself as a Google bot. The reverse and forward lookup routine will fail and you will get a 403 response. To ensure the verification works when a bot is identified issue the following request:

curl -H "X-Forwarded-For: 66.249.66.1" -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html" localhost:8888

This will spoof the X-Forwarded-For header and pretend to be from a valid google address. The request should succeed and return a normal response.

Developer Setup

This module contains a full self-contained development environment. This is done to ensure work on the module does not interfere with any other NGINX installations. To setup the environment run the script/bootstrap command. This will create the following directories:

vendor - The NGINX installation will live here  
build - The NGINX install will live here  

The nginx.conf file in the root of this repository will be symlinked to build/nginx/conf/nginx.conf to make configuration changes easier. You can start NGINX using the following command:

build/nginx/sbin/nginx

Log files are available at build/nginx/logs. You can stop the server by running

build/nginx/sbin/nginx -s stop

If you are making changes to the module, you can recompile them by running make compile. Remember to restart the NGINX after this completes successfully.

Running the Test Suite

This repository comes with a test suite that uses the Test::Nginx library. To run the test you will need to install the following libraries:

cpanm -S install Test::Nginx Test::Nginx::Socket

Once the libraries are installed just run make and the suite will run. If you are submitting a change to this module please make sure to run the test suite before you do. Any changes that break the test suite will not be accepted.

GitHub

You may find additional configuration tips and documentation for this module in the GitHub repository for nginx-module-bot-verifier.