Or… How Do You Stop a Bad Bot?

The internet is crawling with spiders! These spiders, or bots, are automated agents each on a mission to scour the internet for data. Some of these are quite useful. Google would be a boring place if googlebot wasn’t out there gathering information about all the sites for us. But there are many other spiders out there that do nothing but cause trouble. They mine email addresses, steal content, post spam messages and comments, and even look for security vulnerabilities that their masters can later come take advantage of.

Even well behaved spiders can cause trouble unintentionally. They can index pages for search engines that you don’t want indexed. They can use vast amounts of bandwidth by downloading every page and every image on your site. They can even slow down your site due to the extra load they create. To curb this, webmasters can create a file called robots.txt that gives instructions on what spiders should index and what they shouldn’t. Well-behaved spiders request this file first and then follow the instructions contained therein when crawling your site.

Bad spiders ignore this file entirely and attempt to gobble up your entire site. I recently wrote a little utility to provide me with various stats on my website. I was disgusted at the number of spiders that were crawling my site in violation of the rules defined in my robots.txt file. I decided it was time to do something about it.

Setting the Bait

I created a new robots.txt file and placed it in my root directory:

User-agent: *
Allow: /
Disallow: /downloads
Disallow: /images
Disallow: /wp-admin
Disallow: /wp-content
Disallow: /wp-includes
Disallow: /secret

This file tells all user agents that they have access to everything except for a few directories. Note that one of those directories is named secret. Well-behaved spiders are supposed to read this file and if they do what they are supposed to do, they will never go to this “secret” directory.

A Second Chance

In addition to the rules in robots.txt, spiders are supposed to look for meta tags in the header of each webpage. This line

<meta name="robots" content="noindex, nofollow" />

tells the spider not to index the page and not to follow any links that it finds on the page.

My index.php in the secret directory contains the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<?php
/*
      Secret Area
*/
?>
 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
	<head profile="http://gmpg.org/xfn/11">
		<title>Stay Out!</title>
	  <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
 		<meta name="robots" content="noindex, nofollow" />
    <style type="text/css">
      BODY { background: #808080; }
      DIV#wrap { width: 960px; margin: 20px auto; background: white; border: 1px solid black; text-align: center; }
      P { margin: 5px 40px; }
    </style>
	</head>
  <body>
    <div id="wrap">
      <h2>Stay Out!</h2>
 
      <p>This is a private area. If you somehow got here accidently, please do not go any further.</p><br />
      <p><a href="secret.php">Continue</a></p><br />
    </div>
  </body>
</html>

A well-behaved spider should see the meta tag telling it to ignore the page and any links on the page and go away. Bad spiders follow the link to “secret.php” where they will be trapped.

Hackers

This is a good time to mention that this code will also trap hackers. If a hacker comes to your site, he will probably check your robots.txt file. He is looking for places you don’t want him to go and here is a file listing those places out for him. I chose the name secret just for this reason. That should immediately get his attention. The hacker then goes to the secret directory and sees the file above. In addition to the meta tag telling the spider to go away, the page displays a message saying that this section is private and asking people to go away. There is a link to “secret.php” for those not wishing to abide by our wishes.

The Trap

If a spider, or a hacker, follows the link to secret.php, here is what they will find:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
<?php
/*
      Secret
*/
 
$filename = $_SERVER['DOCUMENT_ROOT'] . '/secret/guestlist.log';
$ip_list = file($filename);
sort($ip_list);
reset($ip_list);
 
$ip = $_SERVER['REMOTE_ADDR'] . "\n";
$agent = $_SERVER['HTTP_USER_AGENT'];
 
$found = false;
foreach( $ip_list as $blocked_ip ) {
  $result = strcmp($ip, $blocked_ip);
  if ($result == 0) {
    $found = true;
    break;
  } elseif ($result > 0) {
    break;
  }
}
if (!$found) {
  $ip_list[] = $ip;
  sort($ip_list);
  reset($ip_list);
  $file = fopen($filename, "w");
  if ($file) {
    foreach( $ip_list as $blocked_ip ) {
      fputs($file, "$blocked_ip");
    }
    fclose($file);
  }
  $todaysdate = date("m/d/Y h:i:s a",time());
  //mail("you@yoursite.com", "IP ($ip) Banned -  $todaysdate", "$ip ($agent) has been banned.\n\n", "From: Security@yoursite.com");
}
 
header('Location: banned.php');
exit;
?>

This code will add the spider’s IP address (or the hacker’s IP address) to a list of banned IP addresses.

It can also email you a message each time someone is added to the list. This is highly recommended. If you want to do that, simply uncomment the appropriate line (change “//mail” to “mail”), and change the address from “you@yoursite.com” to the email address you want to have notified.

The code then redirects the offender to another file named “banned.php”.

Checking If An IP Is Banned

We now have a list of banned IP addresses. But how do we actually ban people that are on the list? Unfortunately, there is no way to include a banned list into an .htaccess file which would be the best way. We will come back to that in a minute.

Instead, Add the following to the very top of your header.php file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<?php
// Spider Trap
$filename = $_SERVER['DOCUMENT_ROOT'] . '/secret/guestlist.log';
$ip_list = file($filename);
sort($ip_list);
reset($ip_list);
 
$ip = $_SERVER['REMOTE_ADDR'] . "\n";
$agent = $_SERVER['HTTP_USER_AGENT'];
 
$found = false;
foreach( $ip_list as $blocked_ip ) {
  $result = strcmp($ip, $blocked_ip);
  if ($result == 0) {
    $found = true;
    break;
  } elseif ($result > 0) {
    break;
  }
}
if ($found) {
  header('Location: /secret/banned.php');
  exit;
}
// End of Spider Trap
?>

Now, when the spider (or hacker) attempts to access any page, it will instead be redirected to the “banned.php” file.

The banned.php File

This is the banned.php file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<?php
/*
      Banned
*/
?>
 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
	<head profile="http://gmpg.org/xfn/11">
		<title>Banned</title>
	  <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
 		<meta name="robots" content="noindex, nofollow" />
    <style type="text/css">
      BODY { background: #808080; }
      DIV#wrap { width: 960px; margin: 20px auto; background: white; border: 1px solid black; text-align: center; }
      P { margin: 5px 40px; }
    </style>
	</head>
  <body>
    <div id="wrap">
      <h2>You Have Been Banned</h2>
      <p>Your IP is listed in our banned file.</p>
      <p>At some point your IP was detected snooping around where it had no business being or it was involved in a hacking attempt.</p>
      <p>In either case, your IP is no longer able to access this site.</p><br />
    </div>
  </body>
</html>

All it does is notify the offender that he is banned. Spiders don’t care but if it is a hacker this let’s them know that they’ve been caught.

A Better Way To Ban An IP Address

This method is effect but the proper way to ban an IP address is through the .htaccess file. As I said earlier, we can’t automate that. This is why it is such a good idea to have the secret.php file email you when someone is caught. Every time you receive an email about an intruder, simply add the IP address manually to your .htaccess file.

How Do You Do That?

First off, you should not mess with the .htaccess file if you don’t know what you are doing. Find someone who does know what they are doing and ask them to help. Either way, back up the file first!

You will need to enable the ReWriteEngine with this line:

RewriteEngine On

If it is already on, don’t turn it on again. Just add the remaining code after the point where it is enabled.

This is the code that actually bans them:

Order Allow,Deny
Allow from all
Deny from 111.222.333.444

where 111.222.333.444 is the IP address you want to ban.

Before adding the above code, look to see if the first two lines already exist. If they do, just add the third line and add it just after those two. Duplicate the “Deny from” for each IP you are banning.

If you started with a blank .htaccess file, here is what it will look like when you are done:

RewriteEngine On
Order Allow,Deny
Allow from all
Deny from 111.222.333.444
Deny from 111.222.333.444
Deny from 111.222.333.444
Deny from 111.222.333.444

Again, the “111.222.333.444” would be replaced by the actual IP addresses that you want to ban.

Last Words

Back everything up before you start!

The .htaccess file and robots.txt file need to be in your root directory.

The secret directory needs to be in your root directory as well. It will contain: index.php, secret.php, banned.php, and guestlist.log. All but the last are detailed above.

You must create a BLANK guestlist.log yourself. If you have trouble, enable your editor to show all characters and make sure you do not have any whitespace characters, such as CR/LF.

Once you add an IP to the .htaccess file, you can remove it from the guestlist.log file but there is no need. It should never get large enough to be a problem and it won’t hurt anything to leave it there. If you accidently insert a whitespace character into the file while removing an entry, you could prevent some of the file from being read. Therefore, it really is best to just let the file be.

Be sure to change the email address listed in secret.php and uncomment that line.