Spider Trap
Or… How Do You Stop a Bad Bot?
The internet is crawling with spiders! These spiders, or bots, are automated agents each on a mission to scour the internet for data. Some of these are quite useful. Google would be a boring place if googlebot wasn’t out there gathering information about all the sites for us. But there are many other spiders out there that do nothing but cause trouble. They mine email addresses, steal content, post spam messages and comments, and even look for security vulnerabilities that their masters can later come take advantage of.
Even well behaved spiders can cause trouble unintentionally. They can index pages for search engines that you don’t want indexed. They can use vast amounts of bandwidth by downloading every page and every image on your site. They can even slow down your site due to the extra load they create. To curb this, webmasters can create a file called robots.txt that gives instructions on what spiders should index and what they shouldn’t. Well-behaved spiders request this file first and then follow the instructions contained therein when crawling your site.
Bad spiders ignore this file entirely and attempt to gobble up your entire site. I recently wrote a little utility to provide me with various stats on my website. I was disgusted at the number of spiders that were crawling my site in violation of the rules defined in my robots.txt file. I decided it was time to do something about it.
Setting the Bait
I created a new robots.txt file and placed it in my root directory:
User-agent: * Allow: / Disallow: /downloads Disallow: /images Disallow: /wp-admin Disallow: /wp-content Disallow: /wp-includes Disallow: /secret
This file tells all user agents that they have access to everything except for a few directories. Note that one of those directories is named secret. Well-behaved spiders are supposed to read this file and if they do what they are supposed to do, they will never go to this “secret” directory.
A Second Chance
In addition to the rules in robots.txt, spiders are supposed to look for meta tags in the header of each webpage. This line
<meta name="robots" content="noindex, nofollow" />
tells the spider not to index the page and not to follow any links that it finds on the page.
My index.php in the secret directory contains the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | <?php /* Secret Area */ ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US"> <head profile="http://gmpg.org/xfn/11"> <title>Stay Out!</title> <meta http-equiv="content-type" content="text/html; charset=UTF-8" /> <meta name="robots" content="noindex, nofollow" /> <style type="text/css"> BODY { background: #808080; } DIV#wrap { width: 960px; margin: 20px auto; background: white; border: 1px solid black; text-align: center; } P { margin: 5px 40px; } </style> </head> <body> <div id="wrap"> <h2>Stay Out!</h2> <p>This is a private area. If you somehow got here accidently, please do not go any further.</p><br /> <p><a href="secret.php">Continue</a></p><br /> </div> </body> </html> |
A well-behaved spider should see the meta tag telling it to ignore the page and any links on the page and go away. Bad spiders follow the link to “secret.php” where they will be trapped.
Hackers
This is a good time to mention that this code will also trap hackers. If a hacker comes to your site, he will probably check your robots.txt file. He is looking for places you don’t want him to go and here is a file listing those places out for him. I chose the name secret just for this reason. That should immediately get his attention. The hacker then goes to the secret directory and sees the file above. In addition to the meta tag telling the spider to go away, the page displays a message saying that this section is private and asking people to go away. There is a link to “secret.php” for those not wishing to abide by our wishes.
The Trap
If a spider, or a hacker, follows the link to secret.php, here is what they will find:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | <?php /* Secret */ $filename = $_SERVER['DOCUMENT_ROOT'] . '/secret/guestlist.log'; $ip_list = file($filename); sort($ip_list); reset($ip_list); $ip = $_SERVER['REMOTE_ADDR'] . "\n"; $agent = $_SERVER['HTTP_USER_AGENT']; $found = false; foreach( $ip_list as $blocked_ip ) { $result = strcmp($ip, $blocked_ip); if ($result == 0) { $found = true; break; } elseif ($result > 0) { break; } } if (!$found) { $ip_list[] = $ip; sort($ip_list); reset($ip_list); $file = fopen($filename, "w"); if ($file) { foreach( $ip_list as $blocked_ip ) { fputs($file, "$blocked_ip"); } fclose($file); } $todaysdate = date("m/d/Y h:i:s a",time()); //mail("you@yoursite.com", "IP ($ip) Banned - $todaysdate", "$ip ($agent) has been banned.\n\n", "From: Security@yoursite.com"); } header('Location: banned.php'); exit; ?> |
This code will add the spider’s IP address (or the hacker’s IP address) to a list of banned IP addresses.
It can also email you a message each time someone is added to the list. This is highly recommended. If you want to do that, simply uncomment the appropriate line (change “//mail” to “mail”), and change the address from “you@yoursite.com” to the email address you want to have notified.
The code then redirects the offender to another file named “banned.php”.
Checking If An IP Is Banned
We now have a list of banned IP addresses. But how do we actually ban people that are on the list? Unfortunately, there is no way to include a banned list into an .htaccess file which would be the best way. We will come back to that in a minute.
Instead, Add the following to the very top of your header.php file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | <?php // Spider Trap $filename = $_SERVER['DOCUMENT_ROOT'] . '/secret/guestlist.log'; $ip_list = file($filename); sort($ip_list); reset($ip_list); $ip = $_SERVER['REMOTE_ADDR'] . "\n"; $agent = $_SERVER['HTTP_USER_AGENT']; $found = false; foreach( $ip_list as $blocked_ip ) { $result = strcmp($ip, $blocked_ip); if ($result == 0) { $found = true; break; } elseif ($result > 0) { break; } } if ($found) { header('Location: /secret/banned.php'); exit; } // End of Spider Trap ?> |
Now, when the spider (or hacker) attempts to access any page, it will instead be redirected to the “banned.php” file.
The banned.php File
This is the banned.php file:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | <?php /* Banned */ ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US"> <head profile="http://gmpg.org/xfn/11"> <title>Banned</title> <meta http-equiv="content-type" content="text/html; charset=UTF-8" /> <meta name="robots" content="noindex, nofollow" /> <style type="text/css"> BODY { background: #808080; } DIV#wrap { width: 960px; margin: 20px auto; background: white; border: 1px solid black; text-align: center; } P { margin: 5px 40px; } </style> </head> <body> <div id="wrap"> <h2>You Have Been Banned</h2> <p>Your IP is listed in our banned file.</p> <p>At some point your IP was detected snooping around where it had no business being or it was involved in a hacking attempt.</p> <p>In either case, your IP is no longer able to access this site.</p><br /> </div> </body> </html> |
All it does is notify the offender that he is banned. Spiders don’t care but if it is a hacker this let’s them know that they’ve been caught.
A Better Way To Ban An IP Address
This method is effect but the proper way to ban an IP address is through the .htaccess file. As I said earlier, we can’t automate that. This is why it is such a good idea to have the secret.php file email you when someone is caught. Every time you receive an email about an intruder, simply add the IP address manually to your .htaccess file.
How Do You Do That?
First off, you should not mess with the .htaccess file if you don’t know what you are doing. Find someone who does know what they are doing and ask them to help. Either way, back up the file first!
You will need to enable the ReWriteEngine with this line:
RewriteEngine On
If it is already on, don’t turn it on again. Just add the remaining code after the point where it is enabled.
This is the code that actually bans them:
Order Allow,Deny Allow from all Deny from 111.222.333.444
where 111.222.333.444 is the IP address you want to ban.
Before adding the above code, look to see if the first two lines already exist. If they do, just add the third line and add it just after those two. Duplicate the “Deny from” for each IP you are banning.
If you started with a blank .htaccess file, here is what it will look like when you are done:
RewriteEngine On Order Allow,Deny Allow from all Deny from 111.222.333.444 Deny from 111.222.333.444 Deny from 111.222.333.444 Deny from 111.222.333.444
Again, the “111.222.333.444″ would be replaced by the actual IP addresses that you want to ban.
Last Words
Back everything up before you start!
The .htaccess file and robots.txt file need to be in your root directory.
The secret directory needs to be in your root directory as well. It will contain: index.php, secret.php, banned.php, and guestlist.log. All but the last are detailed above.
You must create a BLANK guestlist.log yourself. If you have trouble, enable your editor to show all characters and make sure you do not have any whitespace characters, such as CR/LF.
Once you add an IP to the .htaccess file, you can remove it from the guestlist.log file but there is no need. It should never get large enough to be a problem and it won’t hurt anything to leave it there. If you accidently insert a whitespace character into the file while removing an entry, you could prevent some of the file from being read. Therefore, it really is best to just let the file be.
Be sure to change the email address listed in secret.php and uncomment that line.















Thanks for a informative tutorial :)
First you could use something like this:
#Send all who amend /admin to your site in the browsers address bar into space
RewriteCond %{HTTP_HOST} ^your_domain.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^www.your_domain.com$ [NC]
RewriteRule ^admin\/?$ “http\:\/\/hubblesite\.org\/gallery\/album\/entire\/pr1998014j\/large_web\/” [R=301,L]
Secondly it is possible to auto add banned ip-adresses to htaccess:
robots.txt
add this to robots.txt, change ‘path’ and ‘file’ to the folder and filename for your php trap page.
User-agent: *
Disallow: /path/file.htm
=============================
.htaccess
This needs to be first on your .htaccess file, put the rest of the .htaccess contents below this line,
what will happen is that the script will prepend the blocked ip addresses to the .htaccess file, while
preserving everything that comes after that. Make sure to give write permissions to the ‘other’ group,
in other words, permissions on the .htaccess file need to be 606 or better, that’s rw–rw.
.htaccess file, above all current contents
===================================
SetEnvIf Request_URI “^(/site/403\.htm¦/robots\.txt)$” allowsome
order deny,allow
deny from env=getout
allow from env=allowsome
=========================================
php trap page, assuming the file is in your site root folder, otherwise replace $_SERVER["DOCUMENT_ROOT"] with the
full server path to your .htaccess file. I changed birdman’s version slightly to automatically put in the path to
the primary .htaccess file at your site root. Link from all pages on your site using the path in the robots.txt file,
use something like a trasparent gif, 1px, or a link with css property display:none; so only spiders will see it.
Before adding these links make sure your robots.txt has been up for at least a few days, a week is better.
Before adding link, test script by going to it, see if you get blocked with 403. First visit should give the text below,
second visit the generic 403 error page. If you also set a 403 error page in the .htaccess file you can get even more
precise blocked messages.
[ like: ErrorDocument 403 /site/403.htm ). The .htaccess file will allow access to only /site/403.htm at that point.
So, combining this with your nice coding could be the ultimate site protect!
Would be nice to see it (I’m no coder, so could you?)
Sara
You mentioned “Birdman’s version” so I did a search on “Birdman Spider Trap”. I thought I had invented a cool new technique but apparently Birdman devised a similar approach at least six years ago and others in the thread talk about using something similar to that for years prior. I guess my idea isn’t so innovative after all. :)
I looked at his solution though, and various modifications floating around the web, and determined that the method used is too risky for me. If the server hiccups while the file is being written to, you could end up with a corrupted .htaccess file which compromises the entire server. It’s a slim chance of that happening but the consequences are, imo, far too dire to take the risk. It’s really tempting though. I would *love* to be able to automate the process. Including a file into .htaccess would be fantastic but writing to it scares me too much.
Thanks so much for the comment though. I never would have heard about birdman’s version otherwise.
While we are talking about site security, another good resource for .htaccess protection is Jeff Starr’s 4G Blacklist. Much of my .htaccess file is based on that.
PS – A permission of 606 on .htaccess can be dangerous. As long as php is installed correctly, 644 should be sufficient. If not, it’s time to look for a new host.
Cheers!
Mike, thanks for this article. I haven’t tested it yet, but think it looks like a great solution for trapping rogue bots and spiders. I have seen several other varieties of auto-banning and spider-trapping scripts (including birdman’s), and each of them seems to employ a slightly different strategy to get the job done. Which one is the best I think depends on your particular server setup.
As for auto-banning via htaccess, I have to agree that it is generally the optimal way to go. I don’t auto-ban anything at Perishable Press because I prefer to check everything first, just in case something legitimate gets through. As you mention, using the lowest possible permissions for the htaccess file is essential to make auto-banning worthwhile. 644 is as high as I would go. As for the functionality of writing directly to the htaccess file, I have never heard of any case where the file was corrupted in the process (although it certainly is a possibility).
One other thought that comes to mind when auto-banning bots and stuff: the ban list will eventually become too huge to manage effectively. Over time, lists get excessively long and may reduce server performance. If you are going the automated (or semi-automated) route, it may be a good idea to do a little maintenance along the way to keep things manageable and optimized into the future.
Just my two cents here. Thanks again for sharing this technique with the community. It is definitely useful and will serve as a great tool for the “good guys.”