Categories
Magento Developing

How to Stop Search Engine Robots from Creating Magento Session

This article covers:

  1. Stop bad bots with Apache Fail2Ban
  2. Limit good bot access
  3. Stop search engine robots from creating Magento sessions

Audiences of this article:

  • Magento stores who are effected by excessive bot traffic and the session files are growing at an alarming rate.

Stop Bad Bots with Fail2Ban

Excessive bot traffic can act like a Dos Attack which drains your server resources and harms your online business. You can alleviate the issue by using Apache firewalls or setup restriction in .htaccess, yet a better approach is by enabling Fail2Ban jails to ban malicious visits on your server. This great step by step guides from Digital Ocean shows you how to install and configure in Ubuntu 14.04 LAMP stack. Just make sure to only add “enabled = true” to applicable Apache jails.

After successfully enabling Apache Fail2Ban, you can check the results from each jail:

sudo fail2ban-client status apache-jail-name
# Jail Status Output for apache-auth
Status for the jail: apache-auth
 |- Filter
 |  |- Currently failed: 547
 |  |- Total failed:     1772
 |  - File list:        /var/log/apache2/error.log - Actions
    |- Currently banned: 25
    |- Total banned:     45
    `- Banned IP list: XXX.XXX.XXX.XXX ...

Then do some IP address lookup and you should get an idea of where those banned traffic come from.

Adjust Apache LogLevel

Once the Fail2Ban is in place, if you haven’t changed your Apache LogLevel setting which defaults to warn, your Apache error log is likely to be flooded with “Client denied by server configuration” errors. Fortunately, we can configure Apache 2 to only log critical errors for module access_compat:

LogLevel warn access_compat:crit

Limit Bot Access in Robots.txt

Here are a few lines you can add to your existing robots.txt file to limit bot access:

# # Crawl-delay parameter: the number of seconds you want to wait between successful requests to the same server. 
# # Set a crawl rate, if your server's traffic problems. Please note that Google ignore crawl-delay setting in Robots.txt. You can set up this in Google Webmaster tool 
Crawl-delay: 10 
# # Denny Any bot access i.e Moz bot
User-agent: rogerbot
Disallow: /

Alternatively, you can also deny bot access (in case they don’t respect the Robots.txt file), by adding restrictions to your .htaccess file:

# # Take Baidu bot as an example
RewriteCond %{HTTP_USER_AGENT} baiduspider [NC] 
RewriteRule .* - [F]
# # [NC] flag means boroader match for case sensitive
# # [F] flag gives a status code of "403 Forbidden" for the restricted bot

Stop Magento from Creating Sessions for Bot Traffic

First copy the file Varien.php from:

app/code/core/Mage/Core/Model/Session/Abstract/Varien.php

to:

app/code/local/Mage/Core/Model/Session/Abstract/Varien.php

In this file modify session start functions:

public function start($sessionName=null)
    //add this line to stop bot from creating session
    if($this->isBot()){
        return false;
    }
    if (isset($_SESSION) && !$this->getSkipEmptySessionCheck()) {
        return $this;
    }
    ...

Then add the bot validation function to the bottom of the file:

public function isBot()
     {
         $isbot = false;
         $bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg-bot\/0.9|boxseabot|bspider|calif|christcrawler|CMC\/0.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|KIT-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image.kapsi.net|KDD-Explorer|ko_yappo_robot|label-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp-info-agent|WebMechanic|NetScoop|newscan-online|ObjectsSearch|Occam|Orbsearch\/1.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1.0|spiderline|nil|suke|http:\/\/www.sygol.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww-perl|verticrawl|Victoria|void-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1.0|webcatcher|T-H-U-N-D-E-R-S-T-O-N-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|m2e/i';
         $userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
         $isBot = !$userAgent || preg_match($bot_regex, $userAgent);
     return $isBot; }

Now clear your Magento Cache and Session folder and check if your excessive Magento bot sessions has stopped. You can add to the regex match for any new bots identified on your server, by either looking at the access log or using netstat command.

By Ethan

To many, a business is a lifetime commitment. It's easy to start one yet difficult to make it successful. Attitude, skills, experiences and dedication help hone the craft along the way, but it's often the great vision and resilience to remain focused wins the game. Read more about me here