How to protect your coldfusion website against bad bots

Jump to the code

Recently one of our websites became almost completely unresponsive. At first, we had no idea what was happening. Using Wireshark and looking at the IIS log files we noticed a huge amount of traffic coming from multiple ip’s all with the same User-Agent. The user agent contained the string 80Legs. We quickly discovered that 80legs is a cloud-based web scraping tool. It allows it’s customers to scrape websites.

If done properly this should not result in any problems. Many spiders crawl this particular website every day. At any given moment around 15 spiders crawl this site. The problem with 80Legs was it was crawling at such a high rate (20+ pages per second) requests became queued. Eventually, the website became unresponsive.

We also noticed something else. The memory usage was increasing. After some debugging, we discovered the problem: sessiontimeout. Sessiontimeout determines the amount of time session variables stay in memory. It will clear the memory after the session is inactive for the amount of time set through the sessiontimeout variable. The session is tracked through cookies send back and forward between the client and the server.

Cookies

In most cases bots ignore cookies. This means the server will create a new session for each request made by the bot. I will give a little example why this can become a problem.

Imagine each session takes up 4kb of memory. You decided to choose a long sessiontimeout e.g. 1 day. Then a bot like 80legs starts to spider your website. It crawls at a rate of 25 pages per second. How much memory do you think this will consume? Below the answer:

4kb * 25 * 60(seconds) * 60(minutes) *24(hours) ≅ 8.6 GB!!!!

This simple example shows why it is important to consider your session management carefully. For this particular website, we decided to track each request. Determine wether a spider or a user was requesting the page and adjust the sessiontimeout accordingly. We also enabled the option to block useragent and spiders completely. Below the simplified code we use.

 <cfcomponent displayname="Application" output="true">
 	<cfprocessingdirective pageEncoding="utf-8"> 

 	<cfset THIS.Name = "TestApp" />
	<cfset THIS.ApplicationTimeout = CreateTimeSpan(0,2,0,0) />
	<cfset THIS.SessionManagement = true />
	<cfset THIS.sessionTimeout = createTimeSpan( 0, 0, 30, 0 ) />

	<cffunction name="OnApplicationStart" access="public" returntype="boolean" output="false" hint="Fires when the application is first created.">
		<!--- List of bots to block completely --->
		<cfset application.badbots 	= "python|Go-http-client|Freshbot|BLEXBot|riddler.io|uni-leipzig.de/crawler|80legs|GarlikCrawler|linkdexbot|magpie-crawler|semrushbot|ahrefsbot|MJ12bot|ltx71|DotBot|MegaIndex|Vagabondo" />
		<!--- List of good bots --->
		<cfset application.normalbots = "apps-spreadsheets|Feed me|Drupal|robots|Jersey|Qwantify|Google-Ads|Crawl|Bot|Spider|Search|Index|seo|curl|Larbin|Libwww|User-Agent|ChangeDetection|Genieo Web filter|BUbiNG|ia_archiver|Yahoo|AddThis|proximic|coccoc|Infohelfer|Woko|Wotbox|Daumoa|Spinn3r|WeSEE|ShopWiki|HubSpot|sistrix|Netseer|iCjobs|Ezooms|Iframely|KrOWLer|Twingly|FacebookExternalHit|Arachnophilia|ichiro|UASlinkChecker|Kraken|Nuhk|Najdi.si|NetcraftSurveyAgent|IntegromeDB|BingPreview|Steeler|TinEye|Nigma|hawkReader|Butterfly|Plukkie|WebThumbnail|ThumbSniper|Embedly|linguatools|backlink|PayPal|adressendeutschland|XRL|IdeelaborPlagiaat|SiteCondor|Web-Monitoring|Vedma|parsijoo|Browsershots|LoadImpactPageAnalyzer|Feedly|WebCookies|CloudFlare|Readability|kulturarw|immediatenet|Qualidator|Qirina Hurdler|BegunAdvertising|linkdex|Curious George|Fetch-Guess|alexa site audit|Speedy|HostTracker|findlinks|FlipboardProxy|Semantifire|LinkAider|webmastercoffee|Crowsnest|UnwindFetchor|MetaURI|MiaDev|CirrusExplorer|Dlvr.it|ADmantX|NLNZ_IAHarvester|wsAnalyzer|Thumbshots|BlogPulse|wscheck|Qseero|drupact|PagePeeker|HomeTags|facebookplatform|Pixray|BDFetch|WebNL|Semager|heritrix|Bad-Neighborhood|akula|Page2RSS|EasyBib AutoCite|gonzo|ScoutJet|Twikle|biwec|Lijit|Apercite|pmoz|Covario|Holmes|envolk|Ask Jeeves|StackRambler|EvriNid|arachnode.net|Nymesis|OpenCalaisSemanticProxy|cityreview|nworm|SBIder|Peew|WatchMouse|page_verifier|DomainDB|LinkWalker|voyager|copyright sheriff|Ocelli|Twiceler|abby|XML Sitemaps Generator|Pompos|Yaanb|livedoor ScreenShot|eCairn|MetaURI|L.webis|Web-sniffer|FairShare|Amagit.COM|Hatena|dotSemantic|HostTracker.com|AportWorm|XmarksFetch|FeedFinder|Nutch|baypup|192.comAgent|Surphace Scout|Szukacz|Charlotte|50.nu|HeartRails_Capture|WebImages|Scooter|Scarlett|Yanga|DNS-Digger-Explorer|Robozilla|UptimeDog|^Nail|Metaspinner|Touche|SniffRSS|Kalooga|Link Valet Online|Shelob|riddler" />
		<cfreturn true />
	</cffunction>

	<cffunction name="OnSessionStart" access="public" returntype="void" output="false" hint="Fires when the session is first created.">
		<Cfset setSessionType() />
		<cfreturn />
	</cffunction>

 
 
	<cffunction name="OnRequestStart" access="public" returntype="boolean" output="false" hint="Fires at first part of page processing.">
		<cfargument name="TargetPage" type="string" required="true"/>

		<cfinvoke component = "com.log" method = "serverLogSave" />

		<Cfif session.sessionType.bot>
			<!--- If client is a spider/bot set session timeout to 5 seconds. See also https://www.bennadel.com/blog/1847-explicitly-ending-a-coldfusion-session.htm --->
			<cfset session.setMaxInactiveInterval(javaCast( "long", 5 )) />
		</cfif>
		<Cfif session.sessionType.blocked>
			<!--- If client is a bad bot/spider stop processing return a blank page --->
			<cfabort>
		</cfif>

		<cfreturn true />
	</cffunction>

 
	<cffunction name="OnSessionEnd" access="public" returntype="void" output="false" hint="Fires when the session is terminated.">
		<!--- Define arguments. --->
		<cfargument name="SessionScope" type="struct" required="true"/>
 		<cfargument name="ApplicationScope" type="struct" required="false" default="#StructNew()#" />

		<!--- Return out. --->
		<cfreturn />
	</cffunction>
 

    </cffunction>


    <cffunction name="setSessionType" output="false">
		<cfset var local = {}>
		<cfset local.usert_agent = cgi.user_agent>

		<!--- Get the ip address of the client --->
		<cfif StructKeyExists(GetHttpRequestData().headers, "X-Forwarded-For") and listlen(Trim(ListFirst(GetHttpRequestData().headers["X-Forwarded-For"])),".") eq 4>
			<cfset local.iip = Trim(ListFirst(GetHttpRequestData().headers["X-Forwarded-For"])) >
		<cfelse>
			<cfset local.iip = CGI.REMOTE_ADDR >
		</cfif>

		<cfif refindnocase("(#application.badbots#)",local.usert_agent) gt 0>
			<cfset session.sessionType = {""bot",true,blocked":true,"sticky":false>
		<cfelse>
			<cfquery name="local.qg" datasource="#application.databaseLog#">
			SELECT `realip`,`bot`,`Blocked`,Useragent,sticky
			FROM `backenddatabase`
			WHERE `RealIP` = <cfqueryparam value="#iip#">
			</cfquery>

			<cfif local.qg.recordcount gte 1>
				<cfset session.sessionType = {"bot",#local.qg.bot#,"blocked":#local.qg.Blocked#,"sticky"::#local.qg.sticky#>
			<cfelseif refindnocase("(#application.normalbots#)",local.usert_agent) gt 0>
				<cfset session.sessionType = {"bot",true,"blocked":false,"sticky":false>
			<Cfelse>
				<cfset session.sessionType = {"bot",false,"blocked":false,"sticky":false>
			</cfif>
		</cfif>

		<cfquery name="local.qi" datasource="#application.databaseLog#">
		INSERT INTO `backenddatabase` (realIP,bot,blocked,Useragent) 
		VALUES ('#local.iip#',<cfqueryparam value="#session.sessionType.bot#">,<cfqueryparam value="#session.sessionType.blocked#">,<cfqueryparam value="#local.usert_agent#">)
		ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
		<cfif not session.sessionType.sticky>
		,bot = <cfqueryparam value="#session.sessionType.bot#">,
		,blocked = <cfqueryparam value="#session.sessionType.blocked#">,
		,Useragent = <cfqueryparam value="#local.usert_agent#">
		</cfif>
		</cfquery>

	</cffunction>


</cfcomponent>


Leave a Reply

Your email address will not be published. Required fields are marked *