FREE THOUGHT · FREE SOFTWARE · FREE WORLD

Blocking Bad Bots and Scrapers with .htaccess

Bad Robot!This article shows 2 methods of blocking this entire list of bad robots and web scrapers with .htaccess files using SetEnvIfNoCase or using RewriteRules with mod_rewrite

Blocking Bad Robots and Web Scrapers with RewriteRules

ErrorDocument 403 /403.html

RewriteEngine On
RewriteBase /

# IF THE UA STARTS WITH THESE
RewriteCond %{HTTP_USER_AGENT} ^(aesop_com_spiderman|alexibot|backweb|bandit|batchftp|bigfoot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(cosmos|crescent|curl|custo|da|diibot|disco|dittospyder|dragonfly) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(grafula|harvest|hloader|hmview|httplib|httrack|humanlinks|ilsebot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(infonavirobot|infotekies|intelliseek|interget|iria|jennybot|jetcar) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(joc|justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp|libweb) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(likse|linkscan|linkwalker|lnspiderguy|lwp|magnet|mag-net|markwatch) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(net.?vampire|netants|netcraft|netmechanic|netspider|nextgensearchbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(offline.?navigator|openfind|outfoxbot|pagegrabber|papa|pavuk) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(psbot|pump|queryn|recorder|realdownload|reaper|reget|true_robot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(repomonkey|rma|internetseer|sitesnagger|siphon|slysearch|smartdownload) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(turingos|turnitinbot|urly.?warning|vacuum|vci|voideye|whacker) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(libwww-perl|widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) [NC,OR]

# STARTS WITH WEB
RewriteCond %{HTTP_USER_AGENT} ^web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|master|reaper|sauger|site.?quester|whack) [NC,OR]

# ANYWHERE IN UA -- GREEDY REGEX
RewriteCond %{HTTP_USER_AGENT} ^.*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures).*$ [NC]

# ISSUE 403 / SERVE ERRORDOCUMENT
RewriteRule . - [F,L]

Alternate RewriteCond Rules

RewriteEngine on

#Block spambots
RewriteCond %{HTTP:User-Agent} (?:Alexibot|Art-Online|asterias|BackDoorbot|Black.Hole|
BlackWidow|BlowFish|botALot|BuiltbotTough|Bullseye|BunnySlippers|Cegbfeieh|Cheesebot|
CherryPicker|ChinaClaw|CopyRightCheck|cosmos|Crescent|Custo|DISCo|DittoSpyder|DownloadsDemon|
eCatch|EirGrabber|EmailCollector|EmailSiphon|EmailWolf|EroCrawler|ExpresssWebPictures|ExtractorPro|
EyeNetIE|FlashGet|Foobot|FrontPage|GetRight|GetWeb!|Go-Ahead-Got-It|Go!Zilla|GrabNet|Grafula|
Harvest|hloader|HMView|httplib|HTTrack|humanlinks|ImagesStripper|ImagesSucker|IndysLibrary|
InfonaviRobot|InterGET|InternetsNinja|Jennybot|JetCar|JOCsWebsSpider|Kenjin.Spider|Keyword.Density|
larbin|LeechFTP|Lexibot|libWeb/clsHTTP|LinkextractorPro|LinkScan/8.1a.Unix|LinkWalker|lwp-trivial|
MasssDownloader|Mata.Hari|Microsoft.URL|MIDownstool|MIIxpc|Mister.PiX|MistersPiX|moget|
Mozilla/3.Mozilla/2.01|Mozilla.*NEWT|Navroad|NearSite|NetAnts|NetMechanic|NetSpider|NetsVampire|
NetZIP|NICErsPRO|NPbot|Octopus|Offline.Explorer|OfflinesExplorer|OfflinesNavigator|Openfind|
Pagerabber|PapasFoto|pavuk|pcBrowser|ProgramsSharewares1|ProPowerbot/2.14|ProWebWalker|ProWebWalker|
psbot/0.1|QueryN.Metasearch|ReGet|RepoMonkey|RMA|SiteSnagger|SlySearch|SmartDownload|Spankbot|spanner|
Superbot|SuperHTTP|Surfbot|suzuran|Szukacz/1.4|tAkeOut|Teleport|TeleportsPro|Telesoft|The.Intraformant|
TheNomad|TightTwatbot|Titan|toCrawl/UrlDispatcher|toCrawl/UrlDispatcher|True_Robot|turingos|
Turnitinbot/1.5|URLy.Warning|VCI|VoidEYE|WebAuto|WebBandit|WebCopier|WebEMailExtrac.*|WebEnhancer|
WebFetch|WebGosIS|Web.Image.Collector|WebsImagesCollector|WebLeacher|WebmasterWorldForumbot|
WebReaper|WebSauger|WebsiteseXtractor|Website.Quester|WebsitesQuester|Webster.Pro|WebStripper|
WebsSucker|WebWhacker|WebZip|Wget|Widow|[Ww]eb[Bb]andit|WWW-Collector-E|WWWOFFLE|
XaldonsWebSpider|Xenu's|Zeus) [NC]
RewriteRule .? - [F]

Block Bad Bots with SetEnvIfNoCase

ErrorDocument 403 /403.html

# IF THE UA STARTS WITH THESE
SetEnvIfNoCase ^User-Agent$ .*(aesop_com_spiderman|alexibot|backweb|bandit|batchftp|bigfoot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(cosmos|crescent|curl|custo|da|diibot|disco|dittospyder|dragonfly) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(grafula|harvest|hloader|hmview|httplib|httrack|humanlinks|ilsebot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(infonavirobot|infotekies|intelliseek|interget|iria|jennybot|jetcar) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(joc|justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp|libweb) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(likse|linkscan|linkwalker|lnspiderguy|lwp|magnet|mag-net|markwatch) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(net.?vampire|netants|netcraft|netmechanic|netspider|nextgensearchbot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(offline.?navigator|openfind|outfoxbot|pagegrabber|papa|pavuk) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(psbot|pump|queryn|recorder|realdownload|reaper|reget|true_robot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(repomonkey|rma|internetseer|sitesnagger|siphon|slysearch|smartdownload) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(turingos|turnitinbot|urly.?warning|vacuum|vci|voideye|whacker) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|master|reaper|sauger|site.?quester|whack) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(libwww-perl|aesop_com_spiderman) HTTP_SAFE_BADBOT
Deny from env=HTTP_SAFE_BADBOT

Original Bad Bot / Web Scraper List

  1. WebBandit
  2. 2icommerce
  3. Accoona
  4. ActiveTouristBot
  5. adressendeutschland
  6. aipbot
  7. Alexibot
  8. Alligator
  9. AllSubmitter
  10. almaden
  11. anarchie
  12. Anonymous
  13. Apexoo
  14. Aqua_Products
  15. asterias
  16. ASSORT
  17. ATHENS
  18. AtHome
  19. Atomz
  20. attache
  21. autoemailspider
  22. autohttp
  23. b2w
  24. bew
  25. BackDoorBot
  26. Badass
  27. Baiduspider
  28. Baiduspider+
  29. BecomeBot
  30. berts
  31. Bitacle
  32. Biz360
  33. Black.Hole
  34. BlackWidow
  35. bladder fusion
  36. Blog Checker
  37. BlogPeople
  38. Blogshares Spiders
  39. Bloodhound
  40. BlowFish
  41. Board Bot
  42. Bookmark search tool
  43. BotALot
  44. BotRightHere
  45. Bot mailto:craftbot@yahoo.com
  46. Bropwers
  47. Browsezilla
  48. BuiltBotTough
  49. Bullseye
  50. BunnySlippers
  51. Cegbfeieh
  52. CFNetwork
  53. CheeseBot
  54. CherryPicker
  55. Crescent
  56. charlotte/
  57. ChinaClaw
  58. Convera
  59. Copernic
  60. CopyRightCheck
  61. cosmos
  62. Crescent
  63. c-spider
  64. curl
  65. Custo
  66. Cyberz
  67. DataCha0s
  68. Daum
  69. Deweb
  70. Digger
  71. Digimarc
  72. digout4uagent
  73. DIIbot
  74. DISCo
  75. DittoSpyder
  76. DnloadMage
  77. Download
  78. dragonfly
  79. DreamPassport
  80. DSurf
  81. DTS Agent
  82. dumbot
  83. DynaWeb
  84. e-collector
  85. EasyDL
  86. EBrowse
  87. eCatch
  88. ecollector
  89. edgeio
  90. efp@gmx.net
  91. EirGrabber
  92. Email Extractor
  93. EmailCollector
  94. EmailSiphon
  95. EmailWolf
  96. EmeraldShield
  97. Enterprise_Search
  98. EroCrawler
  99. ESurf
  100. Eval
  101. Everest-Vulcan
  102. Exabot
  103. Express
  104. Extractor
  105. ExtractorPro
  106. EyeNetIE
  107. FairAd
  108. fastlwspider
  109. fetch
  110. FEZhead
  111. FileHound
  112. findlinks
  113. Flaming AttackBot
  114. FlashGet
  115. FlickBot
  116. Foobot
  117. Forex
  118. Franklin Locator
  119. FreshDownload
  120. FrontPage
  121. FSurf
  122. Gaisbot
  123. Gamespy_Arcade
  124. genieBot
  125. GetBot
  126. Getleft
  127. GetRight
  128. GetWeb!
  129. Go!Zilla
  130. Go-Ahead-Got-It
  131. GOFORITBOT
  132. GrabNet
  133. Grafula
  134. grub
  135. Harvest
  136. Hatena Antenna
  137. heritrix
  138. HLoader
  139. HMView
  140. holmes
  141. HooWWWer
  142. HouxouCrawler
  143. HTTPGet
  144. httplib
  145. HTTPRetriever
  146. HTTrack
  147. humanlinks
  148. IBM_Planetwide
  149. iCCrawler
  150. ichiro
  151. iGetter
  152. Image Stripper
  153. Image Sucker
  154. imagefetch
  155. imds_monitor
  156. IncyWincy
  157. Industry Program
  158. Indy
  159. InetURL
  160. InfoNaviRobot
  161. InstallShield DigitalWizard
  162. InterGET
  163. IRLbot
  164. Iron33
  165. ISSpider
  166. IUPUI Research Bot
  167. Jakarta
  168. java/
  169. JBH Agent
  170. JennyBot
  171. JetCar
  172. jeteye
  173. jeteyebot
  174. JoBo
  175. JOC Web Spider
  176. Kapere
  177. Kenjin
  178. Keyword Density
  179. KRetrieve
  180. ksoap
  181. KWebGet
  182. LapozzBot
  183. larbin
  184. leech
  185. LeechFTP
  186. LeechGet
  187. leipzig.de
  188. LexiBot
  189. libWeb
  190. libwww-FM
  191. libwww-perl
  192. LightningDownload
  193. LinkextractorPro
  194. Linkie
  195. LinkScan
  196. linktiger
  197. LinkWalker
  198. lmcrawler
  199. LNSpiderguy
  200. LocalcomBot
  201. looksmart
  202. LWP
  203. Mac Finder
  204. Mail Sweeper
  205. mark.blonin
  206. MaSagool
  207. Mass
  208. Mata Hari
  209. MCspider
  210. MetaProducts Download Express
  211. Microsoft Data Access
  212. Microsoft URL Control
  213. MIDown
  214. MIIxpc
  215. Mirror
  216. Missauga
  217. Missouri College Browse
  218. Mister
  219. Monster
  220. mkdb
  221. moget
  222. Moreoverbot
  223. mothra/netscan
  224. MovableType
  225. Mozi!
  226. Mozilla/22
  227. Mozilla/3.0 (compatible)
  228. Mozilla/5.0 (compatible; MSIE 5.0)
  229. MSIE_6.0
  230. MSIECrawler
  231. MSProxy
  232. MVAClient
  233. MyFamilyBot
  234. MyGetRight
  235. nameprotect
  236. NASA Search
  237. Naver
  238. Navroad
  239. NearSite
  240. NetAnts
  241. netattache
  242. NetCarta
  243. NetMechanic
  244. NetResearchServer
  245. NetSpider
  246. NetZIP
  247. Net Vampire
  248. NEWT ActiveX
  249. Nextopia
  250. NICErsPRO
  251. ninja
  252. NimbleCrawler
  253. noxtrumbot
  254. NPBot
  255. Octopus
  256. Offline
  257. OK Mozilla
  258. OmniExplorer
  259. OpaL
  260. Openbot
  261. Openfind
  262. OpenTextSiteCrawler
  263. Oracle Ultra Search
  264. OutfoxBot
  265. P3P
  266. PackRat
  267. PageGrabber
  268. PagmIEDownload
  269. panscient
  270. Papa Foto
  271. pavuk
  272. pcBrowser
  273. perl
  274. PerMan
  275. PersonaPilot
  276. PHP version
  277. PlantyNet_WebRobot
  278. playstarmusic
  279. Plucker
  280. Port Huron
  281. Program Shareware
  282. Progressive Download
  283. ProPowerBot
  284. prospector
  285. ProWebWalker
  286. Prozilla
  287. psbot
  288. psycheclone
  289. puf
  290. PushSite
  291. PussyCat
  292. PuxaRapido
  293. Python-urllib
  294. QuepasaCreep
  295. QueryN
  296. Radiation
  297. RealDownload
  298. RedCarpet
  299. RedKernel
  300. ReGet
  301. relevantnoise
  302. RepoMonkey
  303. RMA
  304. Rover
  305. Rsync
  306. RTG30
  307. Rufus
  308. SAPO
  309. SBIder
  310. scooter
  311. ScoutAbout
  312. script
  313. searchpreview
  314. searchterms
  315. Seekbot
  316. Serious
  317. Shai
  318. shelob
  319. Shim-Crawler
  320. SickleBot
  321. sitecheck
  322. SiteSnagger
  323. Slurpy Verifier
  324. SlySearch
  325. SmartDownload
  326. sna-
  327. snagger
  328. Snoopy
  329. sogou
  330. sootle
  331. So-net” bat_bot
  332. SpankBot” bat_bot
  333. spanner” bat_bot
  334. SpeedDownload
  335. Spegla
  336. Sphere
  337. Sphider
  338. SpiderBot
  339. sproose
  340. SQ Webscanner
  341. Sqworm
  342. Stamina
  343. Stanford
  344. studybot
  345. SuperBot
  346. SuperHTTP
  347. Surfbot
  348. SurfWalker
  349. suzuran
  350. Szukacz
  351. tAkeOut
  352. TALWinHttpClient
  353. tarspider
  354. Teleport
  355. Telesoft
  356. Templeton
  357. TestBED
  358. The Intraformant
  359. TheNomad
  360. TightTwatBot
  361. Titan
  362. toCrawl/UrlDispatcher
  363. True_Robot
  364. turingos
  365. TurnitinBot
  366. Twisted PageGetter
  367. UCmore
  368. UdmSearch
  369. UMBC
  370. UniversalFeedParser
  371. URL Control
  372. URLGetFile
  373. URLy Warning
  374. URL_Spider_Pro
  375. UtilMind
  376. vayala
  377. vobsub
  378. VCI
  379. VoidEYE
  380. VoilaBot
  381. voyager
  382. w3mir
  383. Web Image Collector
  384. Web Sucker
  385. Web2WAP
  386. WebaltBot
  387. WebAuto
  388. WebBandit
  389. WebCapture
  390. webcollage
  391. WebCopier
  392. WebCopy
  393. WebEMailExtrac
  394. WebEnhancer
  395. WebFetch
  396. WebFilter
  397. WebFountain
  398. WebGo
  399. WebLeacher
  400. WebMiner
  401. WebMirror
  402. WebReaper
  403. WebSauger
  404. WebSnake
  405. Website
  406. WebStripper
  407. WebVac
  408. webwalk
  409. WebWhacker
  410. WebZIP
  411. Wells Search
  412. WEP Search 00
  413. WeRelateBot
  414. Wget
  415. WhosTalking
  416. Widow
  417. Wildsoft Surfer
  418. WinHttpRequest
  419. WinHTTrack
  420. WUMPUS
  421. WWWOFFLE
  422. wwwster
  423. WWW-Collector
  424. Xaldon
  425. Xenu's
  426. Xenus
  427. XGET
  428. Y!TunnelPro
  429. YahooYSMcm
  430. YaDirectBot
  431. Yeti
  432. Zade
  433. ZBot
  434. zerxbot
  435. Zeus
  436. ZyBorg

Htaccess

 

 

Comments