robots.txt Search Engine Optimization is simply using robots.txt for your blog, wordpress, or phpbb. Wordpress Optimized robots.txt and meta tags
See the Updated WordPress robots.txt file
Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler.
When deciding which pages to crawl, Googlebot goes in this order
User-agent: * # disallow all files in these directories Disallow: /cgi-bin/ Disallow: /z/j/ Disallow: /z/c/ Disallow: /stats/ Disallow: /dh_ Disallow: /about/ Disallow: /contact/ Disallow: /tag/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /contact Disallow: /manual Disallow: /manual/* Disallow: /phpmanual/ Disallow: /category/ User-agent: Googlebot # disallow all files ending with these extensions Disallow: /*.php$ Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: /*.gz$ Disallow: /*.wmv$ Disallow: /*.cgi$ Disallow: /*.xhtml$ # disallow all files with ? in url Disallow: /*?* # disable duggmirror User-agent: duggmirror Disallow: / # allow google image bot to search all images User-agent: Googlebot-Image Disallow: Allow: /* # allow adsense bot on entire site User-agent: Mediapartners-Google* Disallow: Allow: /*
User-agent: * Disallow: /cgi-bin/ Disallow: /phpbb/admin/ Disallow: /phpbb/cache/ Disallow: /phpbb/db/ Disallow: /phpbb/images/ Disallow: /phpbb/includes/ Disallow: /phpbb/language/ Disallow: /phpbb/templates/ Disallow: /phpbb/faq.php Disallow: /phpbb/groupcp.php Disallow: /phpbb/login.php Disallow: /phpbb/memberlist.php Disallow: /phpbb/modcp.php Disallow: /phpbb/posting.php Disallow: /phpbb/privmsg.php Disallow: /phpbb/profile.php Disallow: /phpbb/search.php Disallow: /phpbb/viewonline.php User-agent: Googlebot # disallow files ending with these extensions Disallow: /*.inc$ Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ # disallow all files with? in url Disallow: *mark=* Disallow: *view=* # allow google image bot to search all images User-agent: Googlebot-Image Disallow: Allow: /* # allow adsense bot on entire site User-agent: Mediapartners-Google* Disallow: Allow: /*
User-agent: * Disallow: /stats Disallow: /dh_ Disallow: /V Disallow: /z/j/ Disallow: /z/c/ Disallow: /cgi-bin/ Disallow: /viewtopic.php Disallow: /viewforum.php Disallow: /index.php? Disallow: /posting.php Disallow: /groupcp.php Disallow: /search.php Disallow: /login.php Disallow: /post Disallow: /member Disallow: /profile.php Disallow: /memberlist.php Disallow: /faq.php Disallow: /templates/ Disallow: /mx_ Disallow: /db/ Disallow: /admin/ Disallow: /cache/ Disallow: /images/ Disallow: /includes/ Disallow: /common.php Disallow: /index.php Disallow: /memberlist.php Disallow: /modcp.php Disallow: /privmsg.php Disallow: /viewonline.php Disallow: /images/ Disallow: /rss.php User-agent: Googlebot # disallow all files ending with these extensions Allow: /sitemap.php Disallow: /*.php$ Allow: /sitemap.php Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: /*.txt$ # disallow all files with? in url Disallow: /*?* Disallow: /*? # disallow all files in /wp- directorys Disallow: /wp-*/ # disallow archiving site User-agent: ia_archiver Disallow: / # allow google image bot to search all images User-agent: Googlebot-Image Disallow: Allow: /*.gif$ Allow: /*.png$ Allow: /*.jpeg$ Allow: /*.jpg$ Allow: /*.ico$ Allow: /*.jpg$ Allow: /images Allow: /z/i/ # allow adsense bot on entire site User-agent: Mediapartners-Google* Allow: /*
*
You can use an asterisk *
to match a sequence of characters.
private:
User-Agent: Googlebot Disallow: /private*/
?
User-agent: * Disallow: /*?*
$
You can use the $
character to specify matching the end of the URL.
.php
User-Agent: Googlebot Disallow: /*.php$
User-agent: * Allow: /*?$ Disallow: /*?
Disallow:/*?
blocks any URL that begins with HOST, followed by any string, followed by a ?
, followed by any string
Allow: /*?$
allows any URL that begins with HOST, followed by any string, followed by a ?
, with no characters after the ?
Note: Blocking Googlebot blocks all bots that begin with "Googlebot"
User-agent: Googlebot Disallow: /
Note: Googlebot follows the line directed at it, rather than the line directed at everyone.
User-agent: * Disallow: / User-agent: Googlebot Disallow:
Googlebot recognizes an extension to the robots.txt standard called Allow, which is opposite of Disallow.
User-Agent: Googlebot Disallow: /folder1/ Allow: /folder1/myfile.html
User-agent: Googlebot Disallow: / User-agent: Googlebot-Mobile Allow:
It is better to use
on pages that have been indexed if you wish google to drop them. This way it is much faster than blocked using robots.txt.
Note: removing snippets also removes cached pages.
A snippet is a text excerpt that appears below a page's title in our search results and describes the content of the page.
Google updates its entire index automatically on a regular basis. When we crawl the web, we find new pages, discard dead links, and update links automatically. Links that are outdated now will most likely "fade out" of our index during our next crawl.
Note: Please ensure that you return a true 404 error even if you choose to display a more user-friendly body of the HTML page for your visitors. It won't help to return a page that says "File Not Found" if the http headers still return a status code of 200, or normal.
Google automatically takes a "snapshot" of each page it crawls and archives it. This "cached" version allows a webpage to be retrieved for your end users if the original page is ever unavailable (due to temporary failure of the page's web server). The cached page appears to users exactly as it looked when Google last crawled it, and we display a message at the top of the page to indicate that it's a cached version. Users can access the cached version by choosing the "Cached" link on the search results page.
Note: this tag only removes the "Cached" link for the page. Google will continue to index the page and display a snippet.
If you wish to exclude your entire website from Google's index
User-agent: * Disallow: /
Note: Please note that Googlebot does not interpret a 401/403 response ("Unauthorized"/"Forbidden") to a robots.txt fetch as a request not to crawl any pages on the site.
User-agent: Googlebot Disallow: /
Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols.
User-agent: * Allow: /
User-agent: * Disallow: /
User-agent: Googlebot Disallow: /lems
User-agent: Googlebot Disallow: /*.gif$
User-agent: Googlebot Disallow: /*?
Another standard, which can be more convenient for page-by-page use, involves adding a META tag to an HTML page to tell robots not to index the page. This standard is described at http://www.robotstxt.org/wc/exclusion.html#meta.
User-agent: Googlebot-Image Disallow: /images/dogs.jpg
User-agent: Googlebot-Image Disallow: /
User-agent: Googlebot-Image Disallow: /*.gif$
Note: If you believe your request is urgent and cannot wait until the next time Google crawls your site, use our automatic URL removal system. In order for this automated process to work, the webmaster or SEO agency must first create and place a robots.txt file on the site in question.
Google will continue to exclude your site or directories from successive crawls if the robots.txt file exists in the web server root. If you do not have access to the root level of your server, you may place a robots.txt file at the same level as the files you want to remove. Doing this and submitting via the automatic URL removal system will cause a temporary, 180 day removal of the directories specified in your robots.txt file from the Google index, regardless of whether you remove the robots.txt file after processing your request. (Keeping the robots.txt file at the same level would require you to return to the URL removal system every 180 days to reissue the removal.)
Only blogs with site feeds will be included in Blog Search. If you'd like to prevent your feed from being crawled, make use of a robots.txt file or meta tags (NOINDEX or NOFOLLOW), as described above. Please note that if you have a feed that was previously included, the old posts will remain in the index even though new ones will not be added.
When users add your feed to their Google homepage or Google Reader, Google's Feedfetcher attempts to obtain the content of the feed in order to display it. Since Feedfetcher requests come from explicit action by human users, Feedfetcher has been designed to ignore robots.txt guidelines.
It's not possible for Google to restrict access to a publicly available feed. If your feed is provided by a blog hosting service, you should work with them to restrict access to your feed. Check those sites' help content for more information (e.g., Blogger, LiveJournal, or Typepad).
Google Web Search on mobile phones allows users to search all the content in the Google index for desktop web browsers. Because this content isn't written specifically for mobile phones and devices and thus might not display properly, Google automatically translates (or "transcodes") these pages by analyzing the original HTML code and converting it to a mobile-ready format. To ensure that the highest quality and most useable web page is displayed on your mobile phone or device, Google may resize, adjust, or convert images, text formatting and/or certain aspects of web page functionality.
To save bandwidth, Googlebot only downloads the robots.txt file once a day or whenever we've fetched many pages from the server. So, it may take a while for Googlebot to learn of changes to your robots.txt file. Also, Googlebot is distributed on several machines. Each of these keeps its own record of your robots.txt file.
Also, there's a small difference between the way Googlebot handles the robots.txt file and the way the robots.txt standard says we should (keeping in mind the distinction between "should" and "must"). The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do.
User-Agent: * Allow: / Disallow: /cgi-bin
It's obvious that the webmaster's intent here is to allow robots to crawl everything except the /cgi-bin directory. Consequently, that's what we do.
Meta tags can exclude all outgoing links on a page, but you can also instruct Googlebot not to crawl individual links by adding rel="nofollow" to a hyperlink. When Google sees the attribute rel="nofollow" on hyperlinks, those links won't get any credit when we rank websites in our search results. For example a link, This is a great link! could be replaced with I can't vouch for this link.
# robots.txt, www.nytimes.com 6/29/2006 User-agent: * Disallow: /pages/college/ Disallow: /college/ Disallow: /library/ Disallow: /learning/ Disallow: /aponline/ Disallow: /reuters/ Disallow: /cnet/ Disallow: /partners/ Disallow: /archives/ Disallow: /indexes/ Disallow: /thestreet/ Disallow: /nytimes-partners/ Disallow: /financialtimes/ Allow: /pages/ Allow: /2003/ Allow: /2004/ Allow: /2005/ Allow: /top/ Allow: /ref/ Allow: /services/xml/ User-agent: Mediapartners-Google* Disallow: # robots.txt, http://dictionary.reference.com User-agent: Googlebot Disallow: User-agent: Mediapartners-Google Disallow: User-agent: Teleport Pro Disallow: / User-agent: * Disallow: /cgi-bin/ # robots.txt for www.phpbbhacks.com User-agent: * Disallow: /forums/viewtopic.php Disallow: /forums/viewforum.php Disallow: /forums/index.php? Disallow: /forums/posting.php Disallow: /forums/groupcp.php Disallow: /forums/search.php Disallow: /forums/login.php Disallow: /forums/privmsg.php Disallow: /forums/post Disallow: /forums/profile.php Disallow: /forums/memberlist.php Disallow: /forums/faq.php Disallow: /forums/archive # robots.txt for Slashdot.org # # "Any empty [Disallow] value, indicates that all URLs can be retrieved. # At least one Disallow field needs to be present in a record." User-agent: Mediapartners-Google Disallow: User-agent: Googlebot Crawl-delay: 100 Disallow: /firehose.pl Disallow: /submit.pl Disallow: /comments.pl Disallow: /users.pl Disallow: /zoo.pl Disallow: firehose.pl Disallow: submit.pl Disallow: comments.pl Disallow: users.pl Disallow: zoo.pl Disallow: /~ Disallow: ~ User-agent: Slurp Crawl-delay: 100 Disallow: User-agent: Yahoo-NewsCrawler Disallow: User-Agent: msnbot Crawl-delay: 100 Disallow: User-agent: * Crawl-delay: 100 Disallow: /authors.pl Disallow: /index.pl Disallow: /article.pl Disallow: /comments.pl Disallow: /firehose.pl Disallow: /journal.pl Disallow: /messages.pl Disallow: /metamod.pl Disallow: /users.pl Disallow: /search.pl Disallow: /submit.pl Disallow: /pollBooth.pl Disallow: /pubkey.pl Disallow: /topics.pl Disallow: /zoo.pl Disallow: /palm Disallow: authors.pl Disallow: index.pl Disallow: article.pl Disallow: comments.pl Disallow: firehose.pl Disallow: journal.pl Disallow: messages.pl Disallow: metamod.pl Disallow: users.pl Disallow: search.pl Disallow: submit.pl Disallow: pollBooth.pl Disallow: pubkey.pl Disallow: topics.pl Disallow: zoo.pl Disallow: /~ Disallow: ~ # robots.txt for http://www.myspace.com User-agent: ia_archiver Disallow: / # robots.txt for http://www.craigslist.com User-agent: YahooFeedSeeker Disallow: /forums Disallow: /res/ Disallow: /post Disallow: /email.friend Disallow: /?flagCode Disallow: /ccc Disallow: /hhh Disallow: /sss Disallow: /bbb Disallow: /ggg Disallow: /jjj User-agent: * Disallow: /cgi-bin Disallow: /cgi-secure Disallow: /forums Disallow: /search Disallow: /res/ Disallow: /post Disallow: /email.friend Disallow: /?flagCode Disallow: /ccc Disallow: /hhh Disallow: /sss Disallow: /bbb Disallow: /ggg Disallow: /jjj User-Agent: OmniExplorer_Bot Disallow: / # robots.txt for http://www.alexa.com User-agent: googlebot # allow Google crawler Disallow: /search User-agent: gulliver # allow Northern Light crawler Disallow: /search User-agent: slurp # allow Inktomi crawler Disallow: /search User-agent: fast # allow FAST crawler Disallow: /search User-agent: scooter # allow AltaVista crawler Disallow: /search User-agent: vscooter # allow AltaVista image crawler Disallow: /search User-agent: ia_archiver # allow Internet Archive crawler Disallow: /search User-agent: * # Disallow all other crawlers access Disallow: / # robots.txt for http://www.technorati.com User-agent: NPBot Disallow: / User-agent: TurnitinBot Disallow: / User-Agent: sitecheck.internetseer.com Disallow: / User-Agent: * Crawl-Delay: 3 Disallow: /search/ Disallow: /search.php Disallow: /cosmos.php # robots.txt for www.sitepoint.com User-agent: * Disallow: /cgi-bin/ Disallow: /images/ Disallow: /forums/report.php Disallow: /forums/search.php Disallow: /forums/newreply.php Disallow: /forums/editpost.php Disallow: /forums/memberlist.php Disallow: /forums/profile.php Disallow: /launch/ Disallow: /search/ Disallow: /voucher/424/ Disallow: /email/ Disallow: /feedback/ Disallow: /contact?reason=articlesuggest Disallow: /linktothis/ Disallow: /popup/ Disallow: /forums/archive/ # robots.txt for http://www.w3.org/ # For use by search.w3.org User-agent: W3C-gsa Disallow: /Out-Of-Date User-agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT; MS Search 4.0 Robot) Disallow: / # W3C Link checker User-agent: W3C-checklink Disallow: # exclude some access-controlled areas User-agent: * Disallow: /2004/ontaria/basic Disallow: /Team Disallow: /Project Disallow: /Systems Disallow: /Web Disallow: /History Disallow: /Out-Of-Date Disallow: /2002/02/mid Disallow: /mid/ Disallow: /People/all/ Disallow: /RDF/Validator/ARPServlet Disallow: /2003/03/Translations/byLanguage Disallow: /2003/03/Translations/byTechnology Disallow: /2005/11/Translations/Query Disallow: /2003/glossary/subglossary/ #Disallow: /2005/06/blog/ #Disallow: /2001/07/pubrules-checker #shouldnt get transparent proxies but will ml links of things like pubrules Disallow: /2000/06/webdata/xslt Disallow: /2000/09/webdata/xslt Disallow: /2005/08/online_xslt/xslt Disallow: /Bugs/ Disallow: /Search/Mail/Public/ Disallow: /2006/02/chartergen # robots.txt for www.google-analytics.com User-Agent: * Disallow: / Noindex: / # robots.txt for video.google.com User-agent: * Disallow: /videosearch? Disallow: /videofeed? Disallow: /videopreview? Disallow: /videopreviewbig? Disallow: /videoprograminfo? Disallow: /videorandom Disallow: /videolineup Disallow: /downloadgvp # robots.txt for www.google.com User-agent: * Allow: /searchhistory/ Disallow: /news?output=xhtml& Allow: /news?output=xhtml Disallow: /search Disallow: /groups Disallow: /images Disallow: /catalogs Disallow: /catalogues Disallow: /news Disallow: /nwshp Disallow: /? Disallow: /addurl/image? Disallow: /pagead/ Disallow: /relpage/ Disallow: /relcontent Disallow: /sorry/ Disallow: /imgres Disallow: /keyword/ Disallow: /u/ Disallow: /univ/ Disallow: /cobrand Disallow: /custom Disallow: /advanced_group_search Disallow: /advanced_search Disallow: /googlesite Disallow: /preferences Disallow: /setprefs Disallow: /swr Disallow: /url Disallow: /m? Disallow: /m/search? Disallow: /wml? Disallow: /wml/search? Disallow: /xhtml? Disallow: /xhtml/search? Disallow: /xml? Disallow: /imode? Disallow: /imode/search? Disallow: /jsky? Disallow: /jsky/search? Disallow: /pda? Disallow: /pda/search? Disallow: /sprint_xhtml Disallow: /sprint_wml Disallow: /pqa Disallow: /palm Disallow: /gwt/ Disallow: /purchases Disallow: /hws Disallow: /bsd? Disallow: /linux? Disallow: /mac? Disallow: /microsoft? Disallow: /unclesam? Disallow: /answers/search?q= Disallow: /local? Disallow: /local_url Disallow: /froogle? Disallow: /froogle_ Disallow: /print Disallow: /books Disallow: /patents? Disallow: /scholar? Disallow: /complete Disallow: /sponsoredlinks Disallow: /videosearch? Disallow: /videopreview? Disallow: /videoprograminfo? Disallow: /maps? Disallow: /translate? Disallow: /ie? Disallow: /sms/demo? Disallow: /katrina? Disallow: /blogsearch? Disallow: /blogsearch/ Disallow: /blogsearch_feeds Disallow: /advanced_blog_search Disallow: /reader/ Disallow: /uds/ Disallow: /chart? Disallow: /transit? Disallow: /mbd? Disallow: /extern_js/ Disallow: /calendar/feeds/ Disallow: /calendar/ical/ Disallow: /cl2/feeds/ Disallow: /cl2/ical/ Disallow: /coop/directory Disallow: /coop/manage Disallow: /trends? Disallow: /trends/music? Disallow: /notebook/search? Disallow: /music Disallow: /browsersync Disallow: /call Disallow: /archivesearch? Disallow: /archivesearch/url Disallow: /archivesearch/advanced_search Disallow: /base/search? Disallow: /base/reportbadoffer Disallow: /base/s2 Disallow: /urchin_test/ Disallow: /movies? Disallow: /codesearch? Disallow: /codesearch/feeds/search? Disallow: /wapsearch? Disallow: /safebrowsing Disallow: /finance Disallow: /reviews/search? # robots.txt for validator.w3.org # $Id: robots.txt,v 1.3 2000/12/13 13:04:09 gerald Exp $ User-agent: * Disallow: /check # robots.txt for httpd.apache.org User-agent: * Disallow: /websrc # robots.txt for www.apache.org User-agent: * Disallow: /websrc Crawl-Delay: 4
# Please, we do NOT allow nonauthorized robots. # http://www.webmasterworld.com/robots # Actual robots can always be found here for: http://www.webmasterworld.com/robots2 # Old full robots.txt can be found here: http://www.webmasterworld.com/robots3 # Any unauthorized bot running will result in IP's being banned. # Agent spoofing is considered a bot. # Fair warning to the clueless - honey pots are - and have been - running. # If you have been banned for bot running - please sticky an admin for a reinclusion request. # http://www.searchengineworld.com/robots/ # This code found here: http://www.webmasterworld.com/robots.txt?view=rawcode User-agent: * Crawl-delay: 17 User-agent: * Disallow: /gfx/ Disallow: /cgi-bin/ Disallow: /QuickSand/ Disallow: /pda/ Disallow: /zForumFFFFFF/
# WebmasterWorld.com: robots.txt # GNU Robots.txt Feel free to use with credit # given to WebmasterWorld. # Please, we do NOT allow nonauthorized robots any longer. # http://www.searchengineworld.com/robots/ # Yes, feel free to copy and use the following. User-agent: OmniExplorer_Bot Disallow: / User-agent: FreeFind Disallow: / User-agent: BecomeBot Disallow: / User-agent: Nutch Disallow: / User-agent: Jetbot/1.0 Disallow: / User-agent: Jetbot Disallow: / User-agent: WebVac Disallow: / User-agent: Stanford Disallow: / User-agent: naver Disallow: / User-agent: dumbot Disallow: / User-agent: Hatena Antenna Disallow: / User-agent: grub-client Disallow: / User-agent: grub Disallow: / User-agent: looksmart Disallow: / User-agent: WebZip Disallow: / User-agent: larbin Disallow: / User-agent: b2w/0.1 Disallow: / User-agent: Copernic Disallow: / User-agent: psbot Disallow: / User-agent: Python-urllib Disallow: / User-agent: Googlebot-Image Disallow: / User-agent: NetMechanic Disallow: / User-agent: URL_Spider_Pro Disallow: / User-agent: CherryPicker Disallow: / User-agent: EmailCollector Disallow: / User-agent: EmailSiphon Disallow: / User-agent: WebBandit Disallow: / User-agent: EmailWolf Disallow: / User-agent: ExtractorPro Disallow: / User-agent: CopyRightCheck Disallow: / User-agent: Crescent Disallow: / User-agent: SiteSnagger Disallow: / User-agent: ProWebWalker Disallow: / User-agent: CheeseBot Disallow: / User-agent: LNSpiderguy Disallow: / User-agent: Mozilla Disallow: / User-agent: mozilla Disallow: / User-agent: mozilla/3 Disallow: / User-agent: mozilla/4 Disallow: / User-agent: mozilla/5 Disallow: / User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT) Disallow: / User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95) Disallow: / User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 98) Disallow: / User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows XP) Disallow: / User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 2000) Disallow: / User-agent: ia_archiver Disallow: / User-agent: ia_archiver/1.6 Disallow: / User-agent: Alexibot Disallow: / User-agent: Teleport Disallow: / User-agent: TeleportPro Disallow: / User-agent: Stanford Comp Sci Disallow: / User-agent: MIIxpc Disallow: / User-agent: Telesoft Disallow: / User-agent: Website Quester Disallow: / User-agent: moget/2.1 Disallow: / User-agent: WebZip/4.0 Disallow: / User-agent: WebStripper Disallow: / User-agent: WebSauger Disallow: / User-agent: WebCopier Disallow: / User-agent: NetAnts Disallow: / User-agent: Mister PiX Disallow: / User-agent: WebAuto Disallow: / User-agent: TheNomad Disallow: / User-agent: WWW-Collector-E Disallow: / User-agent: RMA Disallow: / User-agent: libWeb/clsHTTP Disallow: / User-agent: asterias Disallow: / User-agent: httplib Disallow: / User-agent: turingos Disallow: / User-agent: spanner Disallow: / User-agent: InfoNaviRobot Disallow: / User-agent: Harvest/1.5 Disallow: / User-agent: Bullseye/1.0 Disallow: / User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95) Disallow: / User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0 Disallow: / User-agent: CherryPickerSE/1.0 Disallow: / User-agent: CherryPickerElite/1.0 Disallow: / User-agent: WebBandit/3.50 Disallow: / User-agent: NICErsPRO Disallow: / User-agent: Microsoft URL Control - 5.01.4511 Disallow: / User-agent: DittoSpyder Disallow: / User-agent: Foobot Disallow: / User-agent: WebmasterWorldForumBot Disallow: / User-agent: SpankBot Disallow: / User-agent: BotALot Disallow: / User-agent: lwp-trivial/1.34 Disallow: / User-agent: lwp-trivial Disallow: / User-agent: http://www.WebmasterWorld.com bot Disallow: / User-agent: BunnySlippers Disallow: / User-agent: Microsoft URL Control - 6.00.8169 Disallow: / User-agent: URLy Warning Disallow: / User-agent: Wget/1.6 Disallow: / User-agent: Wget/1.5.3 Disallow: / User-agent: Wget Disallow: / User-agent: LinkWalker Disallow: / User-agent: cosmos Disallow: / User-agent: moget Disallow: / User-agent: hloader Disallow: / User-agent: humanlinks Disallow: / User-agent: LinkextractorPro Disallow: / User-agent: Offline Explorer Disallow: / User-agent: Mata Hari Disallow: / User-agent: LexiBot Disallow: / User-agent: Web Image Collector Disallow: / User-agent: The Intraformant Disallow: / User-agent: True_Robot/1.0 Disallow: / User-agent: True_Robot Disallow: / User-agent: BlowFish/1.0 Disallow: / User-agent: http://www.SearchEngineWorld.com bot Disallow: / User-agent: http://www.WebmasterWorld.com bot Disallow: / User-agent: JennyBot Disallow: / User-agent: MIIxpc/4.2 Disallow: / User-agent: BuiltBotTough Disallow: / User-agent: ProPowerBot/2.14 Disallow: / User-agent: BackDoorBot/1.0 Disallow: / User-agent: toCrawl/UrlDispatcher Disallow: / User-agent: WebEnhancer Disallow: / User-agent: suzuran Disallow: / User-agent: VCI WebViewer VCI WebViewer Win32 Disallow: / User-agent: VCI Disallow: / User-agent: Szukacz/1.4 Disallow: / User-agent: QueryN Metasearch Disallow: / User-agent: Openfind data gathere Disallow: / User-agent: Openfind Disallow: / User-agent: Xenu's Link Sleuth 1.1c Disallow: / User-agent: Xenu's Disallow: / User-agent: Zeus Disallow: / User-agent: RepoMonkey Bait & Tackle/v1.01 Disallow: / User-agent: RepoMonkey Disallow: / User-agent: Microsoft URL Control Disallow: / User-agent: Openbot Disallow: / User-agent: URL Control Disallow: / User-agent: Zeus Link Scout Disallow: / User-agent: Zeus 32297 Webster Pro V2.9 Win32 Disallow: / User-agent: Webster Pro Disallow: / User-agent: EroCrawler Disallow: / User-agent: LinkScan/8.1a Unix Disallow: / User-agent: Keyword Density/0.9 Disallow: / User-agent: Kenjin Spider Disallow: / User-agent: Iron33/1.0.2 Disallow: / User-agent: Bookmark search tool Disallow: / User-agent: GetRight/4.2 Disallow: / User-agent: FairAd Client Disallow: / User-agent: Gaisbot Disallow: / User-agent: Aqua_Products Disallow: / User-agent: Radiation Retriever 1.1 Disallow: / User-agent: WebmasterWorld Extractor Disallow: / User-agent: Flaming AttackBot Disallow: / User-agent: Oracle Ultra Search Disallow: / User-agent: MSIECrawler Disallow: / User-agent: PerMan Disallow: / User-agent: searchpreview Disallow: / User-agent: sootle Disallow: / User-agent: es Disallow: / User-agent: Enterprise_Search/1.0 Disallow: / User-agent: Enterprise_Search Disallow: / User-agent: * Disallow: /gfx/ Disallow: /cgi-bin/ Disallow: /QuickSand/ Disallow: /pda/ Disallow: /zForumFFFFFF/
You don't have to block your feeds from indexing. Matt Cutts himself suggested not to block those because there is not real reason to doing so. If you are talking about a blog and its /feed and such urls it won't cause mess in your rankings, so my suggesting would be not touch those feeds.
As for blocking, no you won't have any affection to your main url if you block /feed or whatever urls you want. Gbot will just deindex them and stop crawling them.
It is extremely easy to make Wget wander aimlessly around a web site, sucking all the available data in progress. `wget -r site', and you're set. Great? Not for the server admin.
As long as Wget is only retrieving static pages, and doing it at a reasonable rate (see the `--wait' option), there's not much of a problem. The trouble is that Wget can't tell the difference between the smallest static page and the most demanding CGI. A site I know has a section handled by an, uh, bitchin' CGI Perl script that converts Info files to HTML on the fly. The script is slow, but works well enough for human users viewing an occasional Info file. However, when someone's recursive Wget download stumbles upon the index page that links to all the Info files through the script, the system is brought to its knees without providing anything useful to the downloader.
To avoid this kind of accident, as well as to preserve privacy for documents that need to be protected from well-behaved robots, the concept of robot exclusion has been invented. The idea is that the server administrators and document authors can specify which portions of the site they wish to protect from the robots.
The most popular mechanism, and the de facto standard supported by all the major robots, is the "Robots Exclusion Standard" (RES) written by Martijn Koster et al. in 1994. It specifies the format of a text file containing directives that instruct the robots which URL paths to avoid. To be found by the robots, the specifications must be placed in `/robots.txt'
in the server root, which the robots are supposed to download and parse.
Although Wget is not a web robot in the strictest sense of the word, it can downloads large parts of the site without the user's intervention to download an individual page. Because of that, Wget honors RES when downloading recursively. For instance, when you issue: wget -r http://www.server.com/
First the index of `www.server.com' will be downloaded. If Wget finds that it wants to download more documents from that server, it will request `http://www.server.com/robots.txt'
and, if found, use it for further downloads. `robots.txt' is loaded only once per each server.
Until version 1.8, Wget supported the first version of the standard, written by Martijn Koster in 1994 and available at http://www.robotstxt.org/wc/norobots.html. As of version 1.8, Wget has supported the additional directives specified in the internet draft draft-koster-robots-00.txt titled "A Method for Web Robots Control". The draft, which has as far as I know never made to an RFC, is available at http://www.robotstxt.org/wc/norobots-rfc.txt.
This manual no longer includes the text of the Robot Exclusion Standard.
The second, less known mechanism, enables the author of an individual document to specify whether they want the links from the file to be followed by a robot. This is achieved using the META tag, like this:
This is explained in some detail at http://www.robotstxt.org/wc/meta-user.html. Wget supports this method of robot exclusion in addition to the usual /robots.txt
exclusion.
If you know what you are doing and really really wish to turn off the robot exclusion, set the robots variable to `off' in your `.wgetrc'. You can achieve the same effect from the command line using the -e switch, e.g. wget -e robots=off url