SEO with Robots.txt

20 Oct, 2007

robots.txt Search Engine Optimization is simply using robots.txt for your blog, wordpress, or phpbb. Wordpress Optimized robots.txt and meta tags

See the Updated WordPress robots.txt file

Google Robots.txt Info and Recommendations

Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler.

Googlebot and Robots.txt SEO Info

When deciding which pages to crawl, Googlebot goes in this order

Googlebot will obey the first record in the robots.txt file with a User-agent starting with "Googlebot."
If no "Googlebot User-agent exists, it will obey the first entry with a User-agent of "*"

Google User-agents

Googlebot: crawl pages from our web index and our news index
Googlebot-Mobile: crawls pages for our mobile index
Googlebot-Image: crawls pages for our image index
Mediapartners-Google: crawls pages to determine AdSense content. We only use this bot to crawl your site if you show AdSense ads on your site.
Adsbot-Google: crawls pages to measure AdWords landing page quality. We only use this bot if you use Google AdWords to advertise your site. Find out more about this bot and how to block it from portions of your site.

Removing Old/wrong content from google

Create the new page
In .htaccess (if Linux) add a RedirectPermanent command
DO NOT DELETE THE OLD FILE
Update all the links on your website to point to the new page (change the link text while you're at it)
Verify that no pages point to the old file (including your sitemap.xml)
Add a noindex,nofollow to the old file AND Disallow in your robots.txt
Submit your updated sitemap.xml to Google & Yahoo
Wait a few weeks
When the new page appears in Google, it's safe to delete the old one

Google Sponsored Robots.txt Articles

robots.txt examples

robots.txt for WordPress 2.+

User-agent:  *
# disallow all files in these directories
Disallow: /cgi-bin/
Disallow: /z/j/
Disallow: /z/c/
Disallow: /stats/
Disallow: /dh_
Disallow: /about/
Disallow: /contact/
Disallow: /tag/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /contact
Disallow: /manual
Disallow: /manual/*
Disallow: /phpmanual/
Disallow: /category/






User-agent: Googlebot
# disallow all files ending with these extensions
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.gz$
Disallow: /*.wmv$
Disallow: /*.cgi$
Disallow: /*.xhtml$


# disallow all files with ? in url
Disallow: /*?*





# disable duggmirror
User-agent: duggmirror
Disallow: /





# allow google image bot to search all images
User-agent: Googlebot-Image
Disallow:
Allow: /*




# allow adsense bot on entire site
User-agent: Mediapartners-Google*
Disallow:
Allow: /*

robots.txt for phpBB

User-agent: *
Disallow: /cgi-bin/
Disallow: /phpbb/admin/
Disallow: /phpbb/cache/
Disallow: /phpbb/db/
Disallow: /phpbb/images/
Disallow: /phpbb/includes/
Disallow: /phpbb/language/
Disallow: /phpbb/templates/
Disallow: /phpbb/faq.php
Disallow: /phpbb/groupcp.php
Disallow: /phpbb/login.php
Disallow: /phpbb/memberlist.php
Disallow: /phpbb/modcp.php
Disallow: /phpbb/posting.php
Disallow: /phpbb/privmsg.php
Disallow: /phpbb/profile.php
Disallow: /phpbb/search.php
Disallow: /phpbb/viewonline.php


User-agent: Googlebot
# disallow files ending with these extensions
Disallow: /*.inc$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$

# disallow all files with? in url
Disallow: *mark=*
Disallow: *view=*

# allow google image bot to search all images
User-agent: Googlebot-Image
Disallow:
Allow: /*


# allow adsense bot on entire site
User-agent: Mediapartners-Google*
Disallow:
Allow: /*

User-agent: *
Disallow: /stats
Disallow: /dh_
Disallow: /V
Disallow: /z/j/
Disallow: /z/c/
Disallow: /cgi-bin/
Disallow: /viewtopic.php
Disallow: /viewforum.php
Disallow: /index.php?
Disallow: /posting.php
Disallow: /groupcp.php
Disallow: /search.php
Disallow: /login.php
Disallow: /post
Disallow: /member
Disallow: /profile.php
Disallow: /memberlist.php
Disallow: /faq.php
Disallow: /templates/
Disallow: /mx_
Disallow: /db/
Disallow: /admin/
Disallow: /cache/
Disallow: /images/
Disallow: /includes/
Disallow: /common.php
Disallow: /index.php
Disallow: /memberlist.php
Disallow: /modcp.php
Disallow: /privmsg.php
Disallow: /viewonline.php
Disallow: /images/
Disallow: /rss.php


User-agent: Googlebot
# disallow all files ending with these extensions
Allow: /sitemap.php
Disallow: /*.php$
Allow: /sitemap.php
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.txt$

# disallow all files with? in url
Disallow: /*?*
Disallow: /*?

# disallow all files in /wp- directorys
Disallow: /wp-*/

# disallow archiving site
User-agent: ia_archiver
Disallow: /

# allow google image bot to search all images
User-agent: Googlebot-Image
Disallow:
Allow: /*.gif$
Allow: /*.png$
Allow: /*.jpeg$
Allow: /*.jpg$
Allow: /*.ico$
Allow: /*.jpg$
Allow: /images
Allow: /z/i/


# allow adsense bot on entire site
User-agent: Mediapartners-Google*
Allow: /*

Pattern Matching with Google

Matching a sequence of characters using `*`

You can use an asterisk * to match a sequence of characters.

Block access to all subdirectories that begin with `private:`

User-Agent: Googlebot
Disallow: /private*/

Block access to all URLs that include a `?`

User-agent: *
Disallow: /*?*

Matching the end characters of the URL using `$`

You can use the $ character to specify matching the end of the URL.

Block any URLs that end with `.php`

User-Agent: Googlebot
Disallow: /*.php$

You can use this pattern matching in combination with the Allow directive.

Exclude all URLs that contain ? to ensure Googlebot doesn't crawl duplicate pages. URLs that end with a ? DO get crawled

User-agent: *
Allow: /*?$
Disallow: /*?

Disallow:/*? blocks any URL that begins with HOST, followed by any string, followed by a ?, followed by any string

Allow: /*?$ allows any URL that begins with HOST, followed by any string, followed by a ?, with no characters after the ?

SRC

User-Agent Discussion

Blocking a specific User-Agent

Note: Blocking Googlebot blocks all bots that begin with "Googlebot"

Block Googlebot entirely

User-agent: Googlebot
Disallow: /

Allowing a specific User-Agent

Note: Googlebot follows the line directed at it, rather than the line directed at everyone.

Block access to all bots other than "Googlebot"

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

Googlebot recognizes an extension to the robots.txt standard called Allow, which is opposite of Disallow.

Block all pages inside a subdirectory except for single file

User-Agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html

Block Googlebot but allow other Bot

User-agent: Googlebot
Disallow: /

User-agent: Googlebot-Mobile
Allow:

SRC

Removing Content From Google

It is better to use on pages that have been indexed if you wish google to drop them. This way it is much faster than blocked using robots.txt.

Note: removing snippets also removes cached pages.

A snippet is a text excerpt that appears below a page's title in our search results and describes the content of the page.

Prevent Google from displaying snippets for your page

Remove an outdated "dead" link

Google updates its entire index automatically on a regular basis. When we crawl the web, we find new pages, discard dead links, and update links automatically. Links that are outdated now will most likely "fade out" of our index during our next crawl.

Note: Please ensure that you return a true 404 error even if you choose to display a more user-friendly body of the HTML page for your visitors. It won't help to return a page that says "File Not Found" if the http headers still return a status code of 200, or normal.

Remove cached pages

Google automatically takes a "snapshot" of each page it crawls and archives it. This "cached" version allows a webpage to be retrieved for your end users if the original page is ever unavailable (due to temporary failure of the page's web server). The cached page appears to users exactly as it looked when Google last crawled it, and we display a message at the top of the page to indicate that it's a cached version. Users can access the cached version by choosing the "Cached" link on the search results page.

Prevent all search engines from showing a "Cached" link for your site

Allow other search engines to show a "Cached" link, preventing only Google

Note: this tag only removes the "Cached" link for the page. Google will continue to index the page and display a snippet.

Remove your entire website

If you wish to exclude your entire website from Google's index

Remove site from search engines and prevent all robots from crawling it in the future

User-agent: *
Disallow: /

Note: Please note that Googlebot does not interpret a 401/403 response ("Unauthorized"/"Forbidden") to a robots.txt fetch as a request not to crawl any pages on the site.

To remove your site from Google only and prevent just Googlebot from crawling your site in the future

User-agent: Googlebot
Disallow: /

Allow Googlebot to index all http pages but no https pages

Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols.

For your http protocol (http://yourserver.com/robots.txt)

User-agent: *
Allow: /

For the https protocol (https://yourserver.com/robots.txt)

User-agent: *
Disallow: /

Remove part of your website

Option 1: Robots.txt

Remove all pages under a particular directory (for example, lems)

User-agent: Googlebot
Disallow: /lems

Remove all files of a specific file type (for example, .gif)

User-agent: Googlebot
Disallow: /*.gif$

To remove dynamically generated pages, you'd use this robots.txt entry

User-agent: Googlebot
Disallow: /*?

Option 2: Meta tags

Another standard, which can be more convenient for page-by-page use, involves adding a META tag to an HTML page to tell robots not to index the page. This standard is described at http://www.robotstxt.org/wc/exclusion.html#meta.

Prevent all robots from indexing a page on your site

Allow other robots to index the page on your site, preventing only Google's robots from indexing the page

Allow robots to index the page on your site but instruct them not to follow outgoing links

Remove an image from Google's Image Search

Want Google to exclude the dogs.jpg image that appears on your site at www.yoursite.com/images/dogs.jpg

User-agent: Googlebot-Image
Disallow: /images/dogs.jpg

Remove all the images on your site from our index

User-agent: Googlebot-Image
Disallow: /

Remove all files of a specific file type (for example, to include .jpg but not .gif images)

User-agent: Googlebot-Image
Disallow: /*.gif$

Note: If you believe your request is urgent and cannot wait until the next time Google crawls your site, use our automatic URL removal system. In order for this automated process to work, the webmaster or SEO agency must first create and place a robots.txt file on the site in question.

Google will continue to exclude your site or directories from successive crawls if the robots.txt file exists in the web server root. If you do not have access to the root level of your server, you may place a robots.txt file at the same level as the files you want to remove. Doing this and submitting via the automatic URL removal system will cause a temporary, 180 day removal of the directories specified in your robots.txt file from the Google index, regardless of whether you remove the robots.txt file after processing your request. (Keeping the robots.txt file at the same level would require you to return to the URL removal system every 180 days to reissue the removal.)

Remove a blog from Blog Search

Only blogs with site feeds will be included in Blog Search. If you'd like to prevent your feed from being crawled, make use of a robots.txt file or meta tags (NOINDEX or NOFOLLOW), as described above. Please note that if you have a feed that was previously included, the old posts will remain in the index even though new ones will not be added.

Remove a RSS or Atom feed

When users add your feed to their Google homepage or Google Reader, Google's Feedfetcher attempts to obtain the content of the feed in order to display it. Since Feedfetcher requests come from explicit action by human users, Feedfetcher has been designed to ignore robots.txt guidelines.

It's not possible for Google to restrict access to a publicly available feed. If your feed is provided by a blog hosting service, you should work with them to restrict access to your feed. Check those sites' help content for more information (e.g., Blogger, LiveJournal, or Typepad).

Remove transcoded pages

Google Web Search on mobile phones allows users to search all the content in the Google index for desktop web browsers. Because this content isn't written specifically for mobile phones and devices and thus might not display properly, Google automatically translates (or "transcodes") these pages by analyzing the original HTML code and converting it to a mobile-ready format. To ensure that the highest quality and most useable web page is displayed on your mobile phone or device, Google may resize, adjust, or convert images, text formatting and/or certain aspects of web page functionality.

SRC

To save bandwidth, Googlebot only downloads the robots.txt file once a day or whenever we've fetched many pages from the server. So, it may take a while for Googlebot to learn of changes to your robots.txt file. Also, Googlebot is distributed on several machines. Each of these keeps its own record of your robots.txt file.

Also, there's a small difference between the way Googlebot handles the robots.txt file and the way the robots.txt standard says we should (keeping in mind the distinction between "should" and "must"). The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do.

For example, consider the following robots.txt file:

User-Agent: *
Allow: /
Disallow: /cgi-bin

It's obvious that the webmaster's intent here is to allow robots to crawl everything except the /cgi-bin directory. Consequently, that's what we do.

Tell googlebot not to count certain external links in your ranking

Meta tags can exclude all outgoing links on a page, but you can also instruct Googlebot not to crawl individual links by adding rel="nofollow" to a hyperlink. When Google sees the attribute rel="nofollow" on hyperlinks, those links won't get any credit when we rank websites in our search results. For example a link, This is a great link! could be replaced with I can't vouch for this link.

Other Links

Database of Web Robots, Overview

# robots.txt, www.nytimes.com 6/29/2006
User-agent: *
Disallow: /pages/college/
Disallow: /college/
Disallow: /library/
Disallow: /learning/
Disallow: /aponline/
Disallow: /reuters/
Disallow: /cnet/
Disallow: /partners/
Disallow: /archives/
Disallow: /indexes/
Disallow: /thestreet/
Disallow: /nytimes-partners/
Disallow: /financialtimes/
Allow: /pages/
Allow: /2003/
Allow: /2004/
Allow: /2005/
Allow: /top/
Allow: /ref/
Allow: /services/xml/

User-agent: Mediapartners-Google*
Disallow:






# robots.txt, http://dictionary.reference.com
User-agent: Googlebot
Disallow:

User-agent: Mediapartners-Google
Disallow:

User-agent: Teleport Pro
Disallow: /

User-agent: *
Disallow: /cgi-bin/







# robots.txt for www.phpbbhacks.com
User-agent: *
Disallow: /forums/viewtopic.php
Disallow: /forums/viewforum.php
Disallow: /forums/index.php?
Disallow: /forums/posting.php
Disallow: /forums/groupcp.php
Disallow: /forums/search.php
Disallow: /forums/login.php
Disallow: /forums/privmsg.php
Disallow: /forums/post
Disallow: /forums/profile.php
Disallow: /forums/memberlist.php
Disallow: /forums/faq.php
Disallow: /forums/archive




# robots.txt for Slashdot.org
#
# "Any empty [Disallow] value, indicates that all URLs can be retrieved.
# At least one Disallow field needs to be present in a record."

User-agent: Mediapartners-Google
Disallow:

User-agent: Googlebot
Crawl-delay: 100
Disallow: /firehose.pl
Disallow: /submit.pl
Disallow: /comments.pl
Disallow: /users.pl
Disallow: /zoo.pl
Disallow: firehose.pl
Disallow: submit.pl
Disallow: comments.pl
Disallow: users.pl
Disallow: zoo.pl
Disallow: /~
Disallow: ~

User-agent: Slurp
Crawl-delay: 100
Disallow:

User-agent: Yahoo-NewsCrawler
Disallow:

User-Agent: msnbot
Crawl-delay: 100
Disallow:

User-agent: *
Crawl-delay: 100
Disallow: /authors.pl
Disallow: /index.pl
Disallow: /article.pl
Disallow: /comments.pl
Disallow: /firehose.pl
Disallow: /journal.pl
Disallow: /messages.pl
Disallow: /metamod.pl
Disallow: /users.pl
Disallow: /search.pl
Disallow: /submit.pl
Disallow: /pollBooth.pl
Disallow: /pubkey.pl
Disallow: /topics.pl
Disallow: /zoo.pl
Disallow: /palm
Disallow: authors.pl
Disallow: index.pl
Disallow: article.pl
Disallow: comments.pl
Disallow: firehose.pl
Disallow: journal.pl
Disallow: messages.pl
Disallow: metamod.pl
Disallow: users.pl
Disallow: search.pl
Disallow: submit.pl
Disallow: pollBooth.pl
Disallow: pubkey.pl
Disallow: topics.pl
Disallow: zoo.pl
Disallow: /~
Disallow: ~




# robots.txt for http://www.myspace.com
User-agent: ia_archiver
Disallow: /








# robots.txt for http://www.craigslist.com
User-agent: YahooFeedSeeker
Disallow: /forums
Disallow: /res/
Disallow: /post
Disallow: /email.friend
Disallow: /?flagCode
Disallow: /ccc
Disallow: /hhh
Disallow: /sss
Disallow: /bbb
Disallow: /ggg
Disallow: /jjj

User-agent: *
Disallow: /cgi-bin
Disallow: /cgi-secure
Disallow: /forums
Disallow: /search
Disallow: /res/
Disallow: /post
Disallow: /email.friend
Disallow: /?flagCode
Disallow: /ccc
Disallow: /hhh
Disallow: /sss
Disallow: /bbb
Disallow: /ggg
Disallow: /jjj

User-Agent: OmniExplorer_Bot
Disallow: /






# robots.txt for http://www.alexa.com
User-agent: googlebot	# allow Google crawler
Disallow: /search

User-agent: gulliver	# allow Northern Light crawler
Disallow: /search

User-agent: slurp	# allow Inktomi crawler
Disallow: /search

User-agent: fast	# allow FAST crawler
Disallow: /search

User-agent: scooter	# allow AltaVista crawler
Disallow: /search

User-agent: vscooter	# allow AltaVista image crawler
Disallow: /search

User-agent: ia_archiver	# allow Internet Archive crawler
Disallow: /search

User-agent: *		# Disallow all other crawlers access
Disallow: /





# robots.txt for http://www.technorati.com
User-agent: NPBot
Disallow: /

User-agent: TurnitinBot
Disallow: /

User-Agent: sitecheck.internetseer.com
Disallow: /

User-Agent: *
Crawl-Delay: 3
Disallow: /search/
Disallow: /search.php
Disallow: /cosmos.php




# robots.txt for www.sitepoint.com
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /forums/report.php
Disallow: /forums/search.php
Disallow: /forums/newreply.php
Disallow: /forums/editpost.php
Disallow: /forums/memberlist.php
Disallow: /forums/profile.php
Disallow: /launch/
Disallow: /search/
Disallow: /voucher/424/
Disallow: /email/
Disallow: /feedback/
Disallow: /contact?reason=articlesuggest
Disallow: /linktothis/
Disallow: /popup/
Disallow: /forums/archive/







# robots.txt for http://www.w3.org/

# For use by search.w3.org
User-agent: W3C-gsa
Disallow: /Out-Of-Date

User-agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT; MS Search 4.0 Robot)
Disallow: /

# W3C Link checker
User-agent: W3C-checklink
Disallow:

# exclude some access-controlled areas
User-agent: *
Disallow: /2004/ontaria/basic
Disallow: /Team
Disallow: /Project
Disallow: /Systems
Disallow: /Web
Disallow: /History
Disallow: /Out-Of-Date
Disallow: /2002/02/mid
Disallow: /mid/
Disallow: /People/all/
Disallow: /RDF/Validator/ARPServlet
Disallow: /2003/03/Translations/byLanguage
Disallow: /2003/03/Translations/byTechnology
Disallow: /2005/11/Translations/Query
Disallow: /2003/glossary/subglossary/
#Disallow: /2005/06/blog/
#Disallow: /2001/07/pubrules-checker
#shouldnt get transparent proxies but will ml links of things like pubrules
Disallow: /2000/06/webdata/xslt
Disallow: /2000/09/webdata/xslt
Disallow: /2005/08/online_xslt/xslt
Disallow: /Bugs/
Disallow: /Search/Mail/Public/
Disallow: /2006/02/chartergen




# robots.txt for www.google-analytics.com
User-Agent: *
Disallow: /
Noindex: /




# robots.txt for video.google.com
User-agent: *
Disallow: /videosearch?
Disallow: /videofeed?
Disallow: /videopreview?
Disallow: /videopreviewbig?
Disallow: /videoprograminfo?
Disallow: /videorandom
Disallow: /videolineup
Disallow: /downloadgvp





# robots.txt for www.google.com
User-agent: *
Allow: /searchhistory/
Disallow: /news?output=xhtml&
Allow: /news?output=xhtml
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalogues
Disallow: /news
Disallow: /nwshp
Disallow: /?
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/
Disallow: /relcontent
Disallow: /sorry/
Disallow: /imgres
Disallow: /keyword/
Disallow: /u/
Disallow: /univ/
Disallow: /cobrand
Disallow: /custom
Disallow: /advanced_group_search
Disallow: /advanced_search
Disallow: /googlesite
Disallow: /preferences
Disallow: /setprefs
Disallow: /swr
Disallow: /url
Disallow: /m?
Disallow: /m/search?
Disallow: /wml?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/search?
Disallow: /xml?
Disallow: /imode?
Disallow: /imode/search?
Disallow: /jsky?
Disallow: /jsky/search?
Disallow: /pda?
Disallow: /pda/search?
Disallow: /sprint_xhtml
Disallow: /sprint_wml
Disallow: /pqa
Disallow: /palm
Disallow: /gwt/
Disallow: /purchases
Disallow: /hws
Disallow: /bsd?
Disallow: /linux?
Disallow: /mac?
Disallow: /microsoft?
Disallow: /unclesam?
Disallow: /answers/search?q=
Disallow: /local?
Disallow: /local_url
Disallow: /froogle?
Disallow: /froogle_
Disallow: /print
Disallow: /books
Disallow: /patents?
Disallow: /scholar?
Disallow: /complete
Disallow: /sponsoredlinks
Disallow: /videosearch?
Disallow: /videopreview?
Disallow: /videoprograminfo?
Disallow: /maps?
Disallow: /translate?
Disallow: /ie?
Disallow: /sms/demo?
Disallow: /katrina?
Disallow: /blogsearch?
Disallow: /blogsearch/
Disallow: /blogsearch_feeds
Disallow: /advanced_blog_search
Disallow: /reader/
Disallow: /uds/
Disallow: /chart?
Disallow: /transit?
Disallow: /mbd?
Disallow: /extern_js/
Disallow: /calendar/feeds/
Disallow: /calendar/ical/
Disallow: /cl2/feeds/
Disallow: /cl2/ical/
Disallow: /coop/directory
Disallow: /coop/manage
Disallow: /trends?
Disallow: /trends/music?
Disallow: /notebook/search?
Disallow: /music
Disallow: /browsersync
Disallow: /call
Disallow: /archivesearch?
Disallow: /archivesearch/url
Disallow: /archivesearch/advanced_search
Disallow: /base/search?
Disallow: /base/reportbadoffer
Disallow: /base/s2
Disallow: /urchin_test/
Disallow: /movies?
Disallow: /codesearch?
Disallow: /codesearch/feeds/search?
Disallow: /wapsearch?
Disallow: /safebrowsing
Disallow: /finance
Disallow: /reviews/search?






# robots.txt for validator.w3.org
# $Id: robots.txt,v 1.3 2000/12/13 13:04:09 gerald Exp $

User-agent: *
Disallow: /check




# robots.txt for httpd.apache.org
User-agent: *
Disallow: /websrc






# robots.txt for www.apache.org
User-agent: *
Disallow: /websrc
Crawl-Delay: 4

#  Please, we do NOT allow nonauthorized robots.
#  http://www.webmasterworld.com/robots
#  Actual robots can always be found here for: http://www.webmasterworld.com/robots2
#  Old full robots.txt can be found here: http://www.webmasterworld.com/robots3
#  Any unauthorized bot running will result in IP's being banned.
#  Agent spoofing is considered a bot.
#  Fair warning to the clueless - honey pots are - and have been - running.
#  If you have been banned for bot running - please sticky an admin for a reinclusion request.
#  http://www.searchengineworld.com/robots/
#  This code found here: http://www.webmasterworld.com/robots.txt?view=rawcode

User-agent: *
Crawl-delay: 17

User-agent: *
Disallow: /gfx/
Disallow: /cgi-bin/
Disallow: /QuickSand/
Disallow: /pda/
Disallow: /zForumFFFFFF/

# WebmasterWorld.com: robots.txt
# GNU Robots.txt Feel free to use with credit
# given to WebmasterWorld.
# Please, we do NOT allow nonauthorized robots any longer.
# http://www.searchengineworld.com/robots/
# Yes, feel free to copy and use the following.


User-agent: OmniExplorer_Bot
Disallow: /

User-agent: FreeFind
Disallow: /

User-agent: BecomeBot
Disallow: /

User-agent: Nutch
Disallow: /

User-agent: Jetbot/1.0
Disallow: /

User-agent: Jetbot
Disallow: /

User-agent: WebVac
Disallow: /

User-agent: Stanford
Disallow: /

User-agent: naver
Disallow: /

User-agent: dumbot
Disallow: /

User-agent: Hatena Antenna
Disallow: /

User-agent: grub-client
Disallow: /

User-agent: grub
Disallow: /

User-agent: looksmart
Disallow: /

User-agent: WebZip
Disallow: /

User-agent: larbin
Disallow: /

User-agent: b2w/0.1
Disallow: /

User-agent: Copernic
Disallow: /

User-agent: psbot
Disallow: /

User-agent: Python-urllib
Disallow: /

User-agent: Googlebot-Image
Disallow: /

User-agent: NetMechanic
Disallow: /

User-agent: URL_Spider_Pro
Disallow: /

User-agent: CherryPicker
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: WebBandit
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: ExtractorPro
Disallow: /

User-agent: CopyRightCheck
Disallow: /

User-agent: Crescent
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: ProWebWalker
Disallow: /

User-agent: CheeseBot
Disallow: /

User-agent: LNSpiderguy
Disallow: /

User-agent: Mozilla
Disallow: /

User-agent: mozilla
Disallow: /

User-agent: mozilla/3
Disallow: /

User-agent: mozilla/4
Disallow: /

User-agent: mozilla/5
Disallow: /

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)
Disallow: /

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95)
Disallow: /

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 98)
Disallow: /

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows XP)
Disallow: /

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 2000)
Disallow: /

User-agent: ia_archiver
Disallow: /

User-agent: ia_archiver/1.6
Disallow: /

User-agent: Alexibot
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: Stanford Comp Sci
Disallow: /

User-agent: MIIxpc
Disallow: /

User-agent: Telesoft
Disallow: /

User-agent: Website Quester
Disallow: /

User-agent: moget/2.1
Disallow: /

User-agent: WebZip/4.0
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebSauger
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: NetAnts
Disallow: /

User-agent: Mister PiX
Disallow: /

User-agent: WebAuto
Disallow: /

User-agent: TheNomad
Disallow: /

User-agent: WWW-Collector-E
Disallow: /

User-agent: RMA
Disallow: /

User-agent: libWeb/clsHTTP
Disallow: /

User-agent: asterias
Disallow: /

User-agent: httplib
Disallow: /

User-agent: turingos
Disallow: /

User-agent: spanner
Disallow: /

User-agent: InfoNaviRobot
Disallow: /

User-agent: Harvest/1.5
Disallow: /

User-agent: Bullseye/1.0
Disallow: /

User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
Disallow: /

User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow: /

User-agent: CherryPickerSE/1.0
Disallow: /

User-agent: CherryPickerElite/1.0
Disallow: /

User-agent: WebBandit/3.50
Disallow: /

User-agent: NICErsPRO
Disallow: /

User-agent: Microsoft URL Control - 5.01.4511
Disallow: /

User-agent: DittoSpyder
Disallow: /

User-agent: Foobot
Disallow: /

User-agent: WebmasterWorldForumBot
Disallow: /

User-agent: SpankBot
Disallow: /

User-agent: BotALot
Disallow: /

User-agent: lwp-trivial/1.34
Disallow: /

User-agent: lwp-trivial
Disallow: /

User-agent: http://www.WebmasterWorld.com bot
Disallow: /

User-agent: BunnySlippers
Disallow: /

User-agent: Microsoft URL Control - 6.00.8169
Disallow: /

User-agent: URLy Warning
Disallow: /

User-agent: Wget/1.6
Disallow: /

User-agent: Wget/1.5.3
Disallow: /

User-agent: Wget
Disallow: /

User-agent: LinkWalker
Disallow: /

User-agent: cosmos
Disallow: /

User-agent: moget
Disallow: /

User-agent: hloader
Disallow: /

User-agent: humanlinks
Disallow: /

User-agent: LinkextractorPro
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Mata Hari
Disallow: /

User-agent: LexiBot
Disallow: /

User-agent: Web Image Collector
Disallow: /

User-agent: The Intraformant
Disallow: /

User-agent: True_Robot/1.0
Disallow: /

User-agent: True_Robot
Disallow: /

User-agent: BlowFish/1.0
Disallow: /

User-agent: http://www.SearchEngineWorld.com bot
Disallow: /

User-agent: http://www.WebmasterWorld.com bot
Disallow: /

User-agent: JennyBot
Disallow: /

User-agent: MIIxpc/4.2
Disallow: /

User-agent: BuiltBotTough
Disallow: /

User-agent: ProPowerBot/2.14
Disallow: /

User-agent: BackDoorBot/1.0
Disallow: /

User-agent: toCrawl/UrlDispatcher
Disallow: /

User-agent: WebEnhancer
Disallow: /

User-agent: suzuran
Disallow: /

User-agent: VCI WebViewer VCI WebViewer Win32
Disallow: /

User-agent: VCI
Disallow: /

User-agent: Szukacz/1.4
Disallow: /

User-agent: QueryN Metasearch
Disallow: /

User-agent: Openfind data gathere
Disallow: /

User-agent: Openfind
Disallow: /

User-agent: Xenu's Link Sleuth 1.1c
Disallow: /

User-agent: Xenu's
Disallow: /

User-agent: Zeus
Disallow: /

User-agent: RepoMonkey Bait & Tackle/v1.01
Disallow: /

User-agent: RepoMonkey
Disallow: /

User-agent: Microsoft URL Control
Disallow: /

User-agent: Openbot
Disallow: /

User-agent: URL Control
Disallow: /

User-agent: Zeus Link Scout
Disallow: /

User-agent: Zeus 32297 Webster Pro V2.9 Win32
Disallow: /

User-agent: Webster Pro
Disallow: /

User-agent: EroCrawler
Disallow: /

User-agent: LinkScan/8.1a Unix
Disallow: /

User-agent: Keyword Density/0.9
Disallow: /

User-agent: Kenjin Spider
Disallow: /

User-agent: Iron33/1.0.2
Disallow: /

User-agent: Bookmark search tool
Disallow: /

User-agent: GetRight/4.2
Disallow: /

User-agent: FairAd Client
Disallow: /

User-agent: Gaisbot
Disallow: /

User-agent: Aqua_Products
Disallow: /

User-agent: Radiation Retriever 1.1
Disallow: /

User-agent: WebmasterWorld Extractor
Disallow: /

User-agent: Flaming AttackBot
Disallow: /

User-agent: Oracle Ultra Search
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: PerMan
Disallow: /

User-agent: searchpreview
Disallow: /

User-agent: sootle
Disallow: /

User-agent: es
Disallow: /

User-agent: Enterprise_Search/1.0
Disallow: /

User-agent: Enterprise_Search
Disallow: /


User-agent: *
Disallow: /gfx/
Disallow: /cgi-bin/
Disallow: /QuickSand/
Disallow: /pda/
Disallow: /zForumFFFFFF/

You don't have to block your feeds from indexing. Matt Cutts himself suggested not to block those because there is not real reason to doing so. If you are talking about a blog and its /feed and such urls it won't cause mess in your rankings, so my suggesting would be not touch those feeds.

As for blocking, no you won't have any affection to your main url if you block /feed or whatever urls you want. Gbot will just deindex them and stop crawling them.

Wget Robot Exclusion

It is extremely easy to make Wget wander aimlessly around a web site, sucking all the available data in progress. `wget -r site', and you're set. Great? Not for the server admin.

As long as Wget is only retrieving static pages, and doing it at a reasonable rate (see the `--wait' option), there's not much of a problem. The trouble is that Wget can't tell the difference between the smallest static page and the most demanding CGI. A site I know has a section handled by an, uh, bitchin' CGI Perl script that converts Info files to HTML on the fly. The script is slow, but works well enough for human users viewing an occasional Info file. However, when someone's recursive Wget download stumbles upon the index page that links to all the Info files through the script, the system is brought to its knees without providing anything useful to the downloader.

To avoid this kind of accident, as well as to preserve privacy for documents that need to be protected from well-behaved robots, the concept of robot exclusion has been invented. The idea is that the server administrators and document authors can specify which portions of the site they wish to protect from the robots.

The most popular mechanism, and the de facto standard supported by all the major robots, is the "Robots Exclusion Standard" (RES) written by Martijn Koster et al. in 1994. It specifies the format of a text file containing directives that instruct the robots which URL paths to avoid. To be found by the robots, the specifications must be placed in `/robots.txt' in the server root, which the robots are supposed to download and parse.

Although Wget is not a web robot in the strictest sense of the word, it can downloads large parts of the site without the user's intervention to download an individual page. Because of that, Wget honors RES when downloading recursively. For instance, when you issue: wget -r http://www.server.com/ First the index of `www.server.com' will be downloaded. If Wget finds that it wants to download more documents from that server, it will request `http://www.server.com/robots.txt' and, if found, use it for further downloads. `robots.txt' is loaded only once per each server.

Until version 1.8, Wget supported the first version of the standard, written by Martijn Koster in 1994 and available at http://www.robotstxt.org/wc/norobots.html. As of version 1.8, Wget has supported the additional directives specified in the internet draft draft-koster-robots-00.txt titled "A Method for Web Robots Control". The draft, which has as far as I know never made to an RFC, is available at http://www.robotstxt.org/wc/norobots-rfc.txt.

This manual no longer includes the text of the Robot Exclusion Standard.

The second, less known mechanism, enables the author of an individual document to specify whether they want the links from the file to be followed by a robot. This is achieved using the META tag, like this: This is explained in some detail at http://www.robotstxt.org/wc/meta-user.html. Wget supports this method of robot exclusion in addition to the usual /robots.txt exclusion.

If you know what you are doing and really really wish to turn off the robot exclusion, set the robots variable to `off' in your `.wgetrc'. You can achieve the same effect from the command line using the -e switch, e.g. wget -e robots=off url