SEO with Robots.txt
robots.txt Search Engine Optimization is simply using robots.txt for your blog, wordpress, or phpbb. Wordpress Optimized robots.txt and meta tags
See the Updated WordPress robots.txt file
Google Robots.txt Info and Recommendations
Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler.
Googlebot and Robots.txt SEO Info
When deciding which pages to crawl, Googlebot goes in this order
- Googlebot will obey the first record in the robots.txt file with a User-agent starting with "Googlebot."
- If no "Googlebot User-agent exists, it will obey the first entry with a User-agent of "*"
Google User-agents
- Googlebot
- crawl pages from our web index and our news index
- Googlebot-Mobile
- crawls pages for our mobile index
- Googlebot-Image
- crawls pages for our image index
- Mediapartners-Google
- crawls pages to determine AdSense content. We only use this bot to crawl your site if you show AdSense ads on your site.
- Adsbot-Google
- crawls pages to measure AdWords landing page quality. We only use this bot if you use Google AdWords to advertise your site. Find out more about this bot and how to block it from portions of your site.
Removing Old/wrong content from google
- Create the new page
- In .htaccess (if Linux) add a RedirectPermanent command
- DO NOT DELETE THE OLD FILE
- Update all the links on your website to point to the new page (change the link text while you're at it)
- Verify that no pages point to the old file (including your sitemap.xml)
- Add a noindex,nofollow to the old file AND Disallow in your robots.txt
- Submit your updated sitemap.xml to Google & Yahoo
- Wait a few weeks
- When the new page appears in Google, it's safe to delete the old one
Google Sponsored Robots.txt Articles
- Controlling how search engines access and index your website
- The Robots Exclusion Protocol
- robots.txt analysis tool
- Googlebot
- Inside Google Sitemaps: Using a robots.txt file
- All About Googlebot
robots.txt examples
robots.txt for WordPress 2.+
User-agent: * # disallow all files in these directories Disallow: /cgi-bin/ Disallow: /z/j/ Disallow: /z/c/ Disallow: /stats/ Disallow: /dh_ Disallow: /about/ Disallow: /contact/ Disallow: /tag/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /contact Disallow: /manual Disallow: /manual/* Disallow: /phpmanual/ Disallow: /category/ User-agent: Googlebot # disallow all files ending with these extensions Disallow: /*.php$ Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: /*.gz$ Disallow: /*.wmv$ Disallow: /*.cgi$ Disallow: /*.xhtml$ # disallow all files with ? in url Disallow: /*?* # disable duggmirror User-agent: duggmirror Disallow: / # allow google image bot to search all images User-agent: Googlebot-Image Disallow: Allow: /* # allow adsense bot on entire site User-agent: Mediapartners-Google* Disallow: Allow: /*
robots.txt for phpBB
User-agent: * Disallow: /cgi-bin/ Disallow: /phpbb/admin/ Disallow: /phpbb/cache/ Disallow: /phpbb/db/ Disallow: /phpbb/images/ Disallow: /phpbb/includes/ Disallow: /phpbb/language/ Disallow: /phpbb/templates/ Disallow: /phpbb/faq.php Disallow: /phpbb/groupcp.php Disallow: /phpbb/login.php Disallow: /phpbb/memberlist.php Disallow: /phpbb/modcp.php Disallow: /phpbb/posting.php Disallow: /phpbb/privmsg.php Disallow: /phpbb/profile.php Disallow: /phpbb/search.php Disallow: /phpbb/viewonline.php User-agent: Googlebot # disallow files ending with these extensions Disallow: /*.inc$ Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ # disallow all files with? in url Disallow: *mark=* Disallow: *view=* # allow google image bot to search all images User-agent: Googlebot-Image Disallow: Allow: /* # allow adsense bot on entire site User-agent: Mediapartners-Google* Disallow: Allow: /*
User-agent: * Disallow: /stats Disallow: /dh_ Disallow: /V Disallow: /z/j/ Disallow: /z/c/ Disallow: /cgi-bin/ Disallow: /viewtopic.php Disallow: /viewforum.php Disallow: /index.php? Disallow: /posting.php Disallow: /groupcp.php Disallow: /search.php Disallow: /login.php Disallow: /post Disallow: /member Disallow: /profile.php Disallow: /memberlist.php Disallow: /faq.php Disallow: /templates/ Disallow: /mx_ Disallow: /db/ Disallow: /admin/ Disallow: /cache/ Disallow: /images/ Disallow: /includes/ Disallow: /common.php Disallow: /index.php Disallow: /memberlist.php Disallow: /modcp.php Disallow: /privmsg.php Disallow: /viewonline.php Disallow: /images/ Disallow: /rss.php User-agent: Googlebot # disallow all files ending with these extensions Allow: /sitemap.php Disallow: /*.php$ Allow: /sitemap.php Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: /*.txt$ # disallow all files with? in url Disallow: /*?* Disallow: /*? # disallow all files in /wp- directorys Disallow: /wp-*/ # disallow archiving site User-agent: ia_archiver Disallow: / # allow google image bot to search all images User-agent: Googlebot-Image Disallow: Allow: /*.gif$ Allow: /*.png$ Allow: /*.jpeg$ Allow: /*.jpg$ Allow: /*.ico$ Allow: /*.jpg$ Allow: /images Allow: /z/i/ # allow adsense bot on entire site User-agent: Mediapartners-Google* Allow: /*
Pattern Matching with Google
Matching a sequence of characters using *
You can use an asterisk *
to match a sequence of characters.
Block access to all subdirectories that begin with private:
User-Agent: Googlebot Disallow: /private*/
Block access to all URLs that include a ?
User-agent: * Disallow: /*?*
Matching the end characters of the URL using $
You can use the $
character to specify matching the end of the URL.
Block any URLs that end with .php
User-Agent: Googlebot Disallow: /*.php$
You can use this pattern matching in combination with the Allow directive.
Exclude all URLs that contain ? to ensure Googlebot doesn't crawl duplicate pages. URLs that end with a ? DO get crawled
User-agent: * Allow: /*?$ Disallow: /*?
Disallow:/*?
blocks any URL that begins with HOST, followed by any string, followed by a ?
, followed by any string
Allow: /*?$
allows any URL that begins with HOST, followed by any string, followed by a ?
, with no characters after the ?
SRC
User-Agent Discussion
Blocking a specific User-Agent
Note: Blocking Googlebot blocks all bots that begin with "Googlebot"
Block Googlebot entirely
User-agent: Googlebot Disallow: /
Allowing a specific User-Agent
Note: Googlebot follows the line directed at it, rather than the line directed at everyone.
Block access to all bots other than "Googlebot"
User-agent: * Disallow: / User-agent: Googlebot Disallow:
Googlebot recognizes an extension to the robots.txt standard called Allow, which is opposite of Disallow.
Block all pages inside a subdirectory except for single file
User-Agent: Googlebot Disallow: /folder1/ Allow: /folder1/myfile.html
Block Googlebot but allow other Bot
User-agent: Googlebot Disallow: / User-agent: Googlebot-Mobile Allow:
SRC
Removing Content From Google
It is better to use
on pages that have been indexed if you wish google to drop them. This way it is much faster than blocked using robots.txt.
Note: removing snippets also removes cached pages.
A snippet is a text excerpt that appears below a page's title in our search results and describes the content of the page.
Prevent Google from displaying snippets for your page
Remove an outdated "dead" link
Google updates its entire index automatically on a regular basis. When we crawl the web, we find new pages, discard dead links, and update links automatically. Links that are outdated now will most likely "fade out" of our index during our next crawl.
Note: Please ensure that you return a true 404 error even if you choose to display a more user-friendly body of the HTML page for your visitors. It won't help to return a page that says "File Not Found" if the http headers still return a status code of 200, or normal.
Remove cached pages
Google automatically takes a "snapshot" of each page it crawls and archives it. This "cached" version allows a webpage to be retrieved for your end users if the original page is ever unavailable (due to temporary failure of the page's web server). The cached page appears to users exactly as it looked when Google last crawled it, and we display a message at the top of the page to indicate that it's a cached version. Users can access the cached version by choosing the "Cached" link on the search results page.
Prevent all search engines from showing a "Cached" link for your site
Allow other search engines to show a "Cached" link, preventing only Google
Note: this tag only removes the "Cached" link for the page. Google will continue to index the page and display a snippet.
Remove your entire website
If you wish to exclude your entire website from Google's index
Remove site from search engines and prevent all robots from crawling it in the future
User-agent: * Disallow: /
Note: Please note that Googlebot does not interpret a 401/403 response ("Unauthorized"/"Forbidden") to a robots.txt fetch as a request not to crawl any pages on the site.
To remove your site from Google only and prevent just Googlebot from crawling your site in the future
User-agent: Googlebot Disallow: /
Allow Googlebot to index all http pages but no https pages
Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols.
For your http protocol (http://yourserver.com/robots.txt)
User-agent: * Allow: /
For the https protocol (https://yourserver.com/robots.txt)
User-agent: * Disallow: /
Remove part of your website
Option 1: Robots.txt
Remove all pages under a particular directory (for example, lems)
User-agent: Googlebot Disallow: /lems
Remove all files of a specific file type (for example, .gif)
User-agent: Googlebot Disallow: /*.gif$
To remove dynamically generated pages, you'd use this robots.txt entry
User-agent: Googlebot Disallow: /*?
Option 2: Meta tags
Another standard, which can be more convenient for page-by-page use, involves adding a META tag to an HTML page to tell robots not to index the page. This standard is described at http://www.robotstxt.org/wc/exclusion.html#meta.
Prevent all robots from indexing a page on your site
Allow other robots to index the page on your site, preventing only Google's robots from indexing the page
Allow robots to index the page on your site but instruct them not to follow outgoing links
Remove an image from Google's Image Search
Want Google to exclude the dogs.jpg image that appears on your site at www.yoursite.com/images/dogs.jpg
User-agent: Googlebot-Image Disallow: /images/dogs.jpg
Remove all the images on your site from our index
User-agent: Googlebot-Image Disallow: /
Remove all files of a specific file type (for example, to include .jpg but not .gif images)
User-agent: Googlebot-Image Disallow: /*.gif$
Note: If you believe your request is urgent and cannot wait until the next time Google crawls your site, use our automatic URL removal system. In order for this automated process to work, the webmaster or SEO agency must first create and place a robots.txt file on the site in question.
Google will continue to exclude your site or directories from successive crawls if the robots.txt file exists in the web server root. If you do not have access to the root level of your server, you may place a robots.txt file at the same level as the files you want to remove. Doing this and submitting via the automatic URL removal system will cause a temporary, 180 day removal of the directories specified in your robots.txt file from the Google index, regardless of whether you remove the robots.txt file after processing your request. (Keeping the robots.txt file at the same level would require you to return to the URL removal system every 180 days to reissue the removal.)
Remove a blog from Blog Search
Only blogs with site feeds will be included in Blog Search. If you'd like to prevent your feed from being crawled, make use of a robots.txt file or meta tags (NOINDEX or NOFOLLOW), as described above. Please note that if you have a feed that was previously included, the old posts will remain in the index even though new ones will not be added.
Remove a RSS or Atom feed
When users add your feed to their Google homepage or Google Reader, Google's Feedfetcher attempts to obtain the content of the feed in order to display it. Since Feedfetcher requests come from explicit action by human users, Feedfetcher has been designed to ignore robots.txt guidelines.
It's not possible for Google to restrict access to a publicly available feed. If your feed is provided by a blog hosting service, you should work with them to restrict access to your feed. Check those sites' help content for more information (e.g., Blogger, LiveJournal, or Typepad).
Remove transcoded pages
Google Web Search on mobile phones allows users to search all the content in the Google index for desktop web browsers. Because this content isn't written specifically for mobile phones and devices and thus might not display properly, Google automatically translates (or "transcodes") these pages by analyzing the original HTML code and converting it to a mobile-ready format. To ensure that the highest quality and most useable web page is displayed on your mobile phone or device, Google may resize, adjust, or convert images, text formatting and/or certain aspects of web page functionality.
SRC
To save bandwidth, Googlebot only downloads the robots.txt file once a day or whenever we've fetched many pages from the server. So, it may take a while for Googlebot to learn of changes to your robots.txt file. Also, Googlebot is distributed on several machines. Each of these keeps its own record of your robots.txt file.
Also, there's a small difference between the way Googlebot handles the robots.txt file and the way the robots.txt standard says we should (keeping in mind the distinction between "should" and "must"). The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do.
For example, consider the following robots.txt file:
User-Agent: * Allow: / Disallow: /cgi-bin
It's obvious that the webmaster's intent here is to allow robots to crawl everything except the /cgi-bin directory. Consequently, that's what we do.
Tell googlebot not to count certain external links in your ranking
Meta tags can exclude all outgoing links on a page, but you can also instruct Googlebot not to crawl individual links by adding rel="nofollow" to a hyperlink. When Google sees the attribute rel="nofollow" on hyperlinks, those links won't get any credit when we rank websites in our search results. For example a link, This is a great link! could be replaced with I can't vouch for this link.
Other Links
- Database of Web Robots, Overview
# robots.txt, www.nytimes.com 6/29/2006 User-agent: * Disallow: /pages/college/ Disallow: /college/ Disallow: /library/ Disallow: /learning/ Disallow: /aponline/ Disallow: /reuters/ Disallow: /cnet/ Disallow: /partners/ Disallow: /archives/ Disallow: /indexes/ Disallow: /thestreet/ Disallow: /nytimes-partners/ Disallow: /financialtimes/ Allow: /pages/ Allow: /2003/ Allow: /2004/ Allow: /2005/ Allow: /top/ Allow: /ref/ Allow: /services/xml/ User-agent: Mediapartners-Google* Disallow: # robots.txt, http://dictionary.reference.com User-agent: Googlebot Disallow: User-agent: Mediapartners-Google Disallow: User-agent: Teleport Pro Disallow: / User-agent: * Disallow: /cgi-bin/ # robots.txt for www.phpbbhacks.com User-agent: * Disallow: /forums/viewtopic.php Disallow: /forums/viewforum.php Disallow: /forums/index.php? Disallow: /forums/posting.php Disallow: /forums/groupcp.php Disallow: /forums/search.php Disallow: /forums/login.php Disallow: /forums/privmsg.php Disallow: /forums/post Disallow: /forums/profile.php Disallow: /forums/memberlist.php Disallow: /forums/faq.php Disallow: /forums/archive # robots.txt for Slashdot.org # # "Any empty [Disallow] value, indicates that all URLs can be retrieved. # At least one Disallow field needs to be present in a record." User-agent: Mediapartners-Google Disallow: User-agent: Googlebot Crawl-delay: 100 Disallow: /firehose.pl Disallow: /submit.pl Disallow: /comments.pl Disallow: /users.pl Disallow: /zoo.pl Disallow: firehose.pl Disallow: submit.pl Disallow: comments.pl Disallow: users.pl Disallow: zoo.pl Disallow: /~ Disallow: ~ User-agent: Slurp Crawl-delay: 100 Disallow: User-agent: Yahoo-NewsCrawler Disallow: User-Agent: msnbot Crawl-delay: 100 Disallow: User-agent: * Crawl-delay: 100 Disallow: /authors.pl Disallow: /index.pl Disallow: /article.pl Disallow: /comments.pl Disallow: /firehose.pl Disallow: /journal.pl Disallow: /messages.pl Disallow: /metamod.pl Disallow: /users.pl Disallow: /search.pl Disallow: /submit.pl Disallow: /pollBooth.pl Disallow: /pubkey.pl Disallow: /topics.pl Disallow: /zoo.pl Disallow: /palm Disallow: authors.pl Disallow: index.pl Disallow: article.pl Disallow: comments.pl Disallow: firehose.pl Disallow: journal.pl Disallow: messages.pl Disallow: metamod.pl Disallow: users.pl Disallow: search.pl Disallow: submit.pl Disallow: pollBooth.pl Disallow: pubkey.pl Disallow: topics.pl Disallow: zoo.pl Disallow: /~ Disallow: ~ # robots.txt for http://www.myspace.com User-agent: ia_archiver Disallow: / # robots.txt for http://www.craigslist.com User-agent: YahooFeedSeeker Disallow: /forums Disallow: /res/ Disallow: /post Disallow: /email.friend Disallow: /?flagCode Disallow: /ccc Disallow: /hhh Disallow: /sss Disallow: /bbb Disallow: /ggg Disallow: /jjj User-agent: * Disallow: /cgi-bin Disallow: /cgi-secure Disallow: /forums Disallow: /search Disallow: /res/ Disallow: /post Disallow: /email.friend Disallow: /?flagCode Disallow: /ccc Disallow: /hhh Disallow: /sss Disallow: /bbb Disallow: /ggg Disallow: /jjj User-Agent: OmniExplorer_Bot Disallow: / # robots.txt for http://www.alexa.com User-agent: googlebot # allow Google crawler Disallow: /search User-agent: gulliver # allow Northern Light crawler Disallow: /search User-agent: slurp # allow Inktomi crawler Disallow: /search User-agent: fast # allow FAST crawler Disallow: /search User-agent: scooter # allow AltaVista crawler Disallow: /search User-agent: vscooter # allow AltaVista image crawler Disallow: /search User-agent: ia_archiver # allow Internet Archive crawler Disallow: /search User-agent: * # Disallow all other crawlers access Disallow: / # robots.txt for http://www.technorati.com User-agent: NPBot Disallow: / User-agent: TurnitinBot Disallow: / User-Agent: sitecheck.internetseer.com Disallow: / User-Agent: * Crawl-Delay: 3 Disallow: /search/ Disallow: /search.php Disallow: /cosmos.php # robots.txt for www.sitepoint.com User-agent: * Disallow: /cgi-bin/ Disallow: /images/ Disallow: /forums/report.php Disallow: /forums/search.php Disallow: /forums/newreply.php Disallow: /forums/editpost.php Disallow: /forums/memberlist.php Disallow: /forums/profile.php Disallow: /launch/ Disallow: /search/ Disallow: /voucher/424/ Disallow: /email/ Disallow: /feedback/ Disallow: /contact?reason=articlesuggest Disallow: /linktothis/ Disallow: /popup/ Disallow: /forums/archive/ # robots.txt for http://www.w3.org/ # For use by search.w3.org User-agent: W3C-gsa Disallow: /Out-Of-Date User-agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT; MS Search 4.0 Robot) Disallow: / # W3C Link checker User-agent: W3C-checklink Disallow: # exclude some access-controlled areas User-agent: * Disallow: /2004/ontaria/basic Disallow: /Team Disallow: /Project Disallow: /Systems Disallow: /Web Disallow: /History Disallow: /Out-Of-Date Disallow: /2002/02/mid Disallow: /mid/ Disallow: /People/all/ Disallow: /RDF/Validator/ARPServlet Disallow: /2003/03/Translations/byLanguage Disallow: /2003/03/Translations/byTechnology Disallow: /2005/11/Translations/Query Disallow: /2003/glossary/subglossary/ #Disallow: /2005/06/blog/ #Disallow: /2001/07/pubrules-checker #shouldnt get transparent proxies but will ml links of things like pubrules Disallow: /2000/06/webdata/xslt Disallow: /2000/09/webdata/xslt Disallow: /2005/08/online_xslt/xslt Disallow: /Bugs/ Disallow: /Search/Mail/Public/ Disallow: /2006/02/chartergen # robots.txt for www.google-analytics.com User-Agent: * Disallow: / Noindex: / # robots.txt for video.google.com User-agent: * Disallow: /videosearch? Disallow: /videofeed? Disallow: /videopreview? Disallow: /videopreviewbig? Disallow: /videoprograminfo? Disallow: /videorandom Disallow: /videolineup Disallow: /downloadgvp # robots.txt for www.google.com User-agent: * Allow: /searchhistory/ Disallow: /news?output=xhtml& Allow: /news?output=xhtml Disallow: /search Disallow: /groups Disallow: /images Disallow: /catalogs Disallow: /catalogues Disallow: /news Disallow: /nwshp Disallow: /? Disallow: /addurl/image? Disallow: /pagead/ Disallow: /relpage/ Disallow: /relcontent Disallow: /sorry/ Disallow: /imgres Disallow: /keyword/ Disallow: /u/ Disallow: /univ/ Disallow: /cobrand Disallow: /custom Disallow: /advanced_group_search Disallow: /advanced_search Disallow: /googlesite Disallow: /preferences Disallow: /setprefs Disallow: /swr Disallow: /url Disallow: /m? Disallow: /m/search? Disallow: /wml? Disallow: /wml/search? Disallow: /xhtml? Disallow: /xhtml/search? Disallow: /xml? Disallow: /imode? Disallow: /imode/search? Disallow: /jsky? Disallow: /jsky/search? Disallow: /pda? Disallow: /pda/search? Disallow: /sprint_xhtml Disallow: /sprint_wml Disallow: /pqa Disallow: /palm Disallow: /gwt/ Disallow: /purchases Disallow: /hws Disallow: /bsd? Disallow: /linux? Disallow: /mac? Disallow: /microsoft? Disallow: /unclesam? Disallow: /answers/search?q= Disallow: /local? Disallow: /local_url Disallow: /froogle? Disallow: /froogle_ Disallow: /print Disallow: /books Disallow: /patents? Disallow: /scholar? Disallow: /complete Disallow: /sponsoredlinks Disallow: /videosearch? Disallow: /videopreview? Disallow: /videoprograminfo? Disallow: /maps? Disallow: /translate? Disallow: /ie? Disallow: /sms/demo? Disallow: /katrina? Disallow: /blogsearch? Disallow: /blogsearch/ Disallow: /blogsearch_feeds Disallow: /advanced_blog_search Disallow: /reader/ Disallow: /uds/ Disallow: /chart? Disallow: /transit? Disallow: /mbd? Disallow: /extern_js/ Disallow: /calendar/feeds/ Disallow: /calendar/ical/ Disallow: /cl2/feeds/ Disallow: /cl2/ical/ Disallow: /coop/directory Disallow: /coop/manage Disallow: /trends? Disallow: /trends/music? Disallow: /notebook/search? Disallow: /music Disallow: /browsersync Disallow: /call Disallow: /archivesearch? Disallow: /archivesearch/url Disallow: /archivesearch/advanced_search Disallow: /base/search? Disallow: /base/reportbadoffer Disallow: /base/s2 Disallow: /urchin_test/ Disallow: /movies? Disallow: /codesearch? Disallow: /codesearch/feeds/search? Disallow: /wapsearch? Disallow: /safebrowsing Disallow: /finance Disallow: /reviews/search? # robots.txt for validator.w3.org # $Id: robots.txt,v 1.3 2000/12/13 13:04:09 gerald Exp $ User-agent: * Disallow: /check # robots.txt for httpd.apache.org User-agent: * Disallow: /websrc # robots.txt for www.apache.org User-agent: * Disallow: /websrc Crawl-Delay: 4
# Please, we do NOT allow nonauthorized robots. # http://www.webmasterworld.com/robots # Actual robots can always be found here for: http://www.webmasterworld.com/robots2 # Old full robots.txt can be found here: http://www.webmasterworld.com/robots3 # Any unauthorized bot running will result in IP's being banned. # Agent spoofing is considered a bot. # Fair warning to the clueless - honey pots are - and have been - running. # If you have been banned for bot running - please sticky an admin for a reinclusion request. # http://www.searchengineworld.com/robots/ # This code found here: http://www.webmasterworld.com/robots.txt?view=rawcode User-agent: * Crawl-delay: 17 User-agent: * Disallow: /gfx/ Disallow: /cgi-bin/ Disallow: /QuickSand/ Disallow: /pda/ Disallow: /zForumFFFFFF/
# WebmasterWorld.com: robots.txt # GNU Robots.txt Feel free to use with credit # given to WebmasterWorld. # Please, we do NOT allow nonauthorized robots any longer. # http://www.searchengineworld.com/robots/ # Yes, feel free to copy and use the following. User-agent: OmniExplorer_Bot Disallow: / User-agent: FreeFind Disallow: / User-agent: BecomeBot Disallow: / User-agent: Nutch Disallow: / User-agent: Jetbot/1.0 Disallow: / User-agent: Jetbot Disallow: / User-agent: WebVac Disallow: / User-agent: Stanford Disallow: / User-agent: naver Disallow: / User-agent: dumbot Disallow: / User-agent: Hatena Antenna Disallow: / User-agent: grub-client Disallow: / User-agent: grub Disallow: / User-agent: looksmart Disallow: / User-agent: WebZip Disallow: / User-agent: larbin Disallow: / User-agent: b2w/0.1 Disallow: / User-agent: Copernic Disallow: / User-agent: psbot Disallow: / User-agent: Python-urllib Disallow: / User-agent: Googlebot-Image Disallow: / User-agent: NetMechanic Disallow: / User-agent: URL_Spider_Pro Disallow: / User-agent: CherryPicker Disallow: / User-agent: EmailCollector Disallow: / User-agent: EmailSiphon Disallow: / User-agent: WebBandit Disallow: / User-agent: EmailWolf Disallow: / User-agent: ExtractorPro Disallow: / User-agent: CopyRightCheck Disallow: / User-agent: Crescent Disallow: / User-agent: SiteSnagger Disallow: / User-agent: ProWebWalker Disallow: / User-agent: CheeseBot Disallow: / User-agent: LNSpiderguy Disallow: / User-agent: Mozilla Disallow: / User-agent: mozilla Disallow: / User-agent: mozilla/3 Disallow: / User-agent: mozilla/4 Disallow: / User-agent: mozilla/5 Disallow: / User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT) Disallow: / User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95) Disallow: / User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 98) Disallow: / User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows XP) Disallow: / User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 2000) Disallow: / User-agent: ia_archiver Disallow: / User-agent: ia_archiver/1.6 Disallow: / User-agent: Alexibot Disallow: / User-agent: Teleport Disallow: / User-agent: TeleportPro Disallow: / User-agent: Stanford Comp Sci Disallow: / User-agent: MIIxpc Disallow: / User-agent: Telesoft Disallow: / User-agent: Website Quester Disallow: / User-agent: moget/2.1 Disallow: / User-agent: WebZip/4.0 Disallow: / User-agent: WebStripper Disallow: / User-agent: WebSauger Disallow: / User-agent: WebCopier Disallow: / User-agent: NetAnts Disallow: / User-agent: Mister PiX Disallow: / User-agent: WebAuto Disallow: / User-agent: TheNomad Disallow: / User-agent: WWW-Collector-E Disallow: / User-agent: RMA Disallow: / User-agent: libWeb/clsHTTP Disallow: / User-agent: asterias Disallow: / User-agent: httplib Disallow: / User-agent: turingos Disallow: / User-agent: spanner Disallow: / User-agent: InfoNaviRobot Disallow: / User-agent: Harvest/1.5 Disallow: / User-agent: Bullseye/1.0 Disallow: / User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95) Disallow: / User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0 Disallow: / User-agent: CherryPickerSE/1.0 Disallow: / User-agent: CherryPickerElite/1.0 Disallow: / User-agent: WebBandit/3.50 Disallow: / User-agent: NICErsPRO Disallow: / User-agent: Microsoft URL Control - 5.01.4511 Disallow: / User-agent: DittoSpyder Disallow: / User-agent: Foobot Disallow: / User-agent: WebmasterWorldForumBot Disallow: / User-agent: SpankBot Disallow: / User-agent: BotALot Disallow: / User-agent: lwp-trivial/1.34 Disallow: / User-agent: lwp-trivial Disallow: / User-agent: http://www.WebmasterWorld.com bot Disallow: / User-agent: BunnySlippers Disallow: / User-agent: Microsoft URL Control - 6.00.8169 Disallow: / User-agent: URLy Warning Disallow: / User-agent: Wget/1.6 Disallow: / User-agent: Wget/1.5.3 Disallow: / User-agent: Wget Disallow: / User-agent: LinkWalker Disallow: / User-agent: cosmos Disallow: / User-agent: moget Disallow: / User-agent: hloader Disallow: / User-agent: humanlinks Disallow: / User-agent: LinkextractorPro Disallow: / User-agent: Offline Explorer Disallow: / User-agent: Mata Hari Disallow: / User-agent: LexiBot Disallow: / User-agent: Web Image Collector Disallow: / User-agent: The Intraformant Disallow: / User-agent: True_Robot/1.0 Disallow: / User-agent: True_Robot Disallow: / User-agent: BlowFish/1.0 Disallow: / User-agent: http://www.SearchEngineWorld.com bot Disallow: / User-agent: http://www.WebmasterWorld.com bot Disallow: / User-agent: JennyBot Disallow: / User-agent: MIIxpc/4.2 Disallow: / User-agent: BuiltBotTough Disallow: / User-agent: ProPowerBot/2.14 Disallow: / User-agent: BackDoorBot/1.0 Disallow: / User-agent: toCrawl/UrlDispatcher Disallow: / User-agent: WebEnhancer Disallow: / User-agent: suzuran Disallow: / User-agent: VCI WebViewer VCI WebViewer Win32 Disallow: / User-agent: VCI Disallow: / User-agent: Szukacz/1.4 Disallow: / User-agent: QueryN Metasearch Disallow: / User-agent: Openfind data gathere Disallow: / User-agent: Openfind Disallow: / User-agent: Xenu's Link Sleuth 1.1c Disallow: / User-agent: Xenu's Disallow: / User-agent: Zeus Disallow: / User-agent: RepoMonkey Bait & Tackle/v1.01 Disallow: / User-agent: RepoMonkey Disallow: / User-agent: Microsoft URL Control Disallow: / User-agent: Openbot Disallow: / User-agent: URL Control Disallow: / User-agent: Zeus Link Scout Disallow: / User-agent: Zeus 32297 Webster Pro V2.9 Win32 Disallow: / User-agent: Webster Pro Disallow: / User-agent: EroCrawler Disallow: / User-agent: LinkScan/8.1a Unix Disallow: / User-agent: Keyword Density/0.9 Disallow: / User-agent: Kenjin Spider Disallow: / User-agent: Iron33/1.0.2 Disallow: / User-agent: Bookmark search tool Disallow: / User-agent: GetRight/4.2 Disallow: / User-agent: FairAd Client Disallow: / User-agent: Gaisbot Disallow: / User-agent: Aqua_Products Disallow: / User-agent: Radiation Retriever 1.1 Disallow: / User-agent: WebmasterWorld Extractor Disallow: / User-agent: Flaming AttackBot Disallow: / User-agent: Oracle Ultra Search Disallow: / User-agent: MSIECrawler Disallow: / User-agent: PerMan Disallow: / User-agent: searchpreview Disallow: / User-agent: sootle Disallow: / User-agent: es Disallow: / User-agent: Enterprise_Search/1.0 Disallow: / User-agent: Enterprise_Search Disallow: / User-agent: * Disallow: /gfx/ Disallow: /cgi-bin/ Disallow: /QuickSand/ Disallow: /pda/ Disallow: /zForumFFFFFF/
You don't have to block your feeds from indexing. Matt Cutts himself suggested not to block those because there is not real reason to doing so. If you are talking about a blog and its /feed and such urls it won't cause mess in your rankings, so my suggesting would be not touch those feeds.
As for blocking, no you won't have any affection to your main url if you block /feed or whatever urls you want. Gbot will just deindex them and stop crawling them.
Wget Robot Exclusion
It is extremely easy to make Wget wander aimlessly around a web site, sucking all the available data in progress. `wget -r site', and you're set. Great? Not for the server admin.
As long as Wget is only retrieving static pages, and doing it at a reasonable rate (see the `--wait' option), there's not much of a problem. The trouble is that Wget can't tell the difference between the smallest static page and the most demanding CGI. A site I know has a section handled by an, uh, bitchin' CGI Perl script that converts Info files to HTML on the fly. The script is slow, but works well enough for human users viewing an occasional Info file. However, when someone's recursive Wget download stumbles upon the index page that links to all the Info files through the script, the system is brought to its knees without providing anything useful to the downloader.
To avoid this kind of accident, as well as to preserve privacy for documents that need to be protected from well-behaved robots, the concept of robot exclusion has been invented. The idea is that the server administrators and document authors can specify which portions of the site they wish to protect from the robots.
The most popular mechanism, and the de facto standard supported by all the major robots, is the "Robots Exclusion Standard" (RES) written by Martijn Koster et al. in 1994. It specifies the format of a text file containing directives that instruct the robots which URL paths to avoid. To be found by the robots, the specifications must be placed in `/robots.txt'
in the server root, which the robots are supposed to download and parse.
Although Wget is not a web robot in the strictest sense of the word, it can downloads large parts of the site without the user's intervention to download an individual page. Because of that, Wget honors RES when downloading recursively. For instance, when you issue: wget -r http://www.server.com/
First the index of `www.server.com' will be downloaded. If Wget finds that it wants to download more documents from that server, it will request `http://www.server.com/robots.txt'
and, if found, use it for further downloads. `robots.txt' is loaded only once per each server.
Until version 1.8, Wget supported the first version of the standard, written by Martijn Koster in 1994 and available at http://www.robotstxt.org/wc/norobots.html. As of version 1.8, Wget has supported the additional directives specified in the internet draft draft-koster-robots-00.txt titled "A Method for Web Robots Control". The draft, which has as far as I know never made to an RFC, is available at http://www.robotstxt.org/wc/norobots-rfc.txt.
This manual no longer includes the text of the Robot Exclusion Standard.
The second, less known mechanism, enables the author of an individual document to specify whether they want the links from the file to be followed by a robot. This is achieved using the META tag, like this:
This is explained in some detail at http://www.robotstxt.org/wc/meta-user.html. Wget supports this method of robot exclusion in addition to the usual /robots.txt
exclusion.
If you know what you are doing and really really wish to turn off the robot exclusion, set the robots variable to `off' in your `.wgetrc'. You can achieve the same effect from the command line using the -e switch, e.g. wget -e robots=off url
« Abbr and Acronym examplesUpdated: WordPress RewriteRules Viewer Plugin »
Comments