Updated robots.txt for WordPress

Implementing an effective SEO robots.txt file for WordPress will help your blog to rank higher in Search Engines, receive higher paying relevant Ads, and increase your blog traffic. Using a robots.txt file gives you a search engine robots point of view... Sweet! Looking for the most updated robots.txt? Just look at mine, I don't slack.

Warning about robots.txt files

Your robots.txt file should never have more than 200 Disallow lines.. Start with as few as possible and add to it when needed.

Once google removes links referenced in your robots.txt file, if you want those links to be added back in it could take up to 3 months before Google re-indexes the previously disallowed links.

Google pays serious attention to robots.txt files. Google uses robots.txt files as an authoritative set of links to Disallow. If you Disallow a link in robots.txt, Google will completely and totally remove the disallowed links from the index which means you will not be able to find the disallowed links when searching google.

The big idea for you to take away, is to only use robots.txt to do hard disallows, that you know you don't want indexed. Not only will the links not be indexed, they won't be followed by search engines either, meaning the links and content on the disallowed pages will not be used by the search engines for indexing or for ranking.

So, use the robots.txt file only for disallowing links that you want totally removed from google. Use the robots meta tag to specify all the allows, and also use the rel='nofollow' attribute of the a link element when its temporary or you still want the link to be indexed but not followed.

WordPress robots.txt SEO

Here are some robots.txt files used with WordPress on this blog. For instance, I am disallowing /comment-page- links altogether in the robots.txt file below because I don't use separate comment pages, so I instruct Google to remove these links from the index. See also: Adding a 301 Redirect using mod_rewrite or RedirectMatch can further protect myself from this duplicate content issue.

User-agent: *
Allow: /
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content
Disallow: /e/
Disallow: /show-error-*
Disallow: /xmlrpc.php
Disallow: /trackback/
Disallow: /comment-page-
Allow: /wp-content/uploads/

User-agent: Mediapartners-Google
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Image
Allow: /

User-agent: Googlebot-Mobile
Allow: /


# getting sick with the sitemaps
Sitemap: https://www.askapache.com/sitemap.xml
Sitemap: https://www.askapache.com/sitemap_index.xml
Sitemap: https://www.askapache.com/page-sitemap.xml 
Sitemap: https://www.askapache.com/post-sitemap.xml 
Sitemap: https://www.askapache.com/sitemap-news.xml 
Sitemap: https://www.askapache.com/sitemap-posttype-page.xml 
Sitemap: https://www.askapache.com/sitemap-posttype-post.xml 
Sitemap: https://www.askapache.com/sitemap-home.xml 



#               __                          __
#   ____ ______/ /______ _____  ____ ______/ /_  ___
#  / __ `/ ___/ //_/ __ `/ __ \/ __ `/ ___/ __ \/ _ \
# / /_/ (__  ) ,< / /_/ / /_/ / /_/ / /__/ / / /  __/
# \__,_/____/_/|_|\__,_/ .___/\__,_/\___/_/ /_/\___/
#                     /_/
#

Generic Default robots.txt

For many super-geeky reasons, every single website you control must have a robots.txt file in its root directory example.com/robots.txt. I also recommend having a favicon.ico file, bare minimum. This will ensure your site is viewed as somewhat SEO, and alerts google there are rules for crawling the site. IT will also save your server resources.

User-agent: *
Disallow:

Google Recommendations

Use robots.txt - Webmaster Guidelines

Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler.

Troubleshooting tips part IIb: Ad relevance and targeting continued. To follow up on our previous post about ad relevance and targeting, let's look at some other reasons why you may experience ad targeting issues on your site.

Have you blocked the AdSense crawler's access to your pages?

The AdSense crawler is an automated program that scans your web pages and tracks content for indexing. Sometimes we don't crawl pages because the AdSense crawler doesn't have access to your pages, in which case we're unable to determine their content and show relevant ads. Here are a few specific instances when our crawler can't access a site: If you use a robots.txt file which regulates the crawler access to your page. In this case, you can grant the AdSense crawler access by adding these lines to the top of your robots.txt file:

User-agent: Mediapartners-Google*
Disallow:

Eliminate Duplicate Content

Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include:

Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices

Store items shown or linked via multiple distinct URLs

Printer-only versions of web pages

However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results.

Google tries hard to index and show pages with distinct information. This filtering means, for instance, that if your site has a "regular" and "printer" version of each article, and neither of these is blocked in robots.txt or with a noindex meta tag, we'll choose one of them to list. In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results.

Prevent page from being indexed

Pages you block in this way may still be added to the Google index if other sites link to them. As a result, the URL of the page and, potentially, other publicly available information can appear in Google search results. However, no content from your pages will be crawled, indexed, or displayed.

To entirely prevent a page from being added to the Google index even if other sites link to it, use a noindex meta tag, and ensure that the page does not appear in robots.txt. When Googlebot crawls the page, it will recognize the noindex meta tag and drop the URL from the index.

Prevent content being indexed or remove content from Google's index?

You can instruct us not to include content from your site in our index or to remove content from your site that is currently in our index in the following ways:

Remove your entire website or part of your website using a robots.txt file.

Remove individual pages of your website using a robots meta tag.

Remove cached copies of your pages using a robots meta tag.

Remove snippets that appear below your page's title in our search results and describe the content of your page.

Remove outdated pages by returning the proper server response.

Remove images from Google Image Search using a robots.txt file.

Remove blog entries from Google Blog Search.

Remove a feed from our user-agent Feedfetcher, which provides content to our feed readers.

Remove transcoded versions of your pages (pages we've reformatted for mobile browsers).

Google User-agents

Adsbot-Google: crawls pages to measure AdWords landing page quality
Googlebot: crawl pages from googles web and news index
Googlebot-Image: crawls pages for the image index
Googlebot-Mobile: crawls pages for the mobile index
Mediapartners-Google: crawls pages to determine AdSense content

Robots Meta Tags and Examples

The meta tag is very helpful and should be preferred over modifications to robots.txt. Using the robots meta tag.

Stop all robots from indexing a page on your site, but still follow the links on the page

Allow other robots to index the page on your site, preventing only Googles bots from indexing the page

Allow robots to index the page on your site but not to follow outgoing links

header.php Trick for Conditional Robots Meta

Note: I recommend using the Yoast WordPress SEO Plugin to do this now, but here's a quick and easy way to think about it.. Add this to your header.php

<?php if(is_single() || is_page() || is_category() || is_home()) { ?>
	
<?php } ?>
<?php if(is_archive()) { ?>
	
<?php } ?>
<?php if(is_search() || is_404()) { ?>
	
<?php } ?>

Robots.txt footnoteAlexa, Compete, and Quantcast are all guilty of firewalling unknown friendly search engine agents at the front gate. These sites that monitor the Internet should be the most in the know that unfriendly agents cloak as humans and will come in no matter what. So the general rule of thumb is that robots.txt directives are only for the good agents anyway.

Good Robots.txt Articles

How Google Crawls My Site
Controlling how search engines access and index your website
Controlling Access with robots.txt
Removing duplicate search engine content using robots.txt - Mark Wilson
Revisiting robots.txt - Twenty Steps

Robots.txt References

SEO Google meta robots robots.txt SEO 15 Mar, 200815 Mar, 2008

« Hack WP-Cache for Maximum SpeedIP Abuse Detection for DreamHost »