Web Robots
Search Robots
Sources:
- The Web Robots Pages www.robotstxt.org/wc/robots.html
with e.g. (more exist):
- Web Robots FAQ www.robotstxt.org/wc/faq.html
- www.robotstxt.org - the main source for information on the robots.txt, Robots Exclusion Standard and other articles about writing well-behaved Web robots.
HTML Documents
Uncomment / add:
<head> ... <meta name="robots" content="NOINDEX, NOFOLLOW"> <META name="ROBOTS" content="NONE"> # same as above ... </head>
More details, from www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt :
1. ROBOTS meta-tag
<META name="ROBOTS"
content="ALL | NONE | NOINDEX | NOFOLLOW">
default = empty = "ALL"
"NONE" = "NOINDEX, NOFOLLOW"
The filler is a comma separated list of terms:
ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW.
Discussion: This tag is meant to provide users who cannot control
the robots.txt file at their sites. It provides a last chance to
keep their content out of search services. It was decided not to
add syntax to allow robot specific permissions within the meta-tag.
INDEX means that robots are welcome to include this page in
search services.
FOLLOW means that robots are welcome to follow links from this
page to find other pages.
So a value of "NOINDEX" allows the subsidiary links to be explored,
even though the page is not indexed. A value of "NOFOLLOW" allows the
page to be indexed, but no links from the page are explored (this may
be useful if the page is a free entry point into pay-per-view content,
for example. A value of "NONE" tells the robot to ignore the page.
/robots.txt File
A plain text file placed at the root of the web (web document root, not server's root, or anything else).
Simplest example, disallow all actions by robots:
User-agent: * Disallow: /
A more useful, to allow your own search engine to disccard certain areas:
# /robots.txt file for my own site # comments starts with '#' User-agent: webcrawler Disallow: # empty value = all URLs can be retrieved User-agent: lycra Disallow: / # disallows the whole site User-agent: * Disallow: /tmp Disallow: /logs
More information in (e.g.) A Standard for Robot Exclusion.
