Indexing Pages / Directories Prohibition Through robots.txt
Search robots start their work by looking for the robots.txt file first of all when they enter a website. This text file is located in the site’s root directory where the main index.file is located as well. For the main site/domain, this folder is called public_html. The file robots.txt contains direct instructions for search robots.
These instructions can prohibit folder or website page indexing and point the robot to the main website mirror. It will also recommend the search robot to observe a specific time interval for the site indexing and much more.
In case that the robots.txt file is not located in the website directory, you can create it. To disable site indexing with the help of the robots.txt file, 2 directives can be used: User-agent and Disallow.
- User-agent: SPECIFY_SEARCH_BOT
- Disallow: / # entire website’s indexing will be prohibited
- Disallow: /page/ # indexing of a separate /page/ will be prohibited
For example:
To prevent your website from being indexed by MSNbot
User agent: MSNBot
Disallow: /
To prevent your website from being indexed by Yahoo Bot
User agent: Slurp
Disallow: /
To prevent your website from being indexed by Yandex Bot
User agent: Yandex
Disallow: /
To prevent your website from being indexed by Google Bot
User agent: Googlebot
Disallow: /
To prevent your website from being indexed by all search engines
User agent: *
Disallow: /
To disable indexing of the cgi-bin and image folders for every search engine
User agent: *
Disallow: /cgi-bin/
Disallow: /images/
Now, have a look at how to allow all website page indexing by search engines. Note that an empty robots.txt file will be equivalent to the instruction below.
User agent: *
Disallow:
For example:
Use the following lines to allow only Yandex, Google, and Rambler bots to index the website with a delay of 4 seconds between every page poll.
User agent: *
Disallow: /
User agent: Yandex
Crawl-delay: 4
Disallow:
User agent: Googlebot
Crawl-delay: 4
Disallow:
User agent: StackRambler
Crawl-delay: 4
Disallow: