Controlling Robots Access to Your Files

Written by Suresh Kalyanasundaram

18/04/2016

reach your audience

While we all want out sites spidered to the fullest three are many areas of your site that you will not want robots to visit. These may be personal files, email files, download files, your CGI or java directories, etc.

The robots.txt tag will not only prevent these files from being spidered and displayed to the world but will save the spiders’ time in crawling your site, thus ensuring that during their limited stay at your site, they are getting the data you want them to see.

There are two basic methods of controlling access:

The Robots.txt File

When a robot visits your site one of the first things it does is check for the existence of a robots.txt file in your root directory. If it finds this file and it is properly constructed it will then crawl your site in accordance with this file.

The file robots.txt must be a strict text file and located in the root directory of your IP address space. The file must be constructed according to the rules specified here and should contain records looking something like this:

User-agent: *

Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joe/

This would indicate to the robots that all robots are excluded from the .cgi/bin, /tmp/ and/~joe/ directories. This statement would keep all spiders from spidering the contents of your site:

User-agent: *

Disallow: /

and this statement would keep Googlebot from spidering your email directory:

User-agent:googlebot

Disallow: /email/

Note that you can specify the user agent if you want to allow certain robots access to particular files, thus allowing you to prepare directories containing pages optimized only for certain engines. Complete rules for the construction of the robots.txt file can be found at the site listed above.

The Robots meta tag

The robots meta tag provides some of the functionality of the robots.txt file, but applies only to a particular page where it is located, whereas the robots.txt file applies to your entire site. The robots meta tags are formulated as:

<meta name=”robots” content=”noindex, nofollow”>

This would prevent all well-behaved robots from indexing this page and from analyzing it for links to follow. Note that you can use either or both of the noindex or nofollow parameters in the tag.

It is not necessary to use a robots meta tag to instruct spiders to index the pages since by default spiders will find and follow all links.

Be careful with the robots.txt tag

 

Liked the article? Share it with your friends!
SEMrush
Digital Marketing Course in Chennai