Default robots.txt File For Web-Server

by on September 23, 2012 · 1 comment· LAST UPDATED September 23, 2013

in , ,

How do I create a default robots.txt file for the Apache web-server running on Linux/Unix/MS-Windows server?

Tutorial details
DifficultyEasy (rss)
Root privilegesNo
RequirementsNone
Estimated completion time5m
Web Spiders, (also known as Robots), are WWW search engines that "crawl" across the Internet and index pages on Web servers. The robots.txt file help webmasters or site owners to prevent web crawlers (robots) from accessing all or part of a website. Web site owners use the robots.txt file to give instructions about their site to web robots using the Robots Exclusion Protocol.

robots.txt File Syntax and Rules

The robots.txt file uses basic rules as follows:

  1. User-agent: The robot the following rule applies to
  2. Disallow: The URL you want to block.
  3. Allow: The URL you want to allow.

Examples: The default robots.txt

To block all robots from the entire server create or upload robots.txt file as follows:

 
User-agent: *
Disallow: /
 

Above two lines are considered a single entry in the file. To allow all robots complete access to the entire server create or upload robots.txt file as follows:

 
User-agent: *
Disallow:
 

OR

 
User-agent: *
Allow:
 

Please note that User-agent: * means match "any robot". You can include as many entries as you want. You can include multiple Disallow or Allow lines and multiple user-agents in one entry. The following example tells robots to stay away from /foo/bar.php file

 
User-agent: *
Disallow: /foo/bar.php
 

In this example, you instructs all robots not to enter in /cgi-bin/ and /print/ directories:

 
User-agent: *
Disallow: /cgi-bin/
Disallow: /print/
 

This example tells a specific robot called fooBar to stay away from your web-site. fooBar is the name of the actual user-agent of the bot. Feel free to replace 'fooBar' with the actual user-agent of the bot:

 
User-agent: fooBar
Disallow: /
 

To block files of a specific file type say all *.png image files, use the following syntax for googlebot:

 
User-agent: Googlebot
Disallow: /*.png$
 

The following example disallows a Robot named "fooBar" from the paths "/cgi-bin/" and "/pdfs/":

 
# Tell "fooBar" where it can't go
User-agent: fooBar
Disallow: /cgi-bin/
Disallow: /pdfs/
 
# Allow all other robots to browse everywhere
User-agent: *
Disallow:
 

In this example, I am only allowing a Web Spider named "googlebot" into a site, while denying all other Spiders:

 
# Allow "googlebot" in the site
User-agent: Googlebot
Disallow:
 
# Deny all other spiders
User-agent: *
Disallow: /
 

How do I create a robots.txt file on my server?

Please note that a robots.txt file is a special text file and it is always located in your Web server's root directory. It should be noted that Web Robots are not required to respect robots.txt files, but most well-written Web Spiders follow the rules you define. You can create robots.txt on your system and upload it using ftp client.

You can login to your server using ssh command and use a text editor such as vi to create a robots.txt file. In this example, I am login to server called server1.cyberciti.biz and creating the file at /var/www/html directory from OS X or Linux/Unix based desktop system. MS-Windows user try putty ssh client:
ssh nixcraft@server1.cyberciti.biz
cd /var/www/html
vi robots.txt

Sample robots.txt file

Sample robots.txt file from cyberciti.biz:

#Allow Google Media Partners bot
User-agent: Mediapartners-Google
Disallow:
 
#Block the bad bots
User-agent: ia_archiver
Disallow: /
 
User-agent: VoilaBot
Disallow: /
 
User-agent: Baiduspider
Disallow: /
 
User-agent: MJ12bot
Disallow: /
 
User-agent: BecomeJPBot
Disallow: /
 
User-agent: Exabot
Disallow: /
 
User-agent: 008
Disallow: /
 
User-agent: Sosospider
Disallow: /
 
#Block specific urls and directories for all bots
User-agent: *
Disallow: /low.html
Disallow: /lib/
Disallow: /rd/
Disallow: /tools/
Disallow: /tmp/
Disallow: /*?
Disallow: /view/pdf/faq/*.php
Disallow: /view/pdf/tips/*.php
Disallow: /view/pdf/cms/*.php
 
References
TwitterFacebookGoogle+PDF versionFound an error/typo on this page? Help us!

{ 1 comment… read it below or add one }

1 Sriharsha April 9, 2014 at 7:17 am

Hi frendzz,,, Just i tried to create rebots.txt ,,, here i did’t find the html directory under /var/www/… can we create this file any other location…… ? can u send me the minmum steps to understand easily pls,,,,

Reply

Leave a Comment

Tagged as: ,

Previous Faq:

Next Faq: