Web Spiders, (also known as Robots), are WWW search engines that “crawl” across the Internet and index pages on Web servers. The robots.txt file help webmasters or site owners to prevent web crawlers (robots) from accessing all or part of a website. Web site owners use the robots.txt file to give instructions about their site to web robots using the Robots Exclusion Protocol.
robots.txt File Syntax and Rules
The robots.txt file uses basic rules as follows:
- User-agent: The robot the following rule applies to
- Disallow: The URL you want to block.
- Allow: The URL you want to allow.
Examples: The default robots.txt
To block all robots from the entire server create or upload robots.txt file as follows:
User-agent: * Disallow: /
Above two lines are considered a single entry in the file. To allow all robots complete access to the entire server create or upload robots.txt file as follows:
User-agent: * Disallow:
User-agent: * Allow:
Please note that User-agent: * means match “any robot”. You can include as many entries as you want. You can include multiple Disallow or Allow lines and multiple user-agents in one entry. The following example tells robots to stay away from /foo/bar.php file
User-agent: * Disallow: /foo/bar.php
In this example, you instructs all robots not to enter in /cgi-bin/ and /print/ directories:
User-agent: * Disallow: /cgi-bin/ Disallow: /print/
This example tells a specific robot called fooBar to stay away from your web-site. fooBar is the name of the actual user-agent of the bot. Feel free to replace ‘fooBar’ with the actual user-agent of the bot:
User-agent: fooBar Disallow: /
To block files of a specific file type say all *.png image files, use the following syntax for googlebot:
User-agent: Googlebot Disallow: /*.png$
The following example disallows a Robot named “fooBar” from the paths “/cgi-bin/” and “/pdfs/”:
# Tell "fooBar" where it can't go User-agent: fooBar Disallow: /cgi-bin/ Disallow: /pdfs/ # Allow all other robots to browse everywhere User-agent: * Disallow:
In this example, I am only allowing a Web Spider named “googlebot” into a site, while denying all other Spiders:
# Allow "googlebot" in the site User-agent: Googlebot Disallow: # Deny all other spiders User-agent: * Disallow: /
How do I create a robots.txt file on my server?
Please note that a robots.txt file is a special text file and it is always located in your Web server’s root directory. It should be noted that Web Robots are not required to respect robots.txt files, but most well-written Web Spiders follow the rules you define. You can create robots.txt on your system and upload it using ftp client.
You can login to your server using ssh command and use a text editor such as vi to create a robots.txt file. In this example, I am login to server called server1.cyberciti.biz and creating the file at /var/www/html directory from OS X or Linux/Unix based desktop system. MS-Windows user try putty ssh client:
ssh [email protected]
Sample robots.txt file
#Allow Google Media Partners bot User-agent: Mediapartners-Google Disallow: #Block the bad bots User-agent: ia_archiver Disallow: / User-agent: VoilaBot Disallow: / User-agent: Baiduspider Disallow: / User-agent: MJ12bot Disallow: / User-agent: BecomeJPBot Disallow: / User-agent: Exabot Disallow: / User-agent: 008 Disallow: / User-agent: Sosospider Disallow: / #Block specific urls and directories for all bots User-agent: * Disallow: /low.html Disallow: /lib/ Disallow: /rd/ Disallow: /tools/ Disallow: /tmp/ Disallow: /*? Disallow: /view/pdf/faq/*.php Disallow: /view/pdf/tips/*.php Disallow: /view/pdf/cms/*.php