A Robot.txt file contains instructions for bots that tell them which web pages they can and cannot access. Robot.txt files are most relevant for web crawlers from search engines like Google.
#Robots.txt file
A "robots.txt" file is a file that tells a search engine which search engine will crawl which pages of a site and which pages will not. This robots.txt file is in the root folder. Some pages on the website may need to not be shown in the search results. The reason may be that the work of those pages is not finished yet or any other reason. For this, a robots.txt file can be created to fix which pages will not be crawled by Search Engines.
If there is a subdomain and some of its pages do not need to be shown in the search results then a separate robots.txt file has to be created for it. The robots.txt file needs to be created and then uploaded to the root folder.
#Robots.txt file created
With the robots.txt file, it is possible to control which pages of the search engine bot crawler and spider site will see and which pages will not. This control method is called Robots Exclusion Protocol or Robots Exclusion Standard.
Before creating this file, let's take a look at some of the symbols used here.
Symbol User-agent indicates the robot (s) to * Wildcard. User-agent: * These means disallow all robots. Each line starts with disallow:. You can then adjust the URL path with /. By doing this, the robot will no longer crawl that path or file or that page. If you don't give a path, that is, if it is empty, then disallow will work for allow. # To comment. This is followed by a line so that the following codes can be understood later.
Disallow fields may represent partial or full URLs. The path that will be mentioned after the "/" sign will not be visited by the robot. See an example below -
See example
Disallow: / help
#disallows both /help.html and /help/index.html, whereas
See another example below -
See example
Disallow: / help /
# would disallow /help/index.html but allow /help.html
See some examples below -
Allow all robots to visit all files (wildcard “*” indicates all robots)
See example
User-agent: *
Disallow:
Not all robots will visit any file Aman see an example -
See example
User-agent: *
Disallow: /
GoogleBot will only allow visits. No one else will be able to visit. See an example -
See example
User-agent: GoogleBot
Disallow:
Visits from GoogleBot and Yahoo!
See example
User-agent: GoogleBot
User-agent: Slurp
Disallow:
If you want to block visits to a particular bot, use the following code
See example
User-agent: Teoma
Disallow: /
If you stop crawling any URL or page of your site with this file, these pages may still show up somewhere due to some problem. For example, referral logs may show URLs. Moreover, there are some search engines whose algorithms are not very advanced so when they send spiders/bots to crawl from these engines, they will crawl all your URLs ignoring the instructions in the robots.txt file.
Another good way to avoid these problems is to password-enclose all this content with a .htaccess file.
Beware of rel = "nofollow"
You can tell Google or the search engines not to crawl all these links by setting “nofollow” in the rel attribute of any link. If your site is a blog or forum where comments can be made, then you can keep the comment part by nofollow. This will not increase the rank of your site by using the reputation of your blog or forum. Again many times many may give offensive site address to your site which you do not want. It may also provide links to sites that are spammer to Google, thus damaging your site's reputation.
If you do not follow any link to the robot meta tag, nofollow will do the same.
0 Comments