Controlling Googlebot crawls
For some Google crawls too often which comsumes too much bandwidth. For others it visits too infreqently. Some complain that it doesn't visit their entire site and others get upset when areas that they didn't want accessible via search engines appear on the Google index. It is not really possible to attract robots. Google
will visit your site often if the site has good content that is updated often and is mentioned by other sites. However it is possible to deter robots. You can control both the pages that Googlebot crawls and request reduction in the frequency or depth of each crawl.
To stop google from crawling certain pages you need to create a file called robots.txt and place it at the root of your domain. EG
http://www.frontmasters.co.uk/robots.txt. It is possible to configure within robots.txt to prevent google from indexing images, perl scripts or copyrighted information. The way it works is each tag of the robots.txt file lists first the name of the spider, then the list of directories or files it is not allowed to access. It does support wildcard characters, like * or ?
The following example prevents all robots from accessing images or PERL scripts ( in cgi-bin ). The 2nd example prevents only googlebot from accessing copyright directory and a copyright content page.
---- robots.txt file ---- don't include this line :)
User-agent:*
Disallow:/images
Disallow:/cgi-bin/
USer-agent:Googlebot
Disallow:/copyright/
Disallow:/content/copyright-notice.html
- GeorgeS's blog
- Login or register to post comments
Delicious
Digg
Facebook
Technorati
Thanks for the blog
Hey Front Masters .
I used your advise here. Thanks , Jim, NY NY