Google attempting to index the deep web
Google announced yesterday that the Google spiders will now fill out forms to access dynamically generated pages which till now were inaccessible to search engines. This gives Google the ability to start indexing the Deep Web.
Google promises to follow “good Internet citizenry practices”, but then again not a lot of people trust Google:
Needless to say, this experiment follows good Internet citizenry practices. Only a small number of particularly useful sites receive this treatment, and our crawl agent, the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives.
And they also promise not to index any forms that require any user information:
Similarly, we only retrieve GET forms and avoid forms that require any kind of user information.
Well, we have to wait and see how this turns out for the Internet, but for those who want to prevent Google from indexing their dynamic pages for whatever reasons, here’s how:
- Use no-index and no-follow in your web pages.
- Make use of robots.txt.
- And if you are still uncomfortable, start using CAPTCHAs on your web forms
I have to say that I have nothing against Google and I don’t believe that they are out for World Domination.