You may be wondering about "ia_archiver"
and curious as to why it is downloading your entire site.
OR
you may want to invite the robot to archive your site.

Either way, here is important information you should know about the web-crawling robot:



The robot, which identifies itself as ia_archiver in the HTTP "User-agent" header field, uses a site-by-site crawl strategy. Basically, it starts at the top of the site and drills down, fetching all of the local links that it finds as it goes. There are several advantages to this approach, but chief among them is that it can be as polite as possible to the site being crawled.

We will not archive anything you request to remain private.

All you have to do is tell us. How? By using the Standard for Robot Exclusion (SRE).

The SRE was developed by Martijn Koster at Webcrawler to allow content providers to control how robots behave on their sites. All of the major web-crawling groups, such as InfoSeek and AltaVista, resepect this standard. The Archive strictly adheres to the standard, and extends it in several respects:

  • Whenever the robot lands on the top level of a web site, it looks for a file called "robots.txt". This is a file a web site administrator can place at the top level of a site to exclude a robot from visiting the whole site or any specific directory. [standards about robot exclusion].

  • The robot always waits at least 5 seconds after one request to a site before asking for another file. Since the typical web server can handle dozens of requests per second, this means the maximum load archiving places on a server is less than 5 percent of its capacity.

  • After retrieving any HTML file, we check for the presence of the NOINDEX, NOARCHIVE, and NOFOLLOW tags in the "<HEAD;>" element of the document. If we find a NOINDEX or NOARCHIVE tag, we throw away the copy. If there is a NOFOLLOW tag, the robot will not follow any links we found on that page. The main advantage to this approach is that users can control access to their own data, without needing their site administrators to update "robots.txt". [more about robot exclusions]

Please write Alexa Internet with your questions and concerns.







Acknowledgements . Board Members . Finding Us . In the News . Webmasters . Home


Contact us at: info@archive.org or call 415.561.6900

IA