How can I get my site included in the Archive?
Alexa Internet has been crawling the web since 1996, which has resulted in a massive archive. If you have a web site, and you would like to ensure that it is saved for posterity in the Internet Archive, and you've searched wayback and found no resuls, you can visit the Alexa's "Webmasters" page at http://pages.alexa.com/help/webmasters/index.html#crawl_site.
Method 2: if you have the Alexa tool bar installed, just visit a site.
Method 3: while visiting a site, use the 'show related links' in Internet Explorer, which uses the Alexa service.
Sites are usually crawled within 24 hours and no more then 48. Crawled sites will be added to Wayback in about 6 months.
How can I remove my site's pages from the Wayback Machine?
The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine.
You can find exclusion directions at exclude.php. If you have further questions, you may email firstname.lastname@example.org.
What is the Internet Archive Wayback Machine?
The Internet Archive Wayback Machine is a service that allows people to visit archived versions of Web sites. Visitors to the Wayback Machine can type in a URL, select a date range, and then begin surfing on an archived version of the Web. Imagine surfing circa 1999 and looking at all the Y2K hype, or revisiting an older version of your favorite Web site. The Internet Archive Wayback Machine can make all of this possible. See our press release at http://www.archive.org/about/press_release.php.
Can I link to old pages on the Wayback Machine?
Yes! The Wayback Machine is built so that it can be used and referenced. If you find an archived page that you would like to reference on your Web page or in an article, you can copy the URL. You can even use fuzzy URL matching and date specification... but that's a bit more advanced (check out our advanced search page at http://web.archive.org/collections/web/advanced.html).
Why isn't the site I'm looking for in the archive?
Some sites may not be included because the automated crawlers were unaware of their existence at the time of the crawl. It's also possible that some sites were not archived because they were password protected or otherwise inaccessible to our automated systems or because the Web site administrator has requested removal of the site from the Archive. Note: some pages appear in the Election 2000 archive and not the main archive.
What does it mean when a site's archive data has been "updated"?
When our automated systems crawl the web every few months or so, we find that only about 50% of all pages on the web have changed from our previous visit. This means that much of the content in our archive is duplicate material. If you don't see ""*"" next to an archived document, then the content on the archived page is identical to the previously archived copy.
Who was involved in the creation of the Internet Archive Wayback Machine?
"The original idea for the Internet Archive Wayback Machine began in 1996, when the Internet Archive first began archiving the web. Now, five years later, with over 100 terabytes and a dozen web crawls completed, the Internet Archive has made the Internet Archive Wayback Machine available to the public. The Internet Archive has relied on donations of web crawls, technology, and expertise from Alexa Internet and others. The Internet Archive Wayback Machine is owned and operated by the Internet Archive."
How was the Wayback Machine made?
Over 100 terabytes of data are stored on several dozen modified servers. Alexa Internet, in cooperation with the Internet Archive, has designed a three dimensional index that allows browsing of web documents over multiple time periods, and turned this unique feature into the Wayback Machine.
How large is the Archive?
The Internet Archive Wayback Machine contains over 100 terabytes of data and is currently growing at a rate of 12 terabytes per month. This eclipses the amount of text contained in the world's largest libraries, including the Library of Congress. If you tried to place the entire contents of the archive onto floppy disks (we don't recommend this!) and laid them end to end, it would stretch from New York, past Los Angeles, and halfway to Hawaii.
What type of machinery is used in this Internet Archive?
The Internet Archive is stored on dozens of slightly modified Hewlett Packard servers. The computers run on the FreeBSD operating system. Each computer has 512Mb of memory and can hold just over 300 gigabytes of data on IDE disks.
How do you archive dynamic pages?
Why are some sites harder to archive than others?
If you look at our collection of archived sites, you will find some broken pages, missing graphics, and some sites that aren't archived at all. Here are some things that make it difficult to archive a web site:
As a general rule of thumb, simple html is the easiest to archive.
- Robots.txt -- We respect robot exclusion headers.
- Server side image maps -- Like any functionality on the web, if it needs to contact the originating server in order to work, it will fail when archived.
- Unknown sites -- The archive contains crawls of the Web completed by Alexa Internet. If Alexa doesn't know about your site, it won't be archived. Use the Alexa Toolbar (available at www.alexa.com), and it will know about your page. Or you can visit Alexa's Archive Your Site page at http://pages.alexa.com/help/webmasters/index.html#crawl_site.
- Orphan pages -- If there are no links to your pages, the robot won't find it (the robots don't enter queries in search boxes.)
Some sites are not available because of robots.txt or other exclusions. What does that mean?
The Standard for Robot Exclusion (SRE) is a means by which web site owners can instruct automated systems not to crawl their sites. Web site owners can specify files or directories that are disallowed from a crawl, and they can even create specific rules for different automated crawlers. All of this information is contained in a file called robots.txt. While robots.txt has been adopted as the universal standard for robot exclusion, compliance with robots.txt is strictly voluntary. In fact most web sites do not have a robots.txt file, and many web crawlers are not programmed to obey the instructions anyway. However, Alexa Internet, the company that crawls the web for the Internet Archive, does respect robots.txt instructions, and even does so retroactively. If a web site owner decides he / she prefers not to have a web crawler visiting his / her files and sets up robots.txt on the site, the Alexa crawlers will stop visiting those files and will make unavailable all files previously gathered from that site. This means that sometimes, while using the Internet Archive Wayback Machine, you may find a site that is unavailable due to robots.txt or other exclusions. Other exclusions? Yes, sometimes a web site owner will contact us directly and ask us to stop crawling or archiving a site, and we endevor to comply with these requests.
How can I help the Internet Archive and the Wayback Machine?
The Internet Archive actively seeks donations of digital materials for preservation. If you have digital materials that may be of interest to future generations, please let us know by submitting a proposal at http://www.archive.org/internet/proposal.html. The Internet Archive is also seeking additional funding to continue this important mission. You may make a donation through the Amazon.com Honor System at http://www.amazon.com/paypage/PFW9L3HMJTPIQ.
Can I search the Archive?
Using the Internet Archive Wayback Machine, it is possible to search for the names of sites contained in the Archive (URLs) and to specify date ranges for your search. However, we do not yet have an indexed text search of the documents in the collection. We continue to work on it and should have a full text search soon.
When will you offer text search for the Wayback Machine?
We do not yet have an indexed text search of the documents in the collection. This is a large and complicated project, but we continue to work on it and should have a full text search soon.
Why am I getting broken or gray images on a site?
Broken images (when there is a small red "x" where the image should be) occur when the images are not available on our servers. Usually this means that we did not archive them. Gray images are the result of robots.txt exclusions. The site in question may have blocked robot access to their images directory.
How do I contact the Internet Archive?
Questions about the Wayback Machine should be addressed to email@example.com. General questions about the Internet Archive, or other archive projects, should be addressed to firstname.lastname@example.org.
What is the Wayback Machine's Copyright Policy?
The Internet Archive respects the intellectual property rights and other proprietary rights of others. The Internet Archive may, in appropriate circumstances and at its discretion, remove certain content or disable access to content that appears to infringe the copyright or other intellectual property rights of others. If you believe that your copyright has been violated by material available through the Internet Archive, please provide the Internet Archive Copyright Agent with the following information:
- Identification of the copyrighted work that you claim has been infringed;
- An exact description of where the material about which you complain is located within the Internet Archive collections;
- Your address, telephone number, and email address;
- A statement by you that you have a good-faith belief that the disputed use is not authorized by the copyright owner, its agent, or the law;
- A statement by you, made under penalty of perjury, that the above information in your notice is accurate and that you are the owner of the copyright interest involved or are authorized to act on behalf of that owner;
- Your electronic or physical signature.
The Internet Archive Copyright Agent can be reached as follows:
Internet Archive Copyright Agent
Presidio of San Francisco
P.O. Box 29244
San Francisco, CA 94129
Why is the Internet Archive collecting sites from the Internet? What makes the information useful?
Most societies place importance on preserving artifacts of their culture and heritage. Without such artifacts, civilization has no memory and no mechanism to learn from its successes and failures. Our culture now produces more and more artifacts in digital form. The Archive's mission is to help preserve those artifacts and create an Internet library for researchers, historians, and scholars. The Archive collaborates with institutions including the Library of Congress and the Smithsonian.
Do you archive email? Chat?
No, we do not collect or archive chat systems or personal email messages that have not been posted to Usenet bulletin boards or publicly accessible online message boards.
Do you collect all the sites on the Web?
No, we collect only publicly accessible Web pages. We do not archive pages that require a password to access, pages tagged for "robot exclusion" by their owners, pages that are only accessible when a person types into and sends a form, or pages on secure servers. If a site owner properly requests removal of a Web site through http://www.archive.org/internet/remove/html, we will remove that site from the Archive.
Is there any personal information in these collections?
We collect Web pages that are publicly accessible. These may include pages with personal information.
Who has access to the collections? What about the public?
The Archive makes the collections available at no cost to researchers, historians, and scholars. At present, it takes someone with a certain level of technical knowledge to access them, but there is no requirement that a user be affiliated with any particular organization.
'How can I get a copy of the pages on my Web site? If my site got hacked or damaged, could I get a backup from the Archive?'
Can I download an entire site from the Wayback Machine?
We do not currently offer any method to download entire sites from the Wayback Machine.
Can people download sites from the collections?
How do you protect my privacy if you archive my site?
Like a public library, the Archive provides free and open access to its collections to researchers, historians, and scholars. Our cultural norms have long promoted access to documents that were, but no longer are, publicly accessible.
Given the rate at which the Internet is changing the average life of a Web page is only 77 days if no effort is made to preserve it, it will be entirely and irretrievably lost. Rather than let this moment slip by, we are proceeding with documenting the growth and content of the Internet, using libraries as our model.
If you are interested in these issues, please join and contribute to our announcement and discussion lists.