Who archives the World Wide Web? Michael Cunningham reports on a project to put the entire contents of the Internet into a box - literally
It's all a question of scale. Nowadays millions of words from hundreds of books can be distilled down, and squeezed onto one small, shiny disk. And at the click of a mouse, these documents can be searched through or squirted down networks to the other side of the planet. In seconds.
In another epoch, this would seem like alchemy or witchcraft. Today, though, it is a footnote in a Budget speech. Last week, the Minister for Finance allocated £750,000 to a project to begin turning all the laws enacted by the Oireachtas since the foundation of the State - over 35,000 pages, occupying about 4.5 metres of shelf space - into a digital form. On a CD-Rom and, maybe, on a network near you.
By Internet standards, it's a relatively modest project. Take the BBC, which is going about digitising some 22 million newspaper cuttings it has. Or the British Library, with a mere 12 million items on its shelves. It's about to move from its ancient - but much loved - premises in the British Museum to a controversial £500-million high-tech centre in St Pancras. While the new building has 200 miles of shelving, it also has a staggering 1,900 miles miles of computer cables.
As traditional archives such as the British Library and Ireland's statute books are gradually migrating onto CD-Roms and the Internet, something equally important is happening - or ought to be happening - in the opposite direction. Today's digital media such as the World Wide Web and email are prime candidates to be archived.
While the Internet often seems like an overdose of useless and trivial information, to the digital archaeologists of tomorrow it will represent a rich stream of data. But if nobody's making a record of it, how will they ever know? Has the Internet become too big for it to be copied and archived?
The Web, for example, has often been likened to one vast, rapidly fluctuating library (of bits rather than atoms). But unlike a traditional library it is being rebuilt every minute. Its sites can flicker and die in days, hours or even seconds. Web pages are revised and spiced up with fancier graphics and revamped designs, more "plug-in" animations and "applets", often with no record kept of the previous mutations.
One study from the University of Colorado five years ago - when the Web was obviously much less dynamic but much more manageable to study - found the "mean lifetime" of a Web page was only 44 days. It is also growing at a mind-boggling rate since its birth in the early 1990s - anything from 31 million Web pages (according to the people at Alta Vista) to 50 million (say the crowd at Excite) or more.
Until recently, nobody has been brave or foolish enough to attempt to archive these fleeting moments and sprawling suburbs of cyberspace. Well, not until a certain millionaire entrepreneur called Brewster Kahle came along.
Kahle designs supercomputers. He founded the legendary Massachusetts-based supercomputer company Thinking Machines. He also invented the WAIS (Wide Area Information Servers) system, which lets users search for textual information, and ranks the "hits" according to its own notion of what's relevant. In 1995 he sold WAIS Inc to America Online for $15 million. And he's still only 35.
Last year he spent a fraction of that fortune - about $400,000 - on setting up the Internet Archive. Its modest goal: to preserve our digital heritage - or at least significant parts of it on the Web and Net - by scooping up and preserving every single bit ever posted onto their publicly accessible areas, from ordinary home pages to the most arcane postings in the most obscure Usenet newsgroups. And given that many of the world's greatest books, lyrics, images and other artworks pass through these networks, an Internet Archive would, some say, be nothing less than the sum of all human knowledge.
Last summer Kahle assembled a team of eight assistants to begin grabbing everything they could find from the Net - not just its text, but its video and audio too. They based themselves in a former hospital in the Presidio, a decommissioned army base in San Francisco overlooking the Golden Gate Bridge. By outward appearances, it doesn't look like the repository of the world's digital treasures - it's more like your average shoestring start-up software operation. But, surprisingly enough, you don't need that much space to contain the entire contents of the Internet.
So far the team has estimated that
- there are about 450,000 unique Web servers;
- most Usenet postings disappear after about a week;
- the average HTML page is about five kilobytes. If there are 80 million pages, then the text side of the web is 400 gigabytes;
- the total size of the non-text side of the Web (images, sounds, etc.) appears to be about 4 times that of the HTML side.
So the total size of a single snapshot of the Web appears to be about 2,000 gigabytes, or two terabytes. Or - wait for it - about 2,000,000,000,000 bytes.
As a rough rule-of-thumb, a year's worth of Computimes articles (pure text, no pictures) can be stored on one floppy disk, which holds 1.4 megabytes of information. Going up the scale, many new PCs come with a hard drive which can hold 1,000 megabytes, or a gigabyte (1GB). A terabyte, then, is equivalent to 1,000 one-gigabyte hard drives, or a million megabytes. Strictly speaking, the following units of measurement are approximations:
one kilobyte (KB) = 1,000 bytes
one megabyte (MB) = 1,000 KB
one gigabyte (GB) = 1,000 MB
one terabyte (TB) = 1,000 GB
But another way of putting the Internet's size in context is to compare it with other existing media (see panel below). For example, in a typical public library the amount of text in the books (about 300,000 of them) would add up to about three terabytes of data. So for the moment Kahle's project does seem manageable.
The Internet Archive, which is currently growing at the rate of 100 gigabytes a week, uses highspeed links to the Net. Their high-speed computers are connected to large robot-driven tape drives. Each one can hold two terabytes of data. In other words, the entire contents of the Web (at its current size) could, with a bit of compression, fit into one of these ADIC 448 "tape robots". We're talking "Internet in a box". Literally.
To build their running snapshot of the Web, they use "web crawlers" - automated programs that scour through sites and suck up their contents.
For the technically minded (and please skip these two paragraphs if you aren't), they explain that "the data is then prepared for archiving on a Pentium processor running BSDI and then passed over a SCSI interface to a Digital Linear Tape (DLT) machine, such as the Quantum DLT4500. Each DLT 4500 can house five tapes each storing up to 40 gigabytes of content, or up to 200 gigabytes per DLT machine."
Then the data goes to a Hierarchical Storage Management system (HSM) which "allows the archive to have infinite storage capacity using multiple mediums (RAM, disk and tape) so that the archive can index the content regardless of the ultimate size of the archive".
Infinite capacity? Regardless of size? The team's absolute faith in these "tape robots" and their "snapshots" might be hard to take, particularly since the Internet has become Heisenberg's uncertainty principle with skates on. But Kahle is no fool or eccentric. He has the track record of a computing industry hero.
"I usually work on projects from the you've-got-to-be-crazy stage," he once admitted, "but eventually everyone ends up saying, `Of course.' "
The archive, he says, will be a freely accessible, non-profit organisation. If the bits are to survive as a permanent record, every 10 years or so they'll have to be copied into a new format. Besides this charity arm of the archive, Kahle hopes there are profit-making spin-offs. A commercial arm will aggressively develop the software for gathering and manipulating the terabytes of data.
Among the archive's affiliates so far is the Smithsonian Institute in Washington, which wants to preserve all the Web pages related to the 1996 presidential election campaign - not just the official sites run by the candidates but the unofficial and parody pages, and the online news coverage too.
Until the digital age, historians, biographers and other social scientists relied on paper documents. Tomorrow's equivalents will feed off the email and Web pages which are quickly becoming the dominant media resources of our time. The technologists and Internet researchers of the future will also want to know what was happening online in January 1997.
"Nobody recorded television in its early days," Kahle says. "So we don't know what it looked like. Similarly, no one knows in any real way what the Web looked like a year ago."
Ironically, the latest digital incarnation of television, the Webcam, could spell the death of Kahle's project. "When everyone's camcorder is on the Net, obviously we won't be able to keep up," he admits. "But that's no reason not to try to save as much as we can."
Besides the technological problems, there are some formidable legal and ethical questions, Kahle admits. Should the most embarrassing email message you ever wrote be accessible to others - not just to the digital anthropologists in a century's time but this month? Should your Web page that you yourself have pulled the plug on be stored away like this?
Copyright lawyers and privacy rights lobbies have already begun to question how the Archive might use the data. Will saving files from corporate Web sites soon be construed under proposed international agreements as a copyright violation - as some Web "search engine" companies argued vociferously during the World Intellectual Property Organisation's conference in Geneva last month? Kahle has suggested the archive's information might be treated like census data - aggregate information is made public, but specific information about individuals is kept confidential - though how the dividing line is drawn in practice is anybody's guess.
The Irish question
If a digital national archive is important for the historians of the future, where is Ireland's digital archive? Which national agency in Ireland should - or could - be responsible for saving and preserving today's email and other electronic objects? The National Archives do have a Web site: it's hosted by the Dublin Institute of Technology. The information ranges from family history to famine records and the archives of the Ordnance Survey. But this is information about the information - rather than direct access to (digitised) archives themselves.
As the organisation stresses, "it would be an enormous task to make even a fraction of the material available online. However you can find out here whether the National Archives is likely to contain what you are looking for and how to find and use the National Archives." But we need to take certain practical steps today. To take one simple example, government agencies in the US already have to keep copies of email. In this country most employees in central and local government don't appear to have access to email, so the problem might seem far off. But the longer the State postpones decisions in such areas, the bigger the chunk of our country's digital history that future generations will lose for ever.
Michael Cunningham is at: email@example.com