Web Archiving profile: Overview

overview

- frameworks

- Australia

- overseas

- studies

Overview

This profile explores questions about archiving the web, examining projects such as the Internet Archive or Pandora and considering challenges such as intellectual property.

In the West a range of institutions - particularly libraries - have come to provide long-term access to publications, including monographs, serials, pamphlets, maps, films and sound recordings. There have been desultory and necessarily selective attempts to archive radio and television broadcasts. Until recently, however, there have been few large-scale and exemplary projects to preserve web sites and usenet messages and to enable long-term access to that online content.

Curatorial institutions, businesses and volunteer groups in Australia and overseas are now grappling with technical challenges and the more difficult policy questions, notably copyright (discussed in more detail in our Intellectual Property guide).

contents of this profile

The following pages cover -

Frameworks - voluntary and statutory deposit schemes, resource identification and other questions

Australia - local initiatives and Commonwealth/State legislation

Overseas - developments in North America, Europe and elsewhere

Studies - studies about technologies and regulatory issues

snapshots or facsimiles

In late 2001 article by Brewster Kahle, Rick Prelinger & Mary Jackson on Public Access to Digital Material claimed that universal digital access to print, film, broadcast, web and audio content is the "epic opportunity of our digital age" -

Currently, the technology has reached the point where scanning all books, digitizing all audio recordings, downloading all websites, and recording the output of all TV and radio stations is not only feasible but less costly than buying and storing the physical versions.

Building on Kahle's work with the Alexa Archive and the Internet Archive project (discussed later in this profile) they sought support for comprehensive digital archiving of all publications. Like other enthusiasts, they suggested that archiving the web and other content would provide all sorts of treats -

Imagine a high school student in Singapore writing a report on the life of Madame Curie based on documentary film footage and original photographs from the Curie family archives.

Imagine researching one's family roots in Europe by browsing the original birth and marriage records from the "old country" from a PC in the library.

Imagine being diagnosed with cancer and having the world's best medical research library and librarians less than one mile from your house.

Imagine a college student's documentary film about her grandfather's World War II battalion. This file could contain original military footage, current footage from towns involved in the battles, and interviews with surviving soldiers from the same battalion

In practice things are a bit more complex.

There are substantial (although perhaps not intractable) regulatory issues, discussed on the following page. Identification and retrieval of archived content promises to be more challenging than finding information on the 'live' web. And there are questions about funding, use and administration which aren't necessarily answered by a brute force 'just add more boxes' approach.

As we noted in Analysphere when considering the Kahle article, one might well imagine the digital cornucopia but reluctance to fund existing archives suggests that there's little support for the vision. The European Commission on Preservation & Access (ECPA) concurrently claimed that 80% of EU video and audio archive content is "at risk"; catalogues for much content in Australian and overseas archives are pre-digital. Works such as The Intellectual Foundation of Information Organisation (Cambridge: MIT Press 00) by Elaine Svenonius and Preferred Placement: Knowledge Politics on the Web (Maastricht: Jan van Eyck Akademie Editions 00) edited by Richard Rogers suggest that simply 'copying' the web may be ineffective.

The best projects so far have been small-scale, restricted to 'collections' that cover specific themes or groups of publications and that feature significant investment in quality control (for example making sure that all of a page and images are captured) and resource identification.

One reason is the size and volatility of the web.

dimensions

There's little agreement and no definitive figures about the size and shape of the web.

Our Metrics & Statistics guide notes divergent estimates by experts about the number of domains, the number of unique publicly-accessible sites (probably over 8 million) and the number of distinct web pages.

Some gurus have suggested that there were around 2 billion pages on the web in 2000, heading for 16 billion in 2003. One bold marketer announced that there were merely 550 billion 'individual documents', although most were on corporate databases or otherwise weren't publicly available.

(As points of reference the How Much Information report by Hal Varian & Peter Lyman suggests that in 1999 there were upwards of 22 643 newspaper titles, 40 000 scholarly journals, 80 000 mass-market periodicals and 40 000 newsletters. Building on work by Machlup, the report estimated that 3.2 million book titles might be in print in the English-speaking countries, with a global figure for the number of book titles - including out of print works - at around 65 million titles.)

Much of the web content is evanescent. Wallace Koehler's paper on Digital Libraries & WWW Persistence for example estimates that the 'half life' of a web page is under two years and that the half life of a site is just over two years.

The 2000 paper How Dynamic Is The web? by Brian Brewington & George Cybenko estimated that 20% of pages are less than twelve days old and only 25% older than one year, consistent with conclusions in the 1997 Rate of Change & other Metrics: a Live Study of the World Wide Web paper by Douglis, Feldmann & Krishnamurthy.

The coverage of existing search engines and directories is contentious but most appear to cover only a small part of the web and none cover all of the 'dark web' (ie the non-public component thought to comprise 5 sixths of total online content).

Although getting comprehensive figures is akin to nailing jelly to the wall of a darkened room, it is clear that there's a lot of content on the web, much of it changes and its identification may be difficult.

next page (frameworks)