overview
- frameworks
- Australia
- overseas
- studies |
Overview
This profile explores questions about archiving the web,
examining projects such as the Internet Archive or Pandora
and considering challenges such as intellectual property.
In the West a range of institutions - particularly libraries
- have come to provide long-term access to publications,
including monographs, serials, pamphlets, maps, films
and sound recordings. There have been desultory and necessarily
selective attempts to archive radio and television broadcasts.
Until recently, however, there have been few large-scale
and exemplary projects to preserve web sites and usenet
messages and to enable long-term access to that online
content.
Curatorial institutions, businesses and volunteer groups
in Australia and overseas are now grappling with technical
challenges and the more difficult policy questions, notably
copyright (discussed in more detail in our Intellectual
Property guide).
contents of this profile
The following pages cover -
Frameworks
- voluntary and statutory deposit schemes, resource
identification and other questions
Australia - local
initiatives and Commonwealth/State legislation
Overseas - developments
in North America, Europe and elsewhere
Studies - studies
about technologies and regulatory issues
snapshots or facsimiles
In late 2001 article
by Brewster Kahle, Rick Prelinger & Mary Jackson on
Public Access to Digital Material claimed that
universal digital access to print, film, broadcast, web
and audio content is the "epic opportunity of our
digital age" -
Currently,
the technology has reached the point where scanning
all books, digitizing all audio recordings, downloading
all websites, and recording the output of all TV and
radio stations is not only feasible but less costly
than buying and storing the physical versions.
Building on Kahle's work with the Alexa
Archive and the Internet Archive project
(discussed later in this profile) they sought support
for comprehensive digital archiving of all publications.
Like other enthusiasts, they suggested that archiving
the web and other content would provide all sorts of treats
-
Imagine
a high school student in Singapore writing a report
on the life of Madame Curie based on documentary film
footage and original photographs from the Curie family
archives.
Imagine researching one's family roots in Europe by
browsing the original birth and marriage records from
the "old country" from a PC in the library.
Imagine being diagnosed with cancer and having the world's
best medical research library and librarians less than
one mile from your house.
Imagine a college student's documentary film about her
grandfather's World War II battalion. This file could
contain original military footage, current footage from
towns involved in the battles, and interviews with surviving
soldiers from the same battalion
In
practice things are a bit more complex.
There are substantial (although perhaps not intractable)
regulatory issues, discussed on the following page. Identification
and retrieval of archived content promises to be more
challenging than finding information on the 'live' web.
And there are questions about funding, use and administration
which aren't necessarily answered by a brute force 'just
add more boxes' approach.
As we noted in Analysphere
when considering the Kahle article, one might well imagine
the digital cornucopia but reluctance to fund existing
archives suggests that there's little support for the
vision. The European Commission on Preservation &
Access (ECPA)
concurrently claimed that 80% of EU video and audio archive
content is "at risk"; catalogues for much content
in Australian and overseas archives are pre-digital. Works
such as The Intellectual Foundation of Information
Organisation (Cambridge: MIT Press 00) by Elaine Svenonius
and Preferred Placement: Knowledge Politics on the
Web (Maastricht: Jan van Eyck Akademie Editions 00)
edited by Richard Rogers suggest that simply 'copying'
the web may be ineffective.
The best projects so far have been small-scale, restricted
to 'collections' that cover specific themes or groups
of publications and that feature significant investment
in quality control (for example making sure that all of
a page and images are captured) and resource identification.
One reason is the size and volatility of the web.
dimensions
There's little agreement and no definitive figures
about the size and shape of the web.
Our Metrics & Statistics guide
notes divergent estimates by experts about the number
of domains, the number of unique publicly-accessible sites
(probably over 8 million) and the number of distinct web
pages.
Some gurus have suggested that there were around 2 billion
pages on the web in 2000, heading for 16 billion in 2003.
One bold marketer announced that there were merely 550
billion 'individual documents', although most were on
corporate databases or otherwise weren't publicly available.
(As points of reference the
How Much Information report
by Hal Varian & Peter Lyman suggests that in 1999
there were upwards of 22 643 newspaper titles, 40 000
scholarly journals, 80 000 mass-market periodicals and
40 000 newsletters. Building on work by Machlup,
the report estimated that 3.2 million book titles might
be in print in the English-speaking countries, with a
global figure for the number of book titles - including
out of print works - at around 65 million titles.)
Much of the web content is evanescent. Wallace Koehler's
paper
on Digital Libraries & WWW Persistence for
example estimates that the 'half life' of a web page is
under two years and that the half life of a site is just
over two years.
The 2000 paper
How Dynamic Is The web? by Brian Brewington &
George Cybenko estimated that 20% of pages are less than
twelve days old and only 25% older than one year, consistent
with conclusions in the 1997 Rate of Change & other
Metrics: a Live Study of the World Wide Web paper
by Douglis, Feldmann & Krishnamurthy.
The coverage of existing search engines and directories
is contentious but most appear to cover only a small part
of the web and none cover all of the 'dark web' (ie the
non-public component thought to comprise 5 sixths of total
online content).
Although getting comprehensive figures is akin to nailing
jelly to the wall of a darkened room, it is clear that
there's a lot of content on the web, much of it changes
and its identification may be difficult.
next page
(frameworks)
|