overview
on the web
Dublin Core
RDF
PICS
PURLs
URNs
UDDI
thesauri
directories
web engines
site engines |
Metadata on the web
This
page looks at use of metadata on the web.
what is metadata?
Metadata is literally information about information.
It may be very restricted in scope, such as a simple identification
number. Or it may be descriptive, allowing the creation
of indexes, lists and other tools that can be used for
identification and for evaluation of information.
If you've used a library catalogue you've used such a
tool. The catalogue is based on metadata - subject, author,
publisher etc - about the books and other documents held
by that institution.
Metadata is one of the key features of the web. It is
found within individual web pages, at varying levels of
detail and using varying standards, highlighted below.
And it's found in the search engines, directories and
other tools for finding sites and individual pages. The
next page of this guide looks at those engines and directories.
This site, indeed, can be viewed as metadata about information
on the web and offline, since it identifies and evaluates
several thousand sites, web documents and print publications.
In the metrics guide on
this site we highlight some of the studies about the growth
of the web.
There are now many millions of sites and hundreds of millions
of pages. Many of those documents change periodically
(eg one study
suggests that the 'half life' of a page is less than two
years, roughly half the time it takes for most books to
go out of print and one reason why many big sites - such
as this one - have links that have "rotted").
And domain names don't reveal all the treasures (or lack
of them) within a site. The size and volatility of the
web means that it is beyond anyone to list the contents
of all sites/pages and to provide an evaluation.
classification
and its consequences
The importance of identification
and evaluation - so that your customers can search in
a particular part of the haystack rather than attempting
to scrutinise every piece of straw - is discussed in Elaine
Svenonius' The Intellectual Foundation of Information
Organisation (Cambridge: MIT Press 00).
She offers a demanding but comprehensive introduction
to the theory underlying attempts to identify, categorise
and retrieve the resources in the 'global digital library',
ie information accessed via the web.
There's a more accessible overview of identification/evaluation
issues and that library in Christine Borgman's From
Gutenberg to the Global Information Infrastructure: Access
To Information in the Networked World (Cambridge:
MIT Press 00). It is strongly recommended.
Richard Belew's Finding Out About: Search Engine Technology
From A Cognitive Perspective (Cambridge: Cambridge
Uni Press 01) is a more theoretical study of search processes. The
Advanced Internet Searcher's Handbook (London: Library
Association 02) by Phil Bradley
and The
Invisible Web (01) by Chris
Sherman & Gary Price provide guidance about online
search techniques and resources.
Among specialist and general journals we recommend the
Journal
of Internet Cataloging (JIC), D-LIB and the
terribly earnest Information Trechnologies & Libraries
(ITAL)
the
standards question
Internet
engineering and standards bodies have not mandated detailed
standards for metadata. That means, for example, that
there's no standardized terminology and thesaurus (one
reason why many librarians look at the web askance).
Essentially, in developing
the web provision was made for the inclusion of metadata
within pages/sites, allowing descriptive and other information
to be embedded in each page among the 'invisible' code.
Provision was also made for construction of search engines
and other tools to point to web pages, drawing on the
embedded metadata or using their own metadata about those
pages.
That's had several results:
There's disagreement among specialist users about development
of specific standards for the structuring and expression
of embedded metadata. (Competing and complementary standards
from librarians, museum curators, informatics specialists
and others include the Dublin Core, AAT, CSDGM,
GIS, CGIS-SAIF, Resource Description Framework and Warwick
Framework.
There's similar disagreement about content rating metadata
such as PICS used in censorship
or content management schemes). As Charles Thomas
& Linda Griffin note in their First Monday
article
on Who Will Create The Metadata For The Internet?,
while there are commercial incentives for effective metadata,
the various schemes have to break out of the silicon ghetto
The wide range of search engines and directories produce
different results. There are now at least 2,000 search
engines although most traffic goes to the top 11 such
as Yahoo! and Google.
Most pages (and probably most sites) don't have descriptive
metadata. Some studies
suggest that only 34% have 'meaningful' metadata and that
much metadata is not relevant to the particular site.
Less than 0.3% of sites (and thus a much smaller fraction
of the 'deep web' described in our metrics
guide) uses Dublin Core metadata.
Few major search engines rely on metadata supplied by
the owners of sites. One industry figure quoted in
Search Engine Watch comments
"search engines do not trust metadata. It's fine
to talk about how nice it would be if all web pages were
categorized, but the search engines know from experience
that people will lie, mislead or do whatever they can
to get on top".
where does it come from?
In practice metadata about a page originates in two
ways.
The creator of the page can embed metadata when constructing
(or amending the page).
Some software used in building sites will automatically
generate such metadata, albeit crudely. We have manually
developed the metadata for each page on this site, for
example. Many creators are uncertain about the nature
of metadata - what is it, where does it go, what terms
to use - or see it as an afterthought rather than integral
to electronic publishing.
A second way is the creation of metadata about the page
by an unrelated entity, ie by something/someone that visits
the page rather than by the page's owner.
Many search engines use 'robots' or 'spiders' to visit
pages, look for significant terms within the text and
incorporate that information within the database that
fuels the search engine or flags that it has objectionable
content. Other engines and directories use humans to examine
the pages and create the metadata.
does it matter?
As you might expect, there's disagreement about what
matters.
It's clear that most search engines ignore metadata embedded
by creators. A 1997 report
for example commented that "search engines do not
trust metadata. It's fine to talk about how nice it would
be if all web pages were categorized, but the search engines
know from experience that people will lie, mislead or
do whatever they can to get on top".
More broadly, many sites will never rank highly on search
engines. Their owners should concentrate on driving traffic
to them in other ways.
On the other hand, in parts of the web - such as libraries,
image archives and bodies dealing with geospatial information
- there is agreement about use of metadata and about specific
standards, for example Dublin Core.
Consistent use of metadata schemes, often as a consequence
of the management of information within each body's databases,
facilitates information exchange outside the web and for
example the operation of 'gateways' or sectoral search
engines that provide seamless access to the holdings of
a group of museums.
Preservation Metadata for Digital Objects: A Review of
the State of the Art (PDF)
is a concise overview by the US Research Libraries Group
of competing preservation metadata initiatives such as
the Open Archival Information System (OAIS) and CURL Exemplars
in Digital Archives (CEDARS).
and the future?
The idea of a standard set of terms and phrases as the
basis for online resource identification has been seductive
to librarians and information scientists but has not found
significant acceptance among most site creators and search
engine/directory developers. Two assumptions have impeded
past online metadata initiatives.
What one observer characterised
as the "technological legacy of knowledge representation"
assumes the existence of "a class of disinterested
information workers (i.e., librarians)" responsible
for comprehensive and systematic subject cataloguing.
However, that class has little clout online. Businesses,
organisations and individuals can mark up their pages
as they please. There are few legal constraints or community
norms to prevent the use of 'false' metadata, with the
result that few search engines rely on metadata because
the unscrupulous will 'spoof' the search results.
Current
metadata strategies are designed for "high-level document
properties", with inclusion of topical descriptors and
phrases in the 'head' element of a page assuming that
the content will be stable.
Our Metrics guide points
to research into the volatility of online content that
undermines that assumption. Koehler's paper
on Digital Libraries & WWW Persistence for
example estimates that the 'half life' of a web page is
less than two years (with the half life of a site a bit
more than two years), while the 1997 Rate of Change
& other Metrics: a Live Study of the World Wide Web
paper
by Fred Douglis, Anja Feldmann & Balachander Krishnamurthy
and the 2000 paper
How dynamic is the web? by Brian Brewington &
George Cybenko estimate that 20% of pages are less than
twelve days old, with only 25% older than one year.
next page
(Dublin Core)
|