Metadata Profile: On the Web

overview

on the web

Dublin Core

RDF

PICS

PURLs

URNs

UDDI

thesauri

directories

web engines

site engines

Metadata on the web

This page looks at use of metadata on the web.

     what is metadata?

Metadata is literally information about information. It may be very restricted in scope, such as a simple identification number. Or it may be descriptive, allowing the creation of indexes, lists and other tools that can be used for identification and for evaluation of information.

If you've used a library catalogue you've used such a tool. The catalogue is based on metadata - subject, author, publisher etc - about the books and other documents held by that institution.

Metadata is one of the key features of the web. It is found within individual web pages, at varying levels of detail and using varying standards, highlighted below. And it's found in the search engines, directories and other tools for finding sites and individual pages. The next page of this guide looks at those engines and directories.

This site, indeed, can be viewed as metadata about information on the web and offline, since it identifies and evaluates several thousand sites, web documents and print publications.

In the metrics guide on this site we highlight some of the studies about the growth of the web.

There are now many millions of sites and hundreds of millions of pages. Many of those documents change periodically (eg one study suggests that the 'half life' of a page is less than two years, roughly half the time it takes for most books to go out of print and one reason why many big sites - such as this one - have links that have "rotted"). And domain names don't reveal all the treasures (or lack of them) within a site. The size and volatility of the web means that it is beyond anyone to list the contents of all sites/pages and to provide an evaluation.

     classification and its consequences

The importance of identification and evaluation - so that your customers can search in a particular part of the haystack rather than attempting to scrutinise every piece of straw - is discussed in Elaine Svenonius' The Intellectual Foundation of Information Organisation (Cambridge: MIT Press 00).

She offers a demanding but comprehensive introduction to the theory underlying attempts to identify, categorise and retrieve the resources in the 'global digital library', ie information accessed via the web.

There's a more accessible overview of identification/evaluation issues and that library in Christine Borgman's From Gutenberg to the Global Information Infrastructure: Access To Information in the Networked World (Cambridge: MIT Press 00). It is strongly recommended.

Richard Belew's Finding Out About: Search Engine Technology From A Cognitive Perspective (Cambridge: Cambridge Uni Press 01) is a more theoretical study of search processes. The Advanced Internet Searcher's Handbook (London: Library Association 02) by Phil Bradley and The Invisible Web (01) by Chris Sherman & Gary Price provide guidance about online search techniques and resources.

Among specialist and general journals we recommend the Journal of Internet Cataloging (JIC), D-LIB and the terribly earnest Information Trechnologies & Libraries (ITAL)

     the standards question

Internet engineering and standards bodies have not mandated detailed standards for metadata. That means, for example, that there's no standardized terminology and thesaurus (one reason why many librarians look at the web askance).

Essentially, in developing the web provision was made for the inclusion of metadata within pages/sites, allowing descriptive and other information to be embedded in each page among the 'invisible' code.

Provision was also made for construction of search engines and other tools to point to web pages, drawing on the embedded metadata or using their own metadata about those pages.

That's had several results:

There's disagreement among specialist users about development of specific standards for the structuring and expression of embedded metadata. (Competing and complementary standards from librarians, museum curators, informatics specialists and others include the Dublin Core, AAT, CSDGM, GIS, CGIS-SAIF, Resource Description Framework and Warwick Framework.

There's similar disagreement about content rating metadata such as PICS used in censorship or content management schemes). As Charles Thomas & Linda Griffin note in their First Monday article on Who Will Create The Metadata For The Internet?, while there are commercial incentives for effective metadata, the various schemes have to break out of the silicon ghetto

The wide range of search engines and directories produce different results. There are now at least 2,000 search engines although most traffic goes to the top 11 such as Yahoo! and Google.

Most pages (and probably most sites) don't have descriptive metadata. Some studies suggest that only 34% have 'meaningful' metadata and that much metadata is not relevant to the particular site. Less than 0.3% of sites (and thus a much smaller fraction of the 'deep web' described in our metrics guide) uses Dublin Core metadata.

Few major search engines rely on metadata supplied by the owners of sites. One industry figure quoted in Search Engine Watch comments "search engines do not trust metadata. It's fine to talk about how nice it would be if all web pages were categorized, but the search engines know from experience that people will lie, mislead or do whatever they can to get on top".

     where does it come from?

In practice metadata about a page originates in two ways.

The creator of the page can embed metadata when constructing (or amending the page).

Some software used in building sites will automatically generate such metadata, albeit crudely. We have manually developed the metadata for each page on this site, for example. Many creators are uncertain about the nature of metadata - what is it, where does it go, what terms to use - or see it as an afterthought rather than integral to electronic publishing.

A second way is the creation of metadata about the page by an unrelated entity, ie by something/someone that visits the page rather than by the page's owner.

Many search engines use 'robots' or 'spiders' to visit pages, look for significant terms within the text and incorporate that information within the database that fuels the search engine or flags that it has objectionable content. Other engines and directories use humans to examine the pages and create the metadata.

     does it matter?

As you might expect, there's disagreement about what matters.

It's clear that most search engines ignore metadata embedded by creators. A 1997 report for example commented that "search engines do not trust metadata. It's fine to talk about how nice it would be if all web pages were categorized, but the search engines know from experience that people will lie, mislead or do whatever they can to get on top".

More broadly, many sites will never rank highly on search engines. Their owners should concentrate on driving traffic to them in other ways.

On the other hand, in parts of the web - such as libraries, image archives and bodies dealing with geospatial information - there is agreement about use of metadata and about specific standards, for example Dublin Core.

Consistent use of metadata schemes, often as a consequence of the management of information within each body's databases, facilitates information exchange outside the web and for example the operation of 'gateways' or sectoral search engines that provide seamless access to the holdings of a group of museums.

Preservation Metadata for Digital Objects: A Review of the State of the Art (PDF) is a concise overview by the US Research Libraries Group of competing preservation metadata initiatives such as the Open Archival Information System (OAIS) and CURL Exemplars in Digital Archives (CEDARS).

     and the future?

The idea of a standard set of terms and phrases as the basis for online resource identification has been seductive to librarians and information scientists but has not found significant acceptance among most site creators and search engine/directory developers. Two assumptions have impeded past online metadata initiatives.

What one observer characterised as the "technological legacy of knowledge representation" assumes the existence of "a class of disinterested information workers (i.e., librarians)" responsible for comprehensive and systematic subject cataloguing. However, that class has little clout online. Businesses, organisations and individuals can mark up their pages as they please. There are few legal constraints or community norms to prevent the use of 'false' metadata, with the result that few search engines rely on metadata because the unscrupulous will 'spoof' the search results.

Current metadata strategies are designed for "high-level document properties", with inclusion of topical descriptors and phrases in the 'head' element of a page assuming that the content will be stable.

Our Metrics guide points to research into the volatility of online content that undermines that assumption. Koehler's paper on Digital Libraries & WWW Persistence for example estimates that the 'half life' of a web page is less than two years (with the half life of a site a bit more than two years), while the 1997 Rate of Change & other Metrics: a Live Study of the World Wide Web paper by Fred Douglis, Anja Feldmann & Balachander Krishnamurthy and the 2000 paper How dynamic is the web? by Brian Brewington & George Cybenko estimate that 20% of pages are less than twelve days old, with only 25% older than one year.

   next page (Dublin Core)