Metadata Profile: On the Web

overview

on the web

Dublin Core

RDF

PICS

PURLs

URNs

on the web

This page looks at use of metadata on the web.

the standards question

Internet engineering and standards bodies have not mandated detailed standards for metadata. That means, for example, that there's no standardized terminology and thesaurus (one reason why many librarians look at the web askance).

Essentially, in developing the web provision was made for the inclusion of metadata within pages/sites, allowing descriptive and other information to be embedded in each page among the 'invisible' code.

Provision was also made for construction of search engines and other tools to point to web pages, drawing on the embedded metadata or using their own metadata about those pages.

That's had several results:

There's disagreement among specialist users about development of specific standards for the structuring and expression of embedded metadata. (Competing and complementary standards from librarians, museum curators, informatics specialists and others include the Dublin Core, AAT, CSDGM, GIS, CGIS-SAIF, Resource Description Framework and Warwick Framework.

There's similar disagreement about content rating metadata such as PICS used in censorship or content management schemes). As Charles Thomas & Linda Griffin note in their First Monday article on Who Will Create The Metadata For The Internet?, while there are commercial incentives for effective metadata, the various schemes have to break out of the silicon ghetto

The wide range of search engines and directories produce different results. There are now at least 2,000 search engines although most traffic goes to the top 11 such as Yahoo! and Google.

Most pages (and probably most sites) don't have descriptive metadata. Some studies suggest that only 34% have 'meaningful' metadata and that much metadata is not relevant to the particular site. Less than 0.3% of sites (and thus a much smaller fraction of the 'deep web' described in our metrics guide) uses Dublin Core metadata.

Few major search engines rely on metadata supplied by the owners of sites. One industry figure quoted in Search Engine Watch comments "search engines do not trust metadata. It's fine to talk about how nice it would be if all web pages were categorized, but the search engines know from experience that people will lie, mislead or do whatever they can to get on top".

where does it come from?

In practice metadata about a page originates in two ways.

The creator of the page can embed metadata when constructing (or amending the page).

Some software used in building sites will automatically generate such metadata, albeit crudely. We have manually developed the metadata for each page on this site, for example. Many creators are uncertain about the nature of metadata - what is it, where does it go, what terms to use - or see it as an afterthought rather than integral to electronic publishing.

A second way is the creation of metadata about the page by an unrelated entity, ie by something/someone that visits the page rather than by the page's owner.

Many search engines use 'robots' or 'spiders' to visit pages, look for significant terms within the text and incorporate that information within the database that fuels the search engine or flags that it has objectionable content. Other engines and directories use humans to examine the pages and create the metadata.

does it matter?

As you might expect, there's disagreement about what matters.

It's clear that most search engines ignore metadata embedded by creators. A 1997 report for example commented that "search engines do not trust metadata. It's fine to talk about how nice it would be if all web pages were categorized, but the search engines know from experience that people will lie, mislead or do whatever they can to get on top".

More broadly, many sites will never rank highly on search engines. Their owners should concentrate on driving traffic to them in other ways.

On the other hand, in parts of the web - such as libraries, image archives and bodies dealing with geospatial information - there is agreement about use of metadata and about specific standards, for example Dublin Core.

Consistent use of metadata schemes, often as a consequence of the management of information within each body's databases, facilitates information exchange outside the web and for example the operation of 'gateways' or sectoral search engines that provide seamless access to the holdings of a group of museums.

next page (Dublin Core)