Metadata Glossary

In an attempt to summarize the relationship among various metadata formats and how they relate to building Internet systems I wrote a glossary. I then ordered and tied the terms together with a bit of narrative to explain the relationships among the terms

Metadata - a definition or description of data. In data processing, metadata is definitional data that provides information about, or documentation of, other data managed within an application or environment.

For example, metadata would document data about data elements or attributes, (name, size, data type, etc) and data about records or data structures (length, fields, columns, etc) and data about data (where it is located, how it is associated, ownership, etc.). Metadata may include descriptive information about the context, quality and condition, or characteristics of the data.

For example, the data of a newspaper story is the headline and the story, whereas the metadata describes who wrote it, when and where it was published, and what section of the newspaper it appears in. Metadata can help us determine who content is for and where, how, and when it should appear.

New York Feels Like Mudville
It's only mid-October, and it already feels like this could be the longest, loneliest sports season in 15 years. (Oct 15, 2002)
Mayor Renews Call for Arena, With Pataki's Backing
Sports of The Times; Keeping Their Core Together


Consistency in the metadata is necessary to keep information organized. Consistent terminology helps us talk about metadata, and it helps applications process the metadata. We say consistent metadata is "controlled"...

Controlled  Vocabulary - A collection of preferred terms that are used to assist in more precise retrieval of content. Controlled vocabulary terms can be used for categorizing content, building labeling systems, and creating style guides and database schema. One type of a controlled vocabulary is a taxonomy.


Once controlled, the metadata can be organized in various ways to reflect how information exists in reality. There are several organization schemes for organizing metadata such as taxonomies, thesauri, and ontologies...

Taxonomy - a set of controlled vocabulary terms, usually hierarchical. Once created, it can help inform navigation and search systems. An example of a simple or "enumerative" taxonomy:

United States

            New York State

                        New York City


"Faceted" taxonomies, having also attributes and attribute values, allow for many more possibilities:

New York City


                       Central Park

                       Area: 843 acres
                       Pedestrian paths: 58 miles
                       Trees: 26,000
                       Benches: 8,968

Note that taxonomies are conceptual organization schemes and do not directly map to pages or navigation items or labels. Reference:

Thesaurus - A taxonomy that also includes associated and related terms. It is the most complex type of controlled vocabulary, and is sometimes used to standardize an organizationís terminology and subsequently inform both navigation and search systems.

Example of a thesaurus:


Ontology - Ontologies resemble faceted taxonomies but use richer semantic relationships among terms and attributes, as well as strict rules about how to specify terms and relationships. Because ontologies do more than just control a vocabulary, they are thought of as knowledge representation. The oft-quoted definition of ontology is "the specification of one's conceptualization of a knowledge domain."

Example of concepts and relationships in an ontology:

Ontologies, because they are machine-readable, allow applications to be standardized while the domain-specific information can be customized over time.

Ontologies aim to move the complexity of the system into how the information is organized rather then in the application that processes that information. Reference:

We populate our organizational scheme with information by indexing the information...

Indexing - The intellectual analysis of the subject matter of a document to identify the concepts represented in the document and the allocation of descriptors to allow these concepts to be retrieved. Indexing a large number of documents can be done semi-automatically using software applications.

The administrative application we use to record information and metadata is called a content management system...

Content Management System (CMS) - A combination management and publication application for handling creation, modification, and removal of information resources and metadata from an organized repository; includes tools for publishing, format management, revision control, categorization, and workflow.


We can create a structure like an ontology and use a CMS to populate the structure with information about a topic, resulting in a complete representation of that topic called a knowledge base...

Knowledge Base - An ontology populated with data.

A knowledge base used within a company could support decision-making and increase the intelligence of the business...

Business Intelligence - a popularized, umbrella term used to describe a set of concepts and methods to improve business decision making by using fact-based support systems. The term is sometimes used interchangeably with briefing books and executive information systems.

As the existing human-readable content on the Internet is augmented by machine-readable knowledge bases, it will be easier for computers to discern meaning in this "semantic" web...

Semantic Web - The Semantic Web is an extension of the current Web that will allow you to find, share, and combine information more easily. It relies on machine-readable information and metadata expressed in RDF.


Organization schemes like ontologies are conceptual; they reflect the ways we think. To convert these conceptual schemes into a format that a software application can process we need more concrete representations...

Data Model - A description of data that consists of all entities represented in a data structure or database and the relationships that exist among them. It is more concrete than an ontology but more abstract than a database dictionary (the physical representation).

Resource Description Framework (RDF) - a W3C standard XML framework for describing and interchanging metadata. The simple format of resources, properties, and statements allows RDF to describe robust metadata, such as ontological structures. As opposed to Topic Maps, RDF is more decentralized because the XML is usually stored along with the resources.


Topic Maps - An ISO standard for describing knowledge structures and associating them with information resources. The topics, associations, and occurrences that comprise topic maps allow them to describe complex structures such as ontologies. They are usually implemented using XML (XML Topic Maps, or XTM). As opposed to RDF, Topic Maps are more centralized because all information is contained in the map rather than associated with the resources.


The concrete representations above should use industry-standard syntax to increase compatibility with other applications. The most common standard used is XML...

Extensible Markup Language (XML)- a W3C standard markup language for documents containing structured information. As opposed to HTML, which is designed specifically for web browsers, XML is designed for much wider use and is extensible to fit each application. XML is the basis for an incredible array of standards that describe everything from messages between systems to security specifications to document structures. The advantage of XML is that it is human understandable and platform independent.


<?xml version="1.0"?>
<country>United States</country>
<state>New York</state>
<city>New York</city>
<park>Central Park</park>


Comments? me.
Last updated May 2003
Victor Lombardi