Lecture: Introduction to controlled vocabularies and authority control

advertisement

Controlling values

The equivalence relationship

The vocabulary problem

What is this?

Synonymy

Restroom, bathroom, toilet, loo, facilities, WC, ladies’ room, mens’ room, little girls’ room, little boys’ room. . .

Synonymy: Using different words to identify the same concept.

Another vocabulary problem

What is mercury?

What is bank?

What is python?

What is java?

Polysemy

Polysemy: Using the same word

(morphologically speaking) to identify different concepts.

Java: Island in Indonesia, variety of coffee bean, generic term for coffee, object-oriented programming language.

Yet more vocabulary problems

The White House has been lobbying Congress to support the proposed budget. . .

Freedom of the press is an important value in the United States. . .

I’m tired of taking the bus; I need some new wheels.

. .

Metonymy and synecdoche

Metonymy: Using a related concept to stand for another concept.

Synecdoche: Using the word for part of something to stand for the entire thing.

No.

Do people label consistently?

Furnas and colleagues asked people (including subject experts) to label a variety of items (recipes, text editing operations,

“common content objects”). Surprise, there was little agreement among the names submitted by participants.

Conclusion: “The idea of an ‘obvious,’ ‘self-evident,’ or

‘natural’ term is a myth! Since even the best possible name is not very useful, it follows that there can exist no rules, guidelines or procedures for choosing a good name, in the sense of ‘accessible to the unfamiliar user.’”

What to do?

Furnas and colleagues suggest that interface designers:

Implement unlimited aliasing.

Disambiguate terms that can be used in multiple senses by presenting possibilities to users and asking them to select the appropriate one.

Limitations of Furnas study

Participants were asked to label objects, not how they would search for objects.

The study assumes a search interface, not a browsing (or menu-driven) interface.

In a search interface, users must recall or guess an object’s name. In a browsing interface, users merely need to recognize the appropriate term.

Vocabulary problems and information systems

Designers of information organization systems have long grappled with the ambiguities of language.

Synonymy, polysemy, and so on complicate the goal to collocate, or bring together, like items in an information system.

Vocabulary control

In LIS, vocabulary control is similar to

Furnas’s idea of aliasing: concepts are associated with their synonyms.

One term is designated as preferred: this is the term used in a display. Other labels associated with the concept are used in searching.

Example: Search Nordstrom.com for “frock” and get “dresses” instead.

Example of a controlled term

Preferred term: bathroom

Equivalent terms: restroom, loo, toilet, WC, ladies’ room, mens’ room, little girls’ room, little boys’ room, ladies room, ladys room, lady’s room, ladie’s room, ladys’ room...

Equivalence can be relative

Similar concepts may be treated as equivalents; this is a design decision by the vocabulary creator.

Example

Vocabulary includes this preferred term: Beer

These terms are designated as equivalents: ale, porter, stout, pilsner, bock, IPA, malt liquor, barley wine.

Disambiguation in vocabularies

Polysemous terms are often identified by adding qualifying terms in parentheses.

Mercury (chemical element)

Mercury (god in Greek mythology)

Search engines may use ask users to select the sense they want.

Digression into the library catalog

Library catalogs have three traditional access points: author, title, and subject. In the old card catalog, these were the three ways that users could search.

Each of these access points has associated vocabulary control.

Control of names

In library cataloging, controlled vocabularies for authors, titles, and subjects are called authority files.

Authority files both disambiguate names that identify multiple people or items and group variations for the same person or item (that is, they deal with polysemy and synonymy).

Authority file examples

In the UT author authority file: headings for

Patricia Williams:

Names are disambiguated by using middle initials and dates of birth.

Cross references are used for some authors.

There may still be two headings for one person.

Fun digression: Pseudonyms in the catalog

The current catalog maintains pseudonymous identities (in older catalogs, everything went under the author’s real name).

For example, “Carolyn Keene,” the name used by multiple people as the author for the Nancy

Drew novels, is maintained as an author entity in the authority file.

Thesauri

Thesauri are a type of controlled vocabulary that include equivalence, hierarchical, and associative relationships. Thesauri can also be faceted (that is, represent multiple aspects of a concept...we will discuss facets in depth later).

Thesauri are often developed to deal with subjects of documents, and we will talk a lot about this beginning in a few weeks.

Example thesaurus entry

Dark chocolate

BT Chocolate

RT Single-origin chocolate

UF Semisweet chocolate

Baker’s chocolate

Sweet chocolate

SN Chocolate without milk solids and with less than 70 percent chocolate mass.

BT: broader term, one level up in a hierarchy

RT: related term, in another facet or hierarchical branch

UF: Use for; synonyms, or nonpreferred terms

SN: Scope note; definitions or usage guidelines

Controlled vocabulary example:

MeSH and PubMed

The Medical Subject Headings (MeSH) index journal articles for the PubMed database.

Keyword searches in PubMed are automatically expanded with MeSH. Searches can also be explicitly limited to MeSH terms, which can increase precision.

The comparison to a system like Google Scholar is illuminating.

Summary

Controlled vocabularies increase precision and recall in searching by identifying equivalent terms.

Authority files are types of controlled vocabularies.

Thesauri are subject-based controlled vocabularies that include hierarchical and associative relationships in addition to equivalence relationships. Thesauri can also be used as browsing interfaces.

Download