The Accidental Taxonomist

Sunday, March 24, 2024

History of Modern Information Taxonomies

The word “taxonomy” was coined in 1813 by the Swiss botanist A. P. de Candolle, who developed a new method of classifying plants. The word is derived from the combination of Greek words τάξις (taxis), meaning “order” or “arrangement,” and νόμος (nomos), meaning “method” or “law.” The designation of taxonomy was then applied after-the-fact to Carl Linneaus’ binomial nomenclature system that had been published under the title Systema Naturae initially in 1735.

Today’s information taxonomies have their origins in a combination of classification systems, library subject heading schemes, and literature retrieval thesauri, and thus have features that combine all of these. Despite their name, information taxonomies are closer to subject heading schemes and thesauri, than they are to classification systems.

Classification systems

Classification systems have a multi-level hierarchy of classes, where a subclass is fully contained in its parent class, and consequently members of a subclass are also members of the parent class. Members (things) can belong to only one class, though. Historic examples include:

Linnaean classification of organisms (1735-1758)
Paris Bookseller's classification (1842)
International Classification of Diseases (originally Bertillon Classification of Causes of Death, 1860)
Dewey Decimal Classification (1876) and other library classifications
Industry classification systems:

Standard Industrial Classification System (U.S) (1937)
International Standard Industrial Classification (U.N.) (1948)

The requirement that a thing (an organism, book, document, medical diagnosis, economic establishment) can go into only one class supports various purposes, which are not for information retrieval:

Understanding and organism’s evolutionary background; identifying potential medicinal herbs
Locating and reshelving a book on its shelf
Performing heath data analysis from hospital records; billing health insurance companies appropriately
Doing economic analysis of industries by aggregate establishment data

When it comes to information resources, classification systems may be used to determine in what (virtual) file folder a document belongs or, to support machine-learning based auto-classification.

Classification systems are also useful for data analysis, since content or records are assigned to only one classification, and this prevents any double counting. Large, data-heavy organizations might have developed their own internal classification systems for data tracking purposes. Such classifications do not serve the same purpose of a tagging/information retrieval taxonomy and should not substitute for a taxonomy but rather exist alongside for separate purposes.

Subject heading schemes

Subject heading schemes were developed to help people find books and later also articles on various subjects with more detail and flexibility for growth than classification systems. Subject headings are used for cataloguing and indexing, not for classification. Unlike classification (for shelf location) of which an item has only one classification, an item (book, article, other media) can have multiple subjects.

Features of subject heading schemes:

Alphabetical arrangement of a very large number of subjects and/or named entities (proper nouns)
Cross-references of See (Use) and See also (Related)
Headings with large numbers of citations broken down to group the citations by a sub-heading or subdivision, in what is also called pre-coordination. For example, China – Foreign relations.

Back-of-the-book indexes, whose format evolved over the first half of the 20th century, follow a similar style.

Examples of early subject heading schemes:

Library of Congress Subject Headings (1898) and other national library systems
US. National Library of Medicine’s Medical Subject Headings (1954)

Library subject headings were adopted for periodical article indexes early on. The Reader’s Guide to Periodical Literature published by the H.W, Wilson Company had been using subject headings, including subdivisions and cross-references, since shortly after its introduction in 1901 (as can be seen in the 1900 -1905 cumulative index excerpted in the screenshot below).

(The two-digit years are from the prior century.)

Eventually, subject heading schemes adopted thesaurus features of Broader term, Narrower term, and Related term relationships, as was the case for Library of Congress Subject Headings, starting in 1985. Thus, subject heading schemes and thesauri have become very similar. The name “heading” in subject headings implies that there also exist some sub-headings/subdivisions, a feature which is not a typical of thesauri, though.

Thesauri

Information thesauri (in contrast to a dictionary thesaurus, like Roget’s) emerged in the mid-20th century outside of libraries for the more specialized subject needs of the federal government, scientific publishers, and technology companies. The word “thesaurus” was first used to refer to a controlled vocabulary, as a set of words/terms, not classification codes, for information retrieval in the 1950s.

Early thesauri include:

E. I. Dupont de Nemours Company’s thesaurus (1959)
Thesaurus of Armed Services Technical Information Agency (ASTIA) Descriptors, U.S. Department of Defense (1960)
Chemical Engineering Thesaurus, published by the American Institute of Chemical Engineers (1961)

Additional professional organization publishers of scientific journals created their own thesauri in the 1960s. Dialog, the first online information service for article citations, which also utilized thesauri of information publishers, was launched in 1966.

Soon thereafter, standards for thesauri were developed and published:

UNESCO Guidelines for the establishment and development of monolingual thesauri (1970)
DIN 1463 (Deutsches Institut für Normung) Guidelines for the establishment and development of monolingual thesauri (1972)
ISO 2788 Guidelines for the establishment and development of monolingual thesauri (1974) (superseded by ISO 25964-1 2011)
ANSI American National Standard for Thesaurus Structure, Construction, and Use (1974) (superseded by ANSI/NISO Z39.19 1993)

Modern information taxonomies

The word “taxonomy” for a hierarchical structure (like a classification scheme) of terms for tagging and retrieval (like a thesaurus) gradually became popular in the 1990s. These new taxonomy-like thesauri became popular, largely due to advancements of software and website user interfaces to enable interactive displays of hierarchies. Taxonomies had the same primary purpose of thesauri, which is information findability and retrieval, but taxonomy implementations introduced new designs for browsing and expanding hierarchies. It was found that “taxonomy” also tended to resonate with business audiences better than “thesaurus.” A market for business and commercial taxonomies started to be recognized by software vendors and by consultants by the end of the 1990s.

Combining an interactive user interface with a database enabled the introduction of dynamic filters or refinements of searches by selected taxonomy terms based on different aspects, and thus faceted taxonomies emerged and have since become a popular, if not dominant, implementation of taxonomies for many different use cases. Faceted taxonomies, by combining search terms for refinement, do not need to be as large and detailed as thesauri.

As for the next chapter in the history of taxonomies, that involves a convergence with ontologies. You can read more about that in my past blog article “Taxonomies vs. Ontologies.”

Saturday, February 24, 2024

Faceted Classification and Faceted Taxonomies

I have argued before that a taxonomy is not the same as a classification system, despite the original meaning of the word taxonomy as a system for classification. (See the blog post Classification Systems vs. Taxonomies.) Modern taxonomies that are used to support information management and findability are more similar to information retrieval thesauri and subject heading schemes than they are to classification systems. Another type of classification, the method of “faceted classification,” however, does apply to types of taxonomies. I would not consider “faceted classification” as exactly a synonym, though, to “faceted taxonomy,” though, as not all faceted taxonomies are the same.

What is faceted classification?

Facets for jobs

Facet means face, side, dimension, or aspect. In this sense, facets are meant to mean aspects of classification. A diamond, an object, or a digital content item is multi-faceted. A digital content item (text document, presentation, image, video, etc.) has multiple informational dimensions or aspects to it and thus multiple ways to be classified.

Classification is about putting an item, such as a content item (document, page, or digital asset) into a class or category. If it’s a physical object (a book) it goes into a shelf of its class. In faceted classification, an item cannot physically be in more than one place, but it can still be “assigned to” more than one class. So, while the book itself can be on only one shelf, the record about the book can be assigned to more than one class.

Faceted classification assigns classes/categories/terms/concept from each of multiple facets to a content item, allowing users to find the item by choosing the concepts from any one of the facets they consider first. Different users will consider different classification facets first. Users then narrow the search results by selecting concepts from additional facets in any order they wish, until they get a targeted result set meeting the criteria of multiple facet selections. The user interface of faceted classification is sometimes referred to as faceted browsing.

History of faceted classification

The idea of faceted classification as a superior alternative to traditional hierarchical classification, whereby an item (such as book or article) can be classified in multiple different ways instead of in just a single classification class/category, is not new. The first such faceted classification was developed and published by mathematician/librarian S.R. Ranganathan in 1933, as an alternative to the Dewey Decimal System for classifying books, called Colon Classification (since the colon punctuation was originally used to separate the multiple facets). In addition to subject categories, it has the following facets:

Personality – topic or orientation
Matter – things or materials
Energy – actions
Space – places or locations
Time – times or time periods

Although it was not adopted widely internationally due to its complexities in the pre-digital era, colon classification has been used by libraries in India.

In the late 20^th century, digital library research systems based on databases enabled faceted classification and search, with different fields of a database record represented in different search facets. Users interacted with through an “advanced search” form of multiple fields. Faceted classification and browsing gained widespread adoption with the advancement of interactive user interfaces on websites and in web applications in the late 1990s and early 2000s. Thus, facets started being displayed in more user-friendly ways that were no longer “advanced.”

Structure of facets

It’s not necessary to follow Ranganathan’s suggested five facets, but that’s a good way to get thinking about faceted classification. Another way to look at faceted classification is to consider a facet for each of various question words: What, Who, Where, When

What kind of thing is it – content type
What is it primarily about - subject
Who is it for or concerns – audience or user group
Where is it for/applicable, or where it depicts (media) – geographic region
When it is about – event or season (not date of creation, which is administrative metadata, instead of a taxonomy concept)

The additional question words of “why” and “how” are relevant in some cases, but less common. An individual content item typically does not address all of these questions, but usually addresses more than one. When creating facets, most of the facet types should be applicable to most of the content types.

Another good way to think about faceted classification is to put the word “by” after each facet, to suggest classification and filtering “by” the aspect type. A logical and practical number of facets tends to be in the range of three to seven.

A standard feature of facets is that they are mutually exclusive. A concept/type belongs to only one facet. This is typical practice for the design of classification systems. The difference is that in faceted classification it is merely the concept/type/term that belongs to just one facet, not the content item or thing itself that would belong to only one classification in traditional classification systems.

When a faceted taxonomy is not for classification

The design, implementation and use of facets to construct or refine searches has become so popular that it is no longer used just for classification aspects. Rather, a faceted taxonomy design may be used for any faceted grouping of concepts for search or metadata types that are relevant for the content and users.

Faceted classification is intended to classify things that share all the same facets. For example, all technical documentation content has a product, feature, issue, and content type, so these are faceted classifications. But with more heterogeneous content, facets are not universally shared. While the facets may still be useful tool, it would be best not call it faceted classification when facets are applicable to only some content types.

While faceted classification tends to be quite limited in the number of its facets, non-classification faceted taxonomies, whether based on subject types or separate controlled vocabularies, could result in a rather large number of facets.

Faceted taxonomies that would not be considered faceted classification include those where multiple facets are created for organizing and breaking down subjects or when multiple facets are created for reflecting multiple different controlled vocabularies. These faceted taxonomies stretch the meaning of “facet,” since the facets are not necessarily faces, dimensions, or aspects, but simply “types” suitable for filtering.

Facets for organizing subjects

In faceted classification we assign an object or content item to multiple different classes. However, for classification, these classes are relevant to the content item as a whole. This contrasts with indexing or tagging for subjects or names of relevance that occur within a text or are depicted within a media asset. These names and subjects can be grouped into facets for filtering/limiting search results, without being about the “classification” of the content item. This is common for specialized subject areas. Faceted taxonomies provide a form of guided navigation and are easier to browse and use than deep hierarchical taxonomies, so a large “subject” taxonomy could be broken down into specific subject-type facets.

Examples of specific subject-type facets include:

Organization types
Product types
Technologies
Activities
Industries
Disciplines
Job roles
Event types
Topics

The “Topics” facet is then used for the leftover generic subject concepts that do not belong in any of the other specialized facets. Unlike faceted classification, each facet is applicable to only some content items.

Any content item could be tagged with any number of concepts from any number of these facets. The facets make it easier for user to find taxonomy concepts and combine them. But the facets are not for “classifying” the content.

While faceted taxonomies should also ideally be mutually exclusive, in contrast to the principle of faceted classification, the occasional exception of a concept belonging to more than one subject-type facet (question word of “What”) does not create a problem in search. For example, the same concept Data catalogs, could be in the facet Product Types and Technologies, as long as this type of polyhierarchy is kept to a minimum to avoid confusion. This would not be considered a case of classic polyhierarchy, because it’s not simply a matter of different broader concepts, but rather different facets or concept schemes. It is an attempt to address a different focus or approach to the topic that results it being in more than one facet, offering an additional starting point for searchers.

Facets for organizing controlled vocabularies

Faceted filters/refinement may be based on different controlled vocabulary types: one or more of term lists, name authorities, and subject thesauri/taxonomies. The “facets” are based on how the set of multiple controlled vocabularies is organized rather than based on “aspects” of the content.

Facets could be used for any controlled vocabulary filters that are logical, such as:

Named people (mentioned/discussed)
Organizations (mentioned/discussed)
Products/brands (mentioned/discussed)
Divisions, departments, units (mentioned/discussed)
Named works/document titles (mentioned/discussed)
Places (mentioned/discussed)
Topics (mentioned/discussed)

Because these facets reflect controlled vocabularies of concepts used to tag content for relevant occurrences of the subject/name and not for classification of the content, this kind of faceted taxonomy would not be considered faceted classification. There could, however, be additional faceted classification types, such as content type.

The Topics facet could contain a large hierarchical taxonomy or thesaurus. As such, this faceted search/browse structure, may not even be considered a “faceted taxonomy,” but rather merely a faceted search interface to a set of taxonomies. Thus, there is even a nuanced difference between a faceted browse UI that utilizes at taxonomy (among other controlled vocabularies), and a “faceted taxonomy.”

Facets for heterogeneous content

Finally, whether a faceted taxonomy is considered an implementation of faceted “classification” or not may depend on the context and type of content. If the content is homogeneous and all items share the same facets, then it may be considered faceted classification, but if the content is heterogeneous, and the facets are only relevant to some content, then it would not be considered classification.

Consider the following example of specialized subject-based facets for the field of medicine:

Diseases or conditions
Body parts (anatomy)
Sign and symptoms
Treatments
Patient population types

If all the content comprised just clinical case studies, then these facets actually could be considered faceted classification, since they all apply to nearly all the content and are aspects of the content. The content is classified by these facets. On the other hand, if the content dealt with all kinds of documents that had something to do with health or medicine, then these facets would not be for classification of the content but rather just for grouping of subjects for search filters.

When faceted classification is not a taxonomy

Attributes for computers

Finally, I would not consider all faceted structures to be faceted taxonomies.

Taxonomies are primarily for subjects and may include named entities. Content types/document types may also be included in the scope of taxonomy. There exists additional metadata that may be desired for filtering/refining searches that is out of scope of a definition of taxonomy. This includes date published/uploaded, file format, author/creator, document/approval status, etc. If it is important to the end users, these additional metadata properties could be included among the browsable facets and be considered classification aspects.

Attributes are a form of faceted classification, but a set of attributes is not really a faceted taxonomy. Often ecommerce taxonomies are presented as examples of faceted taxonomies. In fact, ecommerce taxonomies tend to be hierarchical, as they present categories and subcategories of types of products for the users to browse. At lower, more specific levels of the hierarchy, the user then has the additional option to narrow the results further by selecting values from various attributes that are shared among the products within the same product category. These include color, size/dimensions, price range, and product-specific features. I would not consider numeric values to be a taxonomy, but some attributes, such as for features, are more within the realm of taxonomies. Whether these should be called facets or attributes is a matter of debate. More about attributes is discussed in my past blog post “Attributes in Taxonomies.”

Conclusions

Not all faceted taxonomies are faceted classifications, but some are. Not all faceted classifications are taxonomies, but some are. The differences are nuanced, and end-users may not care nor need to know these naming distinctions, as long as the taxonomist should. Having a deep understanding of facets helps taxonomists and information architects design the facets better. The goal is to serve the users with the most suitable faceted design to serve their needs and accommodate the set of content.

Sunday, January 14, 2024

Learning to Create Taxonomies

Knowledge of what taxonomies are, what they are for, and how they are used is quite widespread, even if there are uncertainties and disagreements around the definition of “taxonomy.” People who often look up digital information are familiar with various presentations of taxonomies for selecting terms linked to content. These include hierarchical trees of topic and subtopics to browse, scroll boxes of controlled terms, type-ahead or search-suggest terms that appear below a search box after the first few letters are typed into the box, and terms or named entities grouped by various aspect types (facets) in the left margin to select from in order to limit/refine/filter search results.

Why Learn Taxonomy Creation

There is a big difference, however, between being able to use taxonomies and being able to create taxonomies.

While it is usually best to leave taxonomy creation to the experts, taxonomists are not always available, or the needed taxonomy may be small or apparently “simple,” so it may not be economical to hire a contract taxonomist or a consultant. In other situations, the taxonomy subject may be quite technical, and it would seem preferable to have subject matter experts, rather than an external taxonomist, create the taxonomy. Thus, people who are not professional taxonomists often create taxonomies.

Generative AI now makes it easier for anyone to “generate” a taxonomy. However, the knowledge of taxonomy principles is needed to make necessary corrections and edit the taxonomy to achieve a decent level of quality. Generative AI should not be used to fully create a taxonomy (which could in fact be extracting published taxonomies violating their copyright), but rather it may be a used as a tool facilitate parts of the taxonomy creation process. (See my post “Taxonomies and ChatGPT.”) The technology thus makes it easier to create taxonomies for those who are not taxonomists and have limited time for taxonomy creation tasks.

There is also the matter of taxonomy maintenance. After a contract taxonomist or consultant creates a taxonomy and leaves, the taxonomy still needs to be kept up to date, with new concepts added and others changed, and over time expanded. While documentation and guidelines written by a taxonomy consultant are helpful, a good understanding of taxonomy creation principles is also needed by anyone responsible for expanding or maintaining a taxonomy.

Finally, taxonomy creation is a collaborative effort, involving stakeholders in various roles (project management, content management, digital asset management, information technology tagging, research, user experience, search, etc.) who are invited to contribute their perspectives. Stakeholders can provide better insights to a taxonomy if they have a better understanding of taxonomy principles. Taxonomy project managers in particular need to understand taxonomy creation even if they are not doing the actual taxonomy creation work.

How to Learn Taxonomy Creation

Fortunately, there are many resources to learn the principles and standards of taxonomy design and creation. There is, of course, my book, The Accidental Taxonomist, which, as the name implies, is intended for anyone who finds themselves, perhaps by “accident” in a position that requires them to create, edit, or manage taxonomies.

Heather Hedden delivering a taxonomy workshop

There are also various half-day and full-day workshops at conferences, virtual short courses through professional associations and other organizations, and asynchronous online training. These usually involve some exercises for practice and provide the appropriate amount of training for getting started with creating taxonomies. I’ve offered various kinds of training, both independently and through other organizations, over the years. My current course offerings are on my website

Upcoming Taxonomy Course

The next live (virtual) course I will offer is a new course called “Controlled Vocabularies and Taxonomies” offered through HS Events, on GoToWebinar over four weekly sessions from February 29 though March 27. I will teach this course live (with ample time for Q&A) just once, after which it will become available as a recording for purchase.

HS (Henry Stewart) Events are best known for their dominance in the field of digital asset management (DAM), but the course I will teach is not limited to DAM professionals. Actually, this course is most appropriate for the expanding scope of HS Events, which will introduce a Semantic Data conference event, which includes the subject of taxonomies, co-located with its DAM conferences in London and New York in 2024.

The subject of taxonomies fits nicely into four sessions. The first session is an introduction to the definitions, types, uses, benefits, and standards for taxonomies. The second deals with project management side of planning and researching for creating controlled vocabularies and taxonomies. The third session gets into the details of creating terms and relationships. Finally, the fourth session takes up design and implementation issues.

This course is most similar to the course "Metadata and Taxonomies" which I had taught through the Rome, Italy-based training company Technology Transfer S.r.l from 2019 to 2023, and which I decided to discontinue offering. The scheduling is now better: Instead of two consecutive days of four hours/day it is spread out over four weekly shorter sessions with a dedicated encouraged Q&A time. Also, the sessions start two hours later than the Rome-based course (10:00 am instead of 8:00 am EST). I have also updated the content, which was getting a little stale after several years, and I added more new graphics. Finally, the registration fee is considerably lower than the Technology Transfer course. You can also take advantage of a 20% discount (code JANUARY20) if you register before January 31.

Sunday, December 31, 2023

IT and Taxonomies

Taxonomies are related to many fields of work, including knowledge management, information architecture, website design, website marketing at SEO, document management, terminology management, publishing, product management (for information products), content management and strategy, digital asset management, machine learning for classification, natural language processing for auto-tagging, data management, library and information management, and information technology. Information technology is relevant to the implementation of all taxonomies.

Why is IT involved in taxonomies?

Taxonomies link users to content (and taxonomies extended into ontologies also link users to data), but this linking relies on technology. The technology could be a kind of software, such as a content management system that supports the tagging and retrieval of content by taxonomies along with the feature of taxonomy management. Often, however, additional technology is needed to link multiple software systems together, with APIs, and to move data across systems, with extract-transform-load (ETL) tools. Taxonomies are increasingly built in the SKOS (Simple Knowledge Organization System) standard/data model, which enables taxonomies and other knowledge organization systems to be machine-readable and not just human readable.

Taxonomies are a concern of information technology professionals as they are the owners of, and often also the developers of, the systems in which taxonomies are implemented. The systems could be completely internally developed, or they could be licensed software that typically requires some customization or integration with other systems. In my experience as a taxonomy consultant, I have typically engaged in conversations with those in IT as key stakeholders of the taxonomy. However, the degree of the involvement of IT professionals in the taxonomy itself can vary.

In custom taxonomy implementations, such as in an information service/product or in an ecommerce business, IT professionals are usually not involved in the actual design of the taxonomy, but taxonomists or others who create that taxonomy need to collaborate with IT professionals to understand the system’s capabilities and limitations and may impose requirements. Taxonomists are concerned with how the taxonomy will be displayed to the users, how the users can interact with the taxonomy, how tagging is done, and how the search functions. Custom software development has great flexibility in how it supports a taxonomy.
In implementations of taxonomies in licensed software, there may still be some development work for the IT professionals, but there are limits to what can be done or changed.

Commercial content management systems (CMS) that allow for the custom development of the user interface, referred to as “headless” CMSs, however, are becoming more common. The user interface in particular is very significant to how a taxonomy is designed and how it functions.

Who in IT is involved in taxonomies?

Those who work in IT departments with involvement taxonomies could be in roles doing development or support for systems that manage and consume taxonomies, or they could be in systems integration roles. Additionally, there are taxonomy/metadata/ontology specialists who work within the IT department of an enterprise, especially if a knowledge/information management department does not exist in the organization.

In a survey of taxonomists I conducted in January 2022 for the 3rd edition of The Accidental Taxonomist book, of 162 people who do taxonomy work for their employers, which are not consultancies creating taxonomies for others, a multiple-choice question asked what area they work in. Information technology ranked 4th out of 11 choices, with 17% of the responses, following the areas of knowledge management, content management/strategy, and product development/management, yet ahead of the specialties of library, user experience, marketing, and others.

The survey also asked all respondents to provide their job titles, and some of those working in taxonomies have job title that are closely associated with information technology. These included titles of IT Data Analyst, Data and Technology Platform Products, SharePoint Product Owner, Senior Solutions Consultant, Implementation Project Manager, Data Architect, Senior Manager - Graph Solutions, Enterprise Architect, Staff Engineer - Systems, Information Governance Engineer, Head of Technical Services, and Director of Solutions Delivery.

What does IT do with taxonomies?

From my experience as a taxonomy consultant, I have observed that those working in IT, in their efforts to facilitate the adoption of new software and features that make use of taxonomies, may include starter taxonomies within the tool, whether selected from offerings of software vendor or created by the IT staff themselves. For example, IT professionals might create simple controlled vocabularies in the SharePoint term store, such as for document types, departments, locations, etc., so that users can start using the search refinements right away, and there is also an example of the functionality of taxonomy, which can be improved upon and expanded by someone else later.

Then there is enterprise taxonomy/ontology management software, which should be connected to search systems, content management systems, and tagging systems (if not using a tagging module of the taxonomy management system). In my experience working for a taxonomy software vendor, the IT department was often involved in the software purchasing process, if not actually leading the decision-making. Representatives from the IT department attend pre-sales demos of the tool, ask questions, and compile and compare system requirements when requesting a proposal.

That taxonomy is actually an area concern of IT, was also made clear when I saw that taxonomies were mentioned in a section within a chapter on knowledge management-related systems in my son’s introductory Management Information Systems textbook for a required course for his B.S. in Information Technology.

In sum, IT professionals who support enterprise knowledge or information management systems need to have a basic understanding of taxonomy principles, standards, benefits, and uses. My website contains various taxonomy resources. Some IT professionals may even want to go further and design and create small taxonomies (lacking the time to create large taxonomies), and they may want to read my book or attend my workshops or online courses.