AI meets library: Building an RVK classification tool

Ai Meets Library Building An Rvk Classification Tool

AI and library classification systems

Artificial intelligence is rapidly changing the way we tackle complex tasks in many different fields. As a librarian interested in the evolving landscape of AI, I have been exploring how AI can be integrated into our daily workflows. My latest project focuses on an AI-powered tool for RVK classification. In this post, I will explain what the RVK is, how AI could facilitate its use in libraries, and the challenges and solutions encountered along the way.

The complete source code for the RVK Classification Tool is available on GitHub.

Understanding the RVK

For those unfamiliar with it, the RVK (Regensburger Verbundklassifikation) is a German library classification system widely used in academic libraries, particularly in Germany and Austria. Unlike other classification systems, which focus primarily on subject matter, the RVK was developed around 1964 at the University of Regensburg Library as a pragmatic, shelf-mark-oriented system.

It is designed to organise library holdings so that related materials are physically shelved together. This makes it easier for researchers and students to browse and retrieve materials. The University of Regensburg Library has continuously maintained and restructured the RVK, releasing it as Open Data under the Creative Commons CC0 licence in December 2017.

The RVK’s sophisticated, hierarchical, modular structure renders it particularly complex. This also makes it challenging for AI to identify the exact RVK classification required for the shelf marks. Its contents—keywords and the corresponding classification—are organised into ‘main groups’ (Hauptgruppen), such as ‘N’ for History and ‘Q’ for Economics, which are then broken down into increasingly specific sub-classes, such as ‘NW’ for Economic and Social History.

Geographical distinctions are most often the reason why entire RVK areas are found in a multitude of places within the classification hierarchy. This creates two fundamental challenges for automated classification. The first is the sheer volume of possibilities. The second is the risk of matching keywords without proper contextual understanding. For instance, the term ‘Migrationspolitik’ (migration policy) appears 179 times in the RVK, including in the following contexts:

  • MA – ML: Political Science > MG – MI: Political Systems of Individual Countries > MI: Africa, Latin America, Australia > MI 70000 – MI 92999: Latin America > MI 81000 – MI 92999: South America > MI 82000 – MI 82999: Argentina > MI 82900 – MI 82950: Specific Political Subject Areas > MI 82925 Migration Policy
  • MA – ML: Political Science > MG – MI: Political Systems of Individual Countries > MG: Europe, North America > MG 11000 – MG 44999: Western Europe, Central Europe > MG 15000 – MG 31999: Federal Republic of Germany > MG 22000 – MG 22999: Lower Saxony > MG 22900 – MG 22950: Specific Political Subject Areas > MG 22925 Migration Policy

AI in action: Developing an RVK classification tool

My aim was to create a tool to assist with the often complex process of assigning RVK classifications respectively shelf-marks. The core idea was to use AI to identify the key content in a publication’s metadata in a format such as PICA. These keywords could then be used to search the RVK and determine the most suitable classification.

The project seemed feasible because the RVK provides a freely accessible application programming interface (API). At the same time, content analysis of the metadata could be performed using a paid API from OpenAI. I chose Streamlit, a Python framework that allows rapid development of interactive web applications, for the user interface.

The development process itself was an experiment in employing AI as a programming tool. I used AI models to assist with various coding tasks, ranging from creating initial code structures to debugging and refining algorithms. The initial phase of development was relatively straightforward to implement. The primary challenge, which remains partly unresolved, lies in identifying the most appropriate classification within the RVK. Specifically, the key question was how the tool could, for example, determine which of the 179 instances of ‘migration policy’ corresponds to a given case.

The tool addresses this challenge through a multi-stage process. First, it identifies promising RVK main groups based on the publication’s subject matter. Then, it searches for specific terms within those areas. Finally, it applies a scoring system that weighs three factors: how well the subject matter matches (45%), how relevant the geographical context is (25%), and how appropriate the RVK main group is (30%). While this systematic approach helps prioritise the most likely classifications, it cannot overcome the fundamental limitation of hierarchical depth.

Another significant technical challenge in developing this tool was displaying the full hierarchical path of each RVK classification accurately, which is essential for determining its relevance. Unlike simpler systems, the deep and sometimes complex nesting of the RVK (including its range notations, e.g. MI 70000 – MI 92999) makes reconstructing the full hierarchical path of each classification difficult. Extensive development and fine-tuning would be required to enable this feature.

To meet these challenges as best as possible, the tool employs a bidirectional navigation strategy. Rather than searching only downward from promising starting points, it explores the RVK structure in multiple directions:

  • Upward exploration: Examines parent and grandparent nodes to identify broader subject classifications that might be relevant.
  • Downward refinement: Investigates child and grandchild nodes for more specific and precise classifications.

This multi-directional approach ensures comprehensive coverage of the RVK hierarchy while maintaining practical processing times. Two other key components of the tools’s approach are:

  • Hierarchy-Aware Search: Prioritises RVK main groups and systematically explores child nodes to find specific relevant notations.
  • Regional Context Awareness: Maps local geographical terms to country-level classifications using an extensive built-in database. Since the RVK primarily organises geographical distinctions at the national level, the tool must recognise that ‘Chemnitz’ refers to Germany, or ‘New York’ to the United States, guiding the search toward the appropriate national classifications rather than getting lost in local specifics.

Strengths and limitations of the AI tool

As with any experimental tool, this RVK classifier has its strengths and, importantly, its limitations. Understanding these aspects is crucial for appreciating the areas in which AI can currently contribute, as well as those in which human expertise remains indispensable.

AI is highly effective at recognising the content of publications and texts (even from metadata) and generating relevant keywords to describe them, as previously mentioned. In conjunction with the RVK API, it generally performs well in assigning publications to RVK main groups (i.e. the upper hierarchical level).

The difficulty lies in determining the correct deeper hierarchical level, and in particular the precise local allocation. Is this book about migration policy in Argentina or Lower Saxony, Germany?

The deep branching of the RVK hierarchy, as illustrated above, poses a significant computational challenge. To address this, the tool employs a bidirectional search, exploring two levels—both upwards and downwards—from promising starting points. This takes a considerable amount of computing time. However, this may still not cover enough hierarchy levels to find the correct classification.

One attempted solution to this problem was to assign a higher weighting to matches containing the primary keyword identified by the AI and the appropriate local reference. However, this led to suggestions such as ‘BE 8507’ (Buddhism in China) for a book about Christianity in Northeast China.

Looking forward

This experiment showcases the potential of AI to assist with traditional library tasks while highlighting the unique challenges posed by complex systems like the RVK. Though the tool can reliably identify broad subject areas, the RVK’s intricate, nested logic requires human oversight and refinement to ensure accurate classification.

Regarding the RVK Classification Tool, the most significant challenge lies in searching the RVK hierarchy structure more quickly and efficiently. This would contribute not only to making the tool genuinely usable in library work, but also to minimising the enormous resources required for AI deployment.

What are your thoughts on integrating AI into library workflows? Have you encountered any challenges with complex classification systems, and which other areas do you think would be suitable for AI experimentation or practical application?


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *