Searching through museum collections

What’s the connection between Buzz Aldrin and the Rijksmuseum? Surprisingly, it’s the moon that creates this link between the museum and the famous astronaut. Following their historic lunar mission in 1969, astronauts, Edwin "Buzz" Aldrin, Michael Collins and Neil Armstrong made a visit to the Netherlands. To commemorate the occasion, the American ambassador presented Dutch Prime Minister Willem Drees with a unique gift: an actual moon rock.

This lunar souvenir now resides in the Rijksmuseum's collection, standing out as an unusual artifact. However, the object's story took an unexpected turn in 2006 when researchers made a surprising discovery. The supposed moon rock turned out to be a piece of petrified wood. This revelation added a new layer of intrigue to the item, transforming it from a symbol of space exploration into an accidental testament to the curiosities of collection cataloguing.

As with the moon stone, the Rijksmuseum has many fascinating objects with great stories behind them. In an attempt to make these stories more accessible and appealing, we’ve been working on a new way to explore them. The core of this concept is the way in which objects and cataloguing terms are connected. In this way, while searching for Willem Drees or the material wood, we might notice the odd recommendation to also consider searching for the place moon and thus one ends up looking at this infamous piece of petrified wood.

To accomplish this isn’t necessarily a straightforward task. The previous collection system we built for Rijksmuseum relied on a record based storage of which the text values were searchable but weren’t interlinked themselves. This means that you could search and find the moon stone, but getting recommendations wasn’t easily feasible. Furthermore this architecture didn’t lend itself for exploring the way in which data is structured. We know the moon is a place, but what kind of relationships are there between the objects and this place? To overcome these challenges we’ve created a search engine to which the relations are at the core of the system. Luckily for us the Rijksmuseum does most of the heavy lifting required for such a system. Through their own data pipeline the record based collection data is transformed into linked data. For us this new data model is the starting point of this collection website project.

In this blog post I share some insights and learnings about this project, which we as engineering team at Q42 gained over a longer period in 2024 and 2025.

From collection system to website

While many know about the Rijksmuseum's art collection, fewer people are aware of the other collections maintained by the museum. The two collections in scope for this project besides the art collection are the library and archive collection. The library contains several hundreds of thousands of articles, books, journals and more while the archive similarly contains a substantial amount of items. For the archive the quantity is not so easily expressed since in some cases an item is a report and in other cases a box of old photos. Each of these collections has its own collection system to maintain the data surrounding it. The integration layer is the Rijkmuseum’s way to shape the different collection models into a unified one. This model is then able to express relations shared across domains. An example would be Lucas van Leyden as both the topic of an exhibition catalogue as well as a renowned painter.

Searching and browsing - The interface

So how do we harness the promised utility of these new models and the focus on relationships? Our goal is twofold: empower the user who already knows what they want to search and inspire the one who doesn’t know yet. We believe the graph structures are especially powerful in the latter whilst more conventional data models and technologies suit the former. We addressed this two fold goal with a two fold approach. On one end we utilise Elasticsearch for full text and faceted search while on the other we utilise graph database Neo4j to provide recommendations, switch between domains, switch between relations and generate content pages.

A first (big) search step

For the user this means a couple things have changed compared to the previous collection website. The search bar has become more elaborate and has become the central place to manage and guide a search journey. This means that both search options as well as suggestions can be found in the very same bar.

When the very first textual search is initiated, a selection of matching nodes is provided. These can be places, materials, periods, subjects, makers and many more types of terms for the user to filter the collection by. When typing ‘London’ in the search bar one might see suggestions such as London Museum as maker, London the place or even the London Zoo as place.

These options are all identified terms used by collection specialists to express what is known about the collection objects. What’s interesting is that in this case the user hasn’t specified in which way the node should be related to the objects. This means that, once the user selects London the place, the shown collection objects can be either made in London or contain some sort of depiction of London. Furthermore the user can find different tabs right above the search bar which are gateways to different collection perspectives.

Shown above are the objects from the collection related to London as place but as it turns out there are also six library objects that have some form of relation to the same London. The last tab shows us that there are also 18018 visitor stories that contain collection objects with a relation to the place London.

The very first tab provides the visitor with a unique perspective of the art collection for the place London. These node pages as we call them all start off with a component where the most viewed related objects are shown.

Afterwards the visitor will encounter a component that collects commonly used nodes in combination with London for which the range of production years is also included. For London it turns out that the place node is often paired with the subject node ‘Church’.

As can be seen on the node page there are multiple depictions of St Paul’s cathedral through the centuries. The available objects apparently range from the mid 17th century to the mid 20th. Scrolling down further the visitor is treated with more of these commonly paired nodes such as street, bridge and statues, all uniquely expressing the many fascinating angles London can be viewed by through the ages.

Textual search, relations and recommendations

In the London scenario the user followed the suggested route of selecting one of the available terms used by the Rijksmuseum. Through the autocomplete and the node pages these terms are strongly promoted but there are plenty of cases where these terms aren’t sufficiently applied or available. Take for example the highly specific but very relevant term ‘roemer’, for some this represents the last name of a Dutch politician, but for the Rijksmuseum this is a very relevant type of glass. When searching for this term one finds a match and this results in plenty of actual roemer glasses.

However, when pressing enter the visitor performs a fulltext search which quickly reveals a different perspective. As it turns out the Rijksmuseum collection contains many still life paintings on which these roemers are prominently depicted. Shown here is a case where the structured terms are well applied from a collection perspective but possibly fall short from a user’s perspective.

So far we have a couple of ways in which a user can search and browse the collection website using both nodes and text queries. There are some extensions yet to be explored. One is the way in which an open node can be filtered by selecting a specific relationship. To explain this concept we’ll use the term ‘figurehead’ as an example. A figurehead is a decorative sculpture placed at the bow of a ship.

In the image above you can see many models of such figureheads and one instance of a 19th century life sized one (the fancy knight with his sash). The search and browse bar at this point provides us with two types of suggestions. The first is to select the type of relation the node should have towards collection objects. In this case the node figurehead is suggested to be specified using the as subject relationship. This might appear odd at first glance but the cataloguers aren’t at fault. Technically speaking the models shown here are just that, models of figureheads and not figureheads themselves. As it turns out there are only two actual figureheads and these are tagged as such.

Besides these suggestions to specify the relationship there are also suggestions to narrow down your search. Using the previous example we can see Johan van Oldenbarneveldt, a famous statesman during the 80 years’ war between the Seventeen Provinces and the Spanish Habsburg rule, being suggested. Adding this recommendation as filter results in two remaining objects, both of which show Johan van Oldenbarneveldt as the central element to a model of a figurehead.

Even more search options

Since the options for searching are plentiful, an attempt has been made to capture its richness in an overview. This overview contains the option to switch between relationships for any selected node and also provides a full view of all possible options for you to narrow down your search. Using the example of figurehead you can see the options to close the relation at the top and Johan van Oldenbarneveldt is to be found present under People / Organisations.

The nodes are grouped by specific collections of node types for which a drop down is made available in case there are multiple possible relationships. Below, the dropdown is shown for the Maker grouping which is particularly useful to express how complex these options can be. It turns out that a maker can be connected to a collection object in many ways. “Made by” indicates a confident attribution of the (partial) production of an object. “As subject” represents the relationship between an object which depicts a specific maker. To give an increasingly mind boggling example: Rembrandt van Rijn has multiple self portraits, as such he has multiple objects to which he is the maker and the subject.

Searching and browsing - The data and technology

So how did we shape up the systems that realise this UI? In a nutshell we rely on two products to serve the object records, search results and browse related features such as node pages and recommendations.

🇳🇱 👋 Hey Dutchies!
Even tussendoor... we zoeken nieuwe Q'ers!

Elasticsearch contains indexes with many properties per collection object. This product has been used extensively by the Rijksmuseum for many years and is chosen for its feature richness when it comes to realising textual search and management. It’s currently used to offer faceted search, fulltext search and serve the documents that contain the data seen when looking at a collection object page.

Right beside Elasticsearch stands Neo4j which also contains all terms also present in Elasticsearch which have both a label and an identifier. Neo4j is a labeled property graph (a graph data structure where each node and edge can have labels identifying their types and can also store properties as key-value pairs) which allows us to model the collection data in a very compact manner. This dense modelling of the data is of importance when one wants to dynamically aggregate related nodes such as is done with the recommendations, extensive search and nodepage generation. Navigating nodes in a graph through multiple consecutive hops (moving from one node to the next) is computationally expensive, especially when there are cases of densely connected nodes. Densely connected nodes are ones that have a lot of relationships, the material paper is an example of this with 600k+ relationships.

There is one more technical component to the landscape that warrants some attention. As mentioned before, the Rijksmuseum owns and maintains an integration layer that provides us with linked data. This data is hosted in yet another graph database which contains the complete richness of the collection data for each domain. We query and process this data in two distinct ways (Elasticsearch and Neo4j) to get to the data models that power the website interface.

Sparql queries, graphs and relational databases

In the case of the Rijksmuseum access model an attempt is made to be as explicit as possible which results in a complex data model. In the case of Herman Saftleven’s Pear Cactus in Bloom, this means that it requires five hops to get from the art object node to Herman’s full name. What’s not included in this five hops pattern is the fact that there are many different ways in which different types of makers can be involved with the production of an object. There’s of course also the language and formatting to consider of the artist’s name. And there’s more still to consider when determining who the maker is of an artwork. It’s technically complex to make a direct mapping from a user query to such a complex access model but it’s also performance wise expensive to navigate deep and diverse patterns in a graph.

In order to navigate the linked data access model we use the querying language Sparql. It allows us to extract and transform data as we go. Since we impose a simplification upon a complex data model, a good deal of related complexity shifts to the Sparql queries. To maintain this complexity we’ve decided to create a dedicated Git repository to maintain the many different queries. As of the moment of writing (late 2024) we’re looking at 238 queries representing roughly 92 unique properties. One property has at least two and at most three queries. One to determine the amount of occurrences, one for the graph and one for Elasticsearch.

Getting the data into the labeled property graph is, perhaps unsurprisingly, fairly straightforward. We use Sparql construct queries to generate triples representing our desired graph data model and use those to import the data to Neo4j. Such a triple could look as follows: <https://id.rijksmuseum.nl/200109794> <https://q42.nl/ontology/hasPrimaryMaker> <https://id.rijksmuseum.nl/2102256>. Here the first url represents an identifier for Van Gogh’s self portrait, the second url data modelling intermediary to express the relationship and doesn’t actually lead anywhere. The last iri leads to the identifier for Vincent van Gogh. Feel free to follow the first and third link and see how you end up at the object page and node page on the website.

This figure shows the way in which a collection object is to (left) a maker and (right) the object title. The nodes in bold and prefixed by ‘E’ are the types of the nodes and the relationships are prefixed by P.

Neo4j, as a graph database, is normally able to conveniently ingest such a graph data format. In our specific case it was slightly less easy since we wanted to make use of Neo4j’s hosted solution called Aura. This solution doesn’t support the dedicated linked data plugin Neosemantics which is why we had to create a C# version of another Neo4j library designed to handle such data.

This figure shows a Neo4j view of the compacted data model with the same data points as present in the previous figure. On the left the Night Watch collection object is visible with a relationship to the maker Rembrandt van Rijn. The right side panel exposes the properties attached to this which allows us to manage several ID’s, localized values and search-normalized values.

On the Elasticsearch side of things we were facing a bit more of a challenge. Relationality isn’t the most straightforward concept to implement in a document store. And neither is a one-stop shop solution for getting data out of the graph and into Elasticsearch. We opted to query the individual properties through select queries which results in tables which can then be stored in a MSSQL database. The next step is to construct objects out of all properties on a record by record basis. This sets us up perfectly for the enrichment step in which we include any information from any sources not disclosed by the integration layer. Currently the enrichment step adds image information, the popularity of artworks amongst visitors of the collection website and more. The data pipeline towards Elasticsearch concludes with the indexation step. This step manages the way in which properties are searchable amongst other things.

Lessons learned

The collection data is rich and complex. On one end it provides endless possibilities to explore and provide insight into the Rijksmuseum collections but on the other end this same endlessness can be a detriment to the comprehension of the visitor of the collection. In some cases we saw collection data being incomplete or inconsistent which increases the complexity for the user. This forces us to carefully consider to what extent we expose a visitor to this complexity and how far we go in promoting it when navigating the collections.

At the same time we have a dependency on the efforts of the museum. The integration layer was the critical infrastructure requirement for such a concept. The continuous work of the information specialists in cataloguing the collection data represents the other side of the effort required to get this far. These investments are considerable and achievable only through a well funded visionary approach to online visitor engagement.

What lies ahead

In terms of data the new collection website ironically creates a clear view of the parts of the collection data that are currently lacking. The accumulated insight into the current state of affairs will guide the on-going work on the improvement of the collection data which would make this collection website concept more successful.

Similarly an effort is being made to disclose the archive collection. This collection would become accessible right next to the art collection, library and the visitor stories. While this might again expose complex data not of interest for your everyday visitor, it does offer researchers unique insights into the collection.

While this was a lengthy story it is by no means finished. In a follow up blog post we’d like to explain the effort that was made to allow the Rijksmuseum to step in and control the collection website. The core principle behind this is to create a split between the language and concepts used by the information specialist and the language and concepts targeted at a larger audience.

One year later

The core of this blog post was written shortly after we went live in November of 2024. While the incorporation of the archive and improvements to the collection data is still underway there has already been quite some time for reflection and further improvements for the current state of the collection website.

One of the most important changes is a new view on balancing full-text search and faceted search. In the first iteration of the website the focus was on maximising the usage of facets. While this can still sometimes lead to strong search journeys we’re also seeing that some facets are better suited for this journey than others. To address this, the decision was made to make a subtle change to the working of autocomplete. Now, when one searches for “secondary education” the usual facets suggestions are substituted for full-text searches (indicated by the looking glass). Interestingly these full-text searches are based on existing facets but they aren’t utilised as such. This behaviour can be seen for only subjects as of now. The reason for this is that these subjects are attached to relevant objects in some cases but not in others. This causes the number of results to differ between selecting the "secondary education" facet and using it as a full-text search. On one end the taxonomy used for describing subjects is very relevant and powerful but on the other there are cases of lacking usage which leads to undesirable results. The user is still free to substitute the text query with the faceted variant at a second stage.

A technical challenge that’s being addressed is the fact that right now it’s not possible to get the facets that are available after performing a full-text search. We do support a loosely related flow where we suggest facets that might be an interesting substitute for your text query. This however isn’t what we’re looking for in, for example, the extensive search view of the collection website. Here we want to be able to perform a full-text search query after which all still linked facets are neatly organised and searchable in their own right. To accomplish this we have to shift the extensive search query from Neo4j to Elasticsearch. This is more complex than it appears at first glance, so many interfaces support such logic and Elasticsearch has had faceted search support for a long time. There are three points however that make this not as straightforward:

Nederland is not the same as Nederland. The first is the country of the Netherlands and logically has many connections. The second is a small place with the exact same name. Making this distinction remains a powerful functionality of the system.
We want to be able to work with closed or open relationships. This means a facet can be an aggregation of different properties or it can be a single one.
Talking about extensive search we want to group properties in very specific ways which are informed by usability definitions which should let users more easily navigate the complex data.

The default faceted search provided by Elasticsearch falls short in our scenario but utilising multiple aggregations should do the trick. The problem for the multiple aggregations approach is that the performance cost is too high in cases where highly connected facets are involved. In the end we decided to remodel our Elasticsearch indices with exactly this scenario in mind. The core of this remodelling is to build two special properties. The first concatenates identifier, relation type and relation value into a unique keyword. This forms an array of all associated properties and their different points of interest for our search queries in one place. Similarly a property is introduced that concatenates identifier, nodetype, value and relation. The first constructed property is used for open relation scenarios and the second for closed relation scenarios.

In the end, this has been one of the biggest shifts, taking a step back one can see the online collection will in the future rely more on Elasticsearch in comparison to Neo4j. The graph remains relevant for relation heavy tasks such as the generation of node pages and for discovery purposes such as the recommendations. Elastic, however, is right now still the most potent technology for executing high performance complex search queries.

Another shift is to no longer automatically take the visitor to a selected node page when available. The node pages remain a fascinating addition but aren’t the best place for a visitor to land when expecting to start a conventional search journey. This continued alignment with the expectations of the many different types of visitors will always be critical.

There are more great changes that were already implemented or planned but this blog post will have to end somewhere. Make sure to check out for yourself what the collection website has to offer now and in the future!

Do you also like these kind of technical challenges? Check our job vacancies (in Dutch) at werkenbij.q42.nl!