Matthia Sabatelli is a Ph.D. candidate in Machine Learning at the Department of Electrical Engineering and Computer Science of the University of Liège where he is supervised by Dr. Pierre Geurts. His main research interests revolve around the transferability and scalability of deep neural networks whose generalization properties are studied under the lens of different machine learning paradigms ranging from computer vision to reinforcement learning.
Nikolay Banar is a Ph.D candidate at the University of Antwerp (Belgium). His scientific interests lie with the intersection of machine learning and humanities.
Marie Cocriamont obtained her Masters degree in Musicology at the University of Ghent in 2016. She specialized in the comparison of didactic methods in Classical Arabic music. In 2019 she started working as a scientific assistant at the Royal Museums of Art and History, where she mainly works as an annotator for the research project INSIGHT (Intelligent Neural Systems as InteGrated Heritage Tools).
Eva Coudyzer obtained a Master in Art History and Archaeology in 2004 at the Vrije Universiteit Brussel. She worked in documentation centers and collection management services in several cultural organizations in Belgium. In 2009 she started working as a scientific assistant at the Royal Museums of Art and History, specializing in collection management systems. She was coordinator and partner in several national and international digitization projects with a main focus on linking and publishing collections with the use of standardized controlled vocabularies. She currently works as a scientific assistant at the information center of the Royal Institute for Cultural Heritage where she participates in the development of the collection management system and the valorization of the collection in digitization projects.
Karine Lasaracina has master degrees in art history and journalism. She joined the RMFAB in 1999 and is now head of the Digital Museum unit. From the very beginning of her career, she has been interested in the concept of digital management of heritage data. A current focus is the development of digital applications that can support enriched visitor experiences in the museum through the implementation of various innovative technological solutions, for example virtual reality tools, multimedia narratives and virtual exhibitions. Promoter of various ongoing research projects, she also works on Data Interoperability, Open Science, the development of Artificial Intelligence to serve the museums, as well as innovation in the field of images of artworks (reproduction, storage, preservation and sharing).
Walter Daelemans is professor of Computational Linguistics at the University of Antwerp and research director of the CLiPS (Computational Linguistics, Psycholinguistics and Sociolinguistics) research centre. His expertise is in Natural Language Processing and Machine Learning and applications in automatic text analysis and computational stylometry.
Pierre Geurts is professor in computer science at the University of Liège. His research interests concern the design, the empirical, and the theoretical analyses of machine learning algorithms, with emphasis on scalability, interpretability, and usability of these algorithms. He develops real-world applications of these algorithms in various domains, including computational and systems biology, computer vision, and digital humanities.
Mike Kestemont is research professor in Digital Text Analysis at the University of Antwerp (Belgium). His expertise lies in the application of computational methods to the Humanities, in particular premodern literature. With F. Karsdorp and A. Riddell, he has co-authored the monograph
This is the source
In this paper, we present MINERVA, the first benchmark dataset for the detection of musical instruments in non-photorealistic, unrestricted image collections from the realm of the visual arts. This effort is situated against the scholarly background of music iconography, an interdisciplinary field at the intersection of musicology and art history. We benchmark a number of state-of-the-art systems for image classification and object detection. Our results demonstrate the feasibility of the task but also highlight the significant challenges which this artistic material poses to computer vision. We evaluate the system to an out-of-sample collection and offer an interpretive discussion of the false positives detected. The error analysis yields a number of unexpected insights into the contextual cues that trigger the detector. The iconography surrounding children and musical instruments, for instance, shares some core properties, such as an intimacy in body language.
In this paper, we present MINERVA, the first benchmark dataset for the detection of musical instruments in non-photorealistic, unrestricted image collections from the realm of the visual arts.
The Digital Humanities constitute an intersectional community of praxis, in which the
application of computing technologies in various subdisciplines in the
In the Digital Humanities too, the potential of computer vision is nowadays increasingly
recognized. A programmatic duet of two recent articles on "distant viewing" in the field's
flagship journal
The structure of this paper is as follows. First, we motivate and contextualize our case study of musical instruments from within the scholarly framework of music iconography and computer vision, but also from the more pragmatic context of the research project from which this focus has emerged. We go on to describe the construction and characteristics of an annotated benchmark dataset, the MINERVA dataset, that will be released together with this paper, through which we hope to stimulate further research in this area. Using this benchmark data, we stress-test the available technology for the identification and detection of objects in images and discuss the current limitations of systems. To illustrate the broader relevance of our approach, we apply the trained benchmark system 'in the wild', on unseen and out-of-sample heritage data, followed by a quantitative and qualitative evaluation of the results. Finally, we identify what seem to be the most relevant directions for future research.
The present paper must be understood against the wider scholarly background of music
iconography, a Humanities field of inquiry with a rich, interdisciplinary history in its
own right.
Music iconography deliberately adopts a "methodological plurality"
Music iconography has an important tradition of focused studies targeting the deep,
interpretive analysis of individual artworks or small collections of them. Such
hermeneutic case studies have the advantage of depth, but understandably lack a more
panoramic perspective on the phenomena of interest and, for instance, diachronic or
synchronic trends and shifts therein. The large-scale, "serial" study of musical
instruments as depicted across the visual arts remains a desideratum in the field and
has the potential of bringing a macroscopic perspective to historical developments. In
the present paper, we explore the feasibility of applying methods from present-day
computer vision, in an attempt to scale up current approaches. The primary motivation of
this endeavour is that digital music iconography – or "Distant" music iconography, in an
analogy to similar developments in literary studies
This scholarly initiative is embedded in the collaborative research project INSIGHT
(Intelligent Neural Systems as InteGrated Heritage Tools), which aims to stimulate the
application of Artificial Intelligence to the rapidly expanding digital collections of a
selection of federal museum clusters in Belgium.
The methodology for the present paper largely derives from machine learning and more
specifically computer vision, a field concerned with computational algorithms that can
mimic the perceptual abilities of humans and their capacity to construct high-level
interpretations from raw visual stimuli
One major hurdle is that computer vision nowadays strongly gravitates towards so-called
photo-realistic material, i.e. digitized or born-digital versions of photographs that do
not actively attempt to distort the reality they depict. The best example in this
respect is the influential ImageNet dataset
It is a well-known limitation that convolutional neural networks require large amounts
of manually annotated example data (or training data) in order to perform well. To
address this issue, the community has released several public datasets over the years
Computer vision researchers interested in the artistic domain have attempted to
alleviate the relative dearth of training data by either releasing domain-specific
datasets
With this work, we take one step forward in addressing these limitations. Firstly, the MINERVA dataset that we present below, specifically tackles the problem of object detection within the broader heritage domain of the visual arts, introducing a novel benchmark for researchers working at the intersection of computer vision and art history. Secondly, we present a number of baseline results on the newly introduced dataset. The results are reported for a representative set of common architectures, which were pre-trained on photo-realistic images. This allows us to investigate to what extent these methods can be reused when tested on artistic images.
Previous studies have demonstrated the feasibility of "pretraining": with this
approach, networks are first trained on (large) photorealistic collections (i.e. the
source domain) and then applied downstream (or further fine-tuned) on an out-of-sample
target domain, that has much less annotated data available. While generally useful, this
approach is still confronted with the problem that the annotation labels or categories
attested in the source domain are often of little interest within the target domain
(i.e. art history, in the present case). The popular Pascal-VOC dataset
Popular object detectors such as YOLO
All examples in Figure 1 come from a pretrained YOLO-V3
model
In this section, we describe MINERVA, the annotated dataset in the field of object
detection that is presented in this work. This novel benchmark dataset will be released
jointly with this paper.
The base data for our annotation effort was assembled in a series of 'concentric' collection campaigns, where we started from smaller, but high-quality datasets and gradually expanded into larger, albeit less well curated data sources.
Our collection efforts were inclusive, and the resulting dataset should be considered as "unrestricted", covering a variety of periods, genres and materials (although it was not feasible to include more precise metadata about these aspects in the dataset). Note that, exactly because of this highly mixed data origin, the distribution in MINERVA does not give a faithful representation of any kind of historic reality: music iconography gives a highly colored perspective on "popular" instruments in art history and some instruments may not often have been depicted, even though they were popular at the time. Likewise, other instruments are likely to be over-represented in iconography.
To increase the interoperability of the dataset, individual instruments have been
unambiguously identified using their MIMO codes. The MIMO (Musical Instrument Museums
Online) initiative is an international consortium, well known for its online database of
musical instruments, aggregating data and metadata from multiple heritage institutions
Using the conventional method of rectangular bounding boxes, we have manually annotated
16,142 musical instruments (of which 172 unique) in a collection of 11,765 images,
within the open-source Cytomine software
environment
The dataset contains artistic objects from diverse periods and of various types, ranging from paintings, sculptures, drawings, to decorative arts, manuscript illuminations and stained-glass windows. Thus, they involve a daunting diversity of media, techniques and modes. Whereas in some cases the images were straightforward to annotate (e.g. an image representing a bell in full frame), several obstacles occurred on a recurrent basis. These obstacles can be linked to three parameters:
An important share of the annotations which we collected were singletons, i.e. instruments that were only encountered once or twice. Although we release the full dataset, we shall from now on only consider instruments that occurred at least three times that allow for a conventional machine learning setup (with non-overlapping train, validation and test sets, that include at least one instance of each label). Whereas the full MIMO vocabulary covers over 2,000 vocabulary terms for individual instruments, only a fraction of these were attested in the 4,183 images which we use below (overview in Table 1). Note that this table shows a considerable drop in the original number of images that we annotated, because we only included images that (a) actually contained an instrument and (b) images depicting instruments that occurred at least thrice.
93 different instrument categories appear at least thrice in the dataset. A visualization of the heavily skewed distribution of the different instruments can be seen in Figure 4, where each instrument is represented together with its corresponding MIMO code (between parentheses). This distribution exposes two core aspects of this dataset (but also of music iconography in general): (i) its strong Western-European bias, which has been historically acknowledged, and which scholars are actively trying to correct nowadays, but which is a slow process; (ii) the 'heavy-tail' distribution associated with cultural data in general; i.e. only a fraction of instruments, such as the lute, harp and violin, are depicted with a high frequency, the rest occurs much more sparsely.
The label imbalance described in the previous paragraph is a significant issue for machine learning methods. We therefore experiment with the data in five versions (that are available from the repository) that correspond to object detection tasks of varying complexity. We start by exploring whether it is possible to just detect the presence of an instrument in the different artworks, without the additional need of also predicting the class of the detected instrument. We refer to this benchmark as single-instrument object detection. We then move to three more challenging tasks in which we also aim at correctly classifying the content of the detected bounding boxes. We include data for this detection task for the top-5, the top-10 and top-20 most frequently occurring instruments, a customary practice in the field. Finally, we also repeat this task for all images, but with the "hypernym" labels of the instrument categories (see Figure 5).
Each version of the dataset comes with its own training, development and testing
splits, where we offer the guarantee that at least one of the instrument classes in the
task is represented in each of the splits. Additionally, the splits are stratified so
that the class distribution is approximately the same in each split. The number of
images per split in each version is summarized in Table 2.
The hypernym version of the dataset is not reported in this table as it shares the same
images and splits as the single-instrument version (they both contain all instruments).
We used a standard implementation
In the first benchmark experiment, we start by investigating whether convolutional neural networks are able to correctly classify the different instruments that are present in the dataset. That means that we focus on the image classification task and postpone the task of object detection to the next section. To this end, we have extracted the various patches delineated by the bounding boxes in the detection dataset as stand-alone instances. Note, however, that patches from the same images always ended in the same split, to avoid information leakage across the splits. Example patches are shown in Figure 6.
Next, we tackled this task as a standard machine-learning classification problem for
which we applied a representative selection of established neural network architectures.
All of these networks were pretrained on the Rijksmuseum dataset
In Table 2 and Table 3 we report the results in terms of Accuracy and F1-score for the MINERVA test sets. For the individual instruments, we do so for four versions of the dataset of increasing complexity: the top-5 instruments, top-10 instruments, top-20 instruments and the entire dataset. Analogously we report the scores for a classification experiment where the object detector is trained on the instrument hypernyms as class labels.
For the second benchmark experiment we report the results that we have obtained on the
four of the five detection benchmarks introduced in the previous section. The way the
different instruments are distributed in their respective test sets is visually
represented in the first image of each row of Figure 7.
For our experiments, we use the popular YOLO-V3
To assess the performance of the neural network, we follow the same evaluation protocol
that characterizes object detection problems in CV
As an additional stress-test, we have applied a trained object detector to two external data sets, in order to assess how valid and performant our approach is when applied "in the wild". We have considered two out-of-sample datasets:
Note that both external datasets differ in crucial aspects: can be considered "out-of-sample", but "in-collection", in the sense that these images derive from the same digital collections as many of the images represented in MINERVA. Additionally, we can expect extremely low detection rates for this dataset, because the presence of musical instruments will already have been flagged in a large majority of cases by the museum's staff. Thus, the application of should be viewed as a rather conservative stress test or sanity check, mainly checking for images that might have been missed by annotators in the past. The IconArt dataset is "out-of-sample" and "out-of-collection", in the sense that these images derive from a variety of other sources. It is therefore fully unrestricted, and this test can be considered a curiosity-driven validation of the method "in the wild". Importantly, IconArt was not collected with specific attention for musical instruments, so here too, we can anticipate a rather low detection rate (since many works of art simply do not feature any instruments). For all these reasons, we only evaluate the results on these external datasets in terms of precision (as recall is much less meaningful in this context).
Following these differences, we have applied the single-instrument detector to the data and the hypernym detector to IconArt. Keeping an eye on the feasibility of the manual inspection, we have limited the number of instances returned by only allowing detections with a confidence score ≥ 0.20 (which is a rather generous threshold). Next, the results have then been evaluated in terms of precision, i.e. the number of returned image regions that actually represent musical instruments. The results are presented in Table 6. Figure 10 showcases a number of cherry-picked successful examples of detections from the out-of-collection IconArt images.
First and foremost, we can observe that the scores obtained across all benchmarks are generally much lower than those reported for other datasets in computer vision (outside of the strict artistic domain). This drop in performance was to be expected and can be attributed to both the smaller size of the training data and the higher variance in the representation spectrum of musical instruments (across periods, materials, modes and, artists). Secondly, one can observe large fluctuations in the identifiability and detectability of individual instrument categories across both tasks. Not all of the fluctuations are easy to account for.
We first consider the classification results. The confusion matrix reported in Table 4 clearly shows that the classes representing the top-4
of instruments (harp, lute, violin, and portative organ) can be learned rather
successfully, but that the performance rapidly breaks down for instrument categories at
lower frequency ranks. Thus, while the accuracies for the top-5 experiments are
relatively satisfying, especially in terms of accuracy (V3:
The skewness of the class distribution in MINERVA is representative of the long-tail
distribution that we commonly encounter in cultural data. This imbalance is somewhat
alleviated in the hypernym setup, where the labels are of course much better distributed
over a much smaller number of classes (
Similar trends can be observed for the musical instrument detection task. First of all,
we should emphasize the encouraging scores for the "single-instrument" detection task
that simply aims to detect musical instruments (no matter their type). Here, a
relatively high precision score is obtained (
When making our way down the frequency list in Table 5, we
again observe how the results break down dramatically for less common instrument
categories. The fact that an over-represented category like harps can be reasonably well
detected (
The results from the previous question call into question which visual properties the
neural networks find useful to exploit in the identification of instruments.
Importantly, the characteristic features exploited by a machine learning algorithm need
not coincide with the properties that are judged most relevant by human experts and the
comparison of both types of relevance judgements is worthwhile. In this section, we
therefore perform model criticism or "network introspection" on the basis of the
so-called "saliency maps" that can be extracted from a trained model
The maps in Figure 9 vividly illustrate that the network focuses on two broad types of regions: properties of the instruments itself (which was expected) but also the immediate context of the instruments, and more specifically the way they are operated, handled or presented by people, c.q. musicians. The characteristics of the salient regions in the examples in Figure 9 could be described as:
These characteristics strongly suggest that the way an instrument is handled (i.e. its immediate iconographic neighborhood) is potentially of equal importance as the shape of the actual instrument, an insight that we will further expand on below.
In this section, we offer a qualitative discussion of the false positives from the
out-of-sample tests reported in the previous section, i.e. instances where the detectors
erroneously thought to have detected an instrument. This eagerness is a known problem of
object detectors: a system that is trained to recognize "sheep" will be inclined to see
"sheep" everywhere. Anecdotally, people have noted how misleading contextual cues can
indeed be a confounding factor in image analysis. One blog post for instances noted how
a major image labeling service tagged photographs of green fields with the "sheep"
label, although no sheep whatsoever were present in the images.
The above categorization illustrates that the false positives are rather insightful, mainly because the absence of an instrument highlights the contextual clues that are at work. Of particular relevance is the observation that the iconography surrounding children closely resembles that of instruments. This seems related to the intimate and caring body language of both the caretakers and musicians in such compositions. The immediate iconographic neighborhood of children clearly reminds the detector of the delicacy and reverence with which instruments are portrayed and presented in historical artworks. This delicacy and intimacy in body language can be specifically related to the foregrounding of fingers, the prominent portrayal of which invariably triggers the detector, also in the absence of children. Some of these phenomena invite closer inspection by domain experts in music iconography and suggest that serial or panoramic analyses are a worthwhile endeavour in this field, also from the point of view of more hermeneutically oriented scholars.
In this paper, we have introduced MINERVA, to our knowledge the first sizable benchmark dataset for the identification and detection of individual musical instruments in unrestricted, digitized images from the realm of the visual arts. Our benchmark experiments have highlighted the feasibility of a number of tasks but also, and perhaps primarily, the significant challenges that state-of-the-art machine learning systems are still confronted with on this data, such as the "long-tail" of the instruments' distribution and the staggering variance in depiction across the images in the dataset. We therefore hope that this work will inspire new (and much-needed) research in this area. At the end of this paper, we wish to formulate some advice and concerns in this respect.
One evident direction from future research is more advanced transfer learning, where
algorithms make more efficient use of the wealth of photorealistic data that is provided,
for instance, by MIMO
One crucial final remark is that AI has an amply attested tendency not only to be
sensitive to biases in the input data but also to amplify them
We wish to thank Remy Vandaele for the help with Cytomine and for the fruitful discussions related to computer vision and object detection. Special thanks go out to our annotators and other (former) colleagues in the museums involved: Cedric Feys, Odile Keromnes, Lies Van De Cappelle and Els Angenon. Our gratitude also goes out to Rodolphe Bailly for his support and advice regarding MIMO. Finally, we wish to credit our former project member dr. Ellen van Keer with the original idea of applying object detection to musical instruments. This project is generously funded by the Belgian Federal Research Agency BELSPO under the BRAIN-be program.