Natural history offers an interesting mix of traditional and modern ways of organizing data, information, and knowledge. Within the MITCH project we develop knowledge enrichment methods for museum collection data, enhancing information access for researchers in taxonomy and biodiversity. Our material (metadata of collection objects, as well as textual database records) is largely composed of natural language text, which is generally more noisy and ambiguous than numeric data. We present three case studies in text mining, drawing on supervised an unsupervised machine learning methods: named entity recognition in digitized field trip logbooks, automated discovery of metadata from textual databases, and content mapping in scientific publications.
After graduating from Pecs University (Hungary) in Language and literature studies, Piroska Lendvai obtained a PhD in 2004 from Tilburg University (Netherlands), working on the topic of machine learning techniques applied to natural language dialogues for the extraction of pragmatic-semantic information from spoken user input. She then joined the Dutch national IMIX project that aimed at developing a spoken dialogue system for IE and QA in the medical domain. To present she is a postdoc researcher in the MITCH project, developing text mining methods in the cultural heritage domain. She was co-chair and organiser of the workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education in Athens in March 2009.