Information Extraction (IE) is becoming increasingly important for the semantic analysis of free-text documents stored in large document repositories, such as the Web. Once free-text is analysed for the recognition of concepts and concept interrelations in events and facts of interest, the resulting structured information becomes a valuable knowledge resource. This resource can be of further use in other information management technologies, such as document summarisation, ontology development, semantic document indexing, question answering, etc., or can be further exploited by data mining and reasoning technologies.
A key element for the extraction of information in a natural language document is a set of shallow text analysis rules, which are typically based on pre-defined linguistic patterns. One of the current IE research objectives is the automatic or semiautomatic acquisition of these rules. Typically, current approaches to this problem rely on training text data or existing knowledge resources, such as domain ontologies.
Within this research framework, we propose a knowledge-poor methodology for rule pattern acquisition. Our proposed NeT method for knowledge acquisition in IE aims at facilitating the development and customisation of IE systems. It is a data-centric approach which neither requires any manually annotated documents, nor any preexisting domain knowledge resources. The NeT method is based on the hypothesis that terms (the linguistic representation of concepts in a specialised domain) and Named Entities (e.g., the names of persons, organisations and dates of importance in the text) can together be considered as the basic semantic entities of textual information and can therefore be used as a basis for the conceptual representation of domain specific texts. The extraction patterns discovered by this approach involve significant associations of these semantic entities with verbs and they can subsequently be translated into the grammar formalism of choice.
The proposed NeT method has been implemented in a demonstrator application by exploiting a combination of existing (ENGCG, BSEE, C/NC value) and custom developed tools. The potential of the method has been put to the test by evaluating it against manually annotated data, showing very promising results.
Kalliopi Zervanou is Associate Researcher at the Technical University of Crete (TUC), Dept. of Electronics & Computer Engineering. She received her Bachelor in French Literature & Linguistics from Aristoteles University of Thessaloniki, an MSc in Machine Translation and a PhD in Information Extraction from the UMIST and the University of Manchester. She has worked as Researcher at the Dept. of Computation, UMIST in the CONCERTO (ESPRIT n.29159: CONCEptual indexing, querying and ReTrieval Of digital documents) and PARMENIDES (IST-2001-39023: Ontology driven Temporal Text mining on organisational data for extracting temporal valid knowledge) projects. She joined the Technical University of Crete Dept. of Electronics & Computer Engineering in 2005, where she worked as one of the principal investigators for the Information Extraction and Ontology Development components for the TOWL Project (Time-determined ontology based information system for real time stock market analysis). Her research interests include information extraction, knowledge acquisition and representation techniques, development of linguistic resources, terminology extraction, automatic summarisation and machine translation.