Fast Cat


Description and Distinctive Features

FAST CAT is a web-based system for collaborative data digitization and curation in descriptive and empirical sciences like History. It was designed with the objective to support researchers in faithfully cataloguing their data sources for use as a primary source for research and for long-term validity. It supports online data entry but also offline with possibility of automated synchronization to the online database.

In FAST CAT, data from different information sources can be transcribed as ‘records’ belonging to specific ‘templates’, where a ‘template’ represents the structure of a single data source. A record organizes the data and metadata in tables (similar to spreadsheets), offering functionalities like nesting tables and selection of term from a vocabulary.

The curation of the transcribed data can be performed through FAST CAT TEAM, a special environment of FAST CAT that allows the collaborative management of entities and vocabularies. An important characteristic here is that this curation activity does not alter the data in the records as transcribed from the original sources, which is very important for verification and long-term validity.

The innovative features of FAST CAT are the following:

  • Support for both online and offline data entry with automated synchronization
  • Support for nested tables in the data entry forms
  • Embedded processes for entity correction and enrichment (like adding location coordinates), instance matching (for entities like persons) and vocabularies management (for setting preferred and broader terms)
  • Provenance-aware data curation and enrichment (avoid spoiling the original transcribed data)
  • Configurable: it can be easily used for digitizing and curating data sources of any type



System Characteristics

Data Transcription

The first step before starting the digitization process is the creation of the ‘templates’, each one representing a distinct data source. This is performed in a pre-processing step, in close collaboration between the actual users (e.g., historians) and data engineers. This collaboration is necessary for better designing the structure of the data entry forms in a template, in a way that enables users to accurately and fast digitize the archival data.

During the configuration of a template, we need to provide the structure of the tables that all together constitute the template. Each table consists of a set of columns, where a column accepts values of different types, in particular: i) entity (the value is the name or an attribute of an entity, e.g., of a person), ii) vocabulary term (the value is a term from a controlled vocabulary), iii) literal (the value is a literal, e.g., a free text, a number, or a date), or iv) nested table (the value is another table). Also, a set of columns can be configured to accept multiple values in a single table row. After having configured the templates, users are able to start the digitization process by creating ‘records’, each one corresponding to a particular template.

The below screen dump shows the home page of FAST CAT, where the user is shown with a table containing all the available templates (real example from the SeaLiT project; more below).

The below screen dump shows an example of a record. The record belongs to a template with name ‘Logbook’ (real example from the SeaLiT project).

The below screen dump shows the table ‘Voyage Calendar’ together with its nested table ‘Analytic Calendar’ (real example from the SeaLiT project).

Manual available here.


Data Curation

After completing the digitization of the data sources, users can start curating the transcribed data in order to integrate them into a common form from which further research and quantitative analysis can be carried out correctly and efficiently. This involves several steps, including:

  • applying corrections in entity names, like names of persons or locations
  • adding missing entity information, or enriching with additional data, e.g., adding coordinates in the locations for enabling map visualizations
  • maintaining vocabularies of terms for certain types of data that appear in the transcribed data
  • dealing with varying entity identity assumptions; a problem known as instance matching

FAST CAT offers a special environment, called FAST CAT TEAM, through which users can collaboratively perform the above curation steps without changing the original transcribed data. The below screen dump shows the home page of FAST CAT TEAM (real example from the SeaLiT project; more below).

In the home page, the user is shown with a table containing all the publicly shared FAST CAT records whose data can be curated. The curation steps are organized into two categories, accessible through the left menu: i) management of vocabularies, and ii) management of entity instances (in this example: Legal Entities, Locations, Persons, and Ships).

When clicking on the ‘Vocabularies’ menu item, the user is shown a table with all vocabularies whose terms appear in the transcribed data. The user can select a vocabulary from the table and edit it. The below figure shows an example where the user has selected to edit a vocabulary with name ‘Ship Type’.

For the curation of the entity instances that appear in the transcribed data, the user can click on the corresponding menu item and inspect a table with all the instances of the selected entity type. For each different entity type, the table displays different information and the user has different curation options. The below figures shows examples on how the user can curate entities of type ‘Person’ and ‘Location’ (real examples from the SeaLiT project).

Manual available here.



Use Case: the SeaLiT project

FAST CAT is currently used by around 30 historians in the context of the SeaLiT project, a European (ERC) project of Maritime History. SeaLiT studies the transition from sail to steam navigation and its effects on seafaring populations in the Mediterranean and the Black Sea between the 1850s and the 1920s. The project investigates the maritime labor market, the evolving relations among ship-owners, captain, crew and local societies, and the development of new business strategies, trade routes and navigation patterns, during the transitional period from sail to steam.

The data sources used in SeaLiT range from hand written ship log books, crew lists, payrolls and student registers, to civil registers, business records, account books and consulate reports, gathered from different authorities and written in different languages, including Spanish, Italian, French, Russian and Greek.

The number of configured templates is currently 20, representing 20 different types of data sources, and the total number of records is around 500. The transcribed data are written in 5 languages: Greek, French, Spanish (Castilian), Italian, and Russian.

Indicative examples of records from several templates:



Contact Persons