Importand Dates

Abstract submission:
» March 14, 2003 Submission deadline:
» March 21, 2003 Notification:
» April 25, 2003 Camera-Ready Due:
» May 16, 2003 Workshop:
» June 12-13, 2003

 
Sponsored by:
OGI School of Science & Engineering, Oregon Health & Science University
Computer Science Department, University of Crete
 
 
 
 
 
 
 
 
 
 
 
 
Accepted papers

14 Stream-based XML processing with type filtering
George Russell, Mathias Neumuller, and Richard Connor
Stream-based query methods are event-based and interfaces such as SAX are notoriously difficult to program.

A compromise is to introduce a two-phase evaluation strategy, where the query is described as a primary filter whose output is made incrementally available to a secondary query. Without losing critical properties, this allows the first phase to be expressed in some non-Turing complete algebra, more in tune with human intuition, and the second part to be expressed in a high-level algebra.

This framework has been previously investigated with a first phase implemented using a regular expression language such as XPath. We investigate the tradeoffs of using a type algebra to describe the set of data required for the second phase.

17 On the updatability of XML views
Vanessa Braganholo, Susan Davidson, and Carlos Heuser
XML has become an important medium for data exchange, and is also used as an interface to -- i.e. a view of -- a relational database. While previous work has considered XML views for the purpose of querying relational databases (e.g. Silkroute), in this paper we consider the problem of updating a relational database through an XML view. Using the nested relational algebra as the formalism for an XML view, we study the problem of when such views are updatable. Our results rely on the fact that in most reasonable XML views of relational databases, the unnest operator does not occur. Furthermore, the nest operator is invertible. We therefore show that for an important class of XML views it is possible to consider them as if they were flat relational views.
27 Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data
Martin Theobald, Ralf Schenkel, and Gerhard Weikum
This paper investigates how to automatically classify non-schematic XML data into a user-defined topic directory. The main focus is on constructing appropriate feature spaces on which a classifier operates. In addition to the usual text-based term frequency vectors, we study XML twigs and tag paths as extended features that can be combined with text term occurrences in XML elements. Moreover, we show how to leverage ontological background information, more specifically, the WordNet thesaurus, for the construction of more expressive feature spaces. For efficiency our implementation computes features incrementally and caches ontology entries. Our experiments demonstrate the improved accuracy of automatic classification based on the enhanced feature spaces.
28 Tree Automata to Verify XML Key Constraints
Béatrice Bouchou, Mírian Halfeld-Ferrari-Alves, and Martin Musicante
We address the problem of checking key constraints in XML. Key constraints have been recently considered in the literature and some of their aspects are adopted in XMLSchema. However, only few works have appeared concerning the verification of such constraints.

Unranked deterministic bottom-up tree automata can be used to validate XML documents against a schema. These automata work over (unranked) trees used to represent XML documents.

In this paper we show how key constraints can be integrated in such automaton by extending the automaton to carry up values from the leaves to the root, during its run. In fact the tree automaton becomes a tree transducer. Under these conditions, the key verification is done in asymptotic linear time on the size of the document.

31 A Framework for Estimating XML Query Cardinality
Carlo Sartiani
Given the widespread diffusion of XML data, tools for querying and processing XML data are increasingly available. In this context, the estimation of the cardinality of queries on XML data is becoming more and more important. The information provided by a query result estimator can be used as input to the query optimizer, and as an early feedback to user queries.

Existing estimation models for XML queries focus on particular aspects of XML querying, such as the estimation of path and twig expression cardinality, and they do not deal with the problem of predicting the cardinality of general XQuery queries.

This paper presents a framework for estimating XML query cardinality. The framework provides facilities for estimating result size of FLWR queries, hence allowing the model designer to concentrate her efforts on the development of adequate and accurate, while concise, statistic summaries for XML data. The framework can also be used for extending existing models to a wider class of XML queries.

37 Automatic annotation of data extracted from large web sites
Luigi Arlotta, Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo
Data extraction from web pages is performed by software modules called wrappers. Recently, some systems for the automatic generation of wrappers have been proposed in the literature. These systems are based on unsupervised inference techniques: taking as input a small set of sample pages, they can produce a common wrapper to extract relevant data. However, due to the automatic nature of the approach, the data extracted by these wrappers have anonymous names. In the framework of our ongoing project RoadRunner, we have have developed a prototype, called Labeller, that automatically annotates data extracted by automatically generated wrappers. Although Labeller has been developed as a companion system to our wrapper generator, its underlying approach has a general validity and therefore it can be applied together with other wrapper generator systems. We have experimented the prototype over several real-life Web sites, and we have obtained encouraging results.
44 A Comparison of Peer-to-Peer Search Methods
Dimitrios Tsoumakos, and Nick Roussopoulos
Peer-to-Peer (P2P) networks have become a major research topic over the last few years. Object location is a major part in the operation of these distributed systems. In this work, we present an overview of several search methods for unstructured P2P networks. These systems resemble popular file-sharing applications through which enormous amounts of data are daily exchanged. We analyze the performance of the algorithms relative to their success rates, bandwidth consumption and adaptation to changing topologies. Simulation results are used to empirically evaluate their behavior in direct comparison.
47 XML Interoperability
Laks V. S. Lakshmanan, and Fereidoon Sadri
We study the problem of interoperability among XML data sources. Our approach is distinguished from previous works on interoperability and data integration in not requiring either source-to-source or source-to-global mappings. Instead, it is based on enriching local declarations to an extent that enables derivation of mappings needed for interoperability. Such an approach is ultimately dependent on standards and global information, a direction taken by the semantic web initiative. We present a simple semantic model for sources, discuss local declarations and tools that assist in generating them and formulation of global queries, and address some interesting issues in query processing and optimization.
49 Semantic Email: Adding Lightweight Data Manipulation Capabilities to the Email Habitat
Oren Etzioni, Alon Halevy, Henry Levy, and Luke McDowell
The Semantic Web envisions a portion of the World Wide Web in which the underlying data is machine-understandable and applications can exploit this data for improved querying, aggregation, and interaction. This paper investigates whether the same vision can be carried over to the realm of email, the adjacent information space in which we spent significant amounts of time.

We introduce the notion of semantic email, in which email messages consists of a database query or update coupled with corresponding explanatory text. Thus, semantic email messages are machine- and human-understandable. Semantic email opens the door to a wide range of automated, email-mediated applications. In this paper we describe a class of semantic email processes that involve a set of email messages. For example, sending an email to a program committee, asking who will attend the PC dinner, collecting the responses and tallying them up. We describe a formal model for defining semantic email processes. In the model, the email process is modeled as a set of updates to a data set, on which we specify certain constraints. We then describe a set of inference problems that arise in this context, and provide initial results on some of them. In particular, we show that it is possible to automatically infer which email responses are acceptable w.r.t. a set of ultimately desired constraints. Finally, we describe a first implementation of semantic email, and outline the research challenges in this realm.

52 Index Structures for Querying the Deep Web
Jian Qiu, Feng Shao, Misha Zatsman, and Jayavel Shanmugasundaram
We propose and evaluate various index structures for efficiently querying data stored in Internet-attached databases, also referred to as the "deep web". There are two fundamental differences between index structures for the deep web and index structures used by current web search engines for querying static web pages (also referred to as the "static web" or the "surface web"). First, index structures for the deep web have to deal with structured data because the underlying database data is typically richly structured and typed; this is in contrast to the data available in the static web, which is mostly unstructured HTML data. Second, index structures for the deep web must deal with data volumes that are many orders of magnitude larger than that for the static web. To address the above issues, we devise new deep web index structures that (a) exploit the structure of the underlying data to support structured queries such as equality and range queries, and (b) can be compressed by one or two orders of magnitude so that they scale to the large data volumes present in the deep web. We also present preliminary experimental results that quantify the benefits of our proposed index structures, and illustrate different performance-space tradeoffs.
54 Modeling Query-Based Access to Text Databases
Eugene Agichtein, Panagiotis Ipeirotis, and Luis Gravano
Searchable text databases abound on the web. Applications that require access to such databases often resort to querying to extract relevant documents because of two main reasons. First, some text databases on the web are not ``crawlable,'' and hence the only way to retrieve their documents is via querying. Second, applications often require only a small fraction of a database's contents, so retrieving relevant documents via querying is an attractive choice from an efficiency viewpoint, even for crawlable databases. Often an application's query-based strategy starts with a small number of user-provided queries. Then, new queries are extracted --in an application-dependent way-- from the documents in the initial query results, and the process iterates. The success of this common type of strategy relies on retrieved documents ``contributing'' new queries. If new documents fail to produce new queries, then the process might stall before all relevant documents are retrieved. In this paper, we develop a graph-based ``reachability'' metric that allows to characterize when an application's query-based strategy will successfully ``reach'' all documents that the application needs. We complement our metric with an efficient sampling-based technique, which accurately estimates the reachability associated with a text database and an application's query-based strategy. We report preliminary experiments backing the usefulness of our metric and the accuracy of the associated estimation technique over real text databases and for two applications.
56 ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval
Torsten Suel, Chandan Mathur, Jo-wen Wu, Jiangong Zhang, Alex Delis, and Mehdi Kharrazi
We consider the problem of building a P2P-based search engine for massive document collections. We describe a prototype system called ODISSEA (Open DIStributed Search Engine Architecture) that is currently under development in our group. ODISSEA provides a highly distributed global indexing and query execution service that can be used for content residing inside or outside of a P2P network. ODISSEA is different from most other approaches to P2P search in that it assumes a two-tier search engine architecture and a global index structure distributed over the nodes of the system.

We give an overview of the proposed system and discuss the basic design choices. Our main focus is on efficient query execution, and we discuss how recent work on top-k queries in the database community can be applied in a highly distributed environment. We also give some preliminary simulation results on a real search engine log and a terabyte-size web page collection that indicate good scalability for our approach.

59 On Distributing XML Repositories
Jan-Marco Bremer, and Michael Gertz
XML is increasingly used not only for data exchange but to represent arbitrary data sources as virtual XML repositories. Often, related repositories are distributed over the Web. However, design and query models for distributed XML repositories have not yet been studied in detail.

In this paper, we introduce a distribution approach for XML repositories. We outline a fragmentation method and present an allocation model for distributed XML fragments. The allocation model is based on path identifiers, which allow for an efficient, local evaluation of the most common types of global queries without having to access any other distribution site.

61 Path-expression Queries over Multiversion XML Documents
Zografoula Vagena, and Vassilis J. Tsotras
There has been considerable research recently involving the processing of path-expression queries on XML documents. Typically, such queries are implemented as path (holistic) joins, using a numbering scheme that maintains the structural relationships among the elements of an XML document. Nevertheless, existing numbering schemes are not easily updatable, an issue that appears in the case of a dynamically evolving document. In this paper we address path-expression queries in a multiversion XML document. This is a novel problem; previous work has considered the storage and manipulation of multiversion XML documents,focusing on the compact representation of the XML archive as well as the fast reconstruction of a particular document version. We extend state of the art pattern matching techniques for static documents so as to suport updates. We first present an easily updatable numbering scheme that efficiently captures identifies structural relationships. We then propose a variation of Pathstack, an optimal pattern matching algorithm, addressing the characteristics of our environment. Finally, through a thorough experimental evaluation we investigate two storage techniques in terms of space utilization and query efficiency.
64 Efficient Dissemination of Aggregate Data over the Wireless Web
Mohamed Sharaf, Yannis Sismanis, Alexandros Labrinidis, Panos Chrysanthis, and Nick Roussopoulos
The proliferation of wireless technologies along with the large volume of data available online are forcing us to rethink existing data dissemination techniques for data over the Web, and in particular for aggregate data. In addition to scalability and response time, data delivery to mobile clients with wireless Web connectivity must also consider energy consumption. In this work, we present a hybrid scheduling algorithm (h-ESBS) for broadcast-based data delivery of aggregate data over the wireless Web. Our algorithm efficiently "packs" aggregate data for broadcast delivery and utilizes view subsumption at the mobile client, which allow for faster response times and lower energy consumption.
65 Building Data Integration Systems: A Mass Collaboration Approach
Robert McCann, AnHai Doan, Vanitha Varadaran, Alex Kramnik, and Chengxiang Zhai
Building data integration systems today is largely done by hand, in a very labor intensive and error prone process. In this paper, we describe a conceptually new solution to this problem: that of mass collaboration. The basic idea is to think about a data integration system as having a finite set of parameters whose values must be set. To build such a system, the system administrators can construct and deploy a system ``shell'', then ask the users to help the system ``automatically converge'' to the correct parameter values. This way, the enourmous burden of system developments is lifted from the administrators and spread ``thinly'' over a multitude of users. We discuss the challenges to this approach and propose solutions. Next, we describe our current effort in applying this approach to the problem of schema matching in the context of data integration. Finally, we describe experimental results that show the promise of the approach.
73 The Query Set Specification Language (QSSL)
Michalis Petropoulos, Alin Deutsch, and Yannis Papakonstantinou
Applications require access to multiple information sources and the data of other applications. WSDL-based web services are becoming a popular way of making information sources available on the web and, hence, to applications that need to consume them – often via data integration systems. We argue that the function signature paradigm that is used today by web services cannot capture the query capabilities provided by structurally rich and functionally powerful information sources. We propose the Query Set Specification Language (QSSL) that allows the concise description of sets of parameterized queries, and describe the goals, the usage scenarios and the requirements for such a language. A QSSL specification is embedded in a WSDL specification to form a specialized type of web services, called Data Services. Data Services connect the calls that the source accepts with the underlying schema, providing a set of opportunities and corresponding challenges that we outline.