| |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Importand Dates
Abstract submission:
» March 14, 2003 Submission
deadline:
» March 21, 2003 Notification:
» April 25, 2003
Camera-Ready Due:
» May 16, 2003
Workshop:
» June 12-13, 2003
|
|
|
|
| Sponsored by: |
 |
 |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|

| |
|
Accepted papers
| 14 |
Stream-based XML processing with type filtering
George Russell, Mathias Neumuller, and Richard Connor
|
|
Stream-based query methods are event-based and interfaces such as SAX
are notoriously difficult to program.
A compromise is to introduce a two-phase evaluation strategy, where
the query is described as a primary filter whose output is made
incrementally available to a secondary query. Without losing critical
properties, this allows the first phase to be expressed in some
non-Turing complete algebra, more in tune with human intuition, and
the second part to be expressed in a high-level algebra.
This framework has been previously investigated with a first phase
implemented using a regular expression language such as XPath. We
investigate the tradeoffs of using a type algebra to describe the set
of data required for the second phase.
|
| 17 |
On the updatability of XML views
Vanessa Braganholo, Susan Davidson, and Carlos Heuser
|
|
XML has become an important medium for data exchange, and is also used
as an interface to -- i.e. a view of -- a relational database. While
previous work has considered XML views for the purpose of querying
relational databases (e.g. Silkroute), in this paper we consider the
problem of updating a relational database through an XML view. Using
the nested relational algebra as the formalism for an XML view, we
study the problem of when such views are updatable. Our results rely
on the fact that in most reasonable XML views of relational databases,
the unnest operator does not occur. Furthermore, the nest operator is
invertible. We therefore show that for an important class of XML views
it is possible to consider them as if they were flat relational views.
|
| 27 |
Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data
Martin Theobald, Ralf Schenkel, and Gerhard Weikum
|
|
This paper investigates how to automatically classify non-schematic
XML data into a user-defined topic directory. The main focus is on
constructing appropriate feature spaces on which a classifier
operates. In addition to the usual text-based term frequency vectors,
we study XML twigs and tag paths as extended features that can be
combined with text term occurrences in XML elements. Moreover, we
show how to leverage ontological background information, more
specifically, the WordNet thesaurus, for the construction of more
expressive feature spaces. For efficiency our implementation computes
features incrementally and caches ontology entries. Our experiments
demonstrate the improved accuracy of automatic classification based on
the enhanced feature spaces.
|
| 28 |
Tree Automata to Verify XML Key Constraints
Béatrice Bouchou, Mírian Halfeld-Ferrari-Alves, and Martin Musicante
|
|
We address the problem of checking key constraints in XML. Key
constraints have been recently considered in the literature and some
of their aspects are adopted in XMLSchema. However, only few works
have appeared concerning the verification of such constraints.
Unranked deterministic bottom-up tree automata can be used to validate
XML documents against a schema. These automata work over (unranked)
trees used to represent XML documents.
In this paper we show how key constraints can be integrated in such
automaton by extending the automaton to carry up values from the
leaves to the root, during its run. In fact the tree automaton
becomes a tree transducer. Under these conditions, the key
verification is done in asymptotic linear time on the size of the
document.
|
| 31 |
A Framework for Estimating XML Query Cardinality
Carlo Sartiani
|
|
Given the widespread diffusion of XML data, tools for querying and
processing XML data are increasingly available. In this context, the
estimation of the cardinality of queries on XML data is becoming more
and more important. The information provided by a query result
estimator can be used as input to the query optimizer, and as an early
feedback to user queries.
Existing estimation models for XML queries focus on particular aspects
of XML querying, such as the estimation of path and twig expression
cardinality, and they do not deal with the problem of predicting the
cardinality of general XQuery queries.
This paper presents a framework for estimating XML query
cardinality. The framework provides facilities for estimating result
size of FLWR queries, hence allowing the model designer to concentrate
her efforts on the development of adequate and accurate, while
concise, statistic summaries for XML data. The framework can also be
used for extending existing models to a wider class of XML queries.
|
| 37 |
Automatic annotation of data extracted from large web sites
Luigi Arlotta,
Valter Crescenzi,
Giansalvatore Mecca, and
Paolo Merialdo
|
|
Data extraction from web pages is performed by software modules called
wrappers. Recently, some systems for the automatic generation of
wrappers have been proposed in the literature. These systems are based
on unsupervised inference techniques: taking as input a small set of
sample pages, they can produce a common wrapper to extract relevant
data. However, due to the automatic nature of the approach, the data
extracted by these wrappers have anonymous names. In the framework of
our ongoing project RoadRunner, we have have developed a prototype,
called Labeller, that automatically annotates data extracted by
automatically generated wrappers. Although Labeller has been developed
as a companion system to our wrapper generator, its underlying
approach has a general validity and therefore it can be applied
together with other wrapper generator systems. We have experimented
the prototype over several real-life Web sites, and we have obtained
encouraging results.
|
| 44 |
A Comparison of Peer-to-Peer Search Methods
Dimitrios Tsoumakos, and Nick Roussopoulos
|
|
Peer-to-Peer (P2P) networks have become a major research topic over
the last few years. Object location is a major part in the operation
of these distributed systems. In this work, we present an overview of
several search methods for unstructured P2P networks. These systems
resemble popular file-sharing applications through which enormous
amounts of data are daily exchanged. We analyze the performance of the
algorithms relative to their success rates, bandwidth consumption and
adaptation to changing topologies. Simulation results are used to
empirically evaluate their behavior in direct comparison.
|
| 47 |
XML Interoperability
Laks V. S. Lakshmanan, and Fereidoon Sadri
|
|
We study the problem of interoperability among XML data sources. Our
approach is distinguished from previous works on interoperability and
data integration in not requiring either source-to-source or
source-to-global mappings. Instead, it is based on enriching local
declarations to an extent that enables derivation of mappings needed
for interoperability. Such an approach is ultimately dependent on
standards and global information, a direction taken by the semantic
web initiative. We present a simple semantic model for sources,
discuss local declarations and tools that assist in generating them
and formulation of global queries, and address some interesting issues
in query processing and optimization.
|
| 49 |
Semantic Email: Adding Lightweight Data Manipulation Capabilities to the Email Habitat
Oren Etzioni, Alon Halevy, Henry Levy, and Luke McDowell
|
|
The Semantic Web envisions a portion of the World Wide Web in which
the underlying data is machine-understandable and applications can
exploit this data for improved querying, aggregation, and
interaction. This paper investigates whether the same vision can be
carried over to the realm of email, the adjacent information space in
which we spent significant amounts of time.
We introduce the notion of semantic email, in which email messages
consists of a database query or update coupled with corresponding
explanatory text. Thus, semantic email messages are machine- and
human-understandable. Semantic email opens the door to a wide range of
automated, email-mediated applications. In this paper we describe a
class of semantic email processes that involve a set of email
messages. For example, sending an email to a program committee, asking
who will attend the PC dinner, collecting the responses and tallying
them up. We describe a formal model for defining semantic email
processes. In the model, the email process is modeled as a set of
updates to a data set, on which we specify certain constraints. We
then describe a set of inference problems that arise in this context,
and provide initial results on some of them. In particular, we show
that it is possible to automatically infer which email responses are
acceptable w.r.t. a set of ultimately desired constraints. Finally, we
describe a first implementation of semantic email, and outline the
research challenges in this realm.
|
| 52 |
Index Structures for Querying the Deep Web
Jian Qiu, Feng Shao, Misha Zatsman, and Jayavel Shanmugasundaram
|
|
We propose and evaluate various index structures for efficiently
querying data stored in Internet-attached databases, also referred to as the
"deep web". There are two fundamental differences between index structures for
the deep web and index structures used by current web search engines for
querying static web pages (also referred to as the "static web" or the
"surface web"). First, index structures for the deep web have to deal with
structured data because the underlying database data is typically richly
structured and typed; this is in contrast to the data available in the static
web, which is mostly unstructured HTML data. Second, index structures for the
deep web must deal with data volumes that are many orders of magnitude larger
than that for the static web. To address the above issues, we devise new deep
web index structures that (a) exploit the structure of the underlying data to
support structured queries such as equality and range queries, and (b)
can be compressed by one or two orders of magnitude so that they scale to the
large data volumes present in the deep web. We also present preliminary
experimental results that quantify the benefits of our proposed index
structures, and illustrate different performance-space tradeoffs.
|
| 54 |
Modeling Query-Based Access to Text Databases
Eugene Agichtein, Panagiotis Ipeirotis, and Luis Gravano
|
|
Searchable text databases abound on the web. Applications that require
access to such databases often resort to querying to extract relevant
documents because of two main reasons. First, some text databases on
the web are not ``crawlable,'' and hence the only way to retrieve
their documents is via querying. Second, applications often require
only a small fraction of a database's contents, so retrieving relevant
documents via querying is an attractive choice from an efficiency
viewpoint, even for crawlable databases. Often an application's
query-based strategy starts with a small number of user-provided
queries. Then, new queries are extracted --in an application-dependent
way-- from the documents in the initial query results, and the process
iterates. The success of this common type of strategy relies on
retrieved documents ``contributing'' new queries. If new documents
fail to produce new queries, then the process might stall before all
relevant documents are retrieved. In this paper, we develop a
graph-based ``reachability'' metric that allows to characterize when
an application's query-based strategy will successfully ``reach'' all
documents that the application needs. We complement our metric with an
efficient sampling-based technique, which accurately estimates the
reachability associated with a text database and an application's
query-based strategy. We report preliminary experiments backing the
usefulness of our metric and the accuracy of the associated estimation
technique over real text databases and for two applications.
|
| 56 |
ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval
Torsten Suel, Chandan Mathur, Jo-wen Wu, Jiangong Zhang, Alex Delis, and Mehdi Kharrazi
|
|
We consider the problem of building a P2P-based search engine for massive
document collections. We describe a prototype system called ODISSEA
(Open DIStributed Search Engine Architecture) that is currently under
development in our group. ODISSEA provides a highly distributed global indexing
and query execution service that can be used for content residing inside
or outside of a P2P network. ODISSEA is different from most other approaches
to P2P search in that it assumes a two-tier search engine architecture and
a global index structure distributed over the nodes of the system.
We give an overview of the proposed system and discuss the basic design
choices. Our main focus is on efficient query execution, and we discuss
how recent work on top-k queries in the database community can be applied
in a highly distributed environment. We also give some preliminary simulation
results on a real search engine log and a terabyte-size web page collection
that indicate good scalability for our approach.
|
| 59 |
On Distributing XML Repositories
Jan-Marco Bremer, and Michael Gertz
|
|
XML is increasingly used not only for data exchange but to represent
arbitrary data sources as virtual XML repositories. Often, related
repositories are distributed over the Web. However, design and query
models for distributed XML repositories have not yet been studied in
detail.
In this paper, we introduce a distribution approach for XML
repositories. We outline a fragmentation method and present an
allocation model for distributed XML fragments. The allocation model
is based on path identifiers, which allow for an efficient, local
evaluation of the most common types of global queries without having
to access any other distribution site.
|
| 61 |
Path-expression Queries over Multiversion XML Documents
Zografoula Vagena, and Vassilis J. Tsotras
|
|
There has been considerable research recently involving the processing
of path-expression queries on XML documents. Typically, such queries
are implemented as path (holistic) joins, using a numbering scheme
that maintains the structural relationships among the elements of an
XML document. Nevertheless, existing numbering schemes are not easily
updatable, an issue that appears in the case of a dynamically evolving
document. In this paper we address path-expression queries in a
multiversion XML document. This is a novel problem; previous work has
considered the storage and manipulation of multiversion XML
documents,focusing on the compact representation of the XML archive as
well as the fast reconstruction of a particular document version. We
extend state of the art pattern matching techniques for static
documents so as to suport updates. We first present an easily
updatable numbering scheme that efficiently captures identifies
structural relationships. We then propose a variation of Pathstack, an
optimal pattern matching algorithm, addressing the characteristics of
our environment. Finally, through a thorough experimental evaluation
we investigate two storage techniques in terms of space utilization
and query efficiency. |
| 64 |
Efficient Dissemination of Aggregate Data over the Wireless Web
Mohamed Sharaf, Yannis Sismanis, Alexandros Labrinidis, Panos Chrysanthis, and Nick Roussopoulos
|
|
The proliferation of wireless technologies along with the large volume
of data available online are forcing us to rethink existing data
dissemination techniques for data over the Web, and in particular for
aggregate data. In addition to scalability and response time, data
delivery to mobile clients with wireless Web connectivity must also
consider energy consumption. In this work, we present a hybrid
scheduling algorithm (h-ESBS) for broadcast-based data delivery of
aggregate data over the wireless Web. Our algorithm efficiently
"packs" aggregate data for broadcast delivery and utilizes view
subsumption at the mobile client, which allow for faster response
times and lower energy consumption.
|
| 65 |
Building Data Integration Systems: A Mass Collaboration Approach
Robert McCann, AnHai Doan, Vanitha Varadaran, Alex Kramnik, and Chengxiang Zhai
|
|
Building data integration systems today is largely done by hand, in a
very labor intensive and error prone process. In this paper, we
describe a conceptually new solution to this problem: that of mass
collaboration. The basic idea is to think about a data integration
system as having a finite set of parameters whose values must be
set. To build such a system, the system administrators can construct
and deploy a system ``shell'', then ask the users to help the system
``automatically converge'' to the correct parameter values. This way,
the enourmous burden of system developments is lifted from the
administrators and spread ``thinly'' over a multitude of users. We
discuss the challenges to this approach and propose solutions. Next,
we describe our current effort in applying this approach to the
problem of schema matching in the context of data integration.
Finally, we describe experimental results that show the promise of the
approach.
|
| 73 |
The Query Set Specification Language (QSSL)
Michalis Petropoulos, Alin Deutsch, and Yannis Papakonstantinou
|
|
Applications require access to multiple information sources and the
data of other applications. WSDL-based web services are becoming a
popular way of making information sources available on the web and,
hence, to applications that need to consume them – often via
data integration systems. We argue that the function signature
paradigm that is used today by web services cannot capture the query
capabilities provided by structurally rich and functionally powerful
information sources. We propose the Query Set Specification Language
(QSSL) that allows the concise description of sets of parameterized
queries, and describe the goals, the usage scenarios and the
requirements for such a language. A QSSL specification is embedded in
a WSDL specification to form a specialized type of web services,
called Data Services. Data Services connect the calls that the source
accepts with the underlying schema, providing a set of opportunities
and corresponding challenges that we outline.
|
|
|