HSP is the first heuristics-based SPARQL planner that is capable of exploring the syntactic variations of triple patterns in a SPARQL query in order to choose near to optimal execution plans without any cost model.
In particular, HSP tries to produce plans that maximize the number of merge joins over the variables that are shared among triple patterns and relies on various heurists to decide which variables will be used in selections and joins as well as which underlying access paths will be exploited.
In HSP the problem of the problem of query planning is reduced to the maximum independent set computation over a SPARQL query graph representation where nodes are query variables and edges the triple patterns connecting them. Then the qualifying independent sets are translated to blocks of merge joins connected when is needed by hash joins.
HSP plans are evaluated on top of the MonetDB column-based DBMS. Particular attention was paid to the efficient implementation of HSP logical plans to the underlying MonetDB query execution engine (using the MonetDB Physical Algebra MAL physical algebra). The main challenge stems from the decomposed nature of rows in columns which incur subtle tuple reconstruction operators after the execution of several MAL operators for every HSP logical operator. In addition, to respect the maximal number of merge joins determined by HSP, we have employed bushy rather than left-deep query plans (as obtained by the standard SQL optimizer of MonetDB).
Experiments were conducted using synthetically generated RDF datasets according to SP2Bench benchmarkSP2B, as well as real RDF datasets that are widely used such as Yet Another Great Ontology (YAGO). In particular, we compare the quality and execution time of the plans produced by our heuristic-based SPARQL planner, HSP with the cost-based dynamic programming algorithm of RDF-3X. In all workload queries HSP produces plans with the same number of merge and hash joins as those produced by RDF-3X. Their differences lie on the employed ordered variables as well as the execution order of joins which essentially affect the size of intermediate results. With the exception of queries which are not substantially different in their syntax HSP plans executed on MonetDB outperform those of CDP executed in RDF-3X up to three orders of magnitude.
A SPARQL join query consists of numerous costly joins. The first and foremost important goal is to maximise the number of merge joins in the query plan. A merge join in this context is most commonly a sort-merge join, or any other join that takes advantage of the existence of an index. An equally important goal is to minimize intermediate results in order to minimize the memory footprint during query execution. This is achieved by choosing the most selective triple patterns to evaluate first. Traditionally, deciding which triple patterns are more selective relies on statistics. In HSP we have compiled a set of heuristics, based on the syntactical form of triple patterns, to determine the more selective ones.
Given the position and the number of variables in a triple pattern we derive the following order, starting from the most selective, i.e., the one that is likely to produce less intermediate results, to the least selective.
The different positions in which the same variable appears in a set of triple patterns captures the number of different joins this variable participates in. A variable that appears always in the same position in all triple patterns, for example as subject, entails many self joins with low selectivity. On the other hand, if it appears both as object and property, chances are the join result will be smaller. The ordering for this heuristic stems from our observations while studying RDF data graphs. RDF data graphs tend to be sparse with a small diameter, while there are hub nodes, usually subjects. As a result, query graph patterns that form linear paths are more selective.
Triple patterns that have the most number of literals and URIs – or symmetrically less variables – are more selective. This heuristic is similar to the bound as easier heuristic of relational query processing, according to which, the more bound components a triple pattern has, the more selective it will be.
An object of a triple pattern may be a literal or a URI. In such case, a literal is more selective than a URI. This is true for RDF data because in many cases if a URI is used as an object, it is used by many triples.
This heuristic allows us to consider as late as possible the triple patterns that contain projection variables. In the case in which the compared sets of triple patterns have the same set of projection variables, we prefer the set with the maximum number of unused variables that are not projection variables. This heuristic is directly associated.
You can download HSP from here.