Joins


On this page

    Join operations form a critical part most query engines. The order in which operations are joined, and the algorithms that are used to execute those joins, determine in large part the overall efficiency of query executions. The acts of determining this order and the selection of the join algorithms are parts of query planning.

    Adaptive query planning

    While most query engines perform query planning before query execution, Comunica does (part of) its query planning during query execution, which makes it an adaptive query engine. This is because Comunica aims to query over remote data sources, which makes it difficult to determine the optimal query plan ahead of query execution. Instead, the choices for query planning are taken as soon as they are required and the relevant information about the sources is available.

    What is a join

    SPARQL queries typically consist of many joins. For example, the following SPARQL query requires two triple patterns to be joined:

    SELECT * WHERE {
      ?s <ex:p1> ?link.
      ?link <ex:p2> ?o.
    }
    

    A query engine can represent this as two join entries that each can produce bindings:

    • Join entry 1 with bindings for variables ?s and ?link
    • Join entry 2 with bindings for variables ?link and ?o

    The join of these two entries will result in a new intermediary operation that produces bindings for the variables ?s, ?link, and ?o. The bindings in this intermediary operation will contain all existing combinations of these variables based on the two underlying join entries.

    For example, we assume the following bindings for the two join entries:

    join entry 1:
      { s: "ex:s1"; link: "ex:link1" }
      { s: "ex:s2"; link: "ex:link2" }
      { s: "ex:s3"; link: "ex:link3" }
    
    join entry 2:
      { link: "ex:link1", o: "ex:o1" }
      { link: "ex:link1", o: "ex:o2" }
      { link: "ex:link3", o: "ex:o3" }
    

    If we determine the possible combinations of these join entries following the inner join semantics, then we will obtain the following bindings:

    joined bindings:
      { s: "ex:s1"; link: "ex:link1"; o: "ex:o1" }
      { s: "ex:s1"; link: "ex:link1"; o: "ex:o2" }
      { s: "ex:s3"; link: "ex:link3"; o: "ex:o3" }
    

    Note that the second binding of the first join entry does not appear in the final results, because the value for ?link ("ex:link2") does not exist in the second join entry's bindings.

    Logical and physical joins

    A logical join type indicates the semantics of a join operation, and are under control of the query writer. The example above explains how the so-called inner join works, which is the most common logical join within SPARQL queries.

    There are however also two other logical join types that can occur within SPARQL queries:

    • Optional join (or left join): a join with two entries where all bindings from the left entry are matched with the bindings from the right entry. If no matching bindings are found in the right entry, undefined values are used for those.
    • Minus join (or anti join): a join with two entries where all bindings from the left entry are returned that have no corresponding bindings in the right entry.

    Each logical join can be implemented via different physical join algorithms. The selection of these algorithms is usually done internally within query engines during query planning, and is therefore not under control of the query writer.

    For example, two popular algorithms for the inner join are the nested-loop-join and hash-join algorithms, where the former is based on a nested for-loop, and the latter makes use of a hash-dictionary to achieve a lower computational complexity.

    Join actors

    The @comunica/bus-rdf-join bus in Comunica accepts join actions, where each action determine the entries that require joining, and the logical join that is to be used. For example, this bus will be invoked for the inner-join type when more than one operation (e.g. triple pattern) occurs in the query.

    Currently, the following join actors are available in Comunica:

    Selecting physical joins

    Actor selection in Comunica is done using mediators. Learn more about mediators in the core architecture.

    The Join Coefficients Mediator is a mediator that will select the "optimal" join actor based on their join coefficients (cost estimates). Each join actor can calculate their join coefficients based on metadata that is provided by data sources.

    The available join coefficients that are calculated by each join actor are:

    • iterations: An estimation of how many iterations over items are executed. This is used to determine the CPU cost.
    • persistedItems: An estimation of how many items are stored in memory. This is used to determine the memory cost.
    • blockingItems: An estimation of how many items block the stream. This is used to determine the time the stream is not progressing anymore.
    • requestTime: An estimation of the time to request items from sources. This is used to determine the I/O cost.

    The Join Coefficients Mediator can be configured with weights to calculate an overall cost based on these join coefficients, after which the actor with the lowest overall cost will be allowed to execute the action.

    If you want to inspect or debug the chosen physical joins, you can use the explain functionality, or make use of the logger.

    Physical join selection example

    We assume two join entries with the following cardinalities (a.k.a., estimated number of bindings):

    • Join entry 1: 10
    • Join entry 2: 1.000

    Assuming the availability of the nested-loop-join and hash-join actors, these will calculate the join coefficients as follows:

    • Nested-loop-join
      • iterations = 10 * 1.000 = 10.000
      • persistedItems = 0
      • blockingItems = 0
    • Hash-join
      • iterations = 10 + 1.000 = 1.010
      • persistedItems = 10
      • blockingItems = 10

    The requestTime join coefficient is omitted out for simplicity.

    If the Join Coefficients Mediator gives equal weights to all join coefficients, then it can come up with the following overall costs, which would make hash-join the selected physical actor:

    • Nested-loop-join: 10.000 + 0 + 0 = 10.000
    • Hash-join: 1.010 + 10 + 10 = 1.030

    However, if the Join Coefficients Mediator would be configured to give a much higher weight (10.000) to the number of blocking items (e.g. when early results are prioritized), then the overall costs would become, which would make nested-loop join the selected physical actor:

    • Nested-loop-join: 10.000 + 0 * 1.000 + 0 = 10.000
    • Hash-join: 1.010 + 10 * 10.000 + 10 = 11.020