Release 3.0: 🔥 Blazingly fast federation over heterogeneous sources

Tuesday, March 19, 2024


On this page

    More than 2 years ago, we released Comunica version 2.0, which featured many internal and external API changes that significantly simplified its usage. Today, we release version 3.0, which focuses more on internal changes, with limited changes to the external API. Most of the changes relate to the handling of data sources during query planning, which allows more efficient query plans to be produced when querying over federations of heterogeneous sources. This means that for people using Comunica, the number of breaking changes in this update are very limited. Things will simplify be faster in general, and some small convenience features have been added, such as results being async iterable. To developers extending Comunica with custom actors, there will be some larger breaking changes.

    🔁 Async iterable results

    Since recent JavaScript versions, it has been possible to use a new for-await syntax over async iterables. Comunica has been using the AsyncIterator library since its initial release. This requires users to consume results as streams using on-data listeners, as follows:

    const bindingsStream = await myEngine.queryBindings(`
      SELECT ?s ?p ?o WHERE {
        ?s ?p <http://dbpedia.org/resource/Belgium>.
        ?s ?p ?o
      } LIMIT 100`, {
      sources: [ 'http://fragments.dbpedia.org/2015/en' ],
    });
    
    bindingsStream.on('data', (bindings) => {
        console.log(bindings.toString());
    });
    

    As of Comunica 3.x, results can now also be consumed via the async iterable interface, as follows:

    for await (const bindings of bindingsStream) {
        console.log(bindings.toString());
    }
    

    In performance-critical cases, we still recommend the on-data listener approach. But in most cases, the async iterable interface will provide sufficient levels of performance.

    🙋 Performance improvements for end-users

    In Comunica version 2.x, federated queries (i.e. queries across multiple sources) would essentially be split at triple pattern level, each triple pattern would be sent to each source, and results would be combined together locally. While this way of working is semantically correct, it is not always the most performant, especially when working with sources such as SPARQL endpoints that can accept way more than just triple patterns.

    In Comunica version 3.x, the internal architecture has been refactored to enable query planning to not just happen at triple pattern level, but to enable any kind of query operation to be sent to any kind of source that would support them. While this new architecture will enable better query optimizations to be implemented in the future, we already implemented some optimizations in this release. First, if Comunica detects that multiple operations exclusively apply to one source, then these operations will be grouped and sent in bulk to this source (@comunica/actor-optimize-query-operation-group-sources). Roughly, this correspond to the FedX optimization techniques, but extended to apply to heterogeneous sources instead of only SPARQL endpoints. Second, if a join is done between two sources, where one of these sources accepts bindings to be pushed down into the source (such as SPARQL endpoints and brTPF interfaces) (@comunica/actor-rdf-join-inner-multi-bind-source), the bound-join technique is applied (FedX). Third, if sources accept FILTER operations, then these FILTER operations can be pushed down into the sources that accept them (@comunica/actor-optimize-query-operation-filter-pushdown). Fourth, if some operations will not produce any results based on prior COUNT or ASK queries, then these empty source-specific operations will be pruned away (@comunica/actor-optimize-query-operation-prune-empty-source-operations).

    End-users of Comunica will see a significant performance improvement when federating across multiple sources, especially if some of those sources would be SPARQL endpoints. Below, you can find some high-level performance comparisons of queries in Comunica 2.x vs 3.x.

    QueryComunica 2.xComunica 3.x
    Books by San Franciscans in Harvard Library (DBpedia TPF)5774.32 ms (669 requests)4923.86 ms (334 requests)
    Books by San Franciscans in Harvard Library (DBpedia SPARQL)Timeout8469.86 ms (632 requests)
    Compounds in Lindas and RheaTimeout424.57 ms(41 requests)

    Inspecting source selection results

    If you are interested in understanding how Comunica will split up queries across multiple sources, you can make use of the logical explain mode.

    For example, if we want to execute the following query across three sources (https://dbpedia.org/sparql (SPARQL), http://data.linkeddatafragments.org/viaf (TPF), http://data.linkeddatafragments.org/harvard (TPF)), the logical explain mode will show us how this query is split up and assigned to each source.

    Query:

    SELECT ?person ?name ?book ?title {
      ?person dbpedia-owl:birthPlace [ rdfs:label "San Francisco"@en ].
      ?viafID schema:sameAs ?person;
                   schema:name ?name.
      ?book dc:contributor [ foaf:name ?name ];
                  dc:title ?title.
    } LIMIT 100
    

    Explain:

    comunica-sparql \
        https://dbpedia.org/sparql http://data.linkeddatafragments.org/viaf http://data.linkeddatafragments.org/harvard \
        -f query.sparql --explain logical
    {
      "type": "slice",
      "input": {
        "type": "project",
        "input": {
          "type": "join",
          "input": [
            {
              "type": "join",
              "input": [
                {
                  "type": "union",
                  "input": [
                    {
                      "termType": "Quad",
                      "value": "",
                      "subject": {
                        "termType": "Variable",
                        "value": "viafID"
                      },
                      "predicate": {
                        "termType": "NamedNode",
                        "value": "http://schema.org/sameAs"
                      },
                      "object": {
                        "termType": "Variable",
                        "value": "person"
                      },
                      "graph": {
                        "termType": "DefaultGraph",
                        "value": ""
                      },
                      "type": "pattern",
                      "metadata": {
                        "scopedSource": "QuerySourceHypermedia(https://dbpedia.org/sparql)(SkolemID:0)"
                      }
                    },
                    {
                      "termType": "Quad",
                      "value": "",
                      "subject": {
                        "termType": "Variable",
                        "value": "viafID"
                      },
                      "predicate": {
                        "termType": "NamedNode",
                        "value": "http://schema.org/sameAs"
                      },
                      "object": {
                        "termType": "Variable",
                        "value": "person"
                      },
                      "graph": {
                        "termType": "DefaultGraph",
                        "value": ""
                      },
                      "type": "pattern",
                      "metadata": {
                        "scopedSource": "QuerySourceHypermedia(http://data.linkeddatafragments.org/viaf)(SkolemID:1)"
                      }
                    }
                  ]
                },
                {
                  "type": "union",
                  "input": [
                    {
                      "termType": "Quad",
                      "value": "",
                      "subject": {
                        "termType": "Variable",
                        "value": "g_1"
                      },
                      "predicate": {
                        "termType": "NamedNode",
                        "value": "http://xmlns.com/foaf/0.1/name"
                      },
                      "object": {
                        "termType": "Variable",
                        "value": "name"
                      },
                      "graph": {
                        "termType": "DefaultGraph",
                        "value": ""
                      },
                      "type": "pattern",
                      "metadata": {
                        "scopedSource": "QuerySourceHypermedia(https://dbpedia.org/sparql)(SkolemID:0)"
                      }
                    },
                    {
                      "termType": "Quad",
                      "value": "",
                      "subject": {
                        "termType": "Variable",
                        "value": "g_1"
                      },
                      "predicate": {
                        "termType": "NamedNode",
                        "value": "http://xmlns.com/foaf/0.1/name"
                      },
                      "object": {
                        "termType": "Variable",
                        "value": "name"
                      },
                      "graph": {
                        "termType": "DefaultGraph",
                        "value": ""
                      },
                      "type": "pattern",
                      "metadata": {
                        "scopedSource": "QuerySourceHypermedia(http://data.linkeddatafragments.org/harvard)(SkolemID:2)"
                      }
                    }
                  ]
                },
                {
                  "type": "union",
                  "input": [
                    {
                      "termType": "Quad",
                      "value": "",
                      "subject": {
                        "termType": "Variable",
                        "value": "book"
                      },
                      "predicate": {
                        "termType": "NamedNode",
                        "value": "http://purl.org/dc/terms/title"
                      },
                      "object": {
                        "termType": "Variable",
                        "value": "title"
                      },
                      "graph": {
                        "termType": "DefaultGraph",
                        "value": ""
                      },
                      "type": "pattern",
                      "metadata": {
                        "scopedSource": "QuerySourceHypermedia(https://dbpedia.org/sparql)(SkolemID:0)"
                      }
                    },
                    {
                      "termType": "Quad",
                      "value": "",
                      "subject": {
                        "termType": "Variable",
                        "value": "book"
                      },
                      "predicate": {
                        "termType": "NamedNode",
                        "value": "http://purl.org/dc/terms/title"
                      },
                      "object": {
                        "termType": "Variable",
                        "value": "title"
                      },
                      "graph": {
                        "termType": "DefaultGraph",
                        "value": ""
                      },
                      "type": "pattern",
                      "metadata": {
                        "scopedSource": "QuerySourceHypermedia(http://data.linkeddatafragments.org/harvard)(SkolemID:2)"
                      }
                    }
                  ]
                }
              ]
            },
            {
              "type": "join",
              "input": [
                {
                  "termType": "Quad",
                  "value": "",
                  "subject": {
                    "termType": "Variable",
                    "value": "person"
                  },
                  "predicate": {
                    "termType": "NamedNode",
                    "value": "http://dbpedia.org/ontology/birthPlace"
                  },
                  "object": {
                    "termType": "Variable",
                    "value": "g_0"
                  },
                  "graph": {
                    "termType": "DefaultGraph",
                    "value": ""
                  },
                  "type": "pattern"
                },
                {
                  "termType": "Quad",
                  "value": "",
                  "subject": {
                    "termType": "Variable",
                    "value": "g_0"
                  },
                  "predicate": {
                    "termType": "NamedNode",
                    "value": "http://www.w3.org/2000/01/rdf-schema#label"
                  },
                  "object": {
                    "termType": "Literal",
                    "value": "San Francisco",
                    "language": "en",
                    "datatype": {
                      "termType": "NamedNode",
                      "value": "http://www.w3.org/1999/02/22-rdf-syntax-ns#langString"
                    }
                  },
                  "graph": {
                    "termType": "DefaultGraph",
                    "value": ""
                  },
                  "type": "pattern"
                }
              ],
              "metadata": {
                "scopedSource": "QuerySourceHypermedia(https://dbpedia.org/sparql)(SkolemID:0)"
              }
            },
            {
              "type": "join",
              "input": [
                {
                  "termType": "Quad",
                  "value": "",
                  "subject": {
                    "termType": "Variable",
                    "value": "viafID"
                  },
                  "predicate": {
                    "termType": "NamedNode",
                    "value": "http://schema.org/name"
                  },
                  "object": {
                    "termType": "Variable",
                    "value": "name"
                  },
                  "graph": {
                    "termType": "DefaultGraph",
                    "value": ""
                  },
                  "type": "pattern",
                  "metadata": {
                    "scopedSource": "QuerySourceHypermedia(http://data.linkeddatafragments.org/viaf)(SkolemID:1)"
                  }
                }
              ]
            },
            {
              "type": "join",
              "input": [
                {
                  "termType": "Quad",
                  "value": "",
                  "subject": {
                    "termType": "Variable",
                    "value": "book"
                  },
                  "predicate": {
                    "termType": "NamedNode",
                    "value": "http://purl.org/dc/terms/contributor"
                  },
                  "object": {
                    "termType": "Variable",
                    "value": "g_1"
                  },
                  "graph": {
                    "termType": "DefaultGraph",
                    "value": ""
                  },
                  "type": "pattern",
                  "metadata": {
                    "scopedSource": "QuerySourceHypermedia(http://data.linkeddatafragments.org/harvard)(SkolemID:2)"
                  }
                }
              ]
            }
          ]
        },
        "variables": [
          {
            "termType": "Variable",
            "value": "person"
          },
          {
            "termType": "Variable",
            "value": "name"
          },
          {
            "termType": "Variable",
            "value": "book"
          },
          {
            "termType": "Variable",
            "value": "title"
          }
        ]
      },
      "start": 0,
      "length": 100
    }
    

    The scopedSource annotations on operations show which sources apply to which sources. The above shows that most of the query will be split at triple pattern level to the different sources, except for the patterns ?person dbpedia-owl:birthPlace [ rdfs:label "San Francisco"@en ]., which have been identified as exclusively applying to https://dbpedia.org/sparql, which can therefore be sent as-is to the SPARQL endpoint.

    Hereafter, this post will discuss the internal changes in more detail for developers that want to update their implementations to this new architecture.

    🔍 Query Source Identify bus

    @comunica/bus-query-source-identify is a new bus that roughly replace the @comunica/bus-rdf-resolve-quad-pattern and @comunica/bus-rdf-resolve-quad-pattern-hypermedia buses. The main difference is that @comunica/bus-query-source-identify runs before query execution within the @comunica/bus-context-preprocess bus, while the old buses ran during query execution. Running things before query execution enables more optimization opportunities, which enabled the existence of actors such as @comunica/actor-optimize-query-operation-filter-pushdown and @comunica/actor-optimize-query-operation-prune-empty-source-operations.

    If you had an actor on the @comunica/bus-rdf-resolve-quad-pattern or @comunica/bus-rdf-resolve-quad-pattern-hypermedia bus, these can now be moved to the @comunica/bus-query-source-identify or @comunica/bus-query-source-identify-hypermedia bus. The main API change here is that sources now need to implement the IQuerySource interface, that they need to announce the shape of query operations they support (instead of only quad patterns), and that these operations need to be executable within the source.

    🚌 Query Process bus

    @comunica/bus-query-process is a new bus that contains all logic for fully processing a query, which usually involves steps such as parsing, optimizing, and evaluating, which can be delegated to other buses. All of this logic was previously contained within @comunica/actor-init-query, together with many other boilerplate logic, which made things very difficult if developers would want to modify a small part of the query process. With this new bus, developers can more easily plug in custom query process actors, such as adaptive query planners.

    Full changelog

    While this blog post explained the primary changes in Comunica 3.x, there are actually many more smaller changes internally that will make your lives easier. If you want to learn more about these changes, check out the full changelog.