Statistics Tracking System


On this page

    Comunica includes a dynamic runtime statistics tracking system that monitors key performance metrics throughout query execution.
    These statistics cover aspects such as the number of intermediate results, discovered links, and dereferenced links. Note that this behavior is disabled by default; enabling it requires configuration steps described in the following sections.

    As the query executes, Comunica emits these metrics in real time, enabling other (future) Comunica actors to adaptively optimize query processing.
    After query completion, these statistics can be analyzed to gain insights into execution behavior and performance characteristics. The available types of statistics are described in more detail below.

    Comunica automatically tracks link discovery and link dereference events when StatisticLinkDiscovery and StatisticLinkDereference instances are added to the context under the appropriate key.
    This key is defined as the statistic class attribute .key.

    When querying from a JavaScript application, statistics tracking can be enabled as follows:

    import { QueryEngine } from '@comunica/query-sparql';
    import { StatisticLinkDereference } from "@comunica/statistic-link-dereference";
    import { ILink, QueryStringContext } from '@comunica/types';
    
    const engine = new QueryEngine();
    const query = `SELECT * WHERE {
      ?s ?p ?o.
    }
    LIMIT 1000`;
    
    const context: QueryStringContext = { sources: ["https://fragments.dbpedia.org/2016-04/en"] };
    const statisticLinkDereference = new StatisticLinkDereference();
    context[statisticLinkDereference.key.name] = statisticLinkDereference;
    
    const bindingsStream = await engine.queryBindings(query, context);
    await bindingsStream.toArray();
    
    

    These statistic trackers act as event emitters, enabling other processes to listen and react to emitted events.

    StatisticLinkDiscovery

    This tracker emits an arc representing a link discovery, where the parent is the URI of the document from which a new child URI is discovered. This is particularly useful in link traversal scenarios, where many such links are followed during query execution.

    For each discovered link, the tracker emits metadata for both the parent and child. Since the same URI can be discovered multiple times from different sources, the metadata is structured as a list of records. Each metadata record includes:

    • discoveredTimestamp: The timestamp when the URI was discovered.

    • discoverOrder: A numeric value indicating the discovery order.

    • (Optionally) additional metadata associated with the link, as generated by the engine.

    StatisticLinkDereference

    This tracker emits an event whenever a link is dereferenced by the engine. The emitted data follows the ILink interface and includes:

    • url: The dereferenced URL.

    • context and transform: Link information provided via the ILink interface.

    • metadata: Metadata, which contains:

      • type: The type of source class used by Comunica.

      • dereferencedTimestamp: The timestamp when the dereference occurred.

      • dereferenceOrder: A numeric value indicating the dereference order, analogous to order in StatisticLinkDiscovery.

      • (Optionally) additional metadata associated with the link, as generated by the engine.

    Example Use Case

    These trackers can be used, for instance, to monitor what pages of a triple pattern fragment source are dereferenced and at what timestamps.

    
    const dereferencedUris: string[] = [];
    const timestamps: number[] = [];
    
    statisticLinkDereference.on((data: ILink) => {
        dereferencedUris.push(data.url);
        timestamps.push(data.metadata!.dereferencedTimestamp)
    });
    

    Other Comunica actors can access the statistic tracker objects by retrieving it from the context.

    Command Line

    Comunica does not natively support statistic tracking from the command line.
    However, it would be easy to add it to the context using an actor subscribing to the bus-context-preprocess.
    The implementation would be similar to the implementation of @comunica/actor-context-preprocess-set-defaults.

    Intermediate Result Events

    Tracking intermediate results follows a similar approach to the other statistics, but it requires a custom configuration with additional setup steps.
    Intermediate result events are captured by wrapping the query-operation and rdf-join streams with a callback.
    This wrapping is performed by actors subscribing to the @comunica/bus-iterator-transform bus, which allows multiple wraps of a single stream.

    To track intermediate results produced by join actors, include the following actor:

    {
      "@context": [
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/runner/^4.0.0/components/context.jsonld",
    
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-rdf-join-wrap-stream/^4.0.0/components/context.jsonld"],
      "@id": "urn:comunica:default:Runner",
      "@type": "Runner",
      "actors": [
        {
          "@id": "urn:comunica:default:rdf-join/actors#wrap-stream",
          "@type": "ActorRdfJoinWrapStream",
          "mediatorJoinSelectivity": { "@id": "urn:comunica:default:rdf-join-selectivity/mediators#main" },
          "mediatorJoin": { "@id": "urn:comunica:default:rdf-join/mediators#main" },
          "mediatorIteratorTransform": { "@id": "urn:comunica:default:iterator-transform/mediators#main" }
        }
      ]
    }
    

    To track intermediate results produced by query operation actors, include the following actor:

    {
      "@context": [
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/runner/^4.0.0/components/context.jsonld",
    
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-query-operation-wrap-stream/^4.0.0/components/context.jsonld"
      ],
      "@id": "urn:comunica:default:Runner",
      "@type": "Runner",
      "actors": [
        {
          "@id": "urn:comunica:default:query-operation/actors#wrap-stream",
          "@type": "ActorQueryOperationWrapStream",
          "mediatorQueryOperation": { "@id": "urn:comunica:default:query-operation/mediators#main" },
          "mediatorIteratorTransform": { "@id": "urn:comunica:default:iterator-transform/mediators#main" }
        }
      ]
    }
    

    To define the wrapper applied to each intermediate result stream, include the following actor and mediator:

    {
      "@context": [
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/runner/^4.0.0/components/context.jsonld",
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-iterator-transform-record-intermediate-results/^4.0.0/components/context.jsonld"
      ],
      "@id": "urn:comunica:default:Runner",
      "@type": "Runner",
      "actors": [
        {
          "@id": "urn:comunica:default:iterator-transform/actors#record-intermediate-results",
          "@type": "ActorIteratorTransformRecordIntermediateResults",
          "wraps": ["inner", "minus", "optional", "project"]
        }
      ]
    }
    

    Mediator:

    {
      "@context": [
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/bus-iterator-transform/^4.0.0/components/context.jsonld",
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/mediator-combine-pipeline/^4.0.0/components/context.jsonld"
      ],
      "@id": "urn:comunica:default:iterator-transform/mediators#main",
      "@type": "MediatorCombinePipeline",
      "bus": { "@id": "ActorIteratorTransform:_default_bus" },
      "filterFailures": true
    }
    

    These actor and mediator configurations can be found in the engines/config-query-sparql/config folder at the following locations:

    • rdf-join/actors-wrap-stream.json
    • query-operation/actors/query/wrap-stream.json
    • iterator-transform/actors.json
    • iterator-transform/mediators.json

    The wraps field in the ActorIteratorTransformRecordIntermediateResults configuration specifies which query operations or joins will be wrapped for intermediate result tracking.

    In the example configuration, intermediate results are recorded for inner, minus, and optional joins, as well as the final project result.

    These intermediate results are produced and emitted by StatisticIntermediateResults, which must be included in the query context using the same approach as StatisticLinkDiscovery and StatisticLinkDereference. An example is provided below.

    import { StatisticIntermediateResults } from "@comunica/statistic-intermediate-results";
    import { PartialResult, QueryStringContext } from '@comunica/types';
    import { QueryEngine } from '@comunica/query-sparql';
    
    const engine = new QueryEngine();
    const query = `
    PREFIX dbpprop: <http://dbpedia.org/property/>
    
    SELECT * WHERE {
      ?s dbpprop:format ?o0;
        dbpprop:isCitedBy ?o1;
        dbpprop:title ?o2
    }
    LIMIT 10`;
    
    let intermediateResultsProduced = 0;
    const statisticIntermediateResults = new StatisticIntermediateResults();
    statisticIntermediateResults.on((data: PartialResult) => {
        intermediateResultsProduced++;
        // Optionally you can determine the type of data by checking the
        // data.type field (which is either "bindings" or "quads")
        if (data.type === "bindings") {
            // consume intermediate bindings
        } else {
            // consume intermediate quads
        }
    });
    const context: QueryStringContext = { sources: ["https://fragments.dbpedia.org/2016-04/en"] };
    context[statisticIntermediateResults.key.name] = statisticIntermediateResults;
    
    const bindingsStream = await engine.queryBindings(query, context);
    const results = await bindingsStream.toArray();
    // Log the number of intermediate results.
    console.log(intermediateResultsProduced);