Data$ourcery!
Making Custom Datomic Datalog Datasources
Disclaimers
What follows is not meant for those new to Datomic. This is a very deep dive into Datomic on-prem internals–even deeper than usual! I don’t have access to any source code or any insider knowledge and none of the interfaces discussed here are public, so expect inaccuracies!
Caveat Lector out of the way, let’s get started.
What are Datasources
Datomic datalog’s :where
clause has “data-pattern” sub-clauses. For example:
[:find ?foo
:where
[(+ 1 1) ?foo] ;; This is a function-expression clause
[?foo :attribute ?bar] ;; This is a data-pattern clause--the one we care about.
[(< ?bar 1)] ;; This is a predicate-expression clause
]
Data pattern clauses match tuples in a “datasource”, which we can also call a relation. Syntactically, datasources are datalog symbols that start with $
.
[:find ?foo
:in $ $h ;; two datasources $ and $h
:where
[?foo :attribute ?bar] ;; Implicitly datasource $
[$h ?bar :other-attribute ?baz] ;; Explicitly datasource $h
]
Usually a datasource is a Datomic database, but that’s not the only thing it can be!
My aim is to show you what “makes” a datasource, so you can understand the performance of datalog queries better and potentially make your own datasources.
(Spoiler alert: a datasource is a protocol.)
The ExtRel
Protocol
A datasource is anything that has a useful implementation of the datomic.datalog/ExtRel
protocol. (I’m not sure what this protocol name abbreviates. Perhaps “existential relation”?)
datomic.datalog/ExtRel
;; This is the map that defprotocol creates:
=>
{:on datomic.datalog.ExtRel,
:on-interface datomic.datalog.ExtRel,
;; Note the method signature
:sigs {:extrel {:name extrel,
:arglists ([src consts starts whiles]),
:doc nil}},
:var #'datomic.datalog/ExtRel,
:method-map {:extrel :extrel},
:method-builders {#'datomic.datalog/extrel #object[datomic.datalog$fn__17615 0x576cf258 "datomic.datalog$fn__17615@576cf258"]},
;; Note there are four implementations
:impls {nil {:extrel #object[datomic.datalog$fn__17637 0x3c5ce098 "datomic.datalog$fn__17637@3c5ce098"]},
java.lang.Object {:extrel #object[datomic.datalog$fn__17639 0x66250ab1 "datomic.datalog$fn__17639@66250ab1"]},
datomic.db.Db {:extrel #object[datomic.datalog$fn__17641 0x2effa778 "datomic.datalog$fn__17641@2effa778"]},
java.util.Map {:extrel #object[datomic.datalog$fn__17643 0x1e317ecd "datomic.datalog$fn__17643@1e317ecd"]},
java.util.Collection {:extrel #object[datomic.datalog$fn__17645 0x45b215b3 "datomic.datalog$fn__17645@45b215b3"]}}}
From the protocol map we know that its definition looked something like this:
(defprotocol ExtRel
(extrel [src consts starts whiles]))
And we know a few implementing objects to investigate.
Built-in Datasources
Collections
The nil
and Object
implementations are just to throw error messages:
(d/q '[:find ?x ?y :in $ :where [?x ?y]]
nil)
Execution error (Exceptions$IllegalArgumentExceptionInfo) at datomic.error/arg (error.clj:79).
:db.error/invalid-data-source Nil or missing data source. Did you forget to pass a database argument?
(d/q '[:find ?x ?y :in $ :where [?x ?y]]
(Object.))
Execution error (Exceptions$IllegalArgumentExceptionInfo) at datomic.error/arg (error.clj:79).
:db.error/invalid-data-source class java.lang.Object is not a valid data source type.
Although nil
and Object
technically implement ExtRel
, these implementations are not “useful”, so I don’t want to call these “datasources”.
By contrast, java.util.Collection
allows you to use any collection of tuples as a datasource:
;; Using a vector of tuples
(d/q '[:find ?e ?attr ?v
:in $
:where
[(ground 1) ?e]
[?e ?attr ?v]]
[[1 :int 2] [1 :int 3] [2 :int 4]])
=> #{[1 :int 3] [1 :int 2]}
Note that this implementation has very few constraints:
- The outer collection can be anything that implements just
size
anditerator
. This includes Clojure’s persistent vectors and sets. - The tuples can be anything that supports indexed access via
nth
. - Your tuples can be any length you want, but they should ideally be the same length to be a true relation. (Also you might get
IndexOutOfRange
exceptions.)
There’s a special case for java.util.Map
that just treats a map as a collection of two-element tuples. It seems to only need an entrySet
method, and it probably just delegates the result to the j.u.Collection
implementation.
ExtRel
Parameters
Let’s instrument the extrel
method to see what it calls. This function will take a datasource and return a wrapped datasource with a trace atom that records every call to its extrel
method.
(defn extrel-trace [base-ds]
(let [trace (atom [])
ds (reify
ExtRel
(extrel [_ consts starts whiles]
(let [args [consts starts whiles]
ret (datomic.datalog/extrel base-ds consts starts whiles)]
(swap! trace conj {:args args :ret ret})
ret)))]
[ds trace]))
And let’s try it on a simple query:
(let [[ds t] (extrel-trace [["e1" :int 1 "extra"]
["e1" :int 2 "extra"]
["e1" :int 3 "extra"]
["e1" :int 4 "extra"]
["e1" :int 5 "extra"]
["e2" :int 2 "extra"]])]
[(d/q '[:find ?v ?extra
:in $
:where
[(ground ["e1" "e2"]) [?e ...]]
[$ ?e :int ?v ?extra]
[(= "extra" ?extra)]
[(< ?v 4)]
[(> ?v 1)]]
ds)
@t])
=>
[#{[2 "extra"] [3 "extra"]}
[{:args [[nil :int nil nil]
[nil nil 1 "extra"]
[nil
nil
#object[datomic.datalog$ranges$fn__18296$fn__18300 0x35ac0c98 "datomic.datalog$ranges$fn__18296$fn__18300@35ac0c98"]
#object[datomic.datalog$ranges$fn__18296$fn__18300 0x20fe78e3 "datomic.datalog$ranges$fn__18296$fn__18300@20fe78e3"]]],
:ret (["e1" :int 1 "extra"]
["e1" :int 2 "extra"]
["e1" :int 3 "extra"]
["e1" :int 4 "extra"]
["e1" :int 5 "extra"]
["e2" :int 2 "extra"])}]]
This query had exactly one data-pattern clause in it: [?e :int ?v ?extra]
. The extrel
method invocation corresponds to this clause. Even though there are two possible values of ?e
, extrel
was only called once. This is because extrel
’s responsibility isn’t to join against the results of previous binding clauses, but only provide relations which can be determined from a static examination of the query and its :in
arguments. (We’ll illustrate this point better later.)
This call illustrates the structure of the consts
starts
and whiles
parameters. All three parameters will have the same length as the data-pattern clause, less the optional src-var
(in this case $
). The call may include information from other surrounding clauses.
consts
contains all constant values, or nil
if the value in that slot is not constant. Note that only :int
is constant. ?e
is not considered constant even though it has a ground
value and could be known statically. Instead, the result of the ground
will be joined against the result of this clause later–remember, extrel
is not about joining.
Subrange optimizations
starts
and while
is an optimization available for datasources which are able to return a subset of values. Through knowledge of the primitive predicates <
and =
used with static arguments datalog was able to determine that ?v
must be >= 1 and ?extra
must start with the string "extra"
. Therefore the corresponding slots of starts
have non-nil values where a containing start value is known: [nil nil 1 "extra"]
.
while
is the same information, but for the end of the range. Each item in the corresponding slot is a predicate which returns false if the value in that slot of a candidate tuple is outside the end range.
;; *1 is the previous result
(let [[_ _ while-v while-extra] (-> *1 peek first :args peek)]
(mapv (juxt identity while-v) (range 6)))
=> [[0 true] [1 true] [2 true] [3 true] [4 false] [5 false]]
A datasource can use starts
and whiles
information–especially if it’s available in a sorted order–to return a subset of its relations which could never possibly join with anything else in the query. All a sorted datasource has to do is start seeking at the starts
slot of its choice and take-while
the corresponding while
slot predicate.
However, applying starts
and whiles
is (as far as I can tell) completely optional for the correctness of the query. If a datasource understands them it can leverage them to reduce the size of relations it returns and thus the number of items involved in subsequent joins, but it is only required to return items which satisfy consts
.
You’ll notice in the trace above that the extrel
method of j.u.Collection
included ["e1" :int 5 "extra"]
even though this doesn’t satisfy whiles
. From what I can tell, the j.u.Collections
implementation only filters all items by consts
and doesn’t use starts
or whiles
.
Datomic Database Datasources
However, a Datomic database does provide sorted items, and can leverage starts
and whiles
to reduce its result-set. Based on the non-nil consts
, starts
, and whiles
slots it can choose an appropriate index to seek. For example, if the attribute is known and its values are indexed and the value start or “while” is known it can do a sub-seek of :avet
.
Let’s trace the extrel
call of an actual Datomic datasource. We’ll use a freshly-created dev connection instead of an in-memory connection so that io-stats will tell us what indexes we are using.
(let [[ds t] (extrel-trace db)]
(-> (d/query {:query '[:find ?v
:where
[?e :db/ident ?v]
[(<= :db/a ?v)]
[(< ?v :db/b)]]
:args [ds]
:query-stats true
:io-context :user/query})
(assoc :extrel-trace @t)))
=>
{:ret #{[:db/add]},
:io-stats {:io-context :user/query,
:api :query,
:api-ms 5.18,
:reads {:avet 1, :dev 1, :ocache 1, :dev-ms 1.57, :avet-load 1}},
:query-stats {:query [:find ?v :where [?e :db/ident ?v] [(<= :db/a ?v)] [(< ?v :db/b)]],
:phases [{:sched (([?e :db/ident ?v] [(<= :db/a ?v)] [(< ?v :db/b)])),
:clauses [{:clause [?e :db/ident ?v],
:rows-in 0,
:rows-out 1,
:binds-in (),
:binds-out [?v],
:preds ([(<= :db/a ?v)] [(< ?v :db/b)]),
:expansion 1,
:warnings {:unbound-vars #{?v ?e}}}]}]},
:extrel-trace [{:args [[nil :db/ident nil]
[nil nil :db/a]
[nil
nil
#object[datomic.datalog$ranges$fn__18296$fn__18300 0x3461f77c "datomic.datalog$ranges$fn__18296$fn__18300@3461f77c"]]],
:ret #object[datomic.datalog.DbRel 0x2e373360 "datomic.datalog.DbRel@2e373360"]}]}
There are three things to note here:
- Alias resolution is the responsibility of the datasource.
- Index choice is partially the responsibility of
extrel
. - The return value of
extrel
is not necessarily a concrete collection but anything that implementsdatomic.datalog/IJoin
.
First, the :db/ident
attribute keyword constant was supplied to the query. Datoms don’t have attribute idents (keywords) in them; rather they have the attribute’s entity id. The Datomic database datasource must translate this to an entity id number. This means if the datasource has any aliasing mechanism that allows queries to refer to values in relations by anything other than their raw value, it’s the responsibility of the datasource to normalize those aliases into their canonical form.
Second, notice from the :io-stats
information that the query used the :avet
index for its reads. This index choice is also the responsibility of the datasource. In this case, it used the pattern of consts
, start
, and while
to choose the :avet
index.
If we don’t supply something that the datalog engine can recognise as a start
or while
parameter, the datasource may choose a different index:
;; Using a custom predicate to hide the subrange selection from datalog
(defn db-starts-with-a [x]
(and (= "db" (namespace x))
(.startsWith (name x) "a")))
(let [[ds t] (extrel-trace db)]
(-> (d/query {:query '[:find ?v
:where
[?e :db/ident ?v]
[(user/db-starts-with-a ?v)]]
:args [ds]
:query-stats true
:io-context ::query})
(assoc :extrel-trace @t)))
=>
{:ret #{[:db/add]},
:io-stats {:io-context :user/query,
:api :query,
:api-ms 5.56,
:reads {:aevt 2, :dev 2, :aevt-load 2, :ocache 2, :dev-ms 2.28}},
:query-stats {:query [:find ?v :where [?e :db/ident ?v] [(user/db-starts-with-a ?v)]],
:phases [{:sched (([?e :db/ident ?v] [(user/db-starts-with-a ?v)])),
:clauses [{:clause [?e :db/ident ?v],
:rows-in 0,
:rows-out 1,
:binds-in (),
:binds-out [?v],
:preds ([(user/db-starts-with-a ?v)]),
:expansion 1,
:warnings {:unbound-vars #{?v ?e}}}]}]},
:extrel-trace [{:args [[nil :db/ident nil] [nil nil nil] [nil nil nil]],
:ret #object[datomic.datalog.DbRel 0x99687c4 "datomic.datalog.DbRel@99687c4"]}]}
Notice in this case the io-stats reports reading two :aevt
index segments instead of one :avet
segment; but the query stats look mostly the same except for the :preds
clause. In this case the extrel
returned something which would seek all idents instead of a subset of them, so more (potential) IO was performed.
Why didn’t :query-stats
show this difference? It still reports “rows-out” as 1. This is because of the third thing to notice, which is that the extrel
call didn’t return a collection but something called a DbRel
. What is this?
The IJoin
Protocol
Datasources actually have a pair of protocols which are needed to evaluate a query. The first one is ExtRel
, which we have just covered in detail. But the second one is what extrel returns. Although the simple builtin extrel
implementations simply return a collection, what extrel is actually expected to return is something which implements datomic.datalog/IJoin
.
datomic.datalog/IJoin
=>
{:on datomic.datalog.IJoin,
:on-interface datomic.datalog.IJoin,
:sigs {:join-project {:name join-project, :arglists ([xs ys join-map project-map-x project-map-y predctor]), :doc nil},
:join-project-with {:name join-project-with,
:arglists ([ys xs join-map project-map-x project-map-y predctor]),
:doc nil}},
:var #'datomic.datalog/IJoin,
:method-map {:join-project-with :join-project-with, :join-project :join-project},
:method-builders {#'datomic.datalog/join-project #object[datomic.datalog$fn__17492 0x1a9d7f3c "datomic.datalog$fn__17492@1a9d7f3c"],
#'datomic.datalog/join-project-with #object[datomic.datalog$fn__17513 0x27dd69e9 "datomic.datalog$fn__17513@27dd69e9"]},
:impls {java.util.Collection {:join-project #object[datomic.datalog$fn__17586 0x1d8e070c "datomic.datalog$fn__17586@1d8e070c"],
:join-project-with #object[datomic.datalog$fn__17588 0x32170047 "datomic.datalog$fn__17588@32170047"]},
java.lang.Object {:join-project #object[datomic.datalog$fn__17590 0x4302eda6 "datomic.datalog$fn__17590@4302eda6"],
:join-project-with #object[datomic.datalog$fn__17592 0xfdc460e "datomic.datalog$fn__17592@fdc460e"]},
datomic.datalog.DbRel {:join-project #object[datomic.datalog$fn__17661 0x6287869e "datomic.datalog$fn__17661@6287869e"],
:join-project-with #object[datomic.datalog$fn__17663 0x6898a7b1 "datomic.datalog$fn__17663@6898a7b1"]}}}
Note that j.u.Collection
implements IJoin
, which is why you can return a normal collection from extrel
.
From this protocol map, we know the protocol definition looked something like this:
(defprotocol IJoin
(join-project [xs ys join-map project-map-x project-map-y predctor])
(join-project-with [xs ys join-map project-map-x project-map-y predctor]))
I must confess I have no idea what join-project
is for. I’ve never observed it invoked.
However, join-project-with
is the method that performs a join, projection, and filtering from two IJoin
-ables xs
and ys
. (Read “project” as a verb, not a noun.)
Here’s an instrumented example of the join-project-with
call. The code below reifies an ExtRel
datasource which returns a reified IJoin
.
(def trace (atom []))
(d/query {:query '[:find ?e ?a ?v
:in $
:where
[?e :int ?i]
[(< 1 ?i)]
[(<= ?i 2)]
[?e :str ?str]
[(clojure.string/starts-with? ?str "foo")]
[?e ?a ?v]
]
:args
(let [ds [["e1" :int 1]
["e1" :int 2]
["e1" :str "foo"]
["e1" :str "bar"]
["e2" :int 1]
["e2" :str "baz"]]]
[(reify ExtRel
(extrel [_ consts starts whiles]
(let [xs (datomic.datalog/extrel ds consts starts whiles)]
(swap! trace conj {:fn 'extrel :args [consts starts whiles] :ret xs})
(reify
IJoin
(join-project-with [_ ys join-map project-map-x project-map-y predctor]
(let [r (datomic.datalog/join-project-with
xs
ys
join-map
project-map-x
project-map-y
predctor)]
(swap! trace conj
{:fn 'join-project-with
:args [xs ys join-map project-map-x project-map-y predctor]
:ret r})
r))))))])})
Now lets look at the trace which I have annotated inline:
@trace
[
;; This is for the clauses [?e :int ?i] [(< 1 ?i)] [(<= ?i 2)]
{:fn extrel,
:args [[nil :int nil]
[nil nil 1]
[nil
nil
#object[datomic.datalog$ranges$fn__18296$fn__18300 0x5474264c "datomic.datalog$ranges$fn__18296$fn__18300@5474264c"]]],
:ret (["e1" :int 1] ["e1" :int 2] ["e2" :int 1])}
;; Now we join the result against an empty initial result set
{:fn join-project-with,
:args [(["e1" :int 1] ["e1" :int 2] ["e2" :int 1]) ;; previous extrel
#{[]} ;; initial result set
{} ;; no joins
;; Projection of xs:
;; Put slot 2 in xs into slot 0 in the result
;; Put slot 0 in xs into slot 1 in the result
{2 0, 0 1}
;; Projection of ys: keep nothing
{}
;; This is a predicate constructor.
;; When called, it will return predicates which should be called
;; to filter results.
;; This is why `extrel` doesn't need to honor `starts` and `whiles`--
;; this is what *really* does the filtering.
#object[datomic.datalog$push_preds$fn__18015$fn__18027 0x33a1ac5e "datomic.datalog$push_preds$fn__18015$fn__18027@33a1ac5e"]],
:ret #{[2 "e1"]}}
;; Now we get the extrel for [?e :str ?str]
;; Note the `starts-with?` predicate is not included.
{:fn extrel,
:args [[nil :str nil] [nil nil nil] [nil nil nil]],
:ret (["e1" :str "foo"] ["e1" :str "bar"] ["e2" :str "baz"])}
;; Now join this extrel with the result of the previous IJoin
{:fn join-project-with,
:args [(["e1" :str "foo"] ["e1" :str "bar"] ["e2" :str "baz"])
;; This is the result of the previous IJoin
;; It is *always* a set.
;; Note the tuple slots correspond to the projection maps.
#{[2 "e1"]}
;; Join slot 0 in xs to slot 1 in ys
;; Here, it means only include tuples with "e1"
{0 1}
;; Project xs 2 to 0, 0 to 1
{2 0, 0 1}
;; Project ys 1 to 1
;; Since 1 in the result is the join target of xs and ys,
;; this is ok--they will never conflict.
{1 1}
;; This has the `starts-with?` predicate in it.
#object[datomic.datalog$push_preds$fn__18015$fn__18027 0x2590becd "datomic.datalog$push_preds$fn__18015$fn__18027@2590becd"]],
:ret #{["foo" "e1"]}}
;; This final extrel is for the clause [?e ?a ?v]
;; On datomic databases, this would eventually throw because it is a full scan.
;; That behavior is from the datasource implementation, not the datalog!
{:fn extrel,
:args [[nil nil nil] [nil nil nil] [nil nil nil]],
:ret [["e1" :int 1] ["e1" :int 2] ["e1" :str "foo"] ["e1" :str "bar"] ["e2" :int 1] ["e2" :str "baz"]]}
;; The final projection is to extract what the `:find` clause wants.
{:fn join-project-with,
:args [[["e1" :int 1] ["e1" :int 2] ["e1" :str "foo"] ["e1" :str "bar"] ["e2" :int 1] ["e2" :str "baz"]]
#{["foo" "e1"]}
{0 1}
{0 0, 1 1, 2 2}
{1 0}
#object[datomic.datalog$truep 0x4922463b "datomic.datalog$truep@4922463b"]],
:ret #{["e1" :str "bar"] ["e1" :int 1] ["e1" :int 2] ["e1" :str "foo"]}}]
If you compare this trace with the :query-stats
output of the query, you’ll notice that its data roughly corresponds to join-project-with
invocations (including row counts) more than the extrel
invocations. In general, :io-stats
tells you more about the relations from extrel
and :query-stats
about join-project-with
calls.
Realizing Relations in IJoin
The existing of IJoin
as a protocol allows ExtRel
to defer realization of the relation until it receives more information from join parameters. In this case, DbRel
isn’t actually reading any datoms–this happens during its IJoin
, where it can make better index choices.
This ExtRel
vs IJoin
split also explains a lot of seemingly inconsistent behavior in datalog queries around lookup-ref resolution. The inconsistency often comes down to whether the lookup could be resolved at extrel
time or at join-project-with
time.
Take the following query as an example.
(let [[ds t] (extrel-trace db)]
(-> (d/query {:query '[:find ?v
:in $ [?a ?v]
:where
[:db.part/db ?a ?v]]
:args [ds [:db.install/attribute :db/doc]]
:io-context :user/query
:query-stats true})
(assoc :extrel-trace @t)))
=>
{:ret #{[:db/doc]},
:io-stats {:io-context :user/query,
:api :query,
:api-ms 4.26,
:reads {:aevt 1, :dev 1, :aevt-load 1, :ocache 1, :dev-ms 1.11}},
:query-stats {:query [:find ?v :in $ [?a ?v] :where [:db.part/db ?a ?v]],
:phases [{:sched (([(ground $__in__2) [?a ?v]] [:db.part/db ?a ?v])),
:clauses [{:clause [(ground $__in__2) [?a ?v]],
:rows-in 0,
:rows-out 1,
:binds-in (),
:binds-out [?a ?v],
:expansion 1}
{:clause [:db.part/db ?a ?v],
:rows-in 1,
:rows-out 1,
:binds-in [?a ?v],
:binds-out [?v]}]}]},
:extrel-trace [{:args [[:db.part/db :db.install/attribute :db/doc] [nil nil nil] [nil nil nil]],
:ret #object[datomic.datalog.DbRel 0x750e1765 "datomic.datalog.DbRel@750e1765"]}]}
In this query the [?a ?v]
is a single tuple and not a relation, so the value of ?a
and ?v
are effectively constant. Thus datalog knows it can supply them as consts
to extrel
, which can interpret them as lookups. And you can see in the extrel-trace
that :db.install/attribute
and :db/doc
were both provided, so extrel
knew the attribute was a ref and the value should be resolved to a ref.
But the following seemingly semantically identical query gives a different result:
(let [[ds t] (extrel-trace db)]
(-> (d/query {:query '[:find ?v
:in $ [[?a ?v]]
:where
[:db.part/db ?a ?v]]
:args [ds [[:db.install/attribute :db/doc]]]
:io-context :user/query
:query-stats true})
(assoc :extrel-trace @t)))
=>
{:ret #{},
:io-stats {:io-context :user/query,
:api :query,
:api-ms 4.55,
:reads {:aevt 6, :dev 4, :aevt-load 6, :ocache 6, :dev-ms 3.81}},
:query-stats {:query [:find ?v :in $ [[?a ?v]] :where [:db.part/db ?a ?v]],
:phases [{:sched (([(ground $__in__2) [[?a ?v]]] [:db.part/db ?a ?v])),
:clauses [{:clause [(ground $__in__2) [[?a ?v]]],
:rows-in 0,
:rows-out 1,
:binds-in (),
:binds-out [?a ?v],
:expansion 1}
{:clause [:db.part/db ?a ?v],
:rows-in 1,
:rows-out 0,
:binds-in [?a ?v],
:binds-out [?v]}]}]},
:extrel-trace [{:args [[:db.part/db nil nil] [nil nil nil] [nil nil nil]],
:ret #object[datomic.datalog.DbRel 0x329effe3 "datomic.datalog.DbRel@329effe3"]}]}
In this case, the ?a
and ?v
were part of a relation and so were not provided to the extrel
of the data-pattern clause for the Datomic datasource. For this to work correctly, the alias resolution would have to happen in the join-project-with
, where it is more difficult to determine if a value should be resolved as a lookup ref.
Note that the :query-stats
for both queries have nearly identical row-counts because they would be the same from the perspective of the join-project-with
clauses. Only the difference in :io-context
reveals that the extrel
call for the second query was clearly seeking more datoms.
Implementing your own IJoin
-able is more difficult than your own ExtRel
because you need to rely on even more interfaces to actually perform the projection, joining, and filtering required. Probably anything that is reduce-able and whose elements are indexed will work, but I don’t know for sure and I won’t explore it here.
However, ExtRel
is pretty easy to fulfil!
Custom Datasources
So far we have just looked at “built-in” datasources: collections and Datomic databases. But because its behavior is governed by the ExtRel
protocol we can create our own datasources by implementing this protocol.
We just need to follow these rules:
ExtRel
takes consts, starts, and take-while predicates and returns a relation implementingIJoin
which only supplies tuples matchingconsts
. Implementations may usestarts
andwhiles
to return a subset of things matchingconsts
.IJoin
performs projection, unification, and filtering against anotherIJoin
and returns the result as anotherIJoin
-able, typically a set of tuples. TheIJoin
object may close over information from theExtRel
that supplied it to defer decisions to join-time if it wants.
Let’s look at some simple examples of useful custom datasources.
Transaction Data with Ident Syntax for Attributes
Tx-data from a transaction is a set of datoms.
(:tx-data (d/with db [{:db/id "new" :db/doc "My new entity"}]))
=>
[#datom[13194139534312 50 #inst"2024-06-18T21:07:05.141-00:00" 13194139534312 true]
#datom[17592186045417 62 "My new entity" 13194139534312 true]]
Conceptually, this is the same as a Datomic database. However, as we have seen, the extrel
of a Datomic database does ident resolution of attributes that normal collections don’t, so this query doesn’t work:
(let [{:keys [tx-data db-after]} (d/with db [{:db/id "new" :db/doc "My new entity"}])
query '[:find ?e :where [?e :db/doc "My new entity"]]]
[
(d/q query db-after)
(d/q query tx-data)
])
=> [#{[17592186045417]} ;; Works as expected with a normal db
#{}] ;; Fails with tx-data
But, what if it could work? All we need is an extrel
that resolves attributes in its consts
to their entity id:
(defn tx-data-extrel [db tx-data]
(reify ExtRel
(extrel [_ consts _starts _whiles]
(let [resolve-ref #(d/entid db %)
[c-e c-araw c-vraw c-tx c-op] consts
c-e (resolve-ref c-e)
attr (when (some? c-araw)
(or (d/attribute db c-araw)
(throw (ex-info "Unknown attribute reference" {:attr c-araw}))))
c-a (:id attr)
ref-v? (= :db.type/ref (:value-type attr))
c-v (if ref-v?
(when (some? c-vraw)
(or (d/entid db c-vraw)
(throw (ex-info "Could not resolve entity reference" {:ref c-vraw}))))
c-vraw)
c-tx (resolve-ref c-tx)
xfs (cond-> []
(some? c-e) (conj (filter #(== ^long c-e ^long (:e %))))
(some? c-a) (conj (filter #(== ^long c-a ^long (:a %))))
(some? c-v) (conj (filter #(= c-v (:v %))))
(some? c-tx) (conj (filter #(== ^long c-tx ^long (:tx %))))
(some? c-op) (conj (filter #(= c-op (:op %)))))]
(into [] (apply comp xfs) tx-data)))))
Now if we use this to wrap the tx-data, we can query using attribute idents:
(let [{:keys [tx-data db-after]} (d/with db [{:db/id "new" :db/doc "My new entity"}])
query '[:find ?e :where [?e :db/doc "My new entity"]]]
(d/q query (tx-data-extrel db-after tx-data)))
=> #{[17592186045417]}
Voilà!
Notice that we return a normal collection as the IJoin
-able, which means that attributes that are supplied as relations still won’t resolve.
For example:
(let [{:keys [tx-data db-after]} (d/with db [{:db/id "new" :db/doc "My new entity"}])
query '[:find ?e
:where
[(ground [:db/doc]) [?a ...]]
;; ?a is not recognized as a scalar constant
[?e ?a "My new entity"]]]
(d/q query (tx-data-extrel db-after tx-data)))
=> #{} ;; Does not match!
Compare with a normal Datomic database, where the DbRel
or something in it is doing some resolution of attribute references:
(let [{:keys [tx-data db-after]} (d/with db [{:db/id "new" :db/doc "My new entity"}])
query '[:find ?e
:where
[(ground [:db/doc]) [?a ...]]
[?e ?a "My new entity"]]]
(d/q query db-after))
=> #{[17592186045417]} ;; Works!
I’ll leave that enhancement as an exercise for the reader!
Subset Relations from Sorted Sets
Perhaps you have a datasource which is large and sorted. You could just use it as a normal j.u.Collection
, but as we saw, the built-in implementation can’t take advantage of sorted-ness.
But what if you could use starts
and whiles
information to return a smaller relation and join across fewer rows?
Perhaps like this:
(defn sorted-set-extrel [s width]
;; width is how many slots per element tuple, which is needed because the
;; default comparator compares length before element content.
;; Assumes set `s` is sorted by its elements and does not contain nil anywhere.
;; Does not assume set has a nil-safe custom comparator, but it's a good enhancement!
(reify ExtRel
(extrel [_ consts starts whiles]
;; We always have to filter by constants
(let [consts-pred (if-some [preds (not-empty
(into []
(comp
(map-indexed (fn [i v] (when v #(= v (nth % i)))))
(filter some?))
consts))]
(apply every-pred preds)
(constantly true))
;; Lets see if "starts" has non-nil prefixes; if so we can use them!
usable-starts (when-some [prefix (not-empty
(take-while some? starts))]
;; Padding the suffix for the builtin comparator
(vec (concat prefix
(repeat (- width (count prefix)) nil))))
;; Use non-nil prefix predicates from whiles too
usable-whiles (not-empty
(into []
(comp
(take-while some?)
(map-indexed (fn [i f] #(f (nth % i)))))
whiles))]
(filterv consts-pred
(if usable-starts
(cond->> (subseq s >= usable-starts)
usable-whiles (take-while (apply every-pred usable-whiles)))
s))))))
Let’s use a 16000-row set as an example.
(def myset (apply sorted-set
(for [i (range 1000)
j ["a" "b" "c" "d"]
k ["a" "b" "c" "d"]]
[i j k])))
(count myset)
=> 16000
If we query a range of this set as a normal collection, many more items will be returned. We won’t see any difference in row-counts the :query-stats
, but we should see higher :api-ms
in io-stats.
(def query
'[:find ?i ?j ?k
:where
[(ground "b") ?k]
[?i ?j ?k]
[(< 10 ?i)]
[(< ?i 13)]])
Running this query a few times on a normal set, :api-ms
stablizes at about 8 ms on my machine.
(d/query {:query query :args [myset] :query-stats true :io-context ::query})
=>
{:ret #{[11 "d" "b"] [11 "c" "b"] [12 "d" "b"] [11 "b" "b"] [12 "c" "b"] [11 "a" "b"] [12 "b" "b"] [12 "a" "b"]},
:io-stats {:io-context :user/query, :api :query, :api-ms 8.1, :reads {}},
:query-stats {:query [:find ?i ?j ?k :where [(ground "b") ?k] [?i ?j ?k] [(< 10 ?i)] [(< ?i 13)]],
:phases [{:sched (([(ground "b") ?k] [?i ?j ?k] [(< 10 ?i)] [(< ?i 13)])),
:clauses [{:clause [(ground "b") ?k],
:rows-in 0,
:rows-out 1,
:binds-in (),
:binds-out [?k],
:expansion 1}
{:clause [?i ?j ?k],
:rows-in 1,
:rows-out 8,
:binds-in [?k],
:binds-out [?k ?j ?i],
:preds ([(< 10 ?i)] [(< ?i 13)]),
:expansion 7}]}]}}
But if we use the custom ExtRel
, it’s a bit under 1 ms!
(d/query {:query query :args [(sorted-set-extrel myset 3)] :query-stats true :io-context ::query})
=>
{:ret #{[11 "d" "b"] [11 "c" "b"] [12 "d" "b"] [11 "b" "b"] [12 "c" "b"] [12 "b" "b"] [11 "a" "b"] [12 "a" "b"]},
:io-stats {:io-context :user/query, :api :query, :api-ms 0.82, :reads {}},
:query-stats {:query [:find ?i ?j ?k :where [(ground "b") ?k] [?i ?j ?k] [(< 10 ?i)] [(< ?i 13)]],
:phases [{:sched (([(ground "b") ?k] [?i ?j ?k] [(< 10 ?i)] [(< ?i 13)])),
:clauses [{:clause [(ground "b") ?k],
:rows-in 0,
:rows-out 1,
:binds-in (),
:binds-out [?k],
:expansion 1}
{:clause [?i ?j ?k],
:rows-in 1,
:rows-out 8,
:binds-in [?k],
:binds-out [?k ?j ?i],
:preds ([(< 10 ?i)] [(< ?i 13)]),
:expansion 7}]}]}}
Note that the query-stats isn’t any different, only the io-stats!
Time-traveling Datasources
One limitation of Datomic queries is that you cannot re-parameterize a database within a query. So for example, you can’t do something like this:
[:find ?e ?as-of-tx ?v
:in $normal $history
:where
;; At moments when ?e :my/attr changed ...
[$history ?e :my/attr _ ?as-of-tx]
;; ... what was the corresponding value of :my/other-attr ?
[$normal-but-as-of-tx ?e :my/other-attr ?v]]
But what would it take to make this possible? An approach could be to extend the tuple syntax to pretend that the first slot is an as-of parameter, like so:
[$ ?as-of-tx ?e :my/other-attr ?v]
The extrel
implementation could do something like this:
(reify ExtRel
(extrel [_ consts starts whiles]
(datomic.datalog/extrel
(db/as-of normal-db (first consts))
(rest consts)
(rest starts)
(rest whiles))))
But this only works in the simplest possible case where the as-of value (first consts)
is non-nil (i.e. a query-constant). Even in our example query this is not the case. To get the sample query working as expected, we need to return something IJoin
-compatible which interprets the first tuple slot as an as-of parameter during the join-project-with
, uses that to set the as-of of the database, then delegates the join to the DbRel
of that database.
In other words, something like this:
(defn as-of-db [db]
(reify ExtRel
(extrel [_ consts starts whiles]
(let [as-of (first consts)
consts (vec (rest consts))
starts (vec (rest starts))
whiles (vec (rest whiles))]
(if (some? as-of)
;; as-of is query scalar constant
(datomic.datalog/extrel (d/as-of db as-of) consts starts whiles)
;; as-of will come from a later join
(reify IJoin
(join-project-with [_ ys join-map pmx pmy predctor]
;; Our column 0 is the as-of value.
;; What column in ys should join to it?
(let [as-of-column-idx (get join-map 0)
_ (when (nil? as-of-column-idx) (throw (ex-info "as-of must be bound!" {})))
;; Turn the single ys into a bunch of ys grouped by db-as-of
ys-by-as-of (group-by #(nth % as-of-column-idx) ys)
;; We need to rewrite the join and project-x maps to reflect that the as-of column doesn't exist in datoms
join-map' (into {}
(keep (fn [[^long x ^long y]]
[(dec x) y]))
(dissoc join-map 0))
pmx' (into {}
(keep (fn [[^long x ^long y]]
[(dec x) y]))
(dissoc pmx 0))]
(into #{}
(mapcat (fn [[as-of ys']]
;; Use the as-of value to set the database ...
(let [dbrel (datomic.datalog/extrel (d/as-of db as-of) consts starts whiles)]
;; and perform a join as usual!
;; Because the as-of value *must* be supplied by the ys *and* joined,
;; we know that if it is wanted in the output it will be projected via pmy,
;; so we can rely on the normal dbrel to assemble the output correctly for us.
(datomic.datalog/join-project-with dbrel ys' join-map' pmx' pmy predctor))))
ys-by-as-of)))))))))
And a demonstration of use:
(let [{db :db-after} (d/with db [{:db/ident :my/attr
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one}
{:db/ident :my/other-attr
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one}])
{db :db-after} (d/with db [{:db/ident :my/entity
:my/other-attr "oldvalue"}])
{db :db-after} (d/with db [{:db/ident :my/entity
:my/attr "ignored"}])
{db :db-after} (d/with db [{:db/ident :my/entity
:my/other-attr "newvalue"}])]
(d/q
'[:find ?e ?as-of-tx ?v
:in $normal $history
:where
[$history ?e :my/attr _ ?as-of-tx]
[$normal ?as-of-tx ?e :my/other-attr ?v]]
(as-of-db db) (d/history db)))
=> #{[17592186045418 13194139534315 "oldvalue"]}
Pretty cool!
Summary
Datomic’s Datalog engine uses the ExtRel
protocol on datasources to get relations–sets of tuples–which correspond to the data-pattern clauses in the query. The extrel
call includes query-wide constants and sometimes start values and take-while predicates inferred from surrounding predicate-expression clauses. The start and take-while predicates are advisory and the datasource may use them to return a smaller relation.
The returned value must be an IJoin
-able which takes a pair of IJoin
-ables, applies joins, projection, and filtering, and returns the next result-set via the join-project-with
method. An extrel
can defer realization of its relations to join-project-with
, and in fact this is what Datomic databases do via the DbRel
“relation”. The advantage of this is you can make better index choices with more information about the join; the disadvantage is that it’s a lot more complicated!
A datasource is responsible for any aliasing behavior (such as keywords for attribute ids) in its implementations of extrel
and join-project-with
.
:query-stats
gives information mostly about join-project-with
calls; deferred realization of relation members can often hide the true number of rows examined by a datasource. Clues to realized-relation sizes come from :io-stats
.
Finally, we looked at three toy examples of custom datasources:
- The tx-data example interprets aliases for more egonomic query syntax (attribute idents instead of numbers) over normal collections.
- The sorted-set example takes advantage of subrange information from
extrel
to produce smaller relations and faster query runtimes. - Finally, we added extra within-query as-of parameterization of Datomic databases via an additional “virtual” column of the relation. This dipped a bit into
IJoin
and messing with joins and projections, but left the heavy-lifting to the default Datomic database implementation.
I hope you found this intriguing and enlightening!