Clojure analysis and introspection

edit: updated the static analysis part to be more balanced

While writing omni-trace I ran into a common tooling problem:

Which references to my function exist in the codebase?

Or turned on its head:

What functions does a function call?

An IDE would use this information to help you with refactoring, and a linter would warn you when a function is missing.

Static analysis

Traditionally this information is generated by static code analysis, although this is harder the more dynamic a language is. Imagine code like:

(-> (str "i" "nc")
symbol
resolve
(apply [5]))

;; 6

There is no general way to know that the inc function is referenced without running the code.

Problems like these aside, static analyzers like clj-kondo can still give us the needed information most of the time:

(require '[clj-kondo.core :as clj-kondo])
(-> (clj-kondo/run! {:lint (string/split (System/getProperty "java.class.path") #":")
:config {:output {:analysis true}}}
)

:analysis
:var-usages)

;; =>
[{:alias ana,
:arity 1,
:col 59,
:end-col 89,
:end-row 36,
:filename "src/cyrik/omni_trace/instrument/cljs.cljc",
:fixed-arities #{1 2},
:from cyrik.omni-trace.instrument.cljs,
:from-var ->ns,
:lang :clj,
:name find-ns,
:name-col 60,
:name-end-col 71,
:name-end-row 36,
:name-row 36,
:row 36,
:to cljs.analyzer.api}

;;...
]

This works great but has two new problems:

The speed comparison is very misleading at this point since clj-kondo does a lot more than just finding dependencies and this is just a one-time cost if you have some smart caching.

Another benefit is that you don't need a running env to get help from clj-kondo or other static analyzers.

The second problem can not really be fixed and if you have multiple tools that need that analysis like LSP + something, there is no easy way to share it, since they usually don't run inside your repl.

Runtime introspection

Lisps usually come at the same tooling problem from the other side, where the idea is:

I already have a repl with my code, so that should be able to give me all the runtime information I want.

This idea can be seen in orchard, which uses the running JVM to answer questions about the code. Its implementation of fn-deps showcases that beautifully:

(defn fn-deps [val]
(set (some->> val class .getDeclaredFields
(keep (fn [^java.lang.reflect.Field f]
(or (and (identical? clojure.lang.Var (.getType f))
(java.lang.reflect.Modifier/isPublic (.getModifiers f))
(java.lang.reflect.Modifier/isStatic (.getModifiers f))
(-> f .getName (.startsWith "const__"))
(.get f val))

nil)
)
)
)
)
)

This code was written by Rich Hickey for REBL and generously shared with the community.

The code relies on the fact that the Clojure compiler generates a class for every function and that class has fields with vars pointing to the functions that it's going to call. This is done so that when you redefine a function its call sites don't need to be recompiled, since the var will now point to the new function.

If you are interested to see the exact bytecode or a Java class version of this there is a great library and blog post by Alexander Yakushev.

This is very fast, since it only has to do field access, but has a major problem. It does not handle lambdas or inline function calls.

(defn dummy []
(map #(inc %) (range 10)))

The dummy function will have a reference to map and to range, but there is no reference to the anonymous function inside the dummy class. But of course, the dummy class does have to know about the lambda somewhere, so if you check the generated bytecode you will see a reference to it inside the invokeStatic method call.

A reasonable question would then be, how do I get the bytecode? Sadly there is no direct way to get at the running bytecode, since the default JVM classLoader throws it away after loading the class.

So a more roundabout way to get at all the references inside a function is to use the Clojure compiler and a custom classLoader. This way it's possible to remember the bytecode.

This was my first attempt to "fix" orchards fn-deps (modified from gist):

(def classbytes (atom {}))

(defn recompile [ns-sym form]
(push-thread-bindings
{clojure.lang.Compiler/LOADER
(proxy [clojure.lang.DynamicClassLoader] [@clojure.lang.Compiler/LOADER]
(defineClass
([name bytes src]
(swap! classbytes assoc name bytes)
(proxy-super defineClass name bytes src))
)
)
}
)

(try
(let [line @clojure.lang.Compiler/LINE
column @clojure.lang.Compiler/COLUMN
line (if-let [line (:line (meta form))]
line
line)

column (if-let [column (:column (meta form))]
column
column)
]

(push-thread-bindings {clojure.lang.Compiler/LINE line
clojure.lang.Compiler/COLUMN column}
)

(try
(let [form (macroexpand form)]
(when (and (coll? form) (= 'clojure.core/fn (first (nth form 2 nil))))
(binding [*ns* (create-ns ns-sym)]
(clojure.lang.Compiler/analyze
clojure.lang.Compiler$C/EVAL
(nth form 2))
)
)
)

(finally
(pop-thread-bindings))
)
)

(finally
(pop-thread-bindings))
)
)


(recompile 'playground.decompile '(defn dummy [a] (map #(println a) (range 10))))

This solution works very well produces two classes that can be searched for references.

While playing with this solution I realized that the actual bytecode is not needed, since the Clojure compile also just returns both class names, which can be used inside fn-deps to get all references!

This solution is faster than clj-kondo, even when recompiling my whole code, but still has the problem that you have to have all the source code.

Clojure dynamic classloader

After spending way too much time JVM class loaders it hit me that the Clojure class loader has an internal cache of all the classes it loaded. Since that cache is not public some reflection is needed to get at it, but the solution is pretty straightforward:

(defn- as-val
"Convert thing to a function value."
[thing]
(cond
(var? thing) (var-get thing)
(symbol? thing) (var-get (find-var thing))
(fn? thing) thing)
)


(defn- fn-name [^java.lang.Class f]
(-> f .getName repl/demunge symbol))


(defn fn-deps-class
[v]
(let [^java.lang.Class v (if (class? v)
v
(eval v))
]

(set (some->> v .getDeclaredFields
(keep (fn [^java.lang.reflect.Field f]
(or (and (identical? clojure.lang.Var (.getType f))
(java.lang.reflect.Modifier/isPublic (.getModifiers f))
(java.lang.reflect.Modifier/isStatic (.getModifiers f))
(-> f .getName (.startsWith "const__"))
(.get f (fn-name v)))

nil)
)
)
)
)
)
)


(defn fn-deps [s]
(when-let [v (as-val s)]
(let [f-class-name (-> v .getClass .getName)
field (->> clojure.lang.DynamicClassLoader .getDeclaredFields second)
classes (into {} (.get field clojure.lang.DynamicClassLoader))
filtered-classes (->> classes
(filter (fn [[k _v]] (clojure.string/includes? k f-class-name)))
(map (fn [[_k v]] (.get v))))

deps (set (mapcat fn-deps-class filtered-classes))]

deps)
)
)

This solution relies on another implementation detail of the Clojure compiler: dummy functions class name will be my_ns$dummy and the anonymous functions class will be prefixed with the containing the same name.

It all works great and is very fast. There is a new problem though: when you recompile a function, its lambdas stay in the cache. This means you might see references to functions that are not called anymore.

The solution to that is still in the works, but will probably just be a manual cache clear or a second cache.

Discuss this post here.

Published: 2022-01-05

Archive