Suppose we are given data in a semi-structured format as a tree. As an example, the tree can be formed as a valid XML document or as a valid JSON document. You could imagine it being a lisp-like S-expression or a (G)Algebraic Data Type in Haskell or Ocaml.

We are given a large number of "documents" in the tree structure. Our goal is to cluster documents that are similar. By clustering, we mean a way to divide the documents into *j* groups, such that elements in each look like each other.

I am sure there are papers out there that describe approaches but since I am not very known in the area of AI/Clustering/MachineLearning, I want to ask somebody who is what to look for and where to dig.

My current approach is something like this:

I want to convert each document into an N-dimensional vector set up for a K-means clustering.

To do this, I recursively walk the document tree and for each level, I calculate a vector. If I am at a tree vertex, I recur on all sub vertices and then sum their vectors. Also, whenever I recur, a power factor is applied so it does matter less and less the further down the tree I go. The documents final vector is the root of the tree.

Depending on the data at a tree leaf, I apply a function that takes the data into a vector.

But surely, there are better approaches. One weakness of my approach is that it will only similarity-cluster trees which has a top structure much like each other. If the similarity is present but occurs farther down the tree, then my approach probably won't work very well.

I imagine there are solutions in full-text-search as well, but I do want to take advantage of the semi-structure present in the data.

## Distance function

As suggested, one needs to define a distance function between documents. Without this function, we can't apply a clustering algorithm.

In fact, it may be that the question is about that very distance function and examples thereof. I want documents where elements near the root are the same to cluster close to each other. The farther down the tree we go, the less it matters.

## The take-one-step-back viewpoint:

I want to cluster stack traces from programs. These are well-formed tree structures, where the function close to the root is the inner function that fails. I need a decent distance function between stack traces that probably occur because the same event happened in code.