Sparse Grid Density Estimation faces two challenges when it is applied to extremely
high-dimensional and clustered data. The first is the Curse of dimensionalty which makes
the required computing power increase exponentially with the number of dimensions.
Despite that Sparse Grids moderate the effect of the dimensionality it can reach a point
in which a computation is unfeasible. The second challenge is the location of some grid
points; at the beginning of the computation the grid spreads all over the domain, includ-
ing places where there is no data at all. When clustered data is used all samples are lo-
cated around cluster centroids and not scattered all over the domain, this makes the phe-
nomenon of placing grid points where they are not really necessary, to accentuate. This
unwanted product adds computational cost but does not improve accuracy. Despite that
Adaptive Refinement have shown very good results, the computation will carry all the
way those inefficient grid points added at the beginning.
To reduce the undesired aforementioned effects a pipeline is implemented. This pipeline
is divided in two parts: The first one is a Locality Sensitive Hashing algorithm that uses
Random Projections to divide the data in subsets that correspond to clusters. The second
part applies dimensionality reduction to embed the whole data set and each cluster to a
significantly lower dimensional space and then compute density estimations using Sparse
Grids. For both parts two different sets of metrics are introduced to measure the efficacy
of this implementation. Finally, the Conventional Approach using the whole data set is com-
pared to two different methodologies to treat each individual cluster in parallel(Cluster
Analysis and Clusters Extraction).
Results for a real-world data set and Synthetic one show that the Conventional Ap-
proach presents better results than Cluster Analysis in most of the scenarios. However,
Cluster Extraction exhibit very good results compare to those two methodologies in 60%
to 80% of the proposed numerical experiments. This promising results are consistent even
in scenarios when the numerical experiments have very similar computational cost.
«