In this thesis, portions of the datafold framework are optimized in regards to runtime duration.
Bottlenecks that occur before the eigensolver are located, replaced with improvement candidates and assessed for performance gains. These include the acceleration of a modified min-max-Search as well as various distance calculation methods.
In order to reduce the dimensionality of a dataset, the manifold machine learning framework datafold needs to set up an eigenvalue problem. The underlying distance matrix represents distances from any point in the dataset to any other point within a given radius r in one of a variety of metrics.
The task of finding a suitable value for this radius r, the so called cut_off value, is one of great importance, as it influences the quality of the results. Determining this cutoff radius r has proven to take significantly longer for a higher number of data points. The sparse neighborhood matrix containing distances undergoes some transformation before it is passed to an eigensolver. Since the eigensolver part has previously been optimized to
work on distributed environments such as clusters, this thesis is supposed to assess ways of efficiently parallelizing the prior part, to further utilize the resources of multicore consumer systems and, if possible, clusters.
As a result, the base performance is benchmarked, solutions offered by various frameworks are tested, timed and ultimately compared against each other, for all identified bottlenecks, that are detected to suffer from poor scaling:
The optimize_parameters(...) function is fitted with a Dask implementation, implementing a more performant selection algorithm, that scales significantly better on parallel systems. Options to optimize the distance calculation are assessed and tested against the current implementations using the hypercube example. A new approach is tested in form of an approximate k Nearest Neighbor(s) (kNN) framework: PyNNDescent.
«
In this thesis, portions of the datafold framework are optimized in regards to runtime duration.
Bottlenecks that occur before the eigensolver are located, replaced with improvement candidates and assessed for performance gains. These include the acceleration of a modified min-max-Search as well as various distance calculation methods.
In order to reduce the dimensionality of a dataset, the manifold machine learning framework datafold needs to set up an eigenvalue problem. The underlying dista...
»