Experimente und Beobachtungen / experiments and observations
Texte / texts; Datenbanken / data bases
Other data type:
Network traffic traces
Network traffic collection (PCAP) of three widely-used state-of-the-art Distributed Machine Learning (DML) frameworks (Tensorflow, Horovod, KungFu). The collection contains distributed training runs of four models (MobileNetV2, ResNet50, Resnet101, DenseNet201) with varying configurations of the frameworks. Varied parameters are the communication topology and backend, the distributed optimizer, the batch size and the packet loss in the network.
Method of data assessment:
The traffic was collected in a four worker testbed setup. The workers were interconnected with a 10G Ethernet network via a single packet switch. Each worker was equipped with an Nvidia Tesla T4 GPU. Traffic traces were directly taken on the worker nodes. The models were trained for 20 epochs on the CIFAR-10 image dataset.