When running large scale tests across multiple VMs, it helps to understand how the host / worker node resources are spread between the various test plan instances.
CPU - all test plan instance have a CPU
request set. Kubernetes uses kernel throttling to implement CPU limits, and we currently don't set CPU
limits. If an application goes above the limit, it gets throttled (aka fewer CPU cycles). CPU
requests are used during scheduling of pods.
Memory - all test plan instances have a limit set for their memory usage. If a test plan tries to use above that limit, Kubernetes will kill it. Memory limits are easy to detect - you only need to check if your pod's last restart status is
Disk - all test plan instances are sharing an AWS EFS file system.
Network - all test plan instances are sharing the underlying network as well as the overlay networks (Weave and Flannel), so it is possible for a test plan instance to have an adverse effect on all other instances and even on the host. Because of this it is advisable to monitor network performance (RPC latencies and error rates) from the perspective of individual test plans.
Alongside test plan instances on every host, we also have the following components:
prometheus-node-exporter - exporter for machine metrics
testground-sidecar - the daemon that monitors containers on a given host and executes network updates, traffic shaping, etc. for test plan instances
genie-plugin - Kubernetes CNI plugin for choosing pod network during deployment time
flannel - the control network daemon
kube-proxy - the Kubernetes network proxy
weave - the data network daemon
All of these consume resources, so test plan instances have ~6 vCPUs and ~12GB RAM available (assuming
c5.2xlarge worker node instance type)
With a c5.4xlarge
master node, usually it takes ~5min. for all 10k test plan instances (pods) to be scheduled on a given work node and to be in
Running state. Depending on various factors (i.e. number of hosts, whether the test plan image is already downloaded, etc.) we get around 30-50 pods scheduled per second for a given test plan run.
sync service is an integral part of Testground, used to communicate out-of-band information between test plan instances (such as endpoint addresses, peer ids, etc.)
At peak operation we've benchmarked that it can perform ~300k operations per second, allowing for 10k test plan instances to publish their own node information and retrieve the peer ids for all other test plan instances in 6min. (total of 10^8 operations).
We use Weave for our data plane and Flannel for the control network. All test plan instances connect to each other over the Weave network.
Limits for connections per host - based on the kernel parameters we set on individual worker nodes, we can sustain up to 100k open connections per host (see github.com/testground/infra/blob/master/k8s/cluster.yaml)
Depending on the amount of data sent across the network, TCP dial latencies range between 50ms, and can go up to tens of seconds if you have thousands of calls, so we suggest test plan developers monitor this closely.