Cluster Manager¶
Cluster Manager is Java applications, which manages cluster. It's algorithm decides, which worker should be primary and secondary and what to do, when any worker is down.
Cluster Manager Package¶
Cluster Manager tarball should consist of four elements:
load-balancer-cluster-manager-microservice-${VERSION}-fat.jarapplication,start.bashscript,nodes-settingsdirectory:nodes.jsonfile,
scriptsdirectory:cluster-standard-message-script.bashscript,cluster-error-message-script.bashscript.
Algorithm¶
load-balancer-cluster-manager-microservice-${VERSION}-fat.jar is a Java executable, which contains the cluster manager algorithm.
Nodes Configuration¶
nodes-settings/nodes.json json file contains an array of initial nodes configuration.
The first node configuration will be taken as a potential primary by the algorithm.
[
{
"address": "10.10.10.3",
"cdtpPort": 8030,
"edgeBalancerPort": 8020,
"edgeBalancerProtocol": "http",
"status": "unavailable"
},
{
"address": "10.10.10.2",
"cdtpPort": 8030,
"edgeBalancerPort": 8020,
"edgeBalancerProtocol": "http",
"status": "unavailable"
}
]
addressis an IP address of a worker server,cdtpPortis a Node Manager's port for sending requests via CDTP protocol (by default8030) to the worker server,edgeBalancerPortis a Node Manager's external load balancer's port (by default8020) at the worker server,edgeBalancerProtocolis a protocol to communicate with the Node Manager's external load balancer at the worker server (by defaulthttp),statusis a status of the current worker (for initializing cluster, should beunavailable).
Messages Scripts¶
scripts directory contains two scripts:
cluster-standard-message-script.bash- which handles standard messages from the Cluster Manager's algorithm,cluster-error-message-script.bash- which handles error messages from the Cluster Manager's algorithm.
Handling messages could be implemented in many ways, for example:
- send the messages to the monitoring application,
- send the messages to external software,
- send the messages as email via SMTP server,
- send the messages as an SMS via SMPP server.
In your scripts, remember to use the following variables:
${message}- will be replaced with standard message content in your standard message script,${error_message}- will be replaced with error message content in your error message script.
Example Message Scripts¶
The following examples are Primary Cluster Managers scripts, which send requests to the Onteon Tech's monitoring application.
Standard Message Script¶
#!/bin/bash
set -e -o errexit
curl --location 'http://127.0.0.5:8020/_by_name/cluster-monitoring-core-microservice/v1/addStandardMessage' \
--header 'Content-Type: application/json' \
--data '{
"source": "10.10.10.1-Primary-Cluster-Manager",
"message": "${message}"
}'
Error Message Script¶
#!/bin/bash
set -e -o errexit
curl --location 'http://127.0.0.5:8020/_by_name/cluster-monitoring-core-microservice/v1/addErrorMessage' \
--header 'Content-Type: application/json' \
--data '{
"source": "10.10.10.1-Primary-Cluster-Manager",
"message": "${error_message}"
}'
Start Script¶
start.bash script is intended for easy start on Hetman. It should contain all of the Java system properties, to properly run the algorithm.
Algorithm System Properties¶
-Dlog-directory- directory, where the algorithm logs will be stored,-Dfalcondb-microservice-uri- URI address of FalconDB (valuehttp://localhost:8021/_by_name/falcon-db-core-microservice, should not be changed),-Dnodes-json-file-path- absolute path of nodes configuration file (recommended value${SCRIPT_DIR}/nodes-settings/nodes.json),-Dapplications-directory-path- absolute path of a directory, which holds microservices to be uploaded and started on workers nodes (value/usr/local/bin/onteon-node-manager/store/microservices, should not be changed),-Dcluster-standard-message-script-path-to-file- absolute path of a Cluster Manager's standard message script (recommended value${SCRIPT_DIR}/scripts/cluster-standard-message-script.sh),-Dcluster-error-message-script-path-to-file- absolute path of a Cluster Manager's error message script (recommended value${SCRIPT_DIR}/scripts/cluster-error-message-script.sh),-Dvirtual-ip-up-command- command to enable virtual ip (if you do not use virtual ip, simply set the value to'echo up virtual'),-Dvirtual-ip-down-command- command to disable virtual ip (if you do not use virtual ip, simply set the value to'echo down virtual'),-Dget-keep-alive-url- URL to technical microservice's get keep alive endpoint (value_by_name/technical-generic-microservice/v1/getKeepAlive, should not be changed),-Dset-keep-alive-url- URL to technical microservice's set keep alive endpoint (value_by_name/technical-generic-microservice/v1/setKeepAlive, should not be changed),-Dget-apply-to-be-primary-master-url- URL to technical microservice's get apply to be primary master endpoint (value_by_name/technical-generic-microservice/v1/getApplyToBePrimaryMasterUrl, should not be changed),-Dapply-to-be-primary-master-url- URL to technical microservice's apply to be primary master endpoint (value_by_name/technical-generic-microservice/v1/applyToBePrimaryMaster, should not be changed),-Dreset-apply-to-be-primary-master-url- URL to technical microservice's reset apply to be primary master endpoint (value_by_name/technical-generic-microservice/v1/resetApplyToBePrimaryMaster, should not be changed),-Ddo-command-url- URL to technical microservice's do command endpoint (value_by_name/technical-generic-microservice/v1/doCommand, should not be changed),-Dexpiry-keep-alive-time-in-milliseconds- time, after which the primary worker will turn off, if there was no communication with this node (recommended value:1200000),-Dawait-primary-node-turn-off-in-milliseconds- defines how long the algorithm will wait after sending turn off signal to the primary worker, to make sure it is indeed turned off (recommended value:75000)-Dsecondary-cluster-manager-frequency-control-in-milliseconds- defines how often should secondary check if primary is alive (default value10000),-Dawait-drbd-start-milliseconds- defines how long the algorithm will wait after starting DRBD microservice, to make sure it is indeed started (recommended value:30000),-Dawait-drbd-role-switch-milliseconds- defines how long the algorithm will wait after switching role of DRBD microservice, to make sure it did change it's role (recommended value:30000),-Dis-active-active- defines if the cluster is in active-active mode (iffalse, then active-passive),-Dsecondary-cluster-manager-health-check-url- Secondary Cluster Manager's health check url (add correct IP for valuehttp://${secondary-cluster-manager-ip}:8020/805895fcf846ac34e966e97c),-Dprimary-cluster-manager-health-check-url- Primary Cluster Manager's health check url (add correct IP for valuehttp://${primary-cluster-manager-ip}:8020/805895fcf846ac34e966e97c),-Dcluster-error-message-command-frequency-in-minutes=- defines how often to send error messages (recommended value'-1', means send error message, when it occurs)-Dcluster-standard-message-command-frequency-in-minutes- defines how often to send standard messages (recommended value'-1', means send standard message, when it occurs)-Dis-secondary-part-of-cluster- boolean value, which defines if there is a Secondary Cluster Manager in the cluster,-Dcluster-message-logger-handler-thread-pool-size- defines how many threads are in the message handler's thread pool (recommended value:8),-Dwait-time-in-milliseconds-for-complete-handle-message- defines how long the algorithm's message thread will wait to make sure that the message script was completed (recommended value:30000),-Dcluster-state-table-name- name of a table in FalconDB, which stores the cluster's state (recommended valueha-load-balancer-cluster-state),-Devent-log-table-name- name of a table in FalconDB, which stores event log (recommended valueha-load-balancer-cluster-event-log),-Dawait-technical-microservice-start-milliseconds- defines how long the algorithm will wait after starting technical generic microservice, to make sure it is indeed started (recommended value:70000),-Dself-healing-time-running- defines how long will the algorithm wait after event log occurred, so the self healing can run (recommended value:60000),-Dprod-mode- boolean value, which defines if the cluster is in the production mode (recommended valuetrue),-DisPrimary- boolean value, which defines if the Cluster Manager is Primary.
Example Start Scripts¶
Primary Cluster Manager¶
#!/usr/bin/env bash
# Copyright (c) 2024, Onteon Tech and/or its affiliates.
# All rights reserved.
# Use is subject to license terms.
set -e
set -u
trap "exit 128" INT
SOURCE="${BASH_SOURCE[0]}"
while [[ -h "${SOURCE}" ]] ; do
DIR="$(cd -P "$(dirname "${SOURCE}")" >/dev/null 2>&1 && pwd)"
SOURCE="$(readlink "${SOURCE}")"
[[ "${SOURCE}" != /* ]] && SOURCE="${DIR}/${SOURCE}"
done
SCRIPT_DIR="$(cd -P "$(dirname "${SOURCE}")" >/dev/null 2>&1 && pwd)"
/usr/local/bin/onteon-node-manager/jdk/bin/java -Dlog-directory=${SCRIPT_DIR}/logs -Dfalcondb-microservice-uri=http://localhost:8021/_by_name/falcon-db-core-microservice -Dnodes-json-file-path=${SCRIPT_DIR}/nodes-settings/nodes.json -Dapplications-directory-path=/usr/local/bin/onteon-node-manager/store/microservices -Dcluster-standard-message-script-path-to-file=${SCRIPT_DIR}/scripts/cluster-standard-message-script.sh -Dcluster-error-message-script-path-to-file=${SCRIPT_DIR}/scripts/cluster-error-message-script.sh -Dvirtual-ip-up-command='echo up virtual' -Dvirtual-ip-down-command='echo down virtual' -Dget-keep-alive-url=_by_name/technical-generic-microservice/v1/getKeepAlive -Dset-keep-alive-url=_by_name/technical-generic-microservice/v1/setKeepAlive -Dget-apply-to-be-primary-master-url=_by_name/technical-generic-microservice/v1/getApplyToBePrimaryMasterUrl -Dapply-to-be-primary-master-url=_by_name/technical-generic-microservice/v1/applyToBePrimaryMaster -Dreset-apply-to-be-primary-master-url=_by_name/technical-generic-microservice/v1/resetApplyToBePrimaryMaster -Ddo-command-url=_by_name/technical-generic-microservice/v1/doCommand -Dexpiry-keep-alive-time-in-milliseconds=1200000 -Dawait-primary-node-turn-off-in-milliseconds=75000 -Dsecondary-cluster-manager-frequency-control-in-milliseconds=10000 -Dawait-drbd-start-milliseconds=30000 -Dawait-drbd-role-switch-milliseconds=30000 -Dis-active-active=false -Dsecondary-cluster-manager-health-check-url=http://10.10.10.2:8020/805895fcf846ac34e966e97c -Dprimary-cluster-manager-health-check-url=http://10.10.10.1:8020/805895fcf846ac34e966e97c -Dcluster-error-message-command-frequency-in-minutes='-1' -Dcluster-standard-message-command-frequency-in-minutes='-1' -Dis-secondary-part-of-cluster=true -Dcluster-message-logger-handler-thread-pool-size=8 -Dwait-time-in-milliseconds-for-complete-handle-message=30000 -Dcluster-state-table-name=ha-load-balancer-cluster-state -Devent-log-table-name=ha-load-balancer-cluster-event-log -Dawait-technical-microservice-start-milliseconds=70000 -Dself-healing-time-running=60000 -Dprod-mode=true -DisPrimary=true -jar ${SCRIPT_DIR}/load-balancer-cluster-manager-microservice-1.0.0-fat.jar
Secondary Cluster Manager¶
#!/usr/bin/env bash
# Copyright (c) 2024, Onteon Tech and/or its affiliates.
# All rights reserved.
# Use is subject to license terms.
set -e
set -u
trap "exit 128" INT
SOURCE="${BASH_SOURCE[0]}"
while [[ -h "${SOURCE}" ]] ; do
DIR="$(cd -P "$(dirname "${SOURCE}")" >/dev/null 2>&1 && pwd)"
SOURCE="$(readlink "${SOURCE}")"
[[ "${SOURCE}" != /* ]] && SOURCE="${DIR}/${SOURCE}"
done
SCRIPT_DIR="$(cd -P "$(dirname "${SOURCE}")" >/dev/null 2>&1 && pwd)"
/usr/local/bin/onteon-node-manager/jdk/bin/java -Dlog-directory=${SCRIPT_DIR}/logs -Dfalcondb-microservice-uri=http://localhost:8021/_by_name/falcon-db-core-microservice -Dnodes-json-file-path=${SCRIPT_DIR}/nodes-settings/nodes.json -Dapplications-directory-path=/usr/local/bin/onteon-node-manager/store/microservices -Dcluster-standard-message-script-path-to-file=${SCRIPT_DIR}/scripts/cluster-standard-message-script.sh -Dcluster-error-message-script-path-to-file=${SCRIPT_DIR}/scripts/cluster-error-message-script.sh -Dvirtual-ip-up-command='echo up virtual' -Dvirtual-ip-down-command='echo down virtual' -Dget-keep-alive-url=_by_name/technical-generic-microservice/v1/getKeepAlive -Dset-keep-alive-url=_by_name/technical-generic-microservice/v1/setKeepAlive -Dapply-to-be-primary-master-url=_by_name/technical-generic-microservice/v1/applyToBePrimaryMaster -Dget-apply-to-be-primary-master-url=_by_name/technical-generic-microservice/v1/getApplyToBePrimaryMasterUrl -Dreset-apply-to-be-primary-master-url=_by_name/technical-generic-microservice/v1/resetApplyToBePrimaryMaster -Ddo-command-url=_by_name/technical-generic-microservice/v1/doCommand -Dexpiry-keep-alive-time-in-milliseconds=300000 -Dawait-primary-node-turn-off-in-milliseconds=75000 -Dsecondary-cluster-manager-frequency-control-in-milliseconds=10000 -Dawait-drbd-start-milliseconds=30000 -Dawait-drbd-role-switch-milliseconds=30000 -Dis-active-active=false -Dsecondary-cluster-manager-health-check-url=http://10.10.10.2:8020/805895fcf846ac34e966e97c -Dprimary-cluster-manager-health-check-url=http://10.10.10.1:8020/805895fcf846ac34e966e97c -Dcluster-error-message-command-frequency-in-minutes='-1' -Dcluster-standard-message-command-frequency-in-minutes='-1' -Dis-secondary-part-of-cluster=true -Dcluster-message-logger-handler-thread-pool-size=8 -Dwait-time-in-milliseconds-for-complete-handle-message=30000 -Dcluster-state-table-name=ha-load-balancer-cluster-state -Devent-log-table-name=ha-load-balancer-cluster-event-log -Dawait-technical-microservice-start-milliseconds=70000 -Dself-healing-time-running=60000 -Dprod-mode=true -DisPrimary=false -jar ${SCRIPT_DIR}/load-balancer-cluster-manager-microservice-1.0.0-fat.jar
Create Tarball¶
To create tarball, use the following command:
# Primary Cluster Manager
tar czvf load-balancer-primary-cluster-manager.tar.gz \
load-balancer-cluster-manager-microservice-${VERSION}-fat.jar \
start.bash \
nodes-settings \
scripts
# Secondary Cluster Manager
tar czvf load-balancer-secondary-cluster-manager.tar.gz \
load-balancer-cluster-manager-microservice-${VERSION}-fat.jar \
start.bash \
nodes-settings \
scripts
Upload To Hetman¶
Uploading and scheduling Cluster Manager's package to Hetman, gives responsibility of running the algorithm to the Hetman applications. It runs algorithm every x amount of time, which allows keeping correct cluster state at all times.
Before executing the following steps, make sure to log in to Keycloak, create Hetman user and log into Hetman.
The following steps, should be executed for both Primary and Secondary Cluster Managers.
Create Group¶
- Go to the
Groupstab. - Click the
ADD GROUPbutton. - Fill the group form.
Required fields:
Name

Create Job¶
- Go to the
Jobstab. - Click the
ADD JOBbutton. - Fill the job form.
Required fields:
Job nameGroup- select a group created in the previous step.Cron- CRON expression, which defines how often should the Cluster Manager be executed.Timeout (s)- timeout in seconds (recommended:1300).

Create And Start Task¶
- Click the
+icon next to just created job. -
Fill the task form.
Required fields:
Task nameType- type of task (mandatory value:package).Run Command- command to run the task (mandatory value:${executable-directory-name}/start.bash).Timeout (s)- timeout in seconds (recommend:1200).File- upload Cluster Manager package.

-
Wait until the upload is finished (estimated upload time: 1.5 - 2 min).
-
Make job active, by clicking the
Activecheckbox.