Skip to content

Cluster Manager

Cluster Manager is Java applications, which manages cluster. It's algorithm decides, which worker should be primary and secondary and what to do, when any worker is down.

Cluster Manager Package

Cluster Manager tarball should consist of four elements:

  • load-balancer-cluster-manager-microservice-${VERSION}-fat.jar application,
  • start.bash script,
  • nodes-settings directory:
    • nodes.json file,
  • scripts directory:
    • cluster-standard-message-script.bash script,
    • cluster-error-message-script.bash script.

Algorithm

load-balancer-cluster-manager-microservice-${VERSION}-fat.jar is a Java executable, which contains the cluster manager algorithm.

Nodes Configuration

nodes-settings/nodes.json json file contains an array of initial nodes configuration.

The first node configuration will be taken as a potential primary by the algorithm.

[
    {
        "address": "10.10.10.3",
        "cdtpPort": 8030,
        "edgeBalancerPort": 8020,
        "edgeBalancerProtocol": "http",
        "status": "unavailable"
    },
    {
        "address": "10.10.10.2",
        "cdtpPort": 8030,
        "edgeBalancerPort": 8020,
        "edgeBalancerProtocol": "http",
        "status": "unavailable"
    }
]
  • address is an IP address of a worker server,
  • cdtpPort is a Node Manager's port for sending requests via CDTP protocol (by default 8030) to the worker server,
  • edgeBalancerPort is a Node Manager's external load balancer's port (by default 8020) at the worker server,
  • edgeBalancerProtocol is a protocol to communicate with the Node Manager's external load balancer at the worker server (by default http),
  • status is a status of the current worker (for initializing cluster, should be unavailable).

Messages Scripts

scripts directory contains two scripts:

  • cluster-standard-message-script.bash - which handles standard messages from the Cluster Manager's algorithm,
  • cluster-error-message-script.bash - which handles error messages from the Cluster Manager's algorithm.

Handling messages could be implemented in many ways, for example:

  • send the messages to the monitoring application,
  • send the messages to external software,
  • send the messages as email via SMTP server,
  • send the messages as an SMS via SMPP server.

In your scripts, remember to use the following variables:

  • ${message} - will be replaced with standard message content in your standard message script,
  • ${error_message} - will be replaced with error message content in your error message script.

Example Message Scripts

The following examples are Primary Cluster Managers scripts, which send requests to the Onteon Tech's monitoring application.

Standard Message Script
#!/bin/bash

set -e -o errexit


curl --location 'http://127.0.0.5:8020/_by_name/cluster-monitoring-core-microservice/v1/addStandardMessage' \
--header 'Content-Type: application/json' \
--data '{
    "source": "10.10.10.1-Primary-Cluster-Manager",
    "message": "${message}"
}'
Error Message Script
#!/bin/bash

set -e -o errexit

curl --location 'http://127.0.0.5:8020/_by_name/cluster-monitoring-core-microservice/v1/addErrorMessage' \
--header 'Content-Type: application/json' \
--data '{
    "source": "10.10.10.1-Primary-Cluster-Manager",
    "message": "${error_message}"
}'

Start Script

start.bash script is intended for easy start on Hetman. It should contain all of the Java system properties, to properly run the algorithm.

Algorithm System Properties

  • -Dlog-directory - directory, where the algorithm logs will be stored,
  • -Dfalcondb-microservice-uri - URI address of FalconDB (value http://localhost:8021/_by_name/falcon-db-core-microservice, should not be changed),
  • -Dnodes-json-file-path - absolute path of nodes configuration file (recommended value ${SCRIPT_DIR}/nodes-settings/nodes.json),
  • -Dapplications-directory-path - absolute path of a directory, which holds microservices to be uploaded and started on workers nodes (value /usr/local/bin/onteon-node-manager/store/microservices, should not be changed),
  • -Dcluster-standard-message-script-path-to-file - absolute path of a Cluster Manager's standard message script (recommended value ${SCRIPT_DIR}/scripts/cluster-standard-message-script.sh),
  • -Dcluster-error-message-script-path-to-file - absolute path of a Cluster Manager's error message script (recommended value ${SCRIPT_DIR}/scripts/cluster-error-message-script.sh),
  • -Dvirtual-ip-up-command - command to enable virtual ip (if you do not use virtual ip, simply set the value to 'echo up virtual'),
  • -Dvirtual-ip-down-command - command to disable virtual ip (if you do not use virtual ip, simply set the value to 'echo down virtual'),
  • -Dget-keep-alive-url - URL to technical microservice's get keep alive endpoint (value _by_name/technical-generic-microservice/v1/getKeepAlive, should not be changed),
  • -Dset-keep-alive-url- URL to technical microservice's set keep alive endpoint (value _by_name/technical-generic-microservice/v1/setKeepAlive, should not be changed),
  • -Dget-apply-to-be-primary-master-url - URL to technical microservice's get apply to be primary master endpoint (value _by_name/technical-generic-microservice/v1/getApplyToBePrimaryMasterUrl, should not be changed),
  • -Dapply-to-be-primary-master-url - URL to technical microservice's apply to be primary master endpoint (value _by_name/technical-generic-microservice/v1/applyToBePrimaryMaster, should not be changed),
  • -Dreset-apply-to-be-primary-master-url - URL to technical microservice's reset apply to be primary master endpoint (value _by_name/technical-generic-microservice/v1/resetApplyToBePrimaryMaster, should not be changed),
  • -Ddo-command-url - URL to technical microservice's do command endpoint (value _by_name/technical-generic-microservice/v1/doCommand, should not be changed),
  • -Dexpiry-keep-alive-time-in-milliseconds - time, after which the primary worker will turn off, if there was no communication with this node (recommended value: 1200000),
  • -Dawait-primary-node-turn-off-in-milliseconds - defines how long the algorithm will wait after sending turn off signal to the primary worker, to make sure it is indeed turned off (recommended value: 75000)
  • -Dsecondary-cluster-manager-frequency-control-in-milliseconds - defines how often should secondary check if primary is alive (default value 10000),
  • -Dawait-drbd-start-milliseconds - defines how long the algorithm will wait after starting DRBD microservice, to make sure it is indeed started (recommended value: 30000),
  • -Dawait-drbd-role-switch-milliseconds - defines how long the algorithm will wait after switching role of DRBD microservice, to make sure it did change it's role (recommended value: 30000),
  • -Dis-active-active - defines if the cluster is in active-active mode (if false, then active-passive),
  • -Dsecondary-cluster-manager-health-check-url - Secondary Cluster Manager's health check url (add correct IP for value http://${secondary-cluster-manager-ip}:8020/805895fcf846ac34e966e97c),
  • -Dprimary-cluster-manager-health-check-url - Primary Cluster Manager's health check url (add correct IP for value http://${primary-cluster-manager-ip}:8020/805895fcf846ac34e966e97c),
  • -Dcluster-error-message-command-frequency-in-minutes= - defines how often to send error messages (recommended value '-1', means send error message, when it occurs)
  • -Dcluster-standard-message-command-frequency-in-minutes - defines how often to send standard messages (recommended value '-1', means send standard message, when it occurs)
  • -Dis-secondary-part-of-cluster - boolean value, which defines if there is a Secondary Cluster Manager in the cluster,
  • -Dcluster-message-logger-handler-thread-pool-size - defines how many threads are in the message handler's thread pool (recommended value: 8),
  • -Dwait-time-in-milliseconds-for-complete-handle-message - defines how long the algorithm's message thread will wait to make sure that the message script was completed (recommended value: 30000),
  • -Dcluster-state-table-name - name of a table in FalconDB, which stores the cluster's state (recommended value ha-load-balancer-cluster-state),
  • -Devent-log-table-name - name of a table in FalconDB, which stores event log (recommended value ha-load-balancer-cluster-event-log),
  • -Dawait-technical-microservice-start-milliseconds - defines how long the algorithm will wait after starting technical generic microservice, to make sure it is indeed started (recommended value: 70000),
  • -Dself-healing-time-running - defines how long will the algorithm wait after event log occurred, so the self healing can run (recommended value: 60000),
  • -Dprod-mode - boolean value, which defines if the cluster is in the production mode (recommended value true),
  • -DisPrimary - boolean value, which defines if the Cluster Manager is Primary.

Example Start Scripts

Primary Cluster Manager
#!/usr/bin/env bash

# Copyright (c) 2024, Onteon Tech and/or its affiliates.
# All rights reserved.
# Use is subject to license terms.

set -e
set -u
trap "exit 128" INT

SOURCE="${BASH_SOURCE[0]}"

while [[ -h "${SOURCE}" ]] ; do
    DIR="$(cd -P "$(dirname "${SOURCE}")" >/dev/null 2>&1 && pwd)"
    SOURCE="$(readlink "${SOURCE}")"

    [[ "${SOURCE}" != /* ]] && SOURCE="${DIR}/${SOURCE}"
done

SCRIPT_DIR="$(cd -P "$(dirname "${SOURCE}")" >/dev/null 2>&1 && pwd)"

/usr/local/bin/onteon-node-manager/jdk/bin/java -Dlog-directory=${SCRIPT_DIR}/logs -Dfalcondb-microservice-uri=http://localhost:8021/_by_name/falcon-db-core-microservice -Dnodes-json-file-path=${SCRIPT_DIR}/nodes-settings/nodes.json -Dapplications-directory-path=/usr/local/bin/onteon-node-manager/store/microservices -Dcluster-standard-message-script-path-to-file=${SCRIPT_DIR}/scripts/cluster-standard-message-script.sh -Dcluster-error-message-script-path-to-file=${SCRIPT_DIR}/scripts/cluster-error-message-script.sh -Dvirtual-ip-up-command='echo up virtual' -Dvirtual-ip-down-command='echo down virtual' -Dget-keep-alive-url=_by_name/technical-generic-microservice/v1/getKeepAlive -Dset-keep-alive-url=_by_name/technical-generic-microservice/v1/setKeepAlive -Dget-apply-to-be-primary-master-url=_by_name/technical-generic-microservice/v1/getApplyToBePrimaryMasterUrl -Dapply-to-be-primary-master-url=_by_name/technical-generic-microservice/v1/applyToBePrimaryMaster -Dreset-apply-to-be-primary-master-url=_by_name/technical-generic-microservice/v1/resetApplyToBePrimaryMaster -Ddo-command-url=_by_name/technical-generic-microservice/v1/doCommand -Dexpiry-keep-alive-time-in-milliseconds=1200000 -Dawait-primary-node-turn-off-in-milliseconds=75000 -Dsecondary-cluster-manager-frequency-control-in-milliseconds=10000 -Dawait-drbd-start-milliseconds=30000 -Dawait-drbd-role-switch-milliseconds=30000 -Dis-active-active=false -Dsecondary-cluster-manager-health-check-url=http://10.10.10.2:8020/805895fcf846ac34e966e97c -Dprimary-cluster-manager-health-check-url=http://10.10.10.1:8020/805895fcf846ac34e966e97c -Dcluster-error-message-command-frequency-in-minutes='-1' -Dcluster-standard-message-command-frequency-in-minutes='-1' -Dis-secondary-part-of-cluster=true -Dcluster-message-logger-handler-thread-pool-size=8 -Dwait-time-in-milliseconds-for-complete-handle-message=30000 -Dcluster-state-table-name=ha-load-balancer-cluster-state -Devent-log-table-name=ha-load-balancer-cluster-event-log -Dawait-technical-microservice-start-milliseconds=70000 -Dself-healing-time-running=60000 -Dprod-mode=true -DisPrimary=true -jar ${SCRIPT_DIR}/load-balancer-cluster-manager-microservice-1.0.0-fat.jar
Secondary Cluster Manager
#!/usr/bin/env bash

# Copyright (c) 2024, Onteon Tech and/or its affiliates.
# All rights reserved.
# Use is subject to license terms.

set -e
set -u
trap "exit 128" INT

SOURCE="${BASH_SOURCE[0]}"

while [[ -h "${SOURCE}" ]] ; do
    DIR="$(cd -P "$(dirname "${SOURCE}")" >/dev/null 2>&1 && pwd)"
    SOURCE="$(readlink "${SOURCE}")"

    [[ "${SOURCE}" != /* ]] && SOURCE="${DIR}/${SOURCE}"
done

SCRIPT_DIR="$(cd -P "$(dirname "${SOURCE}")" >/dev/null 2>&1 && pwd)"

/usr/local/bin/onteon-node-manager/jdk/bin/java -Dlog-directory=${SCRIPT_DIR}/logs -Dfalcondb-microservice-uri=http://localhost:8021/_by_name/falcon-db-core-microservice -Dnodes-json-file-path=${SCRIPT_DIR}/nodes-settings/nodes.json -Dapplications-directory-path=/usr/local/bin/onteon-node-manager/store/microservices -Dcluster-standard-message-script-path-to-file=${SCRIPT_DIR}/scripts/cluster-standard-message-script.sh -Dcluster-error-message-script-path-to-file=${SCRIPT_DIR}/scripts/cluster-error-message-script.sh -Dvirtual-ip-up-command='echo up virtual' -Dvirtual-ip-down-command='echo down virtual' -Dget-keep-alive-url=_by_name/technical-generic-microservice/v1/getKeepAlive -Dset-keep-alive-url=_by_name/technical-generic-microservice/v1/setKeepAlive -Dapply-to-be-primary-master-url=_by_name/technical-generic-microservice/v1/applyToBePrimaryMaster -Dget-apply-to-be-primary-master-url=_by_name/technical-generic-microservice/v1/getApplyToBePrimaryMasterUrl -Dreset-apply-to-be-primary-master-url=_by_name/technical-generic-microservice/v1/resetApplyToBePrimaryMaster -Ddo-command-url=_by_name/technical-generic-microservice/v1/doCommand -Dexpiry-keep-alive-time-in-milliseconds=300000 -Dawait-primary-node-turn-off-in-milliseconds=75000 -Dsecondary-cluster-manager-frequency-control-in-milliseconds=10000 -Dawait-drbd-start-milliseconds=30000 -Dawait-drbd-role-switch-milliseconds=30000 -Dis-active-active=false -Dsecondary-cluster-manager-health-check-url=http://10.10.10.2:8020/805895fcf846ac34e966e97c -Dprimary-cluster-manager-health-check-url=http://10.10.10.1:8020/805895fcf846ac34e966e97c -Dcluster-error-message-command-frequency-in-minutes='-1' -Dcluster-standard-message-command-frequency-in-minutes='-1' -Dis-secondary-part-of-cluster=true -Dcluster-message-logger-handler-thread-pool-size=8 -Dwait-time-in-milliseconds-for-complete-handle-message=30000 -Dcluster-state-table-name=ha-load-balancer-cluster-state -Devent-log-table-name=ha-load-balancer-cluster-event-log -Dawait-technical-microservice-start-milliseconds=70000 -Dself-healing-time-running=60000 -Dprod-mode=true -DisPrimary=false -jar ${SCRIPT_DIR}/load-balancer-cluster-manager-microservice-1.0.0-fat.jar

Create Tarball

To create tarball, use the following command:

# Primary Cluster Manager
tar czvf load-balancer-primary-cluster-manager.tar.gz \
    load-balancer-cluster-manager-microservice-${VERSION}-fat.jar \
    start.bash \
    nodes-settings \
    scripts

# Secondary Cluster Manager
tar czvf load-balancer-secondary-cluster-manager.tar.gz \
    load-balancer-cluster-manager-microservice-${VERSION}-fat.jar \
    start.bash \
    nodes-settings \
    scripts

Upload To Hetman

Uploading and scheduling Cluster Manager's package to Hetman, gives responsibility of running the algorithm to the Hetman applications. It runs algorithm every x amount of time, which allows keeping correct cluster state at all times.

Before executing the following steps, make sure to log in to Keycloak, create Hetman user and log into Hetman.

The following steps, should be executed for both Primary and Secondary Cluster Managers.

Create Group

  1. Go to the Groups tab.
  2. Click the ADD GROUP button.
  3. Fill the group form.

Required fields:

  • Name

Create Group

Create Job

  1. Go to the Jobs tab.
  2. Click the ADD JOB button.
  3. Fill the job form.

Required fields:

  • Job name
  • Group - select a group created in the previous step.
  • Cron - CRON expression, which defines how often should the Cluster Manager be executed.
  • Timeout (s) - timeout in seconds (recommended: 1300).

Create Job

Create And Start Task

  1. Click the + icon next to just created job.
  2. Fill the task form.

    Required fields:

    • Task name
    • Type - type of task (mandatory value: package).
    • Run Command - command to run the task (mandatory value: ${executable-directory-name}/start.bash).
    • Timeout (s) - timeout in seconds (recommend: 1200).
    • File - upload Cluster Manager package.

    Create Task

  3. Wait until the upload is finished (estimated upload time: 1.5 - 2 min).

  4. Make job active, by clicking the Active checkbox.

    Job Active