Redundancy and Fault Tolerance for DPE

The following are our recommendations. 

DPE Server

  • Use a load balancer in fail-over configuration in front of two DPE Server machines.
  • One DPE Server is used actively (receives all requests) the other is a hot standby.
  • On failure the load balancer switches transparently for clients.

Databases

  • Content databases are running on a fail-over database cluster.
  • One central DpeCoreDb is running on a fail-over database cluster.
  • DpeCoreDb is shared between the two DPE Servers. This maximizes the preserved (workflow and job) state on fail-over.
  • Alternatively you could have an independent DpeCoreDb for each instance of DPE Server but fail-over would be less seamless.

PAR Files

  • DigaSystem allows to configure a redundant set of PAR file locations (in the windows registry). This mechanism is also supported by DPE Server.
  • Use one central, shared set of PAR files for both DPE Servers.
  • Configure a redundant location for PAR files.
  • Synchronize the master files to the redundant location periodically, e.g. once a night.
  • Alternatively you could store PAR files on a fault-tolerant, distributed file system, e.g. https://moosefs.com/

WorkflowSystem

DPE Processors

  • Processors are running in a farm and are redundant by design.
  • Have more than one processor of each type running.
  • Distribute processors of one type over more than one machine.
  • Processor should be in front of the load balancer and not be assigned to one of the DPE Servers directly.

WorkflowWorker

  • Option 1 (Standby): Configure more than one WorkflowWorker executing the same workflow types on a fail-over cluster (only one is active at the same time).
  • Option 2 (Concurrent): Configure more than one WorkflowWorker executing the same workflow types (both are active at the same time).

WorkflowScheduler

  • Option 1 (Standby): Configure more than one WorkflowScheduler for the same task on a fail-over cluster (only one is active at the same time).
  • Option 2 (Concurrent): Configure more than one WorkflowScheduler for the same task (both are active at the same time). 

Example: workflow that hard-deletes entries at night.

For option 2 we recommend to use slightly different schedules, e.g. one at 2:00 a.m the other one 2:30.

WorkflowTableWatcher

  • Option 1 (Standby): Configure more than one WorkflowTableWatcher for the same task on a fail-over cluster (only one is active at the same time).
  • Challenge with Option 1: WorkflowTableWatcher writes a local memory file that "mirrors" the states of entries (Created, Updated, Deleted). 
  • Option 2 (Concurrent): Configure more than one WorkflowTableWatcher for the same task on a fail-over cluster (both are active at the same time).
  • Challenge with Option 2: workflow could be be executed twice; avoid this by workflow naming

WorkflowFolderWatcher

  • Option 1 (Standby): Configure more than one WorkflowFolderWatcher for the same task on a fail-over cluster (only one is active at the same time).
  • Challenge with Option 1: WorkflowFolderWatcher writes local memory files that "mirror" the states of files (Created, Updated, Deleted). 
  • Option 2 (Concurrent): Configure more than one WorkflowFolderWatcher for the same task on a fail-over cluster (both are active at the same time).
  • Challenge with Option 2: workflow could be be executed twice; avoid this by workflow naming

Clients

  • All clients of DPE must be able to cope with short time unavailabilites of DPE Services.
  • A simple solution to support this are client-side retries.
  • DPE components already support retries.