Redundancy and Fault Tolerance for DPE
The following are our recommendations.
DPE Server
Use a load balancer in fail-over configuration in front of two DPE Server machines.
One DPE Server is used actively (receives all requests) the other is a hot standby.
On failure the load balancer switches transparently for clients.
Databases
Content databases are running on a fail-over database cluster.
One central DpeCoreDb is running on a fail-over database cluster.
DpeCoreDb is shared between the two DPE Servers. This maximizes the preserved (workflow and job) state on fail-over.
Alternatively you could have an independent DpeCoreDb for each instance of DPE Server but fail-over would be less seamless.
PAR Files
DigaSystem allows to configure a redundant set of PAR file locations (in the windows registry). This mechanism is also supported by DPE Server.
Use one central, shared set of PAR files for both DPE Servers.
Configure a redundant location for PAR files.
Synchronize the master files to the redundant location periodically, e.g. once a night.
Alternatively you could store PAR files on a fault-tolerant, distributed file system, e.g. https://moosefs.com/
WorkflowSystem
DPE Processors
Processors are running in a farm and are redundant by design.
Have more than one processor of each type running.
Distribute processors of one type over more than one machine.
Processor should be in front of the load balancer and not be assigned to one of the DPE Servers directly.
WorkflowWorker
Option 1 (Standby): Configure more than one WorkflowWorker executing the same workflow types on a fail-over cluster (only one is active at the same time).
Option 2 (Concurrent): Configure more than one WorkflowWorker executing the same workflow types (both are active at the same time).
WorkflowScheduler
Option 1 (Standby): Configure more than one WorkflowScheduler for the same task on a fail-over cluster (only one is active at the same time).
Option 2 (Concurrent): Configure more than one WorkflowScheduler for the same task (both are active at the same time).
Example: workflow that hard-deletes entries at night.
For option 2 we recommend to use slightly different schedules, e.g. one at 2:00 a.m the other one 2:30.
WorkflowTableWatcher
Option 1 (Standby): Configure more than one WorkflowTableWatcher for the same task on a fail-over cluster (only one is active at the same time).
Challenge with Option 1: WorkflowTableWatcher writes a local memory file that "mirrors" the states of entries (Created, Updated, Deleted).
Option 2 (Concurrent): Configure more than one WorkflowTableWatcher for the same task on a fail-over cluster (both are active at the same time).
Challenge with Option 2: workflow could be be executed twice; avoid this by workflow naming
WorkflowFolderWatcher
Option 1 (Standby): Configure more than one WorkflowFolderWatcher for the same task on a fail-over cluster (only one is active at the same time).
Challenge with Option 1: WorkflowFolderWatcher writes local memory files that "mirror" the states of files (Created, Updated, Deleted).
Option 2 (Concurrent): Configure more than one WorkflowFolderWatcher for the same task on a fail-over cluster (both are active at the same time).
Challenge with Option 2: workflow could be be executed twice; avoid this by workflow naming
Clients
All clients of DPE must be able to cope with short time unavailabilites of DPE Services.
A simple solution to support this are client-side retries.
DPE components already support retries.