Redundancy and Fault Tolerance for DPE
The following are our recommendations.
DPE Server
- Use a load balancer in fail-over configuration in front of two DPE Server machines.
- One DPE Server is used actively (receives all requests) the other is a hot standby.
- On failure the load balancer switches transparently for clients.
Databases
- Content databases are running on a fail-over database cluster.
- One central DpeCoreDb is running on a fail-over database cluster.
- DpeCoreDb is shared between the two DPE Servers. This maximizes the preserved (workflow and job) state on fail-over.
- Alternatively you could have an independent DpeCoreDb for each instance of DPE Server but fail-over would be less seamless.
PAR Files
- DigaSystem allows to configure a redundant set of PAR file locations (in the windows registry). This mechanism is also supported by DPE Server.
- Use one central, shared set of PAR files for both DPE Servers.
- Configure a redundant location for PAR files.
- Synchronize the master files to the redundant location periodically, e.g. once a night.
- Alternatively you could store PAR files on a fault-tolerant, distributed file system, e.g. https://moosefs.com/
WorkflowSystem
DPE Processors
- Processors are running in a farm and are redundant by design.
- Have more than one processor of each type running.
- Distribute processors of one type over more than one machine.
- Processor should be in front of the load balancer and not be assigned to one of the DPE Servers directly.
WorkflowWorker
- Option 1 (Standby): Configure more than one WorkflowWorker executing the same workflow types on a fail-over cluster (only one is active at the same time).
- Option 2 (Concurrent): Configure more than one WorkflowWorker executing the same workflow types (both are active at the same time).
WorkflowScheduler
- Option 1 (Standby): Configure more than one WorkflowScheduler for the same task on a fail-over cluster (only one is active at the same time).
- Option 2 (Concurrent): Configure more than one WorkflowScheduler for the same task (both are active at the same time).
Example: workflow that hard-deletes entries at night.
For option 2 we recommend to use slightly different schedules, e.g. one at 2:00 a.m the other one 2:30.
WorkflowTableWatcher
- Option 1 (Standby): Configure more than one WorkflowTableWatcher for the same task on a fail-over cluster (only one is active at the same time).
- Challenge with Option 1: WorkflowTableWatcher writes a local memory file that "mirrors" the states of entries (Created, Updated, Deleted).
- Option 2 (Concurrent): Configure more than one WorkflowTableWatcher for the same task on a fail-over cluster (both are active at the same time).
- Challenge with Option 2: workflow could be be executed twice; avoid this by workflow naming
WorkflowFolderWatcher
- Option 1 (Standby): Configure more than one WorkflowFolderWatcher for the same task on a fail-over cluster (only one is active at the same time).
- Challenge with Option 1: WorkflowFolderWatcher writes local memory files that "mirror" the states of files (Created, Updated, Deleted).
- Option 2 (Concurrent): Configure more than one WorkflowFolderWatcher for the same task on a fail-over cluster (both are active at the same time).
- Challenge with Option 2: workflow could be be executed twice; avoid this by workflow naming
Clients
- All clients of DPE must be able to cope with short time unavailabilites of DPE Services.
- A simple solution to support this are client-side retries.
- DPE components already support retries.