This plan outlines the steps to be taken by ORI's Research Application Development (RAD) team in the event of a hardware and/or software failure related to a RAD custom application. Software failures include the failure of the application and any associated third-party products.
Application failure includes any problem with application code or supporting software that prevents the application from being available and functioning as expected.
In order to mitigate the risk of downtime, all RAD-supported applications are thoroughly tested before being released to production.
If a problem is detected involving the application code developed or supported by RAD, we will make the necessary code changes, test the changes and re-release the application to production. Any component that fails to function correctly would likely be either trivial to fix or beyond the scope of our ability to fix, such as a network failure or a bug in a third-party component. RAD is confident that we could fix most application failures in less than two business days. However, if the problem cannot be resolved by RAD, it is difficult to anticipate how long it might take DHTS, OIT, or vendors to address the problem. Nevertheless, per ISO recommendation, RAD has planned for a maximum downtime of one week.
If warranted by the impact of the problem, RAD will notify the user via application notifications, or notify the business owner(s) of an application, who will in turn notify the general user community.
Third-Party Product Failure
Third-Party Product failure includes any problem with a third-party development tools and software (e.g., libraries, plug-ins) that prevents the application from being available and functioning as expected.
To mitigate the risk of a downtime, whenever possible, RAD will upgrade to latest version of third-party tools within 6 months of a new release. If an upgrade is not feasible, RAD will thoroughly test the third-party product.
If a problem is detected involving a third-party tool or software, we will troubleshoot the problem and take the appropriate action. In the case of a failure that must be resolved outside of RAD, we will contact the appropriate support resource for assistance. Once the problem is diagnosed, it will be corrected, tested, and applied to the production environment. If the downtime exceeds a period of two days, users will be notified to revert to their business continuity plan.
Application Server Failure (web applications)
Application server failure includes any problem with the application server hardware, operating system, or middleware that prevents the application from being available, accessible, and functioning as expected.
All RAD application servers are supported by DHTS, and have a failover application server. To mitigate the risk of downtime, the failover application server is kept in synch with the production application server.
If a problem is detected with a RAD application server, we will contact the DHTS server team. The server team will typically resolve server problems in less than an hour. If the problem cannot be resolved quickly, the server team will switch to the failover application server. We will use the failover server until the problem with the production server is resolved. In the event of a disaster that impacts multiple DHTS application servers, we should expect a longer response time, as clinical applications will take precedence over RAD applications.
Client/browser failure includes any problem with an individual workstation or an individual internet browser that prevents the application from being available and functioning as expected.
In order to mitigate the risk of browser incompatibility, all RAD-supported applications are thoroughly tested using current browsers on all current Windows and Apple platforms before being released to production.
If an individual client machine fails, RAD will work with the customer’s IT support team to resolve the problem. If the problem cannot be resolved, the user will need to acquire a compliant machine. If an entire class of machines fails (e.g., Apple), then all affected users will be required to acquire a compliant machine, (e.g., Microsoft Windows 7).
If a user cannot run a RAD web application in a certain internet browser, RAD will work with the customer’s IT support team to resolve the problem. RAD will ensure that the user is using a supported version of an internet browser, and has the appropriate browser security settings. If the problem cannot be resolved, the user will need to install a compliant browser or find an alternate machine.
Database failure includes any problem with the database, database server, or database connection that prevents the database from being available, accessible, and functioning as expected.
All RAD databases are supported by DHTS, and have a failover database with a maximum 15-minute data loss. To mitigate the risk of data loss, DHTS regularly backs up all databases. In addition, DHTS requires that databases and database servers are regularly upgraded.
If a problem is detected with a RAD database, we will contact the DHTS database team. The database team will typically resolve database problems in less than an hour. If the problem cannot be resolved quickly, the database team will switch to the failover database. We will use the failover database until the problem with the production database is resolved. In the event of a disaster that impacts multiple DHTS databases, we should expect a longer response time, as clinical databases will take precedence over RAD databases.
Data Feed Failure
RAD sends data to and receives data from many other Duke systems (e.g., Enterprise Directory, RCC, OESO). Data feed failure includes any problem with the process of sending or receiving data that prevents RAD databases from having accurate, current information.
If problems occur with a download, we will work with the appropriate support team to resolve the problem. If coding changes are necessary to fix the problem, we will code and test any required changes. We should be able to resolve any load-related problems within two days.
The following types of problems may be encountered:
- If a scheduled job fails to run, the job will be run manually.
- If a script fails to execute, script commands (each step of each script) will be run manually to perform the load. It should be noted that this would be tedious and time consuming, and introduces the risk of human error. We could also rewrite the script and/or use a different platform, if necessary. The likelihood of this failure is very low.
- If feeder system or process is not working, there is no contingency within our control. We will have to wait until data can be provided.
- If incorrect data is fed, we will replace this data with correct data once the problem is fixed.
Most problems with RAD applications will be reported using one of the following mechanisms:
- A user contacts the Duke Health Technology Solutions (DHTS) or Office of Information Technology (OIT) service desk and indicates that they are having a problem with a RAD application. The service desk submits a ticket assigned to RAD. ServiceNow sends an e-mail to the RAD group and pages the RAD on-call person. If the ticket is not acknowledged within the specified timeframe, the service owner is contacted.
- A user contacts a member of RAD directly via email or telephone.
- A user submits a non-urgent request via the Contact Us link that is available on the RAD public web site.
- A user contacts a business office, who in turn contacts RAD directly.
- An automatic notification is generated by a RAD application, batch process, or monitoring job.
Once RAD has assumed responsibility for resolving a reported problem, a member of RAD will contact and coordinate with the server team, database team, user, and Service Desk, as necessary, to resolve the problem.
External Business Continuity Plans
It is expected that each area within Duke will have its own Business Continuity Plan (BCP) which clearly documents how they do business in the event of a RAD application downtime. This plan should describe the procedures that would be used to continue to do business in the event that the application is inaccessible for an extended period of time. Its purpose is to minimize the disruption of normal business functions.
The following offices are responsible for creating a BCP for their specific area.
- Grants.Duke, SPS, Sponsored Effort, Subrecipients – ORA, ORS
- COI Form, COI Admin, FCOI, Sponsored Travel – RIO, ORS-CCP, DECO
- Research Compliance Tracker, RCC Training Tracker, Research Explorer, Faculty XML – informational only, no BCP required