Plan Objective
This plan outlines the steps to be taken by the OASIS Application Development team in the event of a hardware and/or software failure related to an OASIS custom application. Software failures include the failure of the application and any associated third-party products.
Application Failure
Application failure includes any problem with application code or supporting software that prevents the application from being available and functioning as expected.
In order to mitigate the risk of downtime, all OASIS custom applications are thoroughly tested before being released to production.
If a problem is detected involving the application code developed or supported by OASIS, we will make the necessary code changes, test the changes and re-release the application to production. Based on over 25 years of experience supporting custom applications, applications tend to fail due to:
- a bug in production code
- a problem with the middleware components
- a problem with the infrastructure components
In the event of a bug in production code, the appropriate app development team would make a decision to either develop and deploy a fixt, invoke a workaround, or roll back to the previous version of production code. In this scenario OASIS is confident that we can fix most application failures in a timeframe of 2 hours to 2 business days.
In the event of a problem with the middleware components, OASIS would work directly with the vendor of the component to resolve.
In the event of a problem with the infrastructure components, OASIS would have to rely on DHTS or OIT to resolve, as this type of issue is beyond the scope of our ability to fix. It is difficult to anticipate how long it might take to fully address the problem. Nevertheless, per ISO recommendation, OASIS and the business owners should plan for a maximum downtime of one week.
If warranted by the impact of the problem, OASIS will notify the business owner(s) of the application failure and the anticipated downtime, who will in turn notify the general user community.
Third-Party Product Failure
Third-Party Product failure includes any problem with a third-party development tools and software (e.g., libraries, plug-ins) that prevents the application from being available and functioning as expected.
To mitigate the risk of a downtime, whenever possible, OASIS will upgrade to latest version of third-party tools within 6 months of a new release. If an upgrade is not feasible, OASIS will thoroughly test the third-party product and actively seek alternative solutions.
If a problem is detected involving a third-party tool or software, we will troubleshoot the problem and take the appropriate action. In the case of a failure that must be resolved outside of OASIS, we will contact the appropriate support resource for assistance. Once the problem is diagnosed, it will be corrected, tested, and applied to the production environment. If the downtime exceeds a period of two days, users will be notified to revert to their business continuity plan.
Application Server Failure (web applications)
Application server failure includes any problem with the application server hardware, operating system, or middleware that prevents the application from being available, accessible, and functioning as expected.
All OASIS application servers are supported by DHTS, and have a failover application server. To mitigate the risk of downtime, the failover application server is kept in synch with the production application server.
If a problem is detected with an OASIS application server, we will contact the DHTS server team. The server team will typically resolve server problems in less than an hour. If the problem cannot be resolved quickly, the server team will switch to the failover application server. OASIS will use the failover server until the problem with the production server is resolved. In the event of a disaster that impacts multiple DHTS application servers, we should expect a longer response time, as clinical applications will take precedence over non-clinical applications.
Client/Browser Failure
Client/browser failure includes any problem with an individual workstation/laptopn or an individual internet browser that prevents the application from being available and functioning as expected.
In order to mitigate the risk of browser incompatibility, all OASIS-supported applications are thoroughly tested using current browsers on all current Windows and Apple platforms before being released to production.
If an individual client machine fails, OASIS will work with the customer’s IT support team to resolve the problem. If the problem cannot be resolved, the user will need to acquire a compliant machine. If an entire class of machines fails (e.g., Apple), then all affected users will be required to acquire a compliant machine, (e.g., Microsoft Windows 7).
If a user cannot run an OASIS application in a certain internet browser, OASIS will work with the customer’s IT support team to resolve the problem. OASIS will ensure that the user is using a supported version of an internet browser, and has the appropriate browser security settings. If the problem cannot be resolved, the user will need to install a compliant browser or find an alternate machine.
Database Failure
Database failure includes any problem with the database, database server, or database connection that prevents the database from being available, accessible, and functioning as expected.
All OASIS databases are supported by DHTS, and have a failover database with a maximum 15-minute data loss. To mitigate the risk of data loss, DHTS regularly backs up all databases. In addition, DHTS requires that databases and database servers are regularly upgraded.
If a problem is detected with an OASIS database, we will contact the DHTS database team. The database team will typically resolve database problems in less than an hour. If the problem cannot be resolved quickly, the database team will switch to the failover database. We will use the failover database until the problem with the production database is resolved. In the event of a disaster that impacts multiple DHTS databases, we should expect a longer response time, as clinical databases will take precedence over non-clinical databases.
Data Feed Failure
OASIS sends data to and receives data from many other Duke systems (e.g., Enterprise Directory, RCC, OESO). Data feed failure includes any problem with the process of sending or receiving data that prevents OASIS databases from having accurate, current information.
If problems occur with a download, we will work with the appropriate support team to resolve the problem. If coding changes are necessary to fix the problem, we will code and test any required changes. We should be able to resolve any load-related problems within two days.
The following types of problems may be encountered:
- If a scheduled job fails to run, the job will be run manually.
- If a script fails to execute, script commands (each step of each script) will be run manually to perform the load. It should be noted that this would be tedious and time consuming, and introduces the risk of human error. We could also rewrite the script and/or use a different platform, if necessary. The likelihood of this failure is very low.
- If feeder system or process is not working, there is no contingency within our control. We will have to wait until data can be provided.
- If incorrect data is fed, we will replace this data with correct data once the problem is fixed.
Problem Reporting
Most problems with OASIS applications will be reported using one of the following mechanisms:
- A user contacts the Duke Health Technology Solutions (DHTS) or Office of Information Technology (OIT) service desk and indicates that they are having a problem with an OASIS application. The service desk submits a ticket assigned to RAD. ServiceNow sends an e-mail to the RAD group and pages the RAD on-call person. If the ticket is not acknowledged within the specified timeframe, the service owner is contacted.
- A user contacts a member of OASIS directly via email or telephone.
- A user submits a non-urgent request via the Contact Us link that is available on the OASIS public web site.
- A user contacts a business office, who in turn contacts OASIS directly.
- An automatic notification is generated by an OASIS application, batch process, or monitoring job.
Once OASIS has assumed responsibility for resolving a reported problem, a member of OASIS will contact and coordinate with the server team, database team, user, and Service Desk, as necessary, to resolve the problem.
External Business Continuity Plans
It is expected that each area within Duke will have its own Business Continuity Plan (BCP) which clearly documents how they do business in the event of an OASIS application downtime. This plan should describe the procedures that would be used to continue to do business in the event that the application is inaccessible for an extended period of time. Its purpose is to minimize the disruption of normal business functions.
The following offices are responsible for creating a Business Continuity Plan for their specific area.
- Sponsored Research Apps (e.g., Grants.Duke, SPS) – Office of Research Administration (ORA), Office of Research Support (ORS)
- Scientific Integrity Apps (e.g., Conflict of Interest, Sponsored Travel, Training Trackers) - Duke Office of Scientific Integrity
- myRESEARCHhome – informational only, no BCP required