Thursday, August 29, 2013

Use Corrective Actions to Automate Metric Alert Resolutions

In this demo I show how to configure corrective actions to automate the resolutions of metric alerts.

Corrective actions are processes that can be put in place at the metric level to execute when a threshold is reached.

In our QA database the testers have a bad habit of executing large DML processes with out notifying the DBA team. There processes can quickly generate 30G of archivelog in a 30 minute period which can easily fill up the archivelog area if the correct action is not taken to resolve the issue in a timely matter.

To automate the correct action to take when the archivelog area gets to a certain percent of usage we decided to create a corrective action. The corrective action will execute an RMAN archivelog backup and delete when the archive area used % reaches warning threshold 55%.

Environment

Database- 11.2.0.3 DBTEST1
Enterprise Manager- 12.1.0.2
Archive Area- +FRA01

Steps

Create Corrective Action in Library

1. Log into OEM click Enterprise> Monitoring> Corrective Actions




























2. Use the drop down and select RMAN script then click Go.





















3. Give a name and if you like a description then click Parameters.





















4. Type in the RMAN script box  "backup archivelog all not backed up delete all input;" If you already have a custom RMAN script you can give the full path to the script. Click Credentials.






















5. Set the database and host credentials then click Access.





















6. Set the access level for each account needed to view or make changes to the corrective action. Then click save to library.





















7. The corrective action now has been successfully created in the corrective action library.





















Add Corrective Action to Metric

8. Go to the database homepage of the database you want to apply the corrective action. Then click Oracle Database>Monitoring>Metric and Collection Settings


































9. Find the Archive Area Used % metric and click the pencil icon to the right of the metric.





















10. Set the warning and critical threshold if not already to a reasonable number for you environment. Then click Edit.

For my example I set the warning threshold to 55.





















11. Click the Add button under corrective actions for warning threshold.





















12. Use the drop down and select "from Library" click Continue





















13. Select the name of the corrective action you are going to use in my case I selected the one we created in step 7. Click Continue





















14. Give a name and description for the corrective action then click Parameters





















15. We will keep the existing parameter as set in the library but if you wanted you can make changes as needed. Click Credentials





















16. Set the database and host credentials as needed or leave set as from the library. Click Continue





















17. You will now see that the metric has a corrective action set for warning threshold. Click Continue





















18. You can see the corrective action now has warning only meaning the corrective action will only execute when the warning threshold is hit. Click Continue





















19. Click OK to save the changes.






















Review Metric

20. Click Oracle Database>Monitoring>All Metric


































21. Expand the Archive Area an select Archive Area Used (%). We can see that currently the severity is green and the current % usage for archive area is 53%.






















Check Current Usage

22. We can see that currently the FRA01 disk group is 54% used.















Load Kicks Off

23. I kick off my simple insert from select in a loop and begin to generate redo / archivelog. This begins to use more space in the archive area which is in the FRA01 diskgroup.

















Check Archive Area Usage

24. We check the FRA01 diskgroup and we can now see that 57% is used.















Check Metric Alert

25. Our metric is now in a warning state since it has past the warning threshold of 55%.  Click the eye glasses under details.






















26. We can see that the corrective action has executed at 8:47.






















Check RMAN Process
27. At the OS level I grep for the RMAN process and we can see that RMAN process started at 8:49.


















28. In the alert log we can see that at 8:50 RMAN issues the alter system archive log.
















Check Archive Area Usage After Execution

29. We can see that the FRA01 diskgroup usage is back down to 51%.















Check Metric After Execution

30. We can see that our Archive Area Used (%) metric is now green and below the 55% warning threshold.






















Conclusion 

Using corrective actions allows DBAs to be more proactive in implementing solutions. Corrective actions are also available in OEM version 10g and 11g.

Note:
You may ask why would QA be in archivelog mode due to go-live time line we can not take the chance of not being able to recover the database quickly to get the QA testers back in the system to continue testing. In this case we have setup the database in archivelog mode and execute nightly full backups.

3 comments:

  1. Thank you for this nice step by step

    ReplyDelete
  2. Hi,

    When We try to create corrective action to restart mrp if apply lag is observed corrective action is triggered but gets failed.

    SQL*Plus: Release 11.1.0.7.0 - Production on Wed Feb 10 03:46:28 2016

    Copyright (c) 1982, 2008, Oracle. All rights reserved.

    SQL> SQL> Connected to an idle instance.
    SQL> SQL> SQL> SQL> SQL> SQL> alter database recover managed standby database disconnect from session
    *
    ERROR at line 1:
    ORA-01034: ORACLE not available
    Process ID: 0
    Session ID: 0 Serial number: 0

    ReplyDelete