了解你所不知道的SMON功能(一):清理臨時段
SMON(system monitor process)系統監控后臺進程,有時候也被叫做system cleanup process,這么叫的原因是它負責完成很多清理(cleanup)任務。但凡學習過Oracle基礎知識的技術人員都會或多或少對該background process的功能有所了解。
曾幾何時對SMON功能的了解程度可以作為評判一位DBA理論知識的重要因素,至今仍有很多公司在DBA面試中會問到SMON有哪些功能這樣的問題。首先這是一道開放式的題目,并不會奢求面試者能夠打全(答全幾乎是不可能的,即便是在你閱讀本篇文章之后),答出多少可以作為知識廣度的評判依據(如果面試人特意為這題準備過,那么也很好,說明他已經能系統地考慮問題了),接著還可以就具體的某一個功能說開去,來了解面試者的知識深度,當然這扯遠了。
我們所熟知的SMON是個兢兢業業的家伙,它負責完成一些列系統級別的任務。與PMON(Process Monitor)后臺進程不同的是,SMON負責完成更多和整體系統相關的工作,這導致它會去做一些不知名的”累活”,當系統頻繁產生這些”垃圾任務”,則SMON可能忙不過來。因此在10g中SMON變得有一點懶惰了,如果它在短期內接收到過多的工作通知(SMON: system monitor process posted),那么它可能選擇消極怠工以便讓自己不要過于繁忙(SMON: Posted too frequently, trans recovery disabled),之后會詳細介紹。
SMON的主要作用包括:
1.清理臨時段(SMON cleanup temporary segments)
觸發場景
很多人錯誤地理解了這里所說的臨時段temporary segments,認為temporary segments是指temporary tablespace臨時表空間上的排序臨時段(sort segment)。事實上這里的臨時段主要指的是永久表空間(permanent tablespace)上的臨時段,當然臨時表空間上的temporary segments也是由SMON來清理(cleanup)的,但這種清理僅發生在數據庫實例啟動時(instance startup)。
永久表空間上同樣存在臨時段,譬如當我們在某個永久表空間上使用create table/index等DDL命令創建某個表/索引時,服務進程一開始會在指定的永久表空間上分配足夠多的區間(Extents),這些區間在命令結束之前都是臨時的(Temporary Extents),直到表/索引完全建成才將該temporary segment轉換為permanent segment。另外當使用drop命令刪除某個段時,也會先將該段率先轉換為temporary segment,之后再來清理該temporary segment(DROP object converts the segment to temporary and then cleans up the temporary segment)。 常規情況下清理工作遵循誰創建temporary segment,誰負責清理的原則。換句話說,因服務進程rebuild index所產生的temporary segment在rebuild完成后應由服務進程自行負責清理。一旦服務進程在成功清理temporary segment之前就意外終止了,亦或者服務進程在工作過程中遇到了某些ORA-錯誤導致語句失敗,那么SMON都會被要求(posted)負責完成temporary segment的清理工作。
對于永久表空間上的temporary segment,SMON會三分鐘清理一次(前提是接到post),如果SMON過于繁忙那么可能temporary segment長期不被清理。temporary segment長期不被清理可能造成一個典型的問題是:在rebuild index online失敗后,后續執行的rebuild index命令要求之前產生的temporary segment已被cleanup,如果cleanup沒有完成那么就需要一直等下去。在10gR2中我們可以使用dbms_repair.online_index_clean來手動清理online index rebuild的遺留問題:
The dbms_repair.online_index_clean function has been created to cleanup online index rebuilds. Use the dbms_repair.online_index_clean function to resolve the issue. Please note if you are unable to run the dbms_repair.online_index_clean function it is due to the fact that you have not installed the patch for Bug 3805539 or are not running on a release that includes this fix. The fix for this bug is a new function in the dbms_repair package called dbms_repair.online_index_clean, which has been created to cleanup online index [[sub]partition] [re]builds. New functionality is not allowed in patchsets; therefore, this is not available in a patchset but is available in 10gR2. Check your patch list to verify the database is patched for Bug 3805539 using the following command and patch for the bug if it is not listed: opatch lsinventory -detail Cleanup after a failed online index [re]build can be slow to occurpreventing subsequent such operations until the cleanup has occured.
接著我們通過實踐來看一下smon是如何清理永久表空間上的temporary segment的:
設置10500事件以跟蹤smon進程,這個診斷事件后面會介紹 SQL> alter system set events '10500 trace name context forever,level 10'; System altered. 在第一個會話中執行create table命令,這將產生一定量的Temorary Extents SQL> create table smon as select * from ymon; 在另一個會話中執行對DBA_EXTENTS視圖的查詢,可以發現產生了多少臨時區間 SQL> SELECT COUNT(*) FROM DBA_EXTENTS WHERE SEGMENT_TYPE='TEMPORARY'; COUNT(*) ---------- 117 終止以上create table的session,等待一段時間后觀察smon后臺進程的trc可以發現以下信息: *** 2011-06-07 21:18:39.817 SMON: system monitor process posted msgflag:0x0200 (-/-/-/-/TMPSDROP/-/-) *** 2011-06-07 21:18:39.818 SMON: Posted, but not for trans recovery, so skip it. *** 2011-06-07 21:18:39.818 SMON: clean up temp segments in slave SQL> SELECT COUNT(*) FROM DBA_EXTENTS WHERE SEGMENT_TYPE='TEMPORARY'; COUNT(*) ---------- 0 可以看到smon通過slave進程完成了對temporary segment的清理
與永久表空間上的臨時段不同,出于性能的考慮臨時表空間上的Extents并不在操作(operations)完成后立即被釋放和歸還。相反,這些Temporary Extents會被標記為可用,以便用于下一次的排序操作。SMON仍會清理這些Temporary segments,但這種清理僅發生在實例啟動時(instance startup):
For performance issues, extents in TEMPORARY tablespaces are not released ordeallocated once the operation is complete.Instead, the extent is simply marked as available for the next sort operation. SMON cleans up the segments at startup. A sort segment is created by the first statement that used a TEMPORARY tablespacefor sorting, after startup. A sort segment created in a TEMPOARY tablespace is only released at shutdown. The large number of EXTENTS is caused when the STORAGE clause has been incorrectly calculated.
現象
可以通過以下查詢了解數據庫中Temporary Extent的總數,在一定時間內比較其總數,若有所減少那么說明SMON正在清理Temporary segment
SELECT COUNT(*) FROM DBA_EXTENTS WHERE SEGMENT_TYPE='TEMPORARY';
也可以通過v$sysstat視圖中的”SMON posted for dropping temp segment”事件統計信息來了解SMON收到清理要求的情況:
SQL> select name,value from v$sysstat where name like '%SMON%'; NAME VALUE ---------------------------------------------------------------- ---------- total number of times SMON posted 8 SMON posted for undo segment recovery 0 SMON posted for txn recovery for other instances 0 SMON posted for instance recovery 0 SMON posted for undo segment shrink 0 SMON posted for dropping temp segment 1
另外在清理過程中SMON會長期持有Space Transacton(ST)隊列鎖,其他會話可能因為得不到ST鎖而等待超時出現ORA-01575錯誤:
01575, 00000, "timeout waiting for space management resource" // *Cause: failed to acquire necessary resource to do space management. // *Action: Retry the operation.
如何禁止SMON清理臨時段
可以通過設置診斷事件event=’10061 trace name context forever, level 10′禁用SMON清理臨時段(disable SMON from cleaning temp segments)。
alter system set events '10061 trace name context forever, level 10';
了解你所不知道的SMON功能(二):合并空閑區間
SMON的作用還包括合并空閑區間(coalesces free extent)
觸發場景
早期Oracle采用DMT字典管理表空間,不同于今時今日的LMT本地管理方式,DMT下通過對FET$和UET$2張字典基表的遞歸操作來管理區間。SMON每5分鐘(SMON wakes itself every 5 minutes and checks for tablespaces with default pctincrease != 0)會自發地去檢查哪些默認存儲參數pctincrease不等于0的字典管理表空間,注意這種清理工作是針對DMT的,而LMT則無需合并。SMON對這些DMT表空間上的連續相鄰的空閑Extents實施coalesce操作以合并成一個更大的空閑Extent,這同時也意味著SMON需要維護FET$字典基表。
現象
以下查詢可以檢查數據庫中空閑Extents的總數,如果這個總數在持續減少那么說明SMON正在coalesce free space:
SELECT COUNT(*) FROM DBA_FREE_SPACE;在合并區間時SMON需要排他地(exclusive)持有ST(Space Transaction)隊列鎖, 其他會話可能因為得不到ST鎖而等待超時出現ORA-01575錯誤。同時SMON可能在繁瑣的coalesce操作中消耗100%的CPU。
如何禁止SMON合并空閑區間
可以通過設置診斷事件event=’10269 trace name context forever, level 10′來禁用SMON合并空閑區間(Don’t do coalesces of free space in SMON)
10269, 00000, "Don't do coalesces of free space in SMON" // *Cause: setting this event prevents SMON from doing free space coalesces alter system set events '10269 trace name context forever, level 10';
了解你所不知道的SMON功能(三):清理obj$基表
SMON的作用還包括清理obj$數據字典基表(cleanup obj$)
OBJ$字典基表是Oracle Bootstarp啟動自舉的重要對象之一:
SQL> set linesize 80 ; SQL> select sql_text from bootstrap$ where sql_text like 'CREATE TABLE OBJ$%'; SQL_TEXT -------------------------------------------------------------------------------- CREATE TABLE OBJ$ ("OBJ#" NUMBER NOT NULL,"DATAOBJ#" NUMBER,"OWNER#" NUMBER NOT N ULL,"NAME" VARCHAR2(30) NOT NULL,"NAMESPACE" NUMBER NOT NULL,"SUBNAME" VARCHAR2( 30),"TYPE#" NUMBER NOT NULL,"CTIME" DATE NOT NULL,"MTIME" DATE NOT NULL,"STIME" DATE NOT NULL,"STATUS" NUMBER NOT NULL,"REMOTEOWNER" VARCHAR2(30),"LINKNAME" VAR CHAR2(128),"FLAGS" NUMBER,"OID$" RAW(16),"SPARE1" NUMBER,"SPARE2" NUMBER,"SPARE3 " NUMBER,"SPARE4" VARCHAR2(1000),"SPARE5" VARCHAR2(1000),"SPARE6" DATE) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 STORAGE ( INITIAL 16K NEXT 1024K MINEXTEN TS 1 MAXEXTENTS 2147483645 PCTINCREASE 0 OBJNO 18 EXTENTS (FILE 1 BLOCK 121))
觸發場景
OBJ$基表是一張低級數據字典表,該表幾乎對庫中的每個對象(表、索引、包、視圖等)都包含有一行記錄。很多情況下,這些條目所代表的對象是不存在的對象(non-existent),引起這種現象的一種可能的原因是對象本身已經被從數據庫中刪除了,但是對象條目仍被保留下來以滿足消極依賴機制(negative dependency)。因為這些條目的存在會導致OBJ$表不斷膨脹,這時就需要由SMON進程來刪除這些不再需要的行。SMON會在實例啟動(after startup of DB is started cleanup function again)時以及啟動后的每12個小時執行一次清理任務(the cleanup is scheduled to run after startup and then every 12 hours)。
我們可以通過以下演示來了解SMON清理obj$的過程:
SQL> BEGIN 2 FOR i IN 1 .. 5000 LOOP 3 execute immediate ('create synonym gustav' || i || ' for 4 perfstat.sometable'); 5 execute immediate ('drop synonym gustav' || i ); 6 END LOOP; 7 END; 8 / PL/SQL procedure successfully completed. SQL> startup force; ORACLE instance started. Total System Global Area 1065353216 bytes Fixed Size 2089336 bytes Variable Size 486542984 bytes Database Buffers 570425344 bytes Redo Buffers 6295552 bytes Database mounted. Database opened. SQL> select count(*) from user$ u, obj$ o 2 where u.user# (+)=o.owner# and o.type#=10 and not exists 3 (select p_obj# from dependency$ where p_obj# = o.obj#); COUNT(*) ---------- 5000 SQL> / COUNT(*) ---------- 5000 SQL> / COUNT(*) ---------- 4951 SQL> oradebug setospid 18457; Oracle pid: 8, Unix process pid: 18457, image: oracle@rh2.oracle.com (SMON) SQL> oradebug event 10046 trace name context forever ,level 1; Statement processed. SQL> oradebug tracefile_name; /s01/admin/G10R2/bdump/g10r2_smon_18457.trc select o.owner#, o.obj#, decode(o.linkname, null, decode(u.name, null, 'SYS', u.name), o.remoteowner), o.name, o.linkname, o.namespace, o.subname from user$ u, obj$ o where u.use r#(+) = o.owner# and o.type# = :1 and not exists (select p_obj# from dependency$ where p_obj# = o.obj#) order by o.obj# for update select null from obj$ where obj# = :1 and type# = :2 and obj# not in (select p_obj# from dependency$ where p_obj# = obj$.obj#) delete from obj$ where obj# = :1 /* 刪除過程其實較為復雜,可能要刪除多個字典基表上的記錄 */現象
我們可以通過以下查詢來了解obj$基表中NON-EXISTENT對象的條目總數(type#=10),若這個總數在不斷減少說明smon正在執行清理工作
select trunc(mtime), substr(name, 1, 3) name, count(*) from obj$ where type# = 10 and not exists (select * from dependency$ where obj# = p_obj#) group by trunc(mtime), substr(name, 1, 3); select count(*) from user$ u, obj$ o where u.user#(+) = o.owner# and o.type# = 10 and not exists (select p_obj# from dependency$ where p_obj# = o.obj#);如何禁止SMON清理obj$基表
我們可以通過設置診斷事件event=’10052 trace name context forever’來禁止SMON清理obj$基表,當我們需要避免SMON因cleanup obj$的相關代碼而意外終止或spin從而開展進一步的診斷時可以設置該診斷事件。在Oracle并行服務器或RAC環境中,也可以設置該事件來保證只有特定的某個節點來執行清理工作。
10052, 00000, "don't clean up obj$" alter system set events '10052 trace name context forever, level 65535'; Problem Description: We are receiving the below warning during db startup: WARNING: kqlclo() has detected the following : Non-existent object 37336 NOT deleted because an object of the same name exists already. Object name: PUBLIC.USER$ This is caused by the SMON trying to cleanup the SYS.OJB$. SMON cleans all dropped objects which have a SYS.OBJ$.TYPE#=10. This can happen very often when you create an object that have the same name as a public synonym. When SMON is trying to remove non-existent objects and fails because there are duplicates, multiple nonexistent objects with same name. This query will returned many objects with same name under SYS schema: select o.name,u.user# from user$ u, obj$ o where u.user# (+)=o.owner# and o.type#=10 and not exists (select p_obj# from dependency$ where p_obj# = o.obj#); To cleanup this message: Take a full backup of the database - this is crucial. If anything goes wrong during this procedure, your only option would be to restore from backup, so make sure you have a good backup before proceeding. We suggest a COLD backup. If you plan to use a HOT backup, you will have to restore point in time if any problem happens Normally DML against dictionary objects is unsupported, but in this case we know exactly what the type of corruption, also you are instructing to do this under guidance from Support. Data dictionary patching must be done by an experienced DBA. This solution is unsupported. It means that if there were problems after applying this solution, a database backup must be restored. 1. Set event 10052 at parameter file to disable cleanup of OBJ$ by SMON EVENT="10052 trace name context forever, level 65535" 2. Startup database in restricted mode 3. Delete from OBJ$, COMMIT SQL> delete from obj$ where (name,owner#) in ( select o.name,u.user# from user$ u, obj$ o where u.user# (+)=o.owner# and o.type#=10 and not exists (select p_obj# from dependency$ where p_obj# = o.obj#) ); SQL> commit; SQL> Shutdown abort. 4. remove event 10052 from init.ora 5. Restart the database and monitor for the message in the ALERT LOG file
了解你所不知道的SMON功能(四):維護col_usage$字典基表
SMON的作用還包括維護col_usage$列監控統計信息基表。
最早在9i中引入了col_usage$字典基表,其目的在于監控column在SQL語句作為predicate的情況,col_usage$的出現完善了CBO中柱狀圖自動收集的機制。
create table col_usage$ ( obj# number, /* object number */ intcol# number, /* internal column number */ equality_preds number, /* equality predicates */ equijoin_preds number, /* equijoin predicates */ nonequijoin_preds number, /* nonequijoin predicates */ range_preds number, /* range predicates */ like_preds number, /* (not) like predicates */ null_preds number, /* (not) null predicates */ timestamp date /* timestamp of last time this row was changed */ ) storage (initial 200K next 100k maxextents unlimited pctincrease 0) / create unique index i_col_usage$ on col_usage$(obj#,intcol#) storage (maxextents unlimited) /在10g中我們默認使用’FOR ALL COLUMNS SIZE AUTO’的柱狀圖收集模式,而在9i中默認是’SIZE 1′即默認不收集柱狀圖,這導致許多9i中正常運行的應用程序在10g中CBO執行計劃異常,詳見 <dbms_stats收集模式在9i和10g上的區別> ;?!疭IZE AUTO’意為由Oracle自動決定是否收集柱狀圖及柱狀圖的桶數,Oracle自行判斷的依據就來源于col_usage$字典基表,若表上的某一列曾在硬解析(hard parse)過的SQL語句中充當過predicate(通俗的說就是where后的condition)的話,我們認為此列上有收集柱狀圖的必要,那么col_usage$上就會被加入該列曾充當predicate的記錄。當DBMS_STATS.GATHER_TABLE_STATS存儲過程以’SIZE AUTO’模式執行時,收集進程會檢查col_usage$基表以判斷哪些列之前曾充當過predicate,若充當過則說明該列有收集柱狀圖的價值。
SMON會每15分鐘將shared pool中的predicate columns的數據刷新到col_usage$基表中(until periodically about every 15 minutes SMON flush the data into the data dictionary),另外當instance shutdown時SMON會掃描col_usage$并找出已被drop表的相關predicate columns記錄,并刪除這部分”orphaned”孤兒記錄。
我們來具體了解col_usage$的填充過程:
SQL> select * from v$version; BANNER ---------------------------------------------------------------- Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi PL/SQL Release 10.2.0.4.0 - Production CORE 10.2.0.4.0 Production TNS for Linux: Version 10.2.0.4.0 - Production NLSRTL Version 10.2.0.4.0 - Production SQL> select * from global_name; GLOBAL_NAME -------------------------------------------------------------------------------- www.oracledatabase12g.com SQL> create table maclean (t1 int); Table created. SQL> select object_id from dba_objects where object_name='MACLEAN'; OBJECT_ID ---------- 1323013 SQL> select * from maclean where t1=1; no rows selected SQL> set linesize 200 pagesize 2000; 注意col_usage$的數據同*_tab_modifications類似, 從查詢到數據刷新到col_usage$存在一段時間的延遲, 所以我們立即查詢col_usage$將得不到任何記錄, 可以手動執行DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO將緩存中的信息刷新到字典上 SQL> select * from col_usage$ where obj#=1323013; no rows selected SQL> oradebug setmypid; Statement processed. 針對FLUSH_DATABASE_MONITORING_INFO填充操作做10046 level 12 trace SQL> oradebug event 10046 trace name context forever,level 12; Statement processed. SQL> exec DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO; PL/SQL procedure successfully completed. SQL> select * from col_usage$ where obj#=1323013; OBJ# INTCOL# EQUALITY_PREDS EQUIJOIN_PREDS NONEQUIJOIN_PREDS RANGE_PREDS LIKE_PREDS NULL_PREDS TIMESTAMP ---------- ---------- -------------- -------------- ----------------- ----------- ---------- ---------- --------- 1323013 1 1 0 0 0 0 0 19-AUG-11 =============10046 trace content==================== lock table sys.col_usage$ in exclusive mode nowait 在測試中可以發現10.2.0.4上DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO存儲過程會優先使用 lock in exclusive mode nowait來鎖住col_usage$基表, 如果lock失敗則會反復嘗試1100次, 若仍不能鎖住col_usage$表則放棄更新col_usage$上的數據,避免造成鎖等待和死鎖。 Cksxm.c Monitor Modification Hash Table Base modification hash table entry modification hash table chunk monitoring column usage element ksxmlock_1 lock table sys.col_usage$ in exclusive mode lock table sys.col_usage$ in exclusive mode nowait update sys.col_usage$ set equality_preds = equality_preds + decode(bitand(:flag, 1), 0, 0, 1), equijoin_preds = equijoin_preds + decode(bitand(:flag, 2), 0, 0, 1), nonequijoin_preds = nonequijoin_preds + decode(bitand(:flag, 4), 0, 0, 1), range_preds = range_preds + decode(bitand(:flag, 8), 0, 0, 1), like_preds = like_preds + decode(bitand(:flag, 16), 0, 0, 1), null_preds = null_preds + decode(bitand(:flag, 32), 0, 0, 1), timestamp = :time where obj# = :ob jn and intcol# = :coln insert into sys.col_usage$ (obj#, intcol#, equality_preds, equijoin_preds, nonequijoin_preds, range_preds, like_preds, null_preds, timestamp) values (:objn, :coln, decode(bitand(:flag, 1), 0, 0, 1), decode(bitand(:flag, 2), 0, 0, 1), decode(bitand(:flag, 4), 0, 0, 1), decode(bitand(:flag, 8), 0, 0, 1), decode(bitand(:flag, 16), 0, 0, 1), decode(bitand(:flag, 32), 0, 0, 1), :time)使用dbms_stats的’SIZE AUTO’模式收集表上的統計信息會首先參考col_usage$中的predicate columns記錄:
SQL> begin 2 3 dbms_stats.gather_table_stats(ownname => 'SYS', 4 tabname => 'MACLEAN', 5 method_opt => 'FOR ALL COLUMNS SIZE AUTO'); 6 end; 7 / PL/SQL procedure successfully completed. ============10046 level 12 trace content====================== SELECT /*+ ordered use_nl(o c cu h) index(u i_user1) index(o i_obj2) index(ci_obj#) index(cu i_col_usage$) index(h i_hh_obj#_intcol#) */ C.NAME COL_NAME, C.TYPE# COL_TYPE, C.CHARSETFORM COL_CSF, C.DEFAULT$ COL_DEF, C.NULL$ COL_NULL, C.PROPERTY COL_PROP, C.COL # COL_UNUM, C.INTCOL# COL_INUM, C.OBJ# COL_OBJ, C.SCALE COL_SCALE, H.BUCKET_CNT H_BCNT, (T.ROWCNT - H.NULL_CNT) / GREATEST(H.DISTCNT, 1) H_PFREQ, C.LENGTH COL_LEN, CU.TIMES TAMP CU_TIME, CU.EQUALITY_PREDS CU_EP, CU.EQUIJOIN_PREDS CU_EJP, CU.RANGE_PREDS CU_RP, CU.LIKE_PREDS CU_LP, CU.NONEQUIJOIN_PREDS CU_NEJP, CU.NULL_PREDS NP FROM SYS.USE R$ U, SYS.OBJ$ O, SYS.TAB$ T, SYS.COL$ C, SYS.COL_USAGE$ CU, SYS.HIST_HEAD$ H WHERE :B3 = '0' AND U.NAME = :B2 AND O.OWNER# = U.USER# AND O.TYPE# = 2 AND O.NAME = :B1 AND O.OBJ# = T.OBJ# AND O.OBJ# = C.OBJ# AND C.OBJ# = CU.OBJ#(+) AND C.INTCOL# = CU.INTCOL#(+) AND C.OBJ# = H.OBJ#(+) AND C.INTCOL# = H.INTCOL#(+) UNION ALL SELECT /*+ ordered use_nl(c) */ C.KQFCONAM COL_NAME, C.KQFCODTY COL_TYPE, DECODE(C.KQFCODTY, 1, 1, 0) COL_CSF, NULL COL_DEF, 0 COL_NULL, 0 COL_PROP, C.KQFCOCNO COL_UNUM, C.KQFCOC NO COL_INUM, O.KQFTAOBJ COL_OBJ, DECODE(C.KQFCODTY, 2, -127, 0) COL_SCALE, H.BUCKET_CNT H_BCNT, (ST.ROWCNT - NULL_CNT) / GREATEST(H.DISTCNT, 1) H_PFREQ, DECODE(C.KQFCODTY, 2, 22, C.KQFCOSIZ) COL_LEN, CU.TIMESTAMP CU_TIME, CU.EQUALITY_PREDS CU_EP, CU.EQUIJOIN_PREDS CU_EJP, CU.RANGE_PREDS CU_RP, CU.LIKE_PREDS CU_LP, CU.NONEQUIJOIN_PREDS CU _NEJP, CU.NULL_PREDS NP FROM SYS.X$KQFTA O, SYS.TAB_STATS$ ST, SYS.X$KQFCO C, SYS.COL_USAGE$ CU, SYS.HIST_HEAD$ H WHERE :B3 != '0' AND :B2 = 'SYS' AND O.KQFTANAM = :B1 AND O.KQFTAOBJ = ST.OBJ#(+) AND O.KQFTAOBJ = C.KQFCOTOB AND C.KQFCOTOB = CU.OBJ#(+) AND C.KQFCOCNO = CU.INTCOL#(+) AND C.KQFCOTOB = H.OBJ#(+) AND C.KQFCOCNO = H.INTCO L#(+)現象
根據Metalink Note<Database Shutdown Immediate Takes Forever, Can Only Do Shutdown Abort [ID 332177.1]>:
Database Shutdown Immediate Takes Forever, Can Only Do Shutdown Abort [ID 332177.1] Applies to: Oracle Server - Enterprise Edition - Version: 9.2.0.4.0 This problem can occur on any platform. Symptoms The database is not shutting down for a considerable time when you issue the command : shutdown immediate To shut it down in a reasonable time you have to issue the command shutdown abort To collect some diagnostics before issuing the shutdown immediate command set a trace event as follows: Connect as SYS (/ as sysdba) SQL> alter session set events '10046 trace name context forever,level 12'; SQL> shutdown immediate; In the resultant trace file (within the udump directory) you see something similar to the following :- PARSING IN CURSOR #n delete from sys.col_usage$ c where not exists (select 1 from sys.obj$ o where o.obj# = c.obj# ) ...followed by loads of..... WAIT #2: nam='db file sequential read' ela= 23424 p1=1 p2=4073 p3=1 .... WAIT #2: nam='db file scattered read' ela= 1558 p1=1 p2=44161 p3=8 etc Then eventually WAIT #2: nam='log file sync' ela= 32535 p1=4111 p2=0 p3=0 ...some other SQL....then back to WAIT #2: nam='db file sequential read' ela= 205 p1=1 p2=107925 p3=1 WAIT #2: nam='db file sequential read' ela= 1212 p1=1 p2=107926 p3=1 WAIT #2: nam='db file sequential read' ela= 212 p1=1 p2=107927 p3=1 WAIT #2: nam='db file scattered read' ela= 1861 p1=1 p2=102625 p3=8 etc.... To verify which objects are involved here you can use a couple of the P1 & P2 values from above :- a) a sequential read SELECT owner,segment_name,segment_type FROM dba_extents WHERE file_id=1 AND 107927 BETWEEN block_id AND block_id + blocks b) a scattered read SELECT owner,segment_name,segment_type FROM dba_extents WHERE file_id=1 AND 102625 BETWEEN block_id AND block_id + blocks The output confirms that the objects are SYS.I_COL_USAGE$ (INDEX) and SYS.COL_USAGE$ (TABLE) Finally, issue select count(*) from sys.col_usage$; Cause If the number of entries in sys.col_usage$ is large then you are very probably hitting the issue raised in Bug: 3540022 9.2.0.4.0 RDBMS Base Bug 3221945 Abstract: CLEAN-UP OF ENTRIES IN COL_USAGE$ Base Bug 3221945 9.2.0.3 RDBMS Abstract: ORA-1631 ON COL_USAGE$ Closed as "Not a Bug" However, when a table is dropped, the column usage statistics are not dropped. They are left as they are. When the database is shutdown (in normal mode), then these "orphaned" column usage entries are deleted. The code which does this gets called only during normal shutdown. Unless and until the database is shutdown, the col_usage$ table will continue to grow. Solution To implement the workaround, please execute the following steps: 1. Periodically (eg once a day) run exec DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO; DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO will clean out redundant col_usage$ entries, and when you come to shutdown the database you should not have a huge number of entries left to clean up.該文檔指出了在shutdown instance時SMON會著手清理col_usage$中已被drop表的相關predicate columns的”orphaned”記錄,如果在本次實例的生命周期中曾生成大量最后被drop的中間表,那么col_usage$中已經堆積了眾多的”orphaned”記錄,SMON為了完成cleanup工作需要花費大量時間導致shutdown變慢。這個文檔還指出定期執行DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO也可以清理col_usage$中的冗余記錄。
我們來觀察一下SMON的清理工作:
begin for i in 1 .. 5000 loop execute immediate 'create table maclean1' || i ||' tablespace fragment as select 1 t1 from dual'; execute immediate 'select * from maclean1' || i || ' where t1=1'; end loop; DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO; for i in 1 .. 5000 loop execute immediate 'drop table maclean1' || i; end loop; end; / SQL> purge dba_recyclebin; DBA Recyclebin purged. 我們可以通過以下查詢了解col_usage$上的orphaned記錄總數,這也將是在instance shutdown時 SMON所需要清理的數目 select count(*) from sys.col_usage$ c where not exists (select /*+ unnest */ 1 from sys.obj$ o where o.obj# = c.obj#); COUNT(*) ---------- 10224 針對SMON做10046 level 12 trace SQL> oradebug setospid 30225; Oracle pid: 8, Unix process pid: 30225, image: oracle@rh2.oracle.com (SMON) SQL> oradebug event 10046 trace name context forever,level 12; Statement processed. SQL> shutdown immediate; =================10046 trace content================== lock table sys.col_usage$ in exclusive mode nowait delete from sys.col_usage$ where obj#= :1 and intcol#= :2 delete from sys.col_usage$ c where not exists (select /*+ unnest */ 1 from sys.obj$ o where o.obj# = c.obj#)如何禁止SMON維護col_usage$字典基表
1.設置隱藏參數_column_tracking_level(column usage tracking),該參數默認為1即啟用column使用情況跟蹤。設置該參數為0,將禁用column tracking,該參數可以在session和system級別動態修改:
SQL> col name for a25 SQL> col DESCRIB for a25 SQL> SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ 2 FROM SYS.x$ksppi x, SYS.x$ksppcv y 3 WHERE x.inst_id = USERENV ('Instance') 4 AND y.inst_id = USERENV ('Instance') 5 AND x.indx = y.indx 6 AND x.ksppinm LIKE '%_column_tracking_level%'; NAME VALUE DESCRIB ------------------------- ---------- ------------------------- _column_tracking_level 1 column usage tracking SQL> alter session set "_column_tracking_level"=0 ; Session altered. SQL> alter system set "_column_tracking_level"=0 scope=both; System altered.2.關閉DML monitoring,可以通過設置隱藏參數_dml_monitoring_enabled(enable modification monitoring)為false實現,disable dml monitoring對CBO的影響較大,所以我們一般推薦上一種方式:
SQL> SELECT monitoring, count(*) from DBA_TABLES group by monitoring; MON COUNT(*) --- ---------- NO 79 YES 2206 SQL> alter system set "_dml_monitoring_enabled"=false; System altered. SQL> SELECT monitoring, count(*) from DBA_TABLES group by monitoring; MON COUNT(*) --- ---------- NO 2285 實際上dba_tables的monitoring列來源于內部參數_dml_monitoring_enabled SQL> set long 99999 SQL> select text from dba_views where view_name='DBA_TABLES'; TEXT -------------------------------------------------------------------------------- select u.name, o.name, decode(bitand(t.property,2151678048), 0, ts.name, null), decode(bitand(t.property, 1024), 0, null, co.name), decode((bitand(t.property, 512)+bitand(t.flags, 536870912)), 0, null, co.name), decode(bitand(t.trigflag, 1073741824), 1073741824, 'UNUSABLE', 'VALID'), decode(bitand(t.property, 32+64), 0, mod(t.pctfree$, 100), 64, 0, null), decode(bitand(ts.flags, 32), 32, to_number(NULL), decode(bitand(t.property, 32+64), 0, t.pctused$, 64, 0, null)), decode(bitand(t.property, 32), 0, t.initrans, null), decode(bitand(t.property, 32), 0, t.maxtrans, null), s.iniexts * ts.blocksize, decode(bitand(ts.flags, 3), 1, to_number(NULL), s.extsize * ts.blocksize), s.minexts, s.maxexts, decode(bitand(ts.flags, 3), 1, to_number(NULL), s.extpct), decode(bitand(ts.flags, 32), 32, to_number(NULL), decode(bitand(o.flags, 2), 2, 1, decode(s.lists, 0, 1, s.lists))), decode(bitand(ts.flags, 32), 32, to_number(NULL), decode(bitand(o.flags, 2), 2, 1, decode(s.groups, 0, 1, s.groups))), decode(bitand(t.property, 32+64), 0, decode(bitand(t.flags, 32), 0, 'YES', 'NO'), null), decode(bitand(t.flags,1), 0, 'Y', 1, 'N', '?'), t.rowcnt, decode(bitand(t.property, 64), 0, t.blkcnt, null), decode(bitand(t.property, 64), 0, t.empcnt, null), t.avgspc, t.chncnt, t.avgrln, t.avgspc_flb, decode(bitand(t.property, 64), 0, t.flbcnt, null), lpad(decode(t.degree, 32767, 'DEFAULT', nvl(t.degree,1)),10), lpad(decode(t.instances, 32767, 'DEFAULT', nvl(t.instances,1)),10), lpad(decode(bitand(t.flags, 8), 8, 'Y', 'N'),5), decode(bitand(t.flags, 6), 0, 'ENABLED', 'DISABLED'), t.samplesize, t.analyzetime, decode(bitand(t.property, 32), 32, 'YES', 'NO'), decode(bitand(t.property, 64), 64, 'IOT', decode(bitand(t.property, 512), 512, 'IOT_OVERFLOW', decode(bitand(t.flags, 536870912), 536870912, 'IOT_MAPPING', null ))), decode(bitand(o.flags, 2), 0, 'N', 2, 'Y', 'N'), decode(bitand(o.flags, 16), 0, 'N', 16, 'Y', 'N'), decode(bitand(t.property, 8192), 8192, 'YES', decode(bitand(t.property, 1), 0, 'NO', 'YES')), decode(bitand(o.flags, 2), 2, 'DEFAULT', decode(s.cachehint, 0, 'DEFAULT', 1, 'KEEP', 2, 'RECYCLE', NULL)), decode(bitand(t.flags, 131072), 131072, 'ENABLED', 'DISABLED'), decode(bitand(t.flags, 512), 0, 'NO', 'YES'), decode(bitand(t.flags, 256), 0, 'NO', 'YES'), decode(bitand(o.flags, 2), 0, NULL, decode(bitand(t.property, 8388608), 8388608, 'SYS$SESSION', 'SYS$TRANSACTION')), decode(bitand(t.flags, 1024), 1024, 'ENABLED', 'DISABLED'), decode(bitand(o.flags, 2), 2, 'NO', decode(bitand(t.property, 2147483648), 2147483648, 'NO', decode(ksppcv.ksppstvl, 'TRUE', 'YES', 'NO'))), decode(bitand(t.property, 1024), 0, null, cu.name), decode(bitand(t.flags, 8388608), 8388608, 'ENABLED', 'DISABLED'), decode(bitand(t.property, 32), 32, null, decode(bitand(s.spare1, 2048), 2048, 'ENABLED', 'DISABLED')), decode(bitand(o.flags, 128), 128, 'YES', 'NO') from sys.user$ u, sys.ts$ ts, sys.seg$ s, sys.obj$ co, sys.tab$ t, sys.obj$ o, sys.obj$ cx, sys.user$ cu, x$ksppcv ksppcv, x$ksppi ksppi where o.owner# = u.user# and o.obj# = t.obj# and bitand(t.property, 1) = 0 and bitand(o.flags, 128) = 0 and t.bobj# = co.obj# (+) and t.ts# = ts.ts# and t.file# = s.file# (+) and t.block# = s.block# (+) and t.ts# = s.ts# (+) and t.dataobj# = cx.obj# (+) and cx.owner# = cu.user# (+) and ksppi.indx = ksppcv.indx and ksppi.ksppinm = '_dml_monitoring_enabled'
了解你所不知道的SMON功能(五):Recover Dead transaction
SMON的作用還包括清理死事務:Recover Dead transaction。當服務進程在提交事務(commit)前就意外終止的話會形成死事務(dead transaction),PMON進程負責輪詢Oracle進程,找出這類意外終止的死進程(dead process),通知SMON將與該dead process相關的dead transaction回滾清理,并且PMON還負責恢復dead process原本持有的鎖和latch。
我們來具體了解dead transaction的恢復過程:
SQL> select * from v$version; BANNER ---------------------------------------------------------------- Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi PL/SQL Release 10.2.0.4.0 - Production CORE 10.2.0.4.0 Production TNS for Linux: Version 10.2.0.4.0 - Production NLSRTL Version 10.2.0.4.0 - Production SQL> select * from global_name; GLOBAL_NAME -------------------------------------------------------------------------------- www.oracledatabase12g.com SQL>alter system set fast_start_parallel_rollback=false; System altered. 設置10500,10046事件以跟蹤SMON進程的行為 SQL> alter system set events '10500 trace name context forever,level 8'; System altered. SQL> oradebug setospid 4424 Oracle pid: 8, Unix process pid: 4424, image: oracle@rh2.oracle.com (SMON) SQL> oradebug event 10046 trace name context forever,level 8; Statement processed. 在一個新的terminal中執行大批量的刪除語句,在執行一段時間后使用操作系統命令將執行該刪除操作的 服務進程kill掉,模擬一個大的dead transaction的場景 SQL> delete large_rb; delete large_rb [oracle@rh2 bdump]$ kill -9 4535 等待幾秒后pmon進程會找出dead process: [claim lock for dead process][lp 0x7000003c70ceff0][p 0x7000003ca63dad8.1290666][hist x9a514951] 在x$ktube內部視圖中出現ktuxecfl(Transaction flags)標記為DEAD的記錄: SQL> select sum(distinct(ktuxesiz)) from x$ktuxe where ktuxecfl = 'DEAD'; SUM(DISTINCT(KTUXESIZ)) ----------------------- 29386 SQL> / SUM(DISTINCT(KTUXESIZ)) ----------------------- 28816 以上KTUXESIZ代表事務所使用的undo塊總數(number of undo blocks used by the transaction) ==================smon trace content================== SMON: system monitor process posted WAIT #0: nam='log file switch completion' ela= 0 p1=0 p2=0 p3=0 obj#=1 tim=1278243332801935 WAIT #0: nam='log file switch completion' ela= 0 p1=0 p2=0 p3=0 obj#=1 tim=1278243332815568 WAIT #0: nam='latch: row cache objects' ela= 95 address=2979418792 number=200 tries=1 obj#=1 tim=1278243333332734 WAIT #0: nam='latch: row cache objects' ela= 83 address=2979418792 number=200 tries=1 obj#=1 tim=1278243333356173 WAIT #0: nam='latch: undo global data' ela= 104 address=3066991984 number=187 tries=1 obj#=1 tim=1278243347987705 WAIT #0: nam='latch: object queue header operation' ela= 89 address=3094817048 number=131 tries=0 obj#=1 tim=1278243362468042 WAIT #0: nam='log file switch (checkpoint incomplete)' ela= 0 p1=0 p2=0 p3=0 obj#=1 tim=1278243419588202 Dead transaction 0x00c2.008.0000006d recovered by SMON ===================== PARSING IN CURSOR #3 len=358 dep=1 uid=0 oct=3 lid=0 tim=1278243423594568 hv=3186851936 ad='ae82c1b8' select smontabv.cnt, smontab.time_mp, smontab.scn, smontab.num_mappings, smontab.tim_scn_map, smontab.orig_thread from smon_scn_time smontab, (select max(scn) scnmax, count(*) + sum(NVL2(TIM_SCN_MAP, NUM_MAPPINGS, 0)) cnt from smon_scn_time where thread = 0) smontabv where smontab.scn = smontabv.scnmax and thread = 0 END OF STMT PARSE #3:c=0,e=1354526,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1278243423594556 EXEC #3:c=0,e=106,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=1278243423603269 FETCH #3:c=0,e=47065,p=0,cr=319,cu=0,mis=0,r=1,dep=1,og=4,tim=1278243423650375 *** 2011-06-24 21:19:25.899 WAIT #0: nam='smon timer' ela= 299999999 sleep time=300 failed=0 p3=0 obj#=1 tim=1278243716699171 kglScanDependencyHandles4Unpin(): cumscan=3 cumupin=4 time=776 upinned=0以上SMON回滾清理Dead transaction的過程從”system monitor process posted”開始到”Dead transaction 0x00c2.008.0000006d recovered by SMON”結束。另外可以看到在恢復過程中SMON先后請求了’latch: row cache objects’、’latch: undo global data’、’latch: object queue header operation’三種不同類型的latch。
現象
fast_start_parallel_rollback參數決定了SMON在回滾事務時使用的并行度,若將該參數設置為false那么并行回滾將被禁用,若設置為Low(默認值)那么會以2*CPU_COUNT數目的并行度回滾,當設置為High則4*CPU_COUNT數目的回滾進程將參與進來。當我們通過以下查詢發現系統中存在大的dead tranacation需要回滾時我們可以通過設置fast_start_parallel_rollback為HIGH來加速恢復:
select sum(distinct(ktuxesiz)) from x$ktuxe where ktuxecfl = 'DEAD'; ==============parallel transaction recovery=============== *** 2011-06-24 20:31:01.765 SMON: system monitor process posted msgflag:0x0000 (-/-/-/-/-/-/-) *** 2011-06-24 20:31:01.765 SMON: process sort segment requests begin *** 2011-06-24 20:31:01.765 SMON: process sort segment requests end *** 2011-06-24 20:31:01.765 SMON: parallel transaction recovery begin WAIT #0: nam='DFS lock handle' ela= 504 type|mode=1413545989 id1=3 id2=11 obj#=2 tim=1308918661765715 WAIT #0: nam='DFS lock handle' ela= 346 type|mode=1413545989 id1=3 id2=12 obj#=2 tim=1308918661766135 WAIT #0: nam='DFS lock handle' ela= 565 type|mode=1413545989 id1=3 id2=13 obj#=2 tim=1308918661766758 WAIT #0: nam='DFS lock handle' ela= 409 type|mode=1413545989 id1=3 id2=14 obj#=2 tim=1308918661767221 WAIT #0: nam='DFS lock handle' ela= 332 type|mode=1413545989 id1=3 id2=15 obj#=2 tim=1308918661767746 WAIT #0: nam='DFS lock handle' ela= 316 type|mode=1413545989 id1=3 id2=16 obj#=2 tim=1308918661768146 WAIT #0: nam='DFS lock handle' ela= 349 type|mode=1413545989 id1=3 id2=17 obj#=2 tim=1308918661768549 WAIT #0: nam='DFS lock handle' ela= 258 type|mode=1413545989 id1=3 id2=18 obj#=2 tim=1308918661768858 WAIT #0: nam='DFS lock handle' ela= 310 type|mode=1413545989 id1=3 id2=19 obj#=2 tim=1308918661769224 WAIT #0: nam='DFS lock handle' ela= 281 type|mode=1413545989 id1=3 id2=20 obj#=2 tim=1308918661769555 *** 2011-06-24 20:31:01.769 SMON: parallel transaction recovery end但是在real world的實踐中可以發現當fast_start_parallel_rollback= Low/High,即啟用并行回滾時常有并行進程因為各種資源互相阻塞導致回滾工作停滯的例子,當遭遇到這種問題時將fast_start_parallel_rollback設置為FALSE一般可以保證恢復工作以串行形式在較長時間內完成。
如何禁止SMON Recover Dead transaction
可以設置10513事件來臨時禁止SMON恢復死事務,這在我們做某些異?;謴偷臅r候顯得異常有效,當然不建議在一個正常的生產環境中設置這個事件:
SQL> alter system set events '10513 trace name context forever, level 2'; System altered. 10531 -- event disables transaction recovery which was initiated by SMON SQL> select ktuxeusn, 2 to_char(sysdate, 'DD-MON-YYYY HH24:MI:SS') "Time", 3 ktuxesiz, 4 ktuxesta 5 from x$ktuxe 6 where ktuxecfl = 'DEAD'; KTUXEUSN Time KTUXESIZ KTUXESTA ---------- -------------------------- ---------- ---------------- 17 24-JUN-2011 22:03:10 0 INACTIVE 66 24-JUN-2011 22:03:10 0 INACTIVE 105 24-JUN-2011 22:03:10 0 INACTIVE 193 24-JUN-2011 22:03:10 33361 ACTIVE 194 24-JUN-2011 22:03:10 0 INACTIVE 194 24-JUN-2011 22:03:10 0 INACTIVE 197 24-JUN-2011 22:03:10 20171 ACTIVE 7 rows selected. SQL> / KTUXEUSN Time KTUXESIZ KTUXESTA ---------- -------------------------- ---------- ---------------- 17 24-JUN-2011 22:03:10 0 INACTIVE 66 24-JUN-2011 22:03:10 0 INACTIVE 105 24-JUN-2011 22:03:10 0 INACTIVE 193 24-JUN-2011 22:03:10 33361 ACTIVE 194 24-JUN-2011 22:03:10 0 INACTIVE 194 24-JUN-2011 22:03:10 0 INACTIVE 197 24-JUN-2011 22:03:10 20171 ACTIVE 7 rows selected. ================smon disabled trans recover trace================== SMON: system monitor process posted *** 2011-06-24 22:02:57.980 SMON: Event 10513 is level 2, trans recovery disabled.
了解你所不知道的SMON功能(六):Instance Recovery
SMON的作用還包括RAC環境中的Instance Recovery,注意雖然Instance Recovery可以翻做實例恢復,但實際上和我們口頭所說的實例恢復是不同的。我們口頭語言所說的實例恢復很大程度上是指Crash Recovery崩潰恢復,Instance Recovery與Crash Recovery是存在區別的:針對單實例(single instance)或者RAC中所有節點全部崩潰后的恢復,我們稱之為Crash Recovery。而對于RAC中的某一個節點失敗,存活節點(surviving instance)試圖對失敗節點線程上redo做應用的情況,我們稱之為Instance Recovery。對于Crash Recovery更多的內容可見 <還原真實的cache recovery> 一文。
現象
Instance Recovery期間分別存在cache recovery和ges/gcs remaster2個recovery stage,注意這2個舞臺的恢復是同時進行的。cache recovery的主角是存活節點上的SMON進程,SMON負責分發redo給slave進程。而實施ges/gcs remaster的是RAC專有進程LMON。
整個Reconfiuration的過程如下圖: 注意以上Crash Detected時數據庫進入部分可用(Partial Availability)狀態,從Freeze Lockdb開始None Availability,到IR applies redo即前滾時轉換為Partial Availability,待前滾完成后會實施回滾,但是此時數據庫已經進入完全可用(Full Availability)狀態了,如下圖:
The graphic illustrates the degree of database availability during each step of Oracle instance recovery:
A. Real Application Clusters is running on multiple nodes.
B. Node failure is detected.
C. The enqueue part of the GRD is reconfigured; resource management is redistributed to the surviving nodes. This operation occurs relatively quickly.
D. The cache part of the GRD is reconfigured and SMON reads the redo log of the failed instance to identify the database blocks that it needs to recover.
E. SMON issues the GRD requests to obtain all the database blocks it needs for recovery. After the requests are complete, all other blocks are accessible.
F. The Oracle server performs roll forward recovery. Redo logs of the failed threads are applied to the database, and blocks are available right after their recovery is completed.
G. The Oracle server performs rollback recovery. Undo blocks are applied to the database for all uncommitted transactions.
H. Instance recovery is complete and all data is accessible.
Note: The dashed line represents the blocks identified in step 2 in the previous slide. Also, the dotted steps represent the ones identified in the previous slide.
我們來實際觀察一下Instance Recovery的過程:
INST 1: SQL> select * from v$version; BANNER -------------------------------------------------------------------------------- Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production PL/SQL Release 11.2.0.2.0 - Production CORE 11.2.0.2.0 Production TNS for Linux: Version 11.2.0.2.0 - Production NLSRTL Version 11.2.0.2.0 - Production SQL> select * from global_name; GLOBAL_NAME -------------------------------------------------------------------------------- www.oracledatabase12g.com SQL> alter system set event='10426 trace name context forever,level 12' scope=spfile; -- 10426 event Reconfiguration trace event System altered. SQL> startup force; ORACLE instance started. INST 2: SQL> shutdown abort ORACLE instance shut down. ============================================================= ========================alert.log============================ Reconfiguration started (old inc 4, new inc 6) List of instances: 1 (myinst: 1) Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Instance recovery: looking for dead threads Beginning instance recovery of 1 threads parallel recovery started with 2 processes --2 recovery slave Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Started redo scan Completed redo scan read 88 KB redo, 82 data blocks need recovery Started redo application at Thread 2: logseq 374, block 2, scn 54624376 Recovery of Online Redo Log: Thread 2 Group 4 Seq 374 Reading mem 0 Mem# 0: +DATA/prod/onlinelog/group_4.271.747100549 Mem# 1: +DATA/prod/onlinelog/group_4.272.747100553 Completed redo application of 0.07MB Completed instance recovery at Thread 2: logseq 374, block 178, scn 54646382 73 data blocks read, 83 data blocks written, 88 redo k-bytes read Thread 2 advanced to log sequence 375 (thread recovery) Redo thread 2 internally disabled at seq 375 (SMON) ARC3: Creating local archive destination LOG_ARCHIVE_DEST_1: '/s01/arch/2_374_747100216.dbf' (thread 2 sequence 374) (PROD1) Setting Resource Manager plan SCHEDULER[0x310B]:DEFAULT_MAINTENANCE_PLAN via scheduler window Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter ARC3: Closing local archive destination LOG_ARCHIVE_DEST_1: '/s01/arch/2_374_747100216.dbf' (PROD1) 2011-06-27 22:19:29.280000 +08:00 Archived Log entry 792 added for thread 2 sequence 374 ID 0x9790ab2 dest 1: ARC0: Creating local archive destination LOG_ARCHIVE_DEST_1: '/s01/arch/2_375_747100216.dbf' (thread 2 sequence 375) (PROD1) 2011-06-27 22:19:30.336000 +08:00 ARC0: Archiving disabled thread 2 sequence 375 ARC0: Closing local archive destination LOG_ARCHIVE_DEST_1: '/s01/arch/2_375_747100216.dbf' (PROD1) Archived Log entry 793 added for thread 2 sequence 375 ID 0x9790ab2 dest 1: minact-scn: Master considers inst:2 dead ================================================================== ===========================smon trace begin======================= *** 2011-06-27 22:19:28.279 2011-06-27 22:19:28.279849 : Start recovery for domain=0, valid=0, flags=0x0 Successfully allocated 2 recovery slaves Using 67 overflow buffers per recovery slave Thread 2 checkpoint: logseq 374, block 2, scn 54624376 cache-low rba: logseq 374, block 2 on-disk rba: logseq 374, block 178, scn 54626382 start recovery at logseq 374, block 2, scn 54624376 Instance recovery not required for thread 1 *** 2011-06-27 22:19:28.487 Started writing zeroblks thread 2 seq 374 blocks 178-185 *** 2011-06-27 22:19:28.487 Completed writing zeroblks thread 2 seq 374 ==== Redo read statistics for thread 2 ==== Total physical reads (from disk and memory): 4096Kb -- Redo read_disk statistics -- Read rate (ASYNC): 88Kb in 0.18s => 0.48 Mb/sec Longest record: 8Kb, moves: 0/186 (0%) Longest LWN: 33Kb, moves: 0/47 (0%), moved: 0Mb Last redo scn: 0x0000.0341884d (54626381) ---------------------------------------------- ----- Recovery Hash Table Statistics --------- Hash table buckets = 262144 Longest hash chain = 1 Average hash chain = 82/82 = 1.0 Max compares per lookup = 1 Avg compares per lookup = 248/330 = 0.8 ---------------------------------------------- *** 2011-06-27 22:19:28.489 KCRA: start recovery claims for 82 data blocks *** 2011-06-27 22:19:28.526 KCRA: blocks processed = 82/82, claimed = 81, eliminated = 1 2011-06-27 22:19:28.526088 : Validate domain 0 **************** BEGIN RECOVERY HA STATS **************** I'm the recovery instance smon posted (1278500359646), recovery started 0.027 secs,(1278500359673) domain validated 0.242 secs (1278500359888) claims opened 70, claims converted 11, claims preread 0 **************** END RECOVERY HA STATS ***************** 2011-06-27 22:19:28.526668 : Validated domain 0, flags = 0x0 *** 2011-06-27 22:19:28.556 Recovery of Online Redo Log: Thread 2 Group 4 Seq 374 Reading mem 0 *** 2011-06-27 22:19:28.560 Completed redo application of 0.07MB *** 2011-06-27 22:19:28.569 Completed recovery checkpoint ----- Recovery Hash Table Statistics --------- Hash table buckets = 262144 Longest hash chain = 1 Average hash chain = 82/82 = 1.0 Max compares per lookup = 1 Avg compares per lookup = 330/330 = 1.0 ---------------------------------------------- *** 2011-06-27 22:19:28.572 5401 krsg.c Acquiring RECOVERY INFO PING latch from [krsg.c:5401] IX0 *** 2011-06-27 22:19:28.572 5401 krsg.c Successfully acquired RECOVERY INFO PING latch IX+ *** 2011-06-27 22:19:28.572 5406 krsg.c Freeing RECOVERY INFO PING latch from [krsg.c:5406] IX0 *** 2011-06-27 22:19:28.572 5406 krsg.c Successfully freed RECOVERY INFO PING latch IX- krss_sched_work: Prod archiver request from process SMON (function:0x2000) krss_find_arc: Evaluating ARC3 to receive message (flags 0x0) krss_find_arc: Evaluating ARC0 to receive message (flags 0x0) krss_find_arc: Evaluating ARC1 to receive message (flags 0xc) krss_find_arc: Evaluating ARC2 to receive message (flags 0x2) krss_find_arc: Selecting ARC2 to receive REC PING message *** 2011-06-27 22:19:28.572 3093 krsv.c krsv_send_msg: Sending message to process ARC2 *** 2011-06-27 22:19:28.572 1819 krss.c krss_send_arc: Sent message to ARC2 (message:0x2000) Recovery sets nab of thread 2 seq 374 to 178 with 8 zeroblks Retrieving log 4 pre-aal: xlno:4 flno:0 arf:0 arb:2 arh:2 art:4 Updating log 3 thread 2 sequence 375 Previous log 3 thread 2 sequence 0 Updating log 4 thread 2 sequence 374 Previous log 4 thread 2 sequence 374 post-aal: xlno:4 flno:0 arf:3 arb:2 arh:2 art:3 krss_sched_work: Prod archiver request from process SMON (function:0x1) krss_find_arc: Evaluating ARC3 to receive message (flags 0x0) krss_find_arc: Selecting ARC3 to receive message *** 2011-06-27 22:19:28.589 3093 krsv.c krsv_send_msg: Sending message to process ARC3 *** 2011-06-27 22:19:28.589 1819 krss.c krss_send_arc: Sent message to ARC3 (message:0x1) Retrieving log 2 Kicking thread 1 to switch logfile Retrieving log 4 Retrieving log 3 krss_sched_work: Prod archiver request from process SMON (function:0x1) krss_find_arc: Evaluating ARC0 to receive message (flags 0x0) krss_find_arc: Selecting ARC0 to receive message *** 2011-06-27 22:19:28.599 3093 krsv.c krsv_send_msg: Sending message to process ARC0 *** 2011-06-27 22:19:28.599 1819 krss.c krss_send_arc: Sent message to ARC0 (message:0x1) *** 2011-06-27 22:19:28.599 838 krsv.c krsv_dpga: Waiting for pending I/O to complete *** 2011-06-27 22:19:29.304 krss_sched_work: Prod archiver request from process SMON (function:0x1) krss_find_arc: Evaluating ARC1 to receive message (flags 0xc) krss_find_arc: Selecting ARC1 to receive message *** 2011-06-27 22:19:29.304 3093 krsv.c krsv_send_msg: Sending message to process ARC1 *** 2011-06-27 22:19:29.304 1819 krss.c krss_send_arc: Sent message to ARC1 (message:0x1) SMON[INST-TXN-RECO]:about to recover undo segment 11 status:3 inst:2 SMON[INST-TXN-RECO]: mark undo segment 11 as available status:2 ret:0 SMON[INST-TXN-RECO]:about to recover undo segment 12 status:3 inst:2 SMON[INST-TXN-RECO]: mark undo segment 12 as available status:2 ret:0 SMON[INST-TXN-RECO]:about to recover undo segment 13 status:3 inst:2 SMON[INST-TXN-RECO]: mark undo segment 13 as available status:2 ret:0 SMON[INST-TXN-RECO]:about to recover undo segment 14 status:3 inst:2 SMON[INST-TXN-RECO]: mark undo segment 14 as available status:2 ret:0 SMON[INST-TXN-RECO]:about to recover undo segment 15 status:3 inst:2 SMON[INST-TXN-RECO]: mark undo segment 15 as available status:2 ret:0 SMON[INST-TXN-RECO]:about to recover undo segment 16 status:3 inst:2 SMON[INST-TXN-RECO]: mark undo segment 16 as available status:2 ret:0 SMON[INST-TXN-RECO]:about to recover undo segment 17 status:3 inst:2 SMON[INST-TXN-RECO]: mark undo segment 17 as available status:2 ret:0 SMON[INST-TXN-RECO]:about to recover undo segment 18 status:3 inst:2 SMON[INST-TXN-RECO]: mark undo segment 18 as available status:2 ret:0 SMON[INST-TXN-RECO]:about to recover undo segment 19 status:3 inst:2 SMON[INST-TXN-RECO]: mark undo segment 19 as available status:2 ret:0 SMON[INST-TXN-RECO]:about to recover undo segment 20 status:3 inst:2 SMON[INST-TXN-RECO]: mark undo segment 20 as available status:2 ret:0 *** 2011-06-27 22:19:43.299 * kju_tsn_aff_drm_pending TRACEUD: called with tsn x2, dissolve 0 * kju_tsn_aff_drm_pending TRACEUD: tsn_pkey = x2.1 * >> RM REQ QS ---: single window RM request queue is empty multi-window RM request queue is empty * Global DRM state ---: There is no dynamic remastering RM lock state = 0 pkey 2.1 undo 1 stat 0 masters[32768, 1->1] reminc 4 RM# 1 flg x0 type x0 afftime x36e6e3a8 nreplays by lms 0 = 0 * kju_tsn_aff_drm_pending TRACEUD: matching request not found on swin queue * kju_tsn_aff_drm_pending TRACEUD: pp found, stat x0 * kju_tsn_aff_drm_pending TRACEUD: 2 return true *** 2011-06-27 22:22:18.333 * kju_tsn_aff_drm_pending TRACEUD: called with tsn x2, dissolve 0 * kju_tsn_aff_drm_pending TRACEUD: tsn_pkey = x2.1 * >> RM REQ QS ---: single window RM request queue is empty multi-window RM request queue is empty * Global DRM state ---: There is no dynamic remastering RM lock state = 0 pkey 2.1 undo 1 stat 0 masters[32768, 1->1] reminc 4 RM# 1 flg x0 type x0 afftime x36e6e3a8 nreplays by lms 0 = 0 * kju_tsn_aff_drm_pending TRACEUD: matching request not found on swin queue * kju_tsn_aff_drm_pending TRACEUD: pp found, stat x0 * kju_tsn_aff_drm_pending TRACEUD: 2 return true *** 2011-06-27 22:24:53.365 * kju_tsn_aff_drm_pending TRACEUD: called with tsn x2, dissolve 0 * kju_tsn_aff_drm_pending TRACEUD: tsn_pkey = x2.1 * >> RM REQ QS ---: single window RM request queue is empty multi-window RM request queue is empty * Global DRM state ---: There is no dynamic remastering RM lock state = 0 pkey 2.1 undo 1 stat 0 masters[32768, 1->1] reminc 4 RM# 1 flg x0 type x0 afftime x36e6e3a8 nreplays by lms 0 = 0 * kju_tsn_aff_drm_pending TRACEUD: matching request not found on swin queue * kju_tsn_aff_drm_pending TRACEUD: pp found, stat x0 * kju_tsn_aff_drm_pending TRACEUD: 2 return true ======================================================================== ==============================lmon trace begin========================== *** 2011-06-27 22:19:27.748 kjxgmpoll reconfig instance map: 1 *** 2011-06-27 22:19:27.748 kjxgmrcfg: Reconfiguration started, type 1 CGS/IMR TIMEOUTS: CSS recovery timeout = 31 sec (Total CSS waittime = 65) IMR Reconfig timeout = 75 sec CGS rcfg timeout = 85 sec kjxgmcs: Setting state to 4 0. *** 2011-06-27 22:19:27.759 Name Service frozen kjxgmcs: Setting state to 4 1. kjxgrdecidever: No old version members in the cluster kjxgrssvote: reconfig bitmap chksum 0x2137452d cnt 1 master 1 ret 0 kjxgrpropmsg: SSMEMI: inst 1 - no disk vote kjxgrpropmsg: SSVOTE: Master indicates no Disk Voting 2011-06-27 22:19:27.760783 : kjxgrDiskVote: nonblocking method is chosen kjxggpoll: change poll time to 50 ms 2011-06-27 22:19:27.918847 : kjxgrDiskVote: Obtained RR update lock for sequence 5, RR seq 4 2011-06-27 22:19:28.023160 : kjxgrDiskVote: derive membership from CSS (no disk votes) 2011-06-27 22:19:28.023240 : proposed membership: 1 *** 2011-06-27 22:19:28.081 2011-06-27 22:19:28.081952 : kjxgrDiskVote: new membership is updated by inst 1, seq 6 2011-06-27 22:19:28.082073 : kjxgrDiskVote: bitmap: 1 CGS/IMR TIMEOUTS: CSS recovery timeout = 31 sec (Total CSS waittime = 65) IMR Reconfig timeout = 75 sec CGS rcfg timeout = 85 sec kjxgmmeminfo: can not invalidate inst 2 kjxgmps: proposing substate 2 kjxgmcs: Setting state to 6 2. kjfmSendAbortInstMsg: send an abort message to instance 2 kjfmuin: inst bitmap 1 kjfmmhi: received msg from inst 1 (inc 2) Performed the unique instance identification check kjxgmps: proposing substate 3 kjxgmcs: Setting state to 6 3. Name Service recovery started Deleted all dead-instance name entries kjxgmps: proposing substate 4 kjxgmcs: Setting state to 6 4. Multicasted all local name entries for publish Replayed all pending requests kjxgmps: proposing substate 5 kjxgmcs: Setting state to 6 5. Name Service normal Name Service recovery done *** 2011-06-27 22:19:28.191 kjxgmps: proposing substate 6 kjxgmcs: Setting state to 6 6. kjxgmcs: total reconfig time 0.432 seconds (from 2895072218 to 2895072650) kjxggpoll: change poll time to 600 ms kjfmact: call ksimdic on instance (2) 2011-06-27 22:19:28.211846 : ********* kjfcrfg() called, BEGIN LMON RCFG ********* 2011-06-27 22:19:28.211906 : * Begin lmon rcfg step KJGA_RCFG_BEGIN * kjfcrfg: Resource broadcasting disabled * kjfcrfg: kjfcqiora returned success kjfcrfg: DRM window size = 4096->4096 (min lognb = 15) 2011-06-27 22:19:28.211954 : Reconfiguration started (old inc 4, new inc 6) TIMEOUTS: Local health check timeout: 70 sec Rcfg process freeze timeout: 70 sec Remote health check timeout: 140 sec Defer Queue timeout: 163 secs CGS rcfg timeout: 85 sec Synchronization timeout: 248 sec DLM rcfg timeout: 744 sec List of instances: 1 (myinst: 1) Undo tsn affinity 1 OMF 0 2011-06-27 22:19:28.212394 : * Begin lmon rcfg step KJGA_RCFG_FREEZE *** 2011-06-27 22:19:28.233 * published: inc 6, isnested 0, rora req 0, rora start 0, rora invalid 0, (roram 32767), isrcvinst 1, (rcvinst 1), isdbopen 1, drh 0, (myinst 1) thread 1, isdbmounted 1, sid hash x0 * kjfcrfg: published bigns successfully * Force-published at step 3 2011-06-27 22:19:28.233575 : Global Resource Directory frozen * roram 32767, rcvinst 1 * kjfc_thread_qry: instance 1 flag 3 thread 1 sid 0 * kjfcrfg: queried bigns successfully inst 1 * kjfcrfg: single_instance_kjga = TRUE asby init, 0/1/x2 asby returns, 0/1/x2/false * Domain maps before reconfiguration: * DOMAIN 0 (valid 1): 1 2 * End of domain mappings * dead instance detected - domain 0 invalid = TRUE * Domain maps after recomputation: * DOMAIN 0 (valid 0): 1 * End of domain mappings 2011-06-27 22:19:28.235110 : * Begin lmon rcfg step KJGA_RCFG_COMM 2011-06-27 22:19:28.235242 : GSIPC:KSXPCB: msg 0xd8b84550 status 34, type 2, dest 2, rcvr 0 2011-06-27 22:19:28.235339 : GSIPC:KSXPCB: msg 0xd8b80180 status 34, type 2, dest 2, rcvr 1 Active Sendback Threshold = 50 % Communication channels reestablished 2011-06-27 22:19:28.240076 : * Begin lmon rcfg step KJGA_RCFG_EXCHANGE 2011-06-27 22:19:28.240192 : * Begin lmon rcfg step KJGA_RCFG_ENQCLEANUP Master broadcasted resource hash value bitmaps 2011-06-27 22:19:28.251474 : Non-local Process blocks cleaned out 2011-06-27 22:19:28.251822 : * Begin lmon rcfg step KJGA_RCFG_CLEANUP 2011-06-27 22:19:28.265220 : * Begin lmon rcfg step KJGA_RCFG_TIMERQ 2011-06-27 22:19:28.265308 : * Begin lmon rcfg step KJGA_RCFG_DDQ 2011-06-27 22:19:28.265393 : * Begin lmon rcfg step KJGA_RCFG_SETMASTER 2011-06-27 22:19:28.271551 : Set master node info 2011-06-27 22:19:28.271931 : * Begin lmon rcfg step KJGA_RCFG_ENQREPLAY 2011-06-27 22:19:28.275490 : Submitted all remote-enqueue requests 2011-06-27 22:19:28.275596 : * Begin lmon rcfg step KJGA_RCFG_ENQDUBIOUS Dwn-cvts replayed, VALBLKs dubious 2011-06-27 22:19:28.277223 : * Begin lmon rcfg step KJGA_RCFG_ENQGRANT All grantable enqueues granted 2011-06-27 22:19:28.277992 : * Begin lmon rcfg step KJGA_RCFG_PCMREPLAY 2011-06-27 22:19:28.279234 : 2011-06-27 22:19:28.279255 : Post SMON to start 1st pass IR --SMON posted by LMON 2011-06-27 22:19:28.307890 : Submitted all GCS cache requests --IR acquires all gcs resource needed for recovery 2011-06-27 22:19:28.308038 : * Begin lmon rcfg step KJGA_RCFG_FIXWRITES Post SMON to start 1st pass IR Fix write in gcs resources 2011-06-27 22:19:28.313508 : * Begin lmon rcfg step KJGA_RCFG_END 2011-06-27 22:19:28.313720 : 2011-06-27 22:19:28.313733 : Reconfiguration complete * domain 0 valid?: 0 * kjfcrfg: ask RMS0 to do pnp work **************** BEGIN DLM RCFG HA STATS **************** Total dlm rcfg time (inc 6): 0.100 secs (1278500359581, 1278500359681) Begin step .........: 0.001 secs (1278500359581, 1278500359582) Freeze step ........: 0.020 secs (1278500359582, 1278500359602) Remap step .........: 0.002 secs (1278500359602, 1278500359604) Comm step ..........: 0.005 secs (1278500359604, 1278500359609) Sync 1 step ........: 0.000 secs (0, 0) Exchange step ......: 0.000 secs (1278500359609, 1278500359609) Sync 2 step ........: 0.000 secs (0, 0) Enqueue cleanup step: 0.011 secs (1278500359609, 1278500359620) Sync pcm1 step .....: 0.000 secs (0, 0) Cleanup step .......: 0.013 secs (1278500359620, 1278500359633) Timerq step ........: 0.000 secs (1278500359633, 1278500359633) Ddq step ...........: 0.000 secs (1278500359633, 1278500359633) Set master step ....: 0.006 secs (1278500359633, 1278500359639) Sync 3 step ........: 0.000 secs (0, 0) Enqueue replay step : 0.004 secs (1278500359639, 1278500359643) Sync 4 step ........: 0.000 secs (0, 0) Enqueue dubious step: 0.001 secs (1278500359643, 1278500359644) Sync 5 step ........: 0.000 secs (0, 0) Enqueue grant step .: 0.001 secs (1278500359644, 1278500359645) Sync 6 step ........: 0.000 secs (0, 0) PCM replay step ....: 0.030 secs (1278500359645, 1278500359675) Sync 7 step ........: 0.000 secs (0, 0) Fixwrt replay step .: 0.003 secs (1278500359675, 1278500359678) Sync 8 step ........: 0.000 secs (0, 0) End step ...........: 0.001 secs (1278500359680, 1278500359681) Number of replayed enqueues sent / received .......: 0 / 0 Number of replayed fusion locks sent / received ...: 0 / 0 Number of enqueues mastered before / after rcfg ...: 2217 / 2941 Number of fusion locks mastered before / after rcfg: 3120 / 5747 **************** END DLM RCFG HA STATS ***************** *** 2011-06-27 22:19:36.589 kjxgfipccb: msg 0x0x7ff526139320, mbo 0x0x7ff526139310, type 19, ack 0, ref 0, stat 34 ===================================================================== ============================lms trace begin========================== *** 2011-06-27 22:38:54.663 2011-06-27 22:38:54.663764 : 0 GCS shadows cancelled, 0 closed, 0 Xw survived 2011-06-27 22:38:54.673539 : 5230 GCS resources traversed, 0 cancelled 2011-06-27 22:38:54.707671 : 9322 GCS shadows traversed, 0 replayed, 0 duplicates, 5183 not replayed, dissolve 0 timeout 0 RCFG(10) lms 0 finished replaying gcs resources 2011-06-27 22:38:54.709132 : 0 write requests issued in 384 GCS resources --check past image 0 PIs marked suspect, 0 flush PI msgs 2011-06-27 22:38:54.709520 : 0 write requests issued in 273 GCS resources 1 PIs marked suspect, 0 flush PI msgs 2011-06-27 22:38:54.709842 : 0 write requests issued in 281 GCS resources 0 PIs marked suspect, 0 flush PI msgs 2011-06-27 22:38:54.710159 : 0 write requests issued in 233 GCS resources 0 PIs marked suspect, 0 flush PI msgs 2011-06-27 22:38:54.710531 : 0 write requests issued in 350 GCS resources lms 0 finished fixing gcs write protocolInstance Recovery和普通的Crash Recovery最大的區別在于實例恢復過程中的GRD Frozen和對GES/GCS資源的Remaster,這部分工作主要由LMON進程完成,可以從以上trace中發現一些KJGA_RCFG_*形式的Reconfiguration步驟,它們的含義:
Reconfiguration Steps: 1. KJGA_RCFG_BEGIN LMON continuously polling for reconfiguration event. Once cgs reports a change in cluster membership, LMON starts reconfiguration 2. KJGA_RCFG_FREEZE All processes acknowledges to the reconfiguration freeze before LMON continue 3. KJGA_RCFG_REMAP Updates new instance map (kjfchsu), re-distributes resource mastership. Invalidate recovery domains if reconfiguration is caused by instance death. 4. KJGA_RCFG_COMM Reinitialize communication channel 5. KJGA_RCFG_EXCHANGE Exchange of master information of gcs, ges and file affinity master 6. KJGA_RCFG_ENQCLEANUP Delete remote dead gcs/ges locks. Cancel converting gcs requests. 7. KJGA_RCFG_CLEANUP Cleanup/remove ges resources 8. KJGA_RCFG_TIMERQ Restore relative timeout for enqueue locks on timeout queue 9. KJGA_RCFG_DDQ Clean out enqueue locks on deadlock queue 10. KJGA_RCFG_SETMASTER Update master info for each enqueue resources that need to be remastered. 11. KJGA_RCFG_REPLAY Replay enqueue locks 12. KJGA_RCFG_ENQDUBIOUS Invalidates ges resources without established value 13. KJGA_RCFG_ENQGRANT Grants all grantable ges lock requests 14. KJGA_RCFG_REPLAY2 Enqueue reconfiguration complete. Post SMON to start instance recovery. Starts replaying gcs resources. 15. KJGA_RCFG_FIXWRITES2 Fix write state of gcs resources 16. KJGA_RCFG_END Unfreeze lock databaseInstance Recovery相關的診斷事件
我們無法禁止Instance Recovery的發生,事實上一旦出現Instance Crash那么Instance Recovery就是必須的。
與Instance Recovery相關的診斷事件主要有10426和29717等:10426 – Reconfiguration trace event 10425 – Enqueue operations 10432 – Fusion activity 10429 – IPC tracing oerr ora 10425 10425, 00000, "enable global enqueue operations event trace" // *Document: NO // *Cause: // *Action: Dump trace for global enqueue operations. oerr ora 10426 10426, 00000, "enable ges/gcs reconfiguration event trace" // *Document: NO // *Cause: // *Action: Dump trace for ges/gcs reconfiguration. oerr ora 10430 10430, 00000, "enable ges/gcs dynamic remastering event trace" // *Document: NO // *Cause: // *Action: Dump trace for ges/gcs dynamic remastering. oerr ora 10401 10401, 00000, "turn on IPC (ksxp) debugging" // *Cause: // *Action: Enables debugging code for IPC service layer (ksxp) oerr ora 10708 10708, 00000, "print out trace information from the RAC buffer cache" // *Cause: N/A // *Action: THIS IS NOT A USER ERROR NUMBER/MESSAGE. THIS DOES NOT NEED TO BE // TRANSLATED OR DOCUMENTED. IT IS USED ONLY FOR DEBUGGING. oerr ora 29717 29717, 00000, "enable global resource directory freeze/unfreeze event trace" // *Document: NO // *Cause: // *Action: Dump trace for global resource directory freeze/unfreeze. diag RAC INSTANCE SHUTDOWN LMON LMON will dump more informations to trace during reconfig and freeze. event="10426 trace name context forever, level 8" event="29717 trace name context forever, level 5" or event="10426 trace name context forever, level 12" event="10430 trace name context forever, level 12" event="10401 trace name context forever, level 8" event="10046 trace name context forever, level 8" event="10708 trace name context forever, level 15" event="29717 trace name context forever, level 5" see 29717 grd frozen trace alter system set event='29717 trace name context forever, level 5' scope=spfile; ========================================================================= ============================lmon trace begin============================= ********* kjfcrfg() called, BEGIN LMON RCFG ********* 2011-06-27 23:13:16.693089 : * Begin lmon rcfg step KJGA_RCFG_BEGIN * kjfcrfg: Resource broadcasting disabled * kjfcrfg: kjfcqiora returned success kjfcrfg: DRM window size = 4096->4096 (min lognb = 15) 2011-06-27 23:13:16.693219 : Reconfiguration started (old inc 4, new inc 6) TIMEOUTS: Local health check timeout: 70 sec Rcfg process freeze timeout: 70 sec Remote health check timeout: 140 sec Defer Queue timeout: 163 secs CGS rcfg timeout: 85 sec Synchronization timeout: 248 sec DLM rcfg timeout: 744 sec List of instances: 1 (myinst: 1) Undo tsn affinity 1 OMF 0 [FDB][start] 2011-06-27 23:13:16.701320 : * Begin lmon rcfg step KJGA_RCFG_FREEZE [FACK][18711 not frozen] --fack means acknowledge in advance [FACK][18713 not frozen] [FACK][18719 not frozen] [FACK][18721 not frozen] [FACK][18723 not frozen] [FACK][18729 not frozen] [FACK][18739 not frozen] [FACK][18743 not frozen] [FACK][18745 not frozen] [FACK][18747 not frozen] [FACK][18749 not frozen] [FACK][18751 not frozen] [FACK][18753 not frozen] [FACK][18755 not frozen] [FACK][18757 not frozen] [FACK][18759 not frozen] [FACK][18763 not frozen] [FACK][18765 not frozen] [FACK][18767 not frozen] [FACK][18769 not frozen] [FACK][18771 not frozen] [FACK][18775 not frozen] [FACK][18777 not frozen] [FACK][18816 not frozen] [FACK][18812 not frozen] [FACK][18818 not frozen] [FACK][18820 not frozen] [FACK][18824 not frozen] [FACK][18826 not frozen] [FACK][18830 not frozen] [FACK][18835 not frozen] [FACK][18842 not frozen] [FACK][18860 not frozen] [FACK][18865 not frozen] [FACK][18881 not frozen] [FACK][18883 not frozen] [FACK][18909 not frozen] *** 2011-06-27 23:13:16.724 * published: inc 6, isnested 0, rora req 0, rora start 0, rora invalid 0, (roram 32767), isrcvinst 0, (rcvinst 32767), isdbopen 1, drh 0, (myinst 1) thread 1, isdbmounted 1, sid hash x0 * kjfcrfg: published bigns successfully * Force-published at step 3 2011-06-27 23:13:16.724764 : Global Resource Directory frozen * kjfc_qry_bigns: noone has the rcvinst established yet, set it to the highest open instance = 1 * roram 32767, rcvinst 1 * kjfc_thread_qry: instance 1 flag 3 thread 1 sid 0 * kjfcrfg: queried bigns successfully ===================================================================== ==========================lmd0 trace begin=========================== *** 2011-06-27 23:13:16.700 [FFCLI][frozen] [FUFCLI][normal]
更多文章、技術交流、商務合作、聯系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號聯系: 360901061
您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對您有幫助就好】元
