RAC 2 Node가 동시에 Restart되는 문제 · 2013. 3. 25. · 1 • CPU Scheduling 관련...

1

• CPU Scheduling 관련 버그에 의해 RAC DB 서버가 동시에 Restart되는 현상

현 상 (Symptom)

• 2 Node RAC에서 Split-brain 현상 발생 후 Node1 은 Reboot되고 Node2는 CRS가 Restart되어 서비스

중단 발생.

RAC 2 Node가 동시에 Restart되는 문제

개 념 (Definition)

2012-04-03 22:43:27.910: [ USRTHRD][1548] (:CLSN00111:)clsnproc_needreboot: Impending reboot at 90% of limit 28257; disk timeout 28257, network timeout 25355, last heartbeat from CSSD at epoch seconds 1333507382.449, 25459 milliseconds ago based on invariant clock 2027712154; now polling at 100 ms

- cssd Hang시 (MISSCOUNT-REBOOTTIME/2)초 후 cssdagent가 Node Restart를 수행

- CSSD State

. cTODC (epoch seconds) : 1333507382.449 → Tue, 3 Apr 2012 22:43:02

. cITC (invariant clock) : 2027712154

. NTO (network timeout) : 25355

. DTO (disk timeout ) : 28257,

- oracssdmonitor_root.log (Node 1)

2012-04-03 22:43:29.451: [ CSSD][5671]clssnmCheckSplit: Node 1, sedadb01, is alive, DHB

(1333507409, 2027739010) more than disk timeout of 27000 after the last NHB

(1333507379, 2027709150)

2012-04-03 22:43:29.451: [ CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3)

master(0) event(2), incarn 3, mbrc 3, to member 2, events 0x0, state 0x0

2012-04-03 22:43:29.451: [ CSSD][5671]clssnmCheckDskInfo: My cohort: 2

2012-04-03 22:43:29.451: [ CSSD][5671]clssnmCheckDskInfo: Surviving cohort: 1

2012-04-03 22:43:29.451: [ CSSD][1]clssgmQueueGrockEvent: groupName(CRF-) count(2)

master(2) event(2), incarn 626, mbrc 2, to member 1, events 0x38, state 0x0

2012-04-03 22:43:29.451: [ CSSD][5671](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node

to avoid splitbrain. Cohort of 1 nodes with leader 2, sedadb02, is smaller than cohort of 1

nodes led by node 1, sedadb01, based on map type 2

- ocssd.log (Node 2)

2

원 인 (Cause)

• CSSD Daemon의 일부 Thread들이 Priority가 낮아 CPU를 할당 받지 못해 Hang 상태가 됨

• 2번 Node의 CSSD Daemon은 Split-brain 상황으로 판단하고 Restart됨(Rebootless Restart)

→ Split-brain 발생 시 2 Node CRS에서는 2번 Node가 Restart됨

• 1번 Node의 CSSD Agent/Monitor는 CSSD Daemon의 Hang을 감지하고 1번 Node를 Reboot시킴

→ (MISSCOUNT-REBOOTTIME/2)초 후 cssdagent가 Node Restart를 수행

• Bug 13940331: VALUE FOR SETTING THREAD SCHEDULING IS INCORRECT IN SLTSTSPAWN

→ CSSD Thread들이 Real Time Priority(0)가 아닌 Default Priority(60)로 상속 받는 버그

# ps -mp 11272300 -o THREAD

USER PID PPID TID ST CP PRI SC WCHAN F TT BND COMMAND

oracle 11272300 11796656 - A 1 0 32 * 10240103 - - /oracle/GRID/11203/bin/ocssd.bin

- - - 36372583 S 0 60 1 f1000f0a10022b40 8410400 - - -

- - - 36438127 S 0 60 1 f1000f0a10022c40 8410400 - - -

- - - 36569209 S 0 60 1 f1000f0a10022e40 8410400 - - -

- - - 37552261 S 0 60 1 f1000f0a10023d40 8410400 - - -

해결 방안 (Solution)

• DB patch 13940331 적용

• 대상 : AIX RAC 10.2.0.5, 11.2.0.3

• 패치 적용 후 CSSD Thread Priority

ps -mp 3015026 -o THREAD

USER PID PPID TID S CP PRI SC WCHAN F TT BND COMMAND

grid 3015026 3605094 - A 3 0 32 * 10240103 - - /u01/app/11.2.0/grid/

- - - 26280097 S 0 0 1 f1000f0a10019140 8410400 - - -

- - - 26935467 S 1 0 1 - 418400 - - -

- - - 27394273 S 0 0 1 f1000f0a1001a240 8410400 - - -

- - - 32243907 Z 0 0 1 - c00001 - - -

- - - 39583857 S 0 0 1 f1000f0a10025c40 8410400 - - -

- - - 43515929 Z 0 0 1 - c00001 - - -

- - - 60752031 S 0 0 1 - 418400 - - -

- - - 73138383 S 0 0 1 f1000f0a10045c40 8410400 - - -

3

Node1 Node2

CSS CSS

Voting File

Node3

CSS

I do not see 3

Node1 : I see 1&2

Node2 : I see 1&2

=>

We should

evict 3!

I see 1&2

I see 3

I’ve been

evicted!

I’d better stop

Split-brain

(RMN)

STONITH

참조사항 (Notes)

• Bug 13940331: VALUE FOR SETTING THREAD SCHEDULING IS INCORRECT IN SLTSTSPAWN

→ CSSD Thread들이 Real Time Priority(0)가 아닌 Default Priority(60)로 상속 받는 버그

• Split-brain

- Cluster의 Heartbeat NW이 단절된 Node들이 살아 있는 상태

- Split-brain 상황에서 공유 Data에 IO가 발생하면 Data 손상이 발생

- Split-brain을 방지하기 위해 Node Eviction이 필요

- Surviving cohort

. Cohort with the most nodes

. Cohort with lowest node number not in other cohort

4

• HP Async Device의 비권고 minor number값에 의해 Datafile 및 Redo Log Record가 Corruption이

발생함

현 상 (Symptom)

• 5 Node RAC에서 DB에서 Async Device가 잘못 설정된 2개의 Node에서 IO에러 발생 후 DB File 손상

발생

HP Async Device minor number값에 의한 DB Corruption 장애


Jan 3 08:18:05 euospdb4 vmunix: LVM: WARNING: VG 128 0x027000: LV 2: Some I/O requests to this LV are waiting

Jan 3 08:18:05 euospdb4 vmunix: indefinitely for an unavailable PV. These requests will be queued until

Jan 3 08:18:05 euospdb4 vmunix: the PV becomes available (or a timeout is specified for the LV).

Jan 3 08:18:26 euospdb4 vmunix: LVM: WARNING: VG 128 0x027000: LV 3: Some I/O requests to this LV are waiting

Jan 3 08:18:26 euospdb4 vmunix: LVM: VG 128 0x027000: PVLink 1 0x00004e Failed! The PV is not accessible.

Jan 3 08:18:32 euospdb4 vmunix: LVM: VG 128 0x027000: PVLink 1 0x00004e Recovered.

Jan 3 08:18:26 euospdb4 vmunix: indefinitely for an unavailable PV. These requests will be queued until

- 특정 2개 Node의 syslog에 LVOL 에러 발생

KCF: write/open error block=0x70b4f online=1

file=658 /dev/vg27/rlvol_20g_1

error=65535 txt: ''

Automatic datafile offline due to write error on

file 658: /dev/vg27/rlvol_20g_1

Tue Jan 3 09:57:24 2012

- 데이터파일 손상에러 발생

Errors in file /oracle10g/admin/EUOSPPRD1/bdump/euospprd1_p006_7151.trc:

ORA-00600: internal error code, arguments: [3020], [658], [244497], [2], [11808], [404358], [352],

[]

ORA-10567: Redo is inconsistent with data block (file# 658, block# 244497)

ORA-10564: tablespace AST_DATA01

ORA-01110: data file 658: '/dev/vg27/rlvol_20g_1'

ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 39278

- 해당 데이터파일 복구 시 Redo Record 손상 에러 발생

5

원 인 (Cause)

• 전체 5대의 DB서버 중 2대의 DB 서버에 오라클에서 권고하지 않은 설정 값(minor number 7)으로 Async

Device 구성되었고 2대의 DB 서버에서 IO에러 발생함

***Oracle Confidential - Internal Use Only***Priority3To Bottom @ and recommended using 0x000007 but <> was posted against the note, @ stating the following: @ @ Please update step 7 in the note for async I/O minor numbers the following minor

numbers @ are not recommended/supported. @ 1,2 5 and 7

……………….. @ we neither recommend nor require flags 0x1 (immediate reporting) or 0x2 (CPU cache

flags), @ so those and combinations with them are not mentioned in 11.1 11.2 docs. @ @ 5 and 7 should not be used either


• Non-ASM Mirror환경에서는 Async Device minor number를 0(default)으로 설정

• ASM Normal/High Redundancy 환경에서는 Asynce Device minor number를 4로 설정

• minor number 7 설정 시 데이터 손상 장애 발생 가능

DB1 DB2

Database

DB3 DB4 DB5

IO Fail IO Fail

/dev/async 0x000000 /dev/async 0x000007 /dev/async 0x000007 /dev/async 0x000000 /dev/async 0x000000

6


• HP-UX Async Device minor number

□ 0x000000 default

□ 0x000001 enable immediate reporting

- Write시에 Disk Controller Cache에만 Write하고 IO Complete을 return하는 기능

- 오라클 처럼 O_SYNC 사용 시에는 작동 안 함(Oracle by default opens all database files in O_SYNC mode and hence this bit is irrelevant and we don't recommend our customers to set this bit in minor number selection)

□ 0x000002 CPU cache will not be flushed after a read request

- CPU cache를 flush 안 하게 설정함으로 성능 향상

- CPU Fault나 서버 다운 시 데이터 유실 가능(The benefit of setting this bit is minimal compared to the inconsistencies it can create. So this bit is removed from the recommendation as well)

□ 0x000004 disc device timeouts will complete with an error code instead of retrying forever

- Disk Timeout 기능을 enable함

- asyncdsk_io_timeout으로 값 조정

- ASM이나 LVM Mirror 구성 시에만 권고함

- Non ASM환경에서는 Timeout 발생시 데이터 유실 가능(In non-ASM env, it is not recommended to set this bit as it can result in complete loss of data in the event of an I/O failure. It is best to leave it to the OS kernel to retry the I/O and get it completed rather than getting a failure status on the I/O request soon)

□ 0x000005 is a combination of 1 and 4

□ 0x000007 is a combination of 1 ,2 and 4

7

• Solaris IPMP로 NIC 이중화 되어 있는 환경에서 Oracle VIP 구성이 잘못되어 있어서 NIC 장애 시 IPMP는

정상적으로 Fail Over되었으나 Oracle VIP는 Fail Over되지 않는 문제

현 상 (Symptom)

• Primary NIC가 Fail 발생 시 IPMP는 정상적으로 Secondary NIC로 FailOver되었으나 Oracle VIP는 Offline

상태로 되어 서비스가 안됨

Solaris IPMP로 NIC 이중화 구성 시 Oracle VIP Offline되는 문제


DB1 DB2

Database

ce0 ce1 vip

DB1 DB2

Database

ce0 ce1 vip

NIC Fail

Pri. Sta. Fail Pri.

원 인 (Cause)

• Oracle VIP에 IPMP Group의 NIC 중 Primary NIC(ce0)만 등록되어 있고 Standby NIC(ce1)는 등록되어

있지 않음

• Primary NIC(ce0) Fail시 IPMP는 정상적으로 Standby NIC(ce1)로 Fail Over되었으나 Oracle VIP에는

Standby NIC(ce1)이 등록되어 있지 않아 Fail Over가 실패함


• 같은 IPMP Group에 있는 모든 NIC(Primary/Standby NIC)를 Oracle VIP에 등록

svrctl modify nodeapps -n <nodename> -A 90.224.207.150/255.255.255.0/ce0\|ce1

• 가상 NIC로 이중화하는 방식(HP APA, IBM Etherchannel 등)이 아니고 Physical NIC로 이중화하는

방식(Solaris IPMP,HP ServiceGuard 등)에서는 Oracle VIP에 Physical NIC를 모두 등록하여야 함


• Configuring Solaris IP Multipathing (IPMP) for the CRS 10g VIP [ID 283107.1]

• How to Configure Solaris Link-based IPMP for Oracle VIP [ID 730732.1]

• NIC Failover in IPMP Setup Causes the VIP to go and Remain Offline - CRS-5008 [ID 1121816.1]

• Configuring the HP-UX Operating System for the Oracle 10g and Oracle 11g VIP [ID 296874.1]

8

• Data Guard 환경에서 Primary DB와 Standby DB의 Snapshot Controlfile을 같은 스토리지에 구성하여

동시에 RMAN 백업 수행 시 Instance가 Crash되는 장애

현 상 (Symptom)

• Standby DB에서 RMAN DB 백업을 수행하고 있는 중에 Primary DB에서 RMAN으로 Archive Log Delete

작업 수행 시 Primary DB가 Crash됨

Standby DB Snapshot Controlfile에 의한 Primary Instance Crash 장애


Wed Nov 21 13:47:02 2012 LNS: Standby redo logfile selected for thread 1 sequence 47456 for destination LOG_ARCHIVE_DEST_2 Wed Nov 21 13:47:13 2012 Archived Log entry 130221 added for thread 1 sequence 47455 ID 0x68fa48a7 dest 1: Wed Nov 21 13:47:59 2012 <################ CRASH USER (ospid: 121909): terminating the instance Wed Nov 21 13:48:00 2012 System state dump requested by (instance=1, osid=121909), summary=[abnormal instance termination]. System State dumped to trace file /GSCMLOG/diag/rdbms/gscmxd/GSCMXD1/trace/GSCMXD1_diag_10142.trc Wed Nov 21 13:48:00 2012 ORA-1092 : opitsk aborting process Wed Nov 21 13:48:00 2012 License high water mark = 586 Instance terminated by USER, pid = 121909 USER (ospid: 123281): terminating the instance Instance terminated by USER, pid = 123281 Wed Nov 21 13:48:07 2012 Starting ORACLE instance (normal)

- alert.log(Primary DB)

DB1 DB2

Primary Database

DB1 DB2

Standby

Database

Snapshot Controlfile

NFS /zfs/controlfile/snapcf_DB.f

RMAN : DB Backup RMAN : Delete Archive Log

Instance Crash

Primary Controlfile

Standby Controlfile

9

원 인 (Cause)

• Primary DB와 Standby DB의 Snapshot Controlfile을 NAS에 동일한 이름으로 구성하여 서로

Overwrite할 수 있게 되어있음


• Primary DB와 Standby DB의 Snapshot Controlfile을 분리 구성함

→ NFS대신에 각 DB의 공유 파일시스템(ASM 또는 CFS)에 구성

• Snapshot Controlfile의 경로와 이름은 Primary DB와 Standby DB는 동일하게 설정하여야함 (Bug

13829543)

*** MODULE NAME:(rman@**** (TNS V1-V3)) 2012-11-21 13:47:59.850 <##############

*** ACTION NAME:(0000005 STARTED68) 2012-11-21 13:47:59.850

Error: kccpb_sanity_check_2

Control file sequence number mismatch!

fhcsq: 798031382 bhcsq: 798309313 cfn 8

kjzduptcctx: Notifying DIAG for crash event

----- Abridged Call Stack Trace -----

ksedsts()+461<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+53<-ksuitm()+1332<-kccpb_sanity_check()+341<-kccbmp_get()+309<-kccsed_rbl()+111<-kcc_begin_alt()+674<-kccbckrm()+6568<-kccbck()+37<-kccmus()+1210<-krbicmus()+1074<-pevm_icd_call_common()+867

<-pfrinstr_ICAL()+168<-pfrrun_no_tool()+63<-pfrrun()+627<-plsql_run()+649<-pricar()+1003<-pricbr()+572

----- End of Abridged Call Stack Trace -----

- RMAN trace


• RMAN Backup Fails For Snapshot Controfile File In Data Guard Environment [ID 1482014.1]

• Bug 13829543 : RMAN CHANGES THE SETTING OF SNAPSHOT CONTROLFILE NAME IN DG

ENVIRONMENT

• Bug 13084763 : STANDBY DB CRASHED WHEN RMAN BACKUP RUN IN PRIMARY DB

• RMAN은 Backup 또는 Resync 수행 시 우선 Snapshot Controlfile을 업데이트함

• Standby DB에서 RMAN 백업 수행 중에 Primary DB에서 RMAN 작업을 수행하여 Snapshot

Controlfile의 Seq#값이 Primary DB의 Current Controlfile Seq#값보다 높아 RMAN Process가 Primary

DB의 Instance를 Crash함

-- Primary DB

CONFIGURE SNAPSHOT CONTROLFILE NAME TO '/zfs/controlfile/snapcf_DB.f'; NAS

-- Standby DB

CONFIGURE SNAPSHOT CONTROLFILE NAME TO '/zfs/controlfile/snapcf_DB.f'; NAS

-- Primary DB

CONFIGURE SNAPSHOT CONTROLFILE NAME TO '+DBFS_DG/DB/CONTROLFILE/snapcf_DB.f‘ Primary ASM

-- Standby DB

CONFIGURE SNAPSHOT CONTROLFILE NAME TO '+DBFS_DG/DB/CONTROLFILE/snapcf_DB.f‘ Standby ASM

10

• Logical Data 손상 발생 시 복구하는 방법

현 상 (Symptom)

• ORA-00600 [kdddgb1], ORA-00600 [kcbzpbuf_1] 에러 발생하며 DB Crash되고 DB Open안됨

Logical Redo Corruption 복구 절차


Thu Apr 26 16:35:16 2012 Errors in file /oracle/admin/EDSWP/udump/edswp_ora_10886.trc: ORA-00600: internal error code, arguments: [kdddgb1], [73], [], [], [], [], [], [] Thu Apr 26 16:35:21 2012 Hex dump of (file 53, block 460) in trace file /oracle/admin/EDSWP/bdump/edswp_dbw1_29653.trc Corrupt block relative dba: 0x0d4001cc (file 53, block 460) Bad header found during preparing block for write Data in bad block: type: 73 format: 4 rdba: 0x340e4b39 last change scn: 0x0b59.c3d2030c seq: 0x2 flg: 0x40 spare1: 0x30 spare2: 0x36 spare3: 0x584f consistency value in tail: 0x030c4902 check value in block header: 0x0 block checksum disabled Thu Apr 26 16:36:31 2012 Errors in file /oracle/admin/EDSWP/bdump/edswp_dbw1_29653.trc: ORA-00600: internal error code, arguments: [kcbzpbuf_1], [4], [1], [], [], [], [], [] Thu Apr 26 16:36:33 2012 Errors in file /oracle/admin/EDSWP/bdump/edswp_dbw1_29653.trc: ORA-00600: internal error code, arguments: [kcbzpbuf_1], [4], [1], [], [], [], [], [] Thu Apr 26 16:36:33 2012 DBW1: terminating instance due to error 471 Thu Apr 26 16:36:33 2012 Errors in file /oracle/admin/EDSWP/bdump/edswp_pmon_29633.trc: ORA-00471: DBWR process terminated with error Instance terminated by DBW1, pid = 29653

- alert.log

• 16:35 ~ 16:42 – dbv 로 datafile 정합성 체크 수행

Corrupt 검출 안됨

• 16:42 – DB Restart 시도 동일한 에러로 DB Crash

• 16:44 ~ 17:12 – DB Mount 하여 에러 범위 파악하고 복구 방안 모색

• 17:12:49 – 영향 받은 datafile offline 시키고 DB Startup 정상 확인

• 17:21 ~ 17:24 – 영향 받은 Tablespace offline 시키고 DB/Listener Startup 하여 부분적으로 업무 재개

• 17:24 ~ 18:00 – Tape 백업 Restore

• 18:00 ~ 20:25 – 1차 복구 시도 (BBED로 Corrupt 발생한 block skip 하여 복구하도록 시도) 실패

• 20:00 ~ 22:40 – 2차 복구 시도 (새로운 서버에서 불완전 복구 시도) 성공

• 22:40 ~ 23:30 – 복구된 데이터를 운영 DB 에 Import

• 23:30 ~ 24:00 – 복구된 데이터 검증 및 업무 정상화

11

원 인 (Cause)


• Redo Log가 Corruption이 발생했기 때문에 Block Recovery나 Media Recovery로 복구 안되기 때문에

다음 2가지 방법으로 Open함

1) 손상된 블록이 속한 Datafile만 Offline후 Open

2) Block Checking Parameter Disable 후 DB Open

- db_block_checksum/db_block_checking=false

SQL>Startup mount ;

SQL>Alter system set db_block_checksum=false ;

SQL>Alter system set db_block_checking=false ;

SQL>Alter database open ;

- Table/Index Rebuild (Rebuild 또는 exp/imp)

. Move the table in same tablespace . Rebuild the Indexes

SQL>Alter table <owner>.<tablename> move ;

. Rebuild all indexes in the table

SQL>Alter index <owner>.<indexname> rebuilld online ;

. Export/import the table.

- Run verify on tablespace and then rebuild the bitmap for the tablespace

SQL>Alter session set tracefile_identifier='corrupt';

SQL>exec dbms_space_admin.tablespace_verify('<tablespacename>');

SQL>Alter session set tracefile_identifier='rebuildcorrupt';

SQL>exec dbms_space_admin.tablespace_rebuild_bitmaps('<tablespacename>');

- Set db_block_checksum/checking parameter back to Original values

SQL>Alter sytem set db_block_checking=<original value> ;

SQL>Alter sytem set db_block_checksum=<original value> ;

SQL>Alter session set tracefile_identifier='latest';

SQL>exec Dbms_space_admin.tablespace_verify('<tablespacename>');

• BUG 7662491 - INSTANCE CRASH / ORA-600 [KDDUMMY_BLKCHK] HIT DURING RECOVERY

• 이 Bug으로 인해 다음과 같은 ora-600 error가 발생할수 있음(이중 dbwr 에서 ORA-600 [kcbzpbuf_1]

발생)

ORA-600 [kghstack_free1]

ORA-600 [kghstack_free2]

ORA-600 [kcbzpbuf_1]

ORA-600 [kcbbvr_verify_disk_blk_1]

ORA-600 [kdourp_inorder2]

ORA-600 [kcbnlc_2]

ORA-7445 [_memmove]

ORA-7445 [ksdfsql]

• Recovery시에도 동일한 log sequence에서 ORA-600 [kcbzpbuf_1]가 발생함

• Bug 7662491 은 multi record 를 update 하는 문장을 수행하게 되면 redo record 에 잘못된 정보가

들어가게되어 block 이 corrupt 되는 bug

• 문제 발생 시점의 Redo dump 를 살펴보면 multi record 를 update한 것을 의미하는 OP:11.19 가

반복적으로 발생함

12


• BUG 7662491 - INSTANCE CRASH / ORA-600 [KDDUMMY_BLKCHK] HIT DURING RECOVERY

• Bug 7662491 - Array Update can corrupt a row. ORA-600 [kghstack_free1] ORA-600

[kddummy_blkchk][6110/6129] [ID 861965.1]

• How to Resolve ORA-00600[kddummy_blkchk] [ID 1342443.1])

• ORA-00600[kddummy_blkchk] 관련 버그 리스트

Bug no Bug Abstract Version Confirmed as affected Version fixed

Bug:8198906 OERI [kddummy_blkchk] / OERI [5467] for

an aborted transaction

10.2.0.4,10.2.0.3, 9.2.0.8,9.2.0.6

11.2.0.1,10.2.0.5,10.2.0.4 patc

h 22 for windows

Bug:7662491

Array Update can corrupt a row. ora-600 [k

ghstack_free1] ORA-600 [kddummy_blkchk

]

10.2.03,10.2.0.4, 11.1.0.7

10.2.0.4.2, 10.2.0.5, 11.1.0.7.4, 11.2.0.1

Bug:7411865

OERI:13030 / ORA-1407 / block corruption

from UPDATE .. RETURNING DML with tri

gger

11.1.0.7,11.1.0.6,10.2.0.4 10.2.0.4.2, 10.2.0.5, 11.1.0.7.1, 11.2.0.1

Bug:8951812 Corrupt index by rebuild online. Possible O

ERI [kddummy_blkchk] by SMON 11.2.0.1,11.1.0.7,11.1.0.6 11.2.0.2, 12.1.0.0

Bug:5386204 Block corruption / OERI[kddummy_blkchk]

after direct load

9.2.0.8, 10.2.0.1, 10.2.0.2, 10.2.0.3, 10.2.0.4

10.2.0.5 , 11.1.0.6 ,10.2.0.4.1(

PSU)

Other patches in Windows

9.2.0.8 Patch 15

10.2.0.2 Patch 15

10.2.0.3 Patch 5

10.2.0.4 Patch 2

Bug:8277580

Corruption on compressed tables during Re

covery and Quick

Multi Delete (QMD).

11.1.0.7 11.1.0.7.2, 11.2.0.1, 11.2.0.2, 12.1.0.0

Bug:7041254

ORA-19661 during RMAN restore check lo

gical of

compressed backup / IOT dummy key

11.1.0.7.11.1.0.6 11.1.0.7.5, 11.2.0.1,

11.1.0.7 Patch 19 on Windows

Platforms

Bug: 9231605

Block corruption with missing row on a com

pressed table

after DELETE

ORA-600 [kddummy_blkchk]

11.2.0.1,11.1.0.6,11.1.0.7 11.1.0.7.4, 11.2.0.1.3, 11.2.0.1

.BP02, 11.2.0.2, 12.1.0.0

Bug:9019113

ORA-600 [17182] for OLTP compress table

in Compression redo

ora-0600 [kddummy_blkchk]

11.2.0.1,11.1.0.7 11.2.0.1.BP02, 11.2.0.2, 12.1.

0.0

Bug:4493447

ORA-600 [kddummy_blkchk] [file#] [block#]

[6145] on

rollback of array update

10.2.0.1,10.2.0.2,10.2.0.3 ,10.2.04

11.1.0.6,


Platform

Bug:8720802 Add check for row piece pointing to itself

(db_block_checking,dbv,rman,analyze) 10.2.0.2,10.2.0.4

10.2.0.5, 11.2.0.1.BP07, 11.2.

0.2, 12.1.0.0,


Platforms

Bug:7715244 Corruption on compressed tables. Error co

des 6103 / 6110 11.1.0.7

11.1.0.7.2, 11.2.0.1,


Platforms

Bug:4000840

Update of a row with more than 255 colum

ns can cause

block corruption

10.1.0.3 9.2.0.7, 10.1.0.4, 10.2.0.1

Bug:7331181

ORA-1555 or OERI [kddummy_blkchk] [file

#] [block#]

[6126

11.1.0.6,11.1.0.7 11.2.0.1

13

• Logical Data 손상 발생 시 복구하는 방법

현 상 (Symptom)

• Primary 서버에서는 접속이 정상인 DB가 Service Guard로 Standby 서버로 Fail-Over후 DB 접속 시 ORA-

12514, ORA-12505 에러 발생하며 접속이 안됨

• Standby 서버에서는 Dynamic Service가 Listener에 등록 되지 않음

Standby 서버로 Fail Over 후 Listener 접속 불가 현상


- Listener Status(Standby 서버)

RWP = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = wmsdb01-pkg)(PORT = 1521)) ) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = wmos) ) )

- tnsnames.ora

14

원 인 (Cause)

• Listener Port가 1521이고 Listener가 hostname을 Binding할 경우 Default로 PMON이 Dynamic Service를

Listener에 등록하는데 Listener가 hostname이 아닌 VIP로 Bind되어 있어 PMON이 Dynamic Service를

등록하지 못함

- Primary 서버 /etc/hosts 106.10.1.51 wmsdb01-pkg wmsdb01 Hostname - Standby 서버 /etc/hosts 106.10.1.51 wmsdb01-pkg

SID_LIST_LISTENER = (SID_LIST = (SID_DESC = (SID_NAME = PLSExtProc) (ORACLE_HOME = /oracle/RWP/102_64) (PROGRAM = extproc) ) (SID_DESC = (SID_NAME = RWP) (ORACLE_HOME = /oracle/RWP/102_64) ) ) LISTENER = (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = wmsdb01-pkg)(PORT = 1521)) (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC0)) ) )

• Primary 서버의 /etc/hosts파일에는 VIP가 hostname으로 등록되어 있어 PMON이 Dynamic Service를

등록함

wmsdb01:/oracle/RWP]lsnrctl status LSNRCTL for HPUX: Version 10.2.0.3.0 - Production on 20-FEB-2012 19:40:33 Copyright (c) 1991, 2006, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=wmsdb01-pkg)(PORT=1521))) STATUS of the LISTENER ------------------------ Alias LISTENER Version TNSLSNR for HPUX: Version 10.2.0.3.0 - Production Start Date 20-FEB-2012 19:34:41 Uptime 0 days 0 hr. 5 min. 51 sec Trace Level off Security ON: Local OS Authentication SNMP OFF Listener Parameter File /oracle/RWP/102_64/network/admin/listener.ora Listener Log File /oracle/RWP/102_64/network/log/listener.log Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=wmsdb01)(PORT=1521))) (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC0))) Services Summary... Service "PLSExtProc" has 1 instance(s). Instance "PLSExtProc", status UNKNOWN, has 1 handler(s) for this service... Service "RWP" has 1 instance(s). Instance "RWP", status UNKNOWN, has 1 handler(s) for this service... Service "wmos" has 1 instance(s). Instance "wmos", status READY, has 1 handler(s) for this service... Service "wmos_XPT" has 1 instance(s). Instance "wmos", status READY, has 1 handler(s) for this service... The command completed successfully

- listener.ora

- Listener Status(Primary 서버)

15


• local_listener="(address=(protocol=tcp)(host=wmsdb01-pkg)(port=1521))" 설정 시 VIP를 지정하면

PMON이 해당 VIP로 Dynamic Service를 등록함

• Static Service(SID_LIST)로 접속함

• Standby 서버의 /etc/hosts파일의 VIP에 hostname 등록

- Primary 서버 /etc/hosts 106.10.1.51 wmsdb01-pkg wmsdb01 Primary 서버 Hostname - Standby 서버 /etc/hosts 106.10.1.51 wmsdb01-pkg wmsap01 Standby 서버 Hostname


• Database Will Not Register With Listener configured on IP instead of Hostname ORA-12514 [ID

365314.1]

• How The Listener Binds On TCP Protocol Addresses [ID 421305.1]

16

• RAC로 구성된 ADG에서 MRP가 비정상으로 종료시 다른 Node의 Instance가 Close되는 현상

현 상 (Symptom)

• ADG Node1(Redo Apply Instance)이 비정상적으로 시스템 Rebooting 됨

• ADG Node2가 Open상태에서 Mount상태로 변경됨

ADG구성에서 MRP Down시 다른 Node Database Close 현상


DB1 DB2

Primary Database

DB1 DB2

Standby

Database

Reboot database close

원 인 (Cause)


• alter database close

· 내부적으로 Mount상태로 변경하기 위해 shutdown대신 alter database close 문으로 Restart없이

Mount상태로 변경함

· 접속된 Session이 존재하면 alter database close문장을 실행할 수 없기때문에 실행전에 내부적으로

session을 kill함

· alter database close 문장으로 Mount된 Instance는 다시 Open할 수 없음 (Open시 ORA-16196에러

발생)

· Mount된 DB를 open하기 위해서는 Shutdown 필요

• _abort_on_mrp_crash (11.2.0.4 예정)

· 히든 파라메트 _abort_on_mrp_crash=true를 설정하면 alter database close 대신 shutdown abort됨

alter database close문장은 close전에 session kill을 해야하기때문에 session kill이 오래걸릴 경우

close가 오래 걸릴수 있기 때문에 설정함

• Bug 13147164 - Enhancement to allow failed MRP to abort all instances [ID 13147164.8]

• Apply Instance가 Crash되거나 MRP Process가 Fail되면 Read-Only로 Open된 ADG Instance는 DB

Restart없이 Mount 상태로 변경됨(alter database close)

• 원인은 ADG Instance에 연결된 Session들이 정합성이 맞지 않은 데이터를 조회하는 것을 방지하기 위한

내부 로직 임(In an Active Data Guard RAC standby, if the redo apply instance crashes, all other

instances of that standby that were open read-only will be closed and returned to the MOUNT state.

This will disconnect all readers of the Active Data Guard standby. This is done to prevent any

possibility of queries seeing inconsistent data.)

17


현 상 (Symptom)

• 10g Single DB와 11g RAC DB를 같은 노드에 구성 시 11g CRS를 Shutdown하여도 OCR 볼륨이

Deactive가 안됨

10g Single DB Session의 OCR File Open 문제


11g RAC DB1 11g RAC DB2

11g RAC Database

OCR 10g

Single DB

10g Single DB

# fuser -fu /dev/svg103/rpsfa_ocrfile01 /dev/svg103/rpsfa_ocrfile01: 12887o(oracon10) 12896o(oracon10) 12905o(oracon10) 12927o(oracon10) 6316o(oracon10) 12861o(oracon10) 12859o(oracon10) 22193o(oracon10) 24148o(oracon10) 2597o(oracon10) 24451o(oracon10) 24449o(oracon10) 12900o(oracon10) 28232o(oracon10) 21918o(oracon10) 21895o(oracon10) 12879o(oracon10) 21880o(oracon10) 28250o(oracon10) 14241o(oracon10) 16569o(oracon10) 28302o(oracon10) 3742o(oracon10) 21905o(oracon10) 14218o(oracon10) 2607o(oracon10) 24895o(oracon10) 28203o(oracon10) 12947o(oracon10) 15363o(oracon10) 4620o(oraint10) 28244o(oracon10) 4584o(oraint10) 2613o(oracon10) 24891o(oracon10) 28219o(oracon10) 12873o(oracon10) 21058o(oracon10) 28240o(oracon10) 28221o(oracon10)

• 10g Listener 및 해당 Listener로 접속한 10g DB Session이 OCR 파일을 Open하고 있음

원 인 (Cause)

• 10g Listener 구동 당시 11g CRS Up/Down 유무에 따라 아래와 같이 동작함

1) CRS가 Down 되었을 경우: 10g Listener가 OCR 파일을 직접 읽어 CRS Resource 유무 확인(OCR Open)

2) CRS Startup 되었을 경우: 10g Listener가 crsd Process를 통해 CRS Resource 유무 확인(OCR 미Open)


• Listener의 CRS Notification 기능 Disable

: - CRS_NOTIFICATION_listener_name=off (10g의 listener.ora 파일에 추가) 후 Listener 기동

18


현 상 (Symptom)

• CKPT 백그라운드 프로세스가 CF Enqueue를 holding 한 상태에서 ARC0 백그라운드 프로세스가 CF

Enqueue를 획득하기 위해 900초를 대기하여 CKPT 프로세스가 kill 되며 DB Crash됨

CF Enqueue로 인한 Instance Crash 사례


ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 1986' Mon Jan 7 18:42:02 2013 Killing enqueue blocker (pid=1986) on resource CF-00000000-00000000 Mon Jan 7 18:44:48 2013 by killing session 549.1

분 석 (Analysis)

• CKPT Wait Event 분석

• control file parallel write Wait Event 대기 상태이나 Write 작업은 하고 있음(Write Block이 변경됨)

- arc0.trc

waiting for 'control file parallel write' blocking sess=0x0 seq=28110 wait_time=0 seconds since wait started=0 files=3, block#=1, requests=3 Dumping Session Wait History for 'control file parallel write' count=1 wait_time=486553999 files=3, block#=10, requests=3 for 'control file parallel write' count=1 wait_time=362512884 files=3, block#=12, requests=3 for 'control file parallel write' count=1 wait_time=97149715 files=3, block#=1fa, requests=3

• CKPT Short Stack 분석

• 마지막 Call이 Oracle Call이 아니라 Solaris Kernel Call인 kaio()임 Kernel AIO()

fffffd7ffc781c3a kaio (2, fffffd7fffdfdfc0, 0, 0, 1c00d2df, ffffffff) 0000000003f29e14 skgfospo () + 164 0000000003f27b57 skgfrwat () + 67 0000000001057d41 ksfdwtio () + 181 0000000001054005 ksfdbio () + 795 0000000002d2f3f1 kccwbp () + 361 0000000002d33c9c kccfhb () + ec 0000000002d2b504 kcccmt () + 84 0000000002d2ba72 kccecx () + 42 0000000002cf2aba kcvcca () + 1da 0000000000f82681 ksbcti () + 2f1 0000000000f82d2b ksbabs () + 31b 0000000000f89117 ksbrdp () + 387 00000000034941c6 opirip () + 2a6 00000000019c15ba opidrv () + 3ba 000000000244a65b sou2o () + 5b 0000000000e714ec opimai_real () + 11c 0000000000e71324 main () + 64 0000000000e7116c ???????? ()

19

원 인 (Cause)

• 오라클 버그가 아니라 Solaris의 Kernel Async IO 수행이 느린 상태에서 _controlfile_enqueue_timeout 값

900초에 도달하여 Holder인 CKPT 프로세스가 kill 됨


• OS/스토리지의 IO 분석 필요

• _controlfile_enqueue_timeout 값 설정 조정


• ORA-00494 Or ORA-600 [2103] During High Load After 10.2.0.4 Upgrade [ID 779552.1]

• 관련 파라메트

1) _kill_controlfile_enqueue_blocker = { TRUE | FALSE }

TRUE. Default value. Enables this mechanism and kills blocker process in CF enqueue.

FALSE. Disables this mechanism and no blocker process in CF enqueue will be killed.

2) _kill_enqueue_blocker = { 0 | 1 | 2 | 3 }

0. Disables this mechanism and no foreground or background blocker process in enqueue will be

killed.

1. Enables this mechanism and only kills foreground blocker process in enqueue while background

process is not affected.

2. Enables this mechanism and only kills background blocker process in enqueue.

3. Default value. Enables this mechanism and kills blocker processes in enqueue.

3) _controlfile_enqueue_timeout = { INTEGER }

900. Default value.

1800. Optimum value to prevent enqueue timeout.

20


현 상 (Symptom)

• 동일Cluster 환경하에서 3Node RAC 인 원장DB와 원장DB의 BCV를 이용한 배치DB 서버를

운영하고 있음

• 배치DB는 매일 오후9시에 원장DB의 BCV를 작업자가 수동으로 Script를 이용하여 DB를 구동 시키는

작업을 함

• 배치DB 구동 시키는 작업 프로세스는 아래와 같음

1) BCV 복제 실시 (백업 시 Begin/End 생략, 백업 후 Log Switch 생략)

2) 배치DB서버에 BCV 볼륨 Enable 작업

3) 배치DB에서 Data File 명 rename 작업

4) 배치DB Open

BCV


• 배치DB를 가동하는 작업을 수행 중 작업자가 BCV 볼륨 Enable 작업 Script 창을 닫는 실수를 하였음

• 일부 BCV볼륨은 Disable 상태에서 Data File Rename 작업을 수행함

• 이 결과 Disable된 Data File의 Rename작업이 안된 상태에서 배치DB가 Open되었다. (오후 9시10분)

• 오후 10시4분에 원장DB에서 Log Switch가 발생하였고 데이터파일 유효성 점검 실패 에러를 발생하며

• 원장 DB의 3-Node Instance가 차례로 Down 됨

원장DB 이중화 배치DB DR볼륨 실시간복제 실시간복제

BCV

DR DB DB1 DB2 DB3

배치DB

[HP SD] 28CPU / 56GB x 3EA [HP SD] 12CPU / 60GB [HP SD] 24CPU / 48GB

<DMX2000> <DMX2000> <DMX2000>

Oracle 10.1.0.4 3-node RAC Oracle 10.1.0.4 Single Oracle 10.1.0.4 Single

SRDF/Async SRDF/Sync

Errors in file /ora_resource_dbpa2/dump/bdump/dbpa2_ckpt_18083.trc: ORA-01171: datafile 322 going offline due to error advancing checkpoint ORA-01122: database file 322 failed verification check ORA-01110: data file 322: '/dev/vx/rdsk/LPDBDATADG/10gvol_009' ORA-01207: file is more recent than Control File - old controlfile

ALTER DATABASE RENAME FILE '/dev/vx/rdsk/LPDBDATADG/10gvol_149' TO '/dev/vx/rdsk/BCV_LBDBDG/10gvol_149' ORA-1511 signalled during: ALTER DATABASE RENAME FILE '/dev/vx/rdsk/LPDBDATAD...

21

원 인 (Cause)

• 원장DB DOWN 원인

21시 05분에 배치DB가 원장DB의 일부 데이터파일과 중복된 상태에서 DB 구동됨

22시 04분에 원장DB에서 LOG SWITCH가 발생하였고 이때 Full Checkpoint 가 발생함

Full Checkpoint 발생시 Control File의 Checkpoint 정보와 Data File Header의 Checkpoint 정보가

정합성을 점검하는데 배치DB와 중복되 Data File들의 정합성 점검 오류로 해당 Data File이 Offline되고

이에 DB가 Crash됨

• 복구 작업이 중단된 원인

21시 30분에 SYSAUX 데이터파일에 Redo Log File 적용 시 백업 Data File의 블록 SCN정보와

Redo Log File내 Change Vector에 있는 Block SCN정보가 시점이 달라 ORA-00600 [3020]에러 발생함

Redo Log 내 블록 SCN정보와 백업 Data File의 블록 SCN정보가 불일치한 이유는 원장DB에서 배치DB

데이터 블록SCN을 참조하여 Redo를 생성한 것이 원인임


• 장애 원인을 찾기 위해서는 원장DB의 데이터파일 헤더 정보를 확인해야 함

• V$DATAFILE_HEADER나 Data File HEADER DUMP를 확인하여 DBID와 Checkpoint 정보를 확인함

• 본 사례의 경우 배치DB가 원장DB의 BCV이기 때문에 DBID는 동일하게 나옴

• 원장DB가 다운된 상태에서도 배치DB가 구동되고 있었으므로 Data File의 Checkpoint정보는

원장DB DOWN 이후 시점에도 계속 업데이트 되고 있었기 때문에 중복되어 있음을 확인 할 수 있음

• 담당자는 장애의 원인이 배치DB와 운영DB의 데이터파일이 중복되어 발생하였음을 확인.

• 담당자는 Control File을 재생성한 후 복구 시점을 22시 까지 설정하여 불완전 복구를 수행하기로 결정함

복구 시점을 22시로 설정한 근거는 변경작업 업무가 22시 이후 구동되기 때문임

• Full Restore 후 불완전 복구를 시작하였고 복구 도중 21:30분 Redo Log 적용 시 ORA-00600 [3020]

에러 발생 후 복구 작업이 중단됨. 해당 블록은 SYSAUX Tablespace의 비트맵 블록이었음

Errors in file /ora_resource_dbpa3/dump/bdump/dbpa3_p010_9068.trc: ORA-00600: internal error code, arguments: [3020], [86], [27145], [1], [84713], [333139], [24], [] ORA-10567: Redo is inconsistent with data block (file# 86, block# 27145) ORA-10564: tablespace SYSAUX ORA-01110: data file 86: '/dev/vx/rdsk/LPDBDATADG/5gvol_046' ORA-10560: block type 'FIRST LEVEL BITMAP BLOCK'

SQL> select checkpoint_time,count(*) from v$datafile_header group by checkpoint_time; CHECKPOINT_TIME COUNT(*) ----------------- ---------- 10/08/23 22:04:18 222 10/08/23 23:30:38 417

• 백업 수행 프로세스 개선

BCV 백업 스크립트를 BEGIN/END 모드 없이 수행하였고 백업 후 Log Switch 작업 생략함

Online 백업 시에는 BEGIN/END 모드에서 수행하고 백업 수행 직후에는 Log Switch 를 수행해야 함

백업 수행 후 Log Switch나 Checkpoint를 수행하였으며 배치DB구동 시 Data File 정합성

점검 단계에서 DB구동이 실패하여 장애가 발생하지 않음

• 복구 시점 산정

Data File이 Overwrite 된 경우에는 복구 작업 시 Stuck Recovery 가 발생 할 수 있음

전체 DB 복구 또는 주요 Tablespace 복구 시에는 Data File Overwrite 이전 시점으로 불완전 복구해야 함

22


• Stuck Recovery (Stuck recovery of database ORA-00600[3020] [ID 283269.1])

. Oracle Recovery 작업 중 오류메시지 ORA-600[3020] 이 발생되면서 Recovery 작업이 중단됨

. Recovery 진행 중 Oracle은 Redo Log를 적용할 Data Block의 정합성(Consistency) 체크를 하게 되는데,

만약 Redo Log를 적용할 Data Block의 SCN정보와 적용되는 Archive Log 또는 Redo Log내의 Data

Block SCN정보(Change Vector SCN)가 일치하지 않을 경우 발생됨 (ORA-00600 [3020])

. Oracle Bug 또는 시스템의 I/O 관련 문제 (OS, DISK 오류)로 발생 가능함

. 다른 DB 또는 다른 시점의 Data File Block 이 Overwrite 되었을 경우

. Recovery 방법

1) 일반 데이터용 Tablespace 일 경우 (ex. USERS)

- 해당 Block 을 손상(corrupt) 상태로 강제 표시하여(Mark) 복구 진행 가능

SQL> RECOVER DATABASE ALLOW N CORRUPTION; (Corrupt Mark 후 복구 진행)

- 단, 이 경우 해당 Block 내의 사용자 Data 는 유실됨을 감안

2) SYSTEM 또는 UNDO Tablespace 일 경우

- 해당 Block이 주요 Block일 경우 DB OPEN 불가할 수 있음

- 이럴 경우 Full Restore후 Corruption Log 직전까지 복구 후 RESETLOGS 옵션으로 Open

(Incomplete 복구)

3) SYSAUX Tablespace 일 경우

- AWR Disable 후 오브젝트 재생성 (How to recreate tables from SYSAUX Tablespace [ID 333665.1]

. Trial Recovery를 수행하여 사전에 Stuck 가능성 테스트 필요 (Trial Recovery [ID 283262.1])

SQL> RECOVER DATABASE TEST ALLOW N CORRUPTION;

SQL> RECOVER DATABASE TEST

SQL> RECOVER DATABASE USING BACKUP CONTROLFILE UNTIL CANCEL TEST

SQL> RECOVER TABLESPACE TEST

SQL> RECOVER DATABASE UNTIL CANCEL TEST

※ Trial Recovery는 실제 백업 파일에 복구작업을 수행하지 않고 메모리에서만 복구 테스트를 하는

Recovery기법으로 사전에 블록 손상을 검출할 수 있다.

• Instance가 Startup 되면 다음 2가지를 점검

① Control File내의 Checkpoint Counter와 Data File Header내의 Checkpoint Counter값의 일치 여부

② 모든 Data File Header의 Checkpoint SCN값(Start SCN)과 Control File의 Stop SCN값의 일치 여부

. 만약 2가지 모두 일치 하면 Oracle은 Recovery 작업 없이 Open

. 첫 번째 점검이 실패하면 (Data File Header와 Control File의 Checkpoint Counter 일치 여부) Media

Recovery를 수행하여야 함

. 두 번째 점검이 실패하면 (Data File Header의 Checkpoint SCN값과 Control File의 Stop SCN값의

일치 여부) Instance Recovery를 수행

. Instance Recovery는 해당 Thread가 마지막 Checkpoint된 이후로 일어난 Thread의 모든 Redo

(Current Redo Log의 End Of Thread)를 자동으로 적용함

→ 이 과정에서 Data Block의 SCN이 Redo Record의 SCN보다 높은 Block은 Redo 적용이 Skip됨

. Instance Recovery가 끝날때는 모든 종류의 Fuzzy Bit이 Data File에서 지워지고 Instance Recovery의

종료를 의미하는 Redo(end-crash-recovery redo)가 생성됨

→ 이 Redo는 Media Recovery시 해당 Data File의 Online 및 Hot Backup Fuzzy Bit을 지울때 사용됨

. Instance Recovery가 완료된 후에는 Log Switch가 자동으로 발생하고 Checkpoint Counter가 1씩 증가

RAC 2 Node가 동시에 Restart되는 문제 · 2013. 3. 25. · 1 • CPU Scheduling 관련...

Documents

Transcript of RAC 2 Node가 동시에 Restart되는 문제 · 2013. 3. 25. · 1 • CPU Scheduling 관련...