环境介绍:前几天搭建了一套 二节点单实例的 linux+oracle11.2.0.3+dataguard   maximize availability 的环境。

故障现象:今天发现不能同步了,在trace文件alert_orcl.log里发现有报错信息MRP进程启不来

MRP0: Background Media Recovery terminated with error 328

ORA-00328: 8386238 , 8972415

ORA-00334: '/opt/oracle/fast_recovery_area/DG_BEI/archivelog/2014_09_28/o1_mf_1_830_b2h9mjmn_.arc'

在报上述错误前trace文件alert_orcl.log有报与主库网络中断错误

RFS[4]: Assigned to RFS process 20669

RFS[4]: Possible network disconnect with primary database
Sun Sep 28 13:23:48 2014

 

MRP0: Background Media Recovery terminated with error 328详细报错信息如下:

Mon Sep 29 14:32:28 2014

alter database recover managed standby database using current logfile disconnect from session nodelay
Attempt to start background Managed Standby Recovery process (orcl)
Mon Sep 29 14:32:28 2014
MRP0 started with pid=27, OS id=23710
MRP0: Background Managed Standby Recovery process started (orcl)
 started logmerger process
Mon Sep 29 14:32:33 2014
Managed Standby Recovery starting Real Time Apply
Parallel Media Recovery started with 2 slaves
Waiting for all non-current ORLs to be archived...
All non-current ORLs have been archived.
Clearing online redo logfile 1 /opt/oracle/oradata/orcl/redo01.log
Clearing online log 1 of thread 1 sequence number 835
Clearing online redo logfile 1 complete
Clearing online redo logfile 2 /opt/oracle/oradata/orcl/redo02.log
Clearing online log 2 of thread 1 sequence number 834
Clearing online redo logfile 2 complete
Clearing online redo logfile 3 /opt/oracle/oradata/orcl/redo03.log
Clearing online log 3 of thread 1 sequence number 835
Clearing online redo logfile 3 complete
Media Recovery Log /opt/oracle/fast_recovery_area/DG_BEI/archivelog/2014_09_28/o1_mf_1_830_b2h9mjmn_.arc
Errors with log /opt/oracle/fast_recovery_area/DG_BEI/archivelog/2014_09_28/o1_mf_1_830_b2h9mjmn_.arc
MRP0: Background Media Recovery terminated with error 328
Errors in file /opt/oracle/diag/rdbms/dg_bei/orcl/trace/orcl_pr00_23712.trc:
ORA-00328: 8386238 , 8972415
ORA-00334: : '/opt/oracle/fast_recovery_area/DG_BEI/archivelog/2014_09_28/o1_mf_1_830_b2h9mjmn_.arc'
Managed Standby Recovery not using Real Time Apply
Recovery interrupted!
Completed: alter database recover managed standby database using current logfile disconnect from session nodelay
MRP0: Background Media Recovery process shutdown (orcl)
Mon Sep 29 14:34:06 2014
RFS[1]: Assigned to RFS process 23726
RFS[1]: Opened log for thread 1 sequence 836 dbid 1356850190 branch 829069458
Archived Log entry 77 added for thread 1 sequence 836 rlc 829069458 ID 0x50df5e0e dest 2:
Mon Sep 29 14:34:08 2014
Primary database is in MAXIMUM AVAILABILITY mode
Changing standby controlfile to RESYNCHRONIZATION level
Standby controlfile consistent with primary
RFS[2]: Assigned to RFS process 23728
RFS[2]: Selected log 4 for thread 1 sequence 838 dbid 1356850190 branch 829069458
Mon Sep 29 14:34:08 2014
RFS[3]: Assigned to RFS process 23730
RFS[3]: Selected log 5 for thread 1 sequence 837 dbid 1356850190 branch 829069458
Mon Sep 29 14:34:08 2014
Archived Log entry 78 added for thread 1 sequence 837 ID 0x50df5e0e dest 1:
Changing standby controlfile to MAXIMUM AVAILABILITY level
RFS[2]: Selected log 5 for thread 1 sequence 839 dbid 1356850190 branch 829069458
Mon Sep 29 14:34:11 2014
Archived Log entry 79 added for thread 1 sequence 838 ID 0x50df5e0e dest 1:

 

处理过程:

1.备库上检查恢复相关的进程,确实少了MRP

select process,status,sequence# from v$managed_standby;

 

2.在备库上检查归档日志视图

sql>select name,sequence#,applied from v$archived_log;

 奇怪的事:trace日志中报错的归档日志显示被应用了,如红框中示

 

3.对比主库与备库 2014_09_28归档日志,数据和大小都不一样,备库比主库日志数量要多,但比主库要小,如下图示:(估计是传输日志的时候刚好网络中断造成的大小不一致)

 

4.万能的google搜索,大概是归档日志传输出错,从主库copy日志到备库,然后再执行恢复

4.1 以防万一,先将备库2014_09_28归档日志备份一下,然后删除备库2014_09_28归档文件夹

4.2 从主库上scp 复制2014_09_28归档文件夹到备库

4.3 备库上执行恢复操作

SQL> alter database recover managed standby database using current logfile disconnect;

查看trace日志显示正常了

 

5.到备库上确认归档日志 sql>select name,sequence#,applied from v$archived_log;

显示能正常应用了,不过有一个日志(红框中示)没用应用,我猜因为从主库copy过来的日志文件没有这个文件(只5个,原备库有8个),检查了数据是正常的。故障应该算是解决了。