Contact Us
Technical Guide
Your current position:Home > Technical Guide
【365比分网】Troubleshooting TSM server path offline due to tape Error




Failure phenomenon


A customer's TSM backup duty officer reports that a drive of TSM server is working abnormally, only one channel can be used for backup, 

and part of the schedule fails to be backed up because there is no channel.


The engineer receives the call and handles the problem immediately. 

After logging in, he checks the failure phenomena as follows:


l . A server and drive01 correspond to the path offline:


tsm: tsm-server>q path

 

Source Name Source Type Destination Destination On-Line

Name Type

----------- ----------- ----------- ----------- -------

tsm-server    SERVER     ts3200lib       LIBRARY   Yes

 

tsm-server    SERVER     DRIVE01      DRIVE     No    


tsm-server      SERVER     DRIVE02       DRIVE     Yes


l . Check the TS3200 tape library management interface, which shows that 

the tape library status is normal and in the "Ready" state:image001.jpg

l . Analyze the act log for a large number of read and write errors:


Q act search=error begind=-1 endd=+1

10/02/20 13:19:32 ANR8310E An I/O error occurred while accessing library

ts3200lib. (SESSION: 4566100)

10/02/20 13:20:04 ANR8310E An I/O error occurred while accessing library

ts3200lib. (SESSION: 4566100)

10/02/20 13:20:35 ANR8310E An I/O error occurred while accessing library

ts3200lib. (SESSION: 4566100)

10/02/20 13:21:07 ANR8310E An I/O error occurred while accessing library

ts3200lib. (SESSION: 4566100)

10/02/20 13:21:39 ANR8310E An I/O error occurred while accessing library

ts3200lib. (SESSION: 4566100)




Failure Analysis



Most of the path offline of TSM server is caused by read/write error due to driver failure. 

According to the past experience, if the driver fails, 

the red Error alarm similar to the following figure will appear in the web management interface of TS3200, 

but it does not appear in this case, that is to say, the driver with libraries does not fail, so why the path goes offline?


image002.jpg


1、Preliminary judgment




Reboot the tape library and update the offline path to try to solve the problem:


l. update drive ts3200lib DRIVE01 online=yes, this command is executed successfully, indicating that the driver status is normal;


l. Initiate backup again, use q mount to see the tape mounted in the driver, but after a period of time, the path is offline again.


Through the above failure phenomenon, the preliminary judgment is that the driver cleaning is not timely caused by the driver damage. 

Before the cleaning method is to appear after the alarm cleaning, so it is very easy to cause damage to the driver, 

to this conclusion: the need to replace the drive01.


2. Discover new clues


While waiting for the driver to be replaced, check the backup, try to update drive ts3200lib DRIVE01 online=yes successfully, 

path online, but once the tape CI0898L5 is mounted into the driver, 

the path is offline immediately. check the act log, the backup of this node reported that it is failed, and there are a lot of I/E data. 

Check the act log, the backup of this node reported failed, 

there are a lot of I/O errors, check the corresponding tape, found that the tape is unavailable, and is in error state.

(Error State):

tsm: TSM>QUERY VOLUME CI0898L5 f=d


Volume Name: CI0898L5

Storage Pool Name: ORA_POOL
Estimated Capacity: 3.0 T
Access: Unavailable
In Error State?: Yes
Number of Writable Sides: 1
Number of Times Mounted: 751
Write Pass Number: 1
Number of Write Errors: 1
Number of Read Errors: 0



Checked the read/write error tape with DB2's SQL and found 2 other disks with a similar phenomenon:


Tsm: TSM6>SELECT volumes.volume_name, volumes.stgpool_name, volumes.pct_utilized, volumes.status, volumes.write_errors, volumes.read_errors FROM volumes, libvolumes WHERE volumes.volume_name=libvolumes.volume_name AND ( volumes.write_errors>0 OR volumes.read_errors>0 )

VOLUME_NAME: C04780L5

STGPOOL_NAME: ORA_POOL

PCT_UTILIZED: 17.7

STATUS: FILLING

WRITE_ERRORS: 1

READ_ERRORS: 0

VOLUME_NAME: NIN039L5

STGPOOL_NAME: ORA_POOL

PCT_UTILIZED: 100.0

STATUS: FILLING

WRITE_ERRORS: 1

READ_ERRORS: 0


Once these 3 unavailable tapes are mounted to drive1, TSM server determines that the C tapes in the driver are unavailable, 

which results in a path offline.



3.Analyze the cause

When the tape library is used for too long, or the cleaning head is not timely, or the cleaning tape is used more than once, 

it will inevitably cause the tape read/write error, which is labeled as "Unavailable" and In Error State=yes.


According to the above phenomenon in this case, and combined with the backup process is cyclic use of tape 

(i.e., beyond the retention period after the deletion of data recycling), 

to determine the cause of this failure is the tape read and write errors caused by the path offline.




Troubleshooting



1. Mark the unavailable 3 tapes as "readonly" to avoid TSM writing to these 3 tapes.


UPDATE VOLUME CI0898L5 access= readonly

UPDATE VOLUME NIN039L5 access= readonly

UPDATE VOLUME C04780L5 access= readonly


2, Check out unusable 3 tapes;


3, checkout 3 new tapes;


4, Since ora_pool does not have a replica pool, you can only start a manual full backup immediately to make sure 

you have the latest backup version;.


5, Delete the paths and rebuild the defined paths:


define path tsm DRIVE1 SRCTYPE=SERVER DESTTYPE=DRIVE LIBRARY=ts3200lib device=/dev/rmt1 online=yes


At this point, the backup is back to normal and the problem is solved.




Summary of experience



1, TSM server path offline most of the reasons are due to driver failure, for Lan Free nodes also have configuration problems caused by the error; 

but also a small part of the reason is due to the tape read and write Error, 

for the long-term cyclic use of the tape in particular to pay attention to;


2, it is recommended to use automatic cleaning driver, regularly check the remaining number of cleaning tape;


3. It is recommended to check the execution status of schedule in time;


4, it is recommended to check the status of the tapes in the tape library with SQL commands frequently, 

for the tapes that have been reported as WRITE_ERRORS>1, 

it is necessary to consider replacing the tapes with new ones.



For more information, please visit Antute's official website.:m1b.winmatrixat.com

版权所有 365比分网 Filing No:京ICP备17074963号-1
Technical Support:Genesis Network