Failure phenomenon
A customer's TSM backup duty officer reports that a drive of TSM server is working abnormally, only one channel can be used for backup,
and part of the schedule fails to be backed up because there is no channel.
The engineer receives the call and handles the problem immediately.
After logging in, he checks the failure phenomena as follows:
l . A server and drive01 correspond to the path offline:
tsm: tsm-server>q path
Source Name Source Type Destination Destination On-Line
Name Type
----------- ----------- ----------- ----------- -------
tsm-server SERVER ts3200lib LIBRARY Yes
tsm-server SERVER DRIVE01 DRIVE No
tsm-server SERVER DRIVE02 DRIVE Yes
l . Check the TS3200 tape library management interface, which shows that
the tape library status is normal and in the "Ready" state:
l . Analyze the act log for a large number of read and write errors:
Q act search=error begind=-1 endd=+1
10/02/20 13:19:32 ANR8310E An I/O error occurred while accessing library
ts3200lib. (SESSION: 4566100)
10/02/20 13:20:04 ANR8310E An I/O error occurred while accessing library
ts3200lib. (SESSION: 4566100)
10/02/20 13:20:35 ANR8310E An I/O error occurred while accessing library
ts3200lib. (SESSION: 4566100)
10/02/20 13:21:07 ANR8310E An I/O error occurred while accessing library
ts3200lib. (SESSION: 4566100)
10/02/20 13:21:39 ANR8310E An I/O error occurred while accessing library
ts3200lib. (SESSION: 4566100)
Failure Analysis
Most of the path offline of TSM server is caused by read/write error due to driver failure.
According to the past experience, if the driver fails,
the red Error alarm similar to the following figure will appear in the web management interface of TS3200,
but it does not appear in this case, that is to say, the driver with libraries does not fail, so why the path goes offline?
1、Preliminary judgment
Reboot the tape library and update the offline path to try to solve the problem:
l. update drive ts3200lib DRIVE01 online=yes, this command is executed successfully, indicating that the driver status is normal;
l. Initiate backup again, use q mount to see the tape mounted in the driver, but after a period of time, the path is offline again.
Through the above failure phenomenon, the preliminary judgment is that the driver cleaning is not timely caused by the driver damage.
Before the cleaning method is to appear after the alarm cleaning, so it is very easy to cause damage to the driver,
to this conclusion: the need to replace the drive01.
2. Discover new clues
While waiting for the driver to be replaced, check the backup, try to update drive ts3200lib DRIVE01 online=yes successfully,
path online, but once the tape CI0898L5 is mounted into the driver,
the path is offline immediately. check the act log, the backup of this node reported that it is failed, and there are a lot of I/E data.
Check the act log, the backup of this node reported failed,
there are a lot of I/O errors, check the corresponding tape, found that the tape is unavailable, and is in error state.
(Error State):
tsm: TSM>QUERY VOLUME CI0898L5 f=d
Volume Name: CI0898L5
Storage Pool Name: ORA_POOLEstimated Capacity: 3.0 TNumber of Writable Sides: 1Number of Times Mounted: 751Number of Write Errors: 1
Checked the read/write error tape with DB2's SQL and found 2 other disks with a similar phenomenon:
Tsm: TSM6>SELECT volumes.volume_name, volumes.stgpool_name, volumes.pct_utilized, volumes.status, volumes.write_errors, volumes.read_errors FROM volumes, libvolumes WHERE volumes.volume_name=libvolumes.volume_name AND ( volumes.write_errors>0 OR volumes.read_errors>0 )
VOLUME_NAME: C04780L5
STGPOOL_NAME: ORA_POOL
PCT_UTILIZED: 17.7
STATUS: FILLING
WRITE_ERRORS: 1
READ_ERRORS: 0
VOLUME_NAME: NIN039L5
STGPOOL_NAME: ORA_POOL
PCT_UTILIZED: 100.0
STATUS: FILLING
WRITE_ERRORS: 1
READ_ERRORS: 0
Once these 3 unavailable tapes are mounted to drive1, TSM server determines that the C tapes in the driver are unavailable,
which results in a path offline.
3.Analyze the cause
When the tape library is used for too long, or the cleaning head is not timely, or the cleaning tape is used more than once,
it will inevitably cause the tape read/write error, which is labeled as "Unavailable" and In Error State=yes.
According to the above phenomenon in this case, and combined with the backup process is cyclic use of tape
(i.e., beyond the retention period after the deletion of data recycling),
to determine the cause of this failure is the tape read and write errors caused by the path offline.
Troubleshooting
1. Mark the unavailable 3 tapes as "readonly" to avoid TSM writing to these 3 tapes.
UPDATE VOLUME CI0898L5 access= readonly
UPDATE VOLUME NIN039L5 access= readonly
UPDATE VOLUME C04780L5 access= readonly
2, Check out unusable 3 tapes;
3, checkout 3 new tapes;
4, Since ora_pool does not have a replica pool, you can only start a manual full backup immediately to make sure
you have the latest backup version;.
5, Delete the paths and rebuild the defined paths:
define path tsm DRIVE1 SRCTYPE=SERVER DESTTYPE=DRIVE LIBRARY=ts3200lib device=/dev/rmt1 online=yes
At this point, the backup is back to normal and the problem is solved.
Summary of experience
1, TSM server path offline most of the reasons are due to driver failure, for Lan Free nodes also have configuration problems caused by the error;
but also a small part of the reason is due to the tape read and write Error,
for the long-term cyclic use of the tape in particular to pay attention to;
2, it is recommended to use automatic cleaning driver, regularly check the remaining number of cleaning tape;
3. It is recommended to check the execution status of schedule in time;
4, it is recommended to check the status of the tapes in the tape library with SQL commands frequently,
for the tapes that have been reported as WRITE_ERRORS>1,
it is necessary to consider replacing the tapes with new ones.
For more information, please visit Antute's official website.:m1b.winmatrixat.com