Saturday, May 16, 2009

Rebuild device path. Useful during creating alternate boot disk

Sometime we need to rebuild device path for alternate boot disk eventhough we have succedded with bootdisk cloning. Whenever cloning bootdisk won't start, it could be some device path were missing.

Following is step to rebuild device path.
( Presumed: main bootdisk= c0t0d0s0; cloning bootdisk= c1t0d0s0 )
Starting from ok> prompt
ok> boot cdrom -s (either boot net -s)
# mount /dev/dsk/c1t0d0s0 /mnt
# cp /mnt/etc/path_to_inst /mnt/etc/path_to_inst.orig
# devfsadm -t /mnt -p /mnt/etc/path_to_inst
# cd /devices
# find . -print | cpio -pduVm /mnt/devices
==> note: find . (dot) - print |
==> output: ............................
# disks -r /mnt
==> Done

If decided that server will boot from c1t0d0s0, then /etc/vfstab need to be modified as well. (e.g. all system partition / filesystem should refer to c1t0d0)

Friday, May 15, 2009

How to locate a missing disk within several JBOD.... dd

IHAC dropped an email wondering if one of the disk has missing from server configuration. An email sent with "format" command output as:

Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c0t0d0
/pci@1c,600000/scsi@1/sd@0,0
1. c0t1d0
/pci@1c,600000/scsi@1/sd@1,0
2. c0t2d0
/pci@1c,600000/scsi@1/sd@2,0
3. c0t3d0
/pci@1c,600000/scsi@1/sd@3,0
4. c0t4d0
/pci@1c,600000/scsi@1/sd@4,0
5. c0t5d0
/pci@1c,600000/scsi@1/sd@5,0
6. c1t0d0
/pci@1c,600000/scsi@1,1/sd@0,0
7. c1t1d0
/pci@1c,600000/scsi@1,1/sd@1,0
8. c1t2d0
/pci@1c,600000/scsi@1,1/sd@2,0
9. c1t3d0
/pci@1c,600000/scsi@1,1/sd@3,0
10. c1t4d0
/pci@1c,600000/scsi@1,1/sd@4,0
11. c1t5d0
/pci@1c,600000/scsi@1,1/sd@5,0
12. c3t0d0
/pci@1f,700000/scsi@2/sd@0,0
13. c3t1d0
/pci@1f,700000/scsi@2/sd@1,0
14. c4t0d0
/pci@1d,700000/scsi@1/sd@0,0
15. c4t1d0
/pci@1d,700000/scsi@1/sd@1,0
16. c4t2d0
/pci@1d,700000/scsi@1/sd@2,0
17. c4t3d0
/pci@1d,700000/scsi@1/sd@3,0
18. c4t4d0
/pci@1d,700000/scsi@1/sd@4,0
19. c4t5d0
/pci@1d,700000/scsi@1/sd@5,0
20. c5t0d0
/pci@1d,700000/scsi@1,1/sd@0,0
21. c5t1d0
/pci@1d,700000/scsi@1,1/sd@1,0
22. c5t2d0
/pci@1d,700000/scsi@1,1/sd@2,0
23. c5t3d0
/pci@1d,700000/scsi@1,1/sd@3,0
24. c5t4d0
/pci@1d,700000/scsi@1,1/sd@4,0
25. c5t5d0
/pci@1d,700000/scsi@1,1/sd@5,0
26. c7t0d0
/pci@1d,700000/scsi@2/sd@0,0
27. c7t1d0
/pci@1d,700000/scsi@2/sd@1,0
28. c7t2d0
/pci@1d,700000/scsi@2/sd@2,0
29. c7t3d0
/pci@1d,700000/scsi@2/sd@3,0
30. c7t4d0
/pci@1d,700000/scsi@2/sd@4,0
31. c7t5d0
/pci@1d,700000/scsi@2/sd@5,0
32. c7t8d0
/pci@1d,700000/scsi@2/sd@8,0
33. c7t9d0
/pci@1d,700000/scsi@2/sd@9,0
34. c7t10d0
/pci@1d,700000/scsi@2/sd@a,0
35. c7t11d0
/pci@1d,700000/scsi@2/sd@b,0
36. c7t13d0
/pci@1d,700000/scsi@2/sd@d,0
37. c8t0d0
/pci@1d,700000/scsi@2,1/sd@0,0
38. c8t1d0
/pci@1d,700000/scsi@2,1/sd@1,0
39. c8t2d0
/pci@1d,700000/scsi@2,1/sd@2,0
40. c8t3d0
/pci@1d,700000/scsi@2,1/sd@3,0
41. c8t4d0
/pci@1d,700000/scsi@2,1/sd@4,0
42. c8t5d0
/pci@1d,700000/scsi@2,1/sd@5,0
43. c8t8d0
/pci@1d,700000/scsi@2,1/sd@8,0
44. c8t9d0
/pci@1d,700000/scsi@2,1/sd@9,0
45. c8t10d0
/pci@1d,700000/scsi@2,1/sd@a,0
46. c8t11d0
/pci@1d,700000/scsi@2,1/sd@b,0
47. c8t12d0
/pci@1d,700000/scsi@2,1/sd@c,0
48. c8t13d0
/pci@1d,700000/scsi@2,1/sd@d,0


Phew.. Which one is missing?
Understanding the log, its obvious that disk c7t12d0 was missing.

35. c7t11d0
/pci@1d,700000/scsi@2/sd@b,0

==>> c7t12d0
==>> /pci@1d,700000/scsi@2/sd@c,0
36. c7t13d0
/pci@1d,700000/scsi@2/sd@d,0


Next question, where is the physical disk? In which tray? In which slot? We need to fixed it without server downtime!

root@sun1 # ./decode
Enter path? /pci@1d,700000/scsi@2/sd@c,0
Possible machines:
0. Sun Fire V440
1. Sun Fire V250
Selection
[default: V440]: 0
Sun Fire V440
System Board
PCI Slot 2 3.3V 33/66Mhz 32/64Bit PCI
Node: scsi@2
Desc: Dual Differential Ultra/Wide SCSI Host Adapter
Desc: Dual Single-Ended Ultra/Wide SCSI Host Adapter
Desc: PCI-X - Dual Ultra-320 SCSI/RAID
Desc: QLogic QLA-22xx PCI Fibre Channel Adaptor
Port: SCSI Channel A
Info: VHDC168 Connector - SE/LVD Ultra-320 SCSI
Node:
Desc: Storage Array
Node: sd@c,0
Desc: SCSI Emulated FCAL Disk Device
Location: LUN 0


Now we have missing disk information in hand:
- device path: c7t12d0
- SCSI connection: PCI Slot 2 <=> Port: SCSI Channel A <=> Storage Array (Node: sd@c,0)
- adjacent disk c7t11d0 and c7t13d0;
root@sun1: #iostat -En
<..truncated..>
c7t11d0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 0
Vendor: FUJITSU Product: MAX3073NCSUN72G Revision: 1503 Serial No: 0714F03WY1
c7t13d0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 0
Vendor: SEAGATE Product: ST373455LSUN72G Revision: 0491 Serial No: 0717R0WPXR

<..truncated..>

However, even though we have enough data, still we can't found the missing disk. There were 2 server (in cluster mode), connected to 4 disk trays (SE3310) through 12 rigid SCSI cables. The worse, there were no label nor cable tag on all SCSI cable. Then we locate the disk by simulating I/O activity on disk c7t11d0 and c7t13d0, since we understand that the missing disk was sitting in between those disks.

On terminal window execute;
root@sun1 # dd if=/dev/rdsk/c7t11d0s2 of=/dev/null
and on another terminal window run the same command concurrently;
root@sun1 # dd if=/dev/rdsk/c7t13d0s2 of=/dev/null

Now we look for disk drives which have extensive I/O activity, representing by "steadily blinking LED" on it. Finally we found the c7t12d0 disk which is the disk drive without any activity and sitting in slot in between disk c7t11d0 and c7t13d0. Then proceed with normal disk replacement procedure.


Wednesday, May 13, 2009

Hot issue today, the rtc.... Real Time Clock

Sun SPARC server experienced with system timestamp jump backdated to 00:00 January 1st 1970 UTC. Its impacted serious problem since all new created files have had this incorrect time and also lead to database records having incorrect timestamps too.

Fortunately there is SUN Alert (Doc ID: 253828) described bug (Bug ID 6724580) including workaround and definite resolution.

1. Workaround:
- if system experienced with the issue, the time/date can be corrected using "date" command
root@sun1: # date 051113002009
Mon May 11 13:00 WIT 2009

- to make persistent across system reboot, add following lines to "/etc/system" file;
set tod_broken=1
set dosynctodr=0

Note:
These two lines can be safely removed once the fix patch has been applied. This is most efficiently carried out by installing the patch and then removing these two lines from "/etc/system" before performing the system reboot associated with the patch installation

2. Resolution:
Apply "todm5819p_rmc" patch release as follows:

SPARC Platform;
* Solaris 8 with patch 117350-62 or later
* Solaris 9 with patch 139384-01 or later
* Solaris 10 with patch 139514-01 or later