SES5.5 How to remove/replace an osd
This document (000019687) is provided subject to the disclaimer at the end of this document.
Environment
Situation
Customer needs to redeploy osd's with different configuration.
Customer needs to removed failed hdd, ssd, or nvme device(s).
Resolution
Choose remove.osd vs replace.osd
https://documentation.suse.com/ses/5.5/single-html/ses-admin/#salt-removing-osd
https://documentation.suse.com/ses/5.5/single-html/ses-admin/#ds-osd-replace
This document covers remove.osd, and not replace.osd. However, replace.osd is very similar.
There are three reasons to remove osd from the cluster:
-osd device failed. In this case there would be one osd marked down.
-journaling device failed. In this case all associated osd's would be marked down.
-last, reconfigure osd's with different journaling partition sizes.
The first two reasons require replacing hardware and requires configuration to address the issue. The last option is just a reconfiguration and no replacement of hardware.
If an OSD nodes needs to be shut down, consider setting the “noout” flag before shutting down the OSD node.
"ceph osd set noout"
Removing and osd can be intimidating.
Customer should validate and revalidate each step, to ensure that the correct osd, device, partitions are being removed.
Removing the incorrect osd, devices, partitions can be harmful to the cluster and has the potential of data loss.
Please use caution!
Kernel names /dev/sd? are not persistent. If the OSD node is rebooted/restarted, there is potential that the kernel names have changed.
Take precautions and revalidate kernel names if the node is rebooted.
This document is an example and does NOT cover all conditions.
It is the customers responsibility to validate/verify steps.
SUSE is not responsible for data loss.
As a rule, osd's in the same failure domain can be removed at the same time. osd's in different failure domains should only be removed in a serial sequence, and ensuring the cluster is healthy between removing osd's.
Removing an osd from a cluster:
1 -On the Admin node, identify the osd and its node/host name.
"ceph osd tree"
Example: osd.63 on OSD node ceph01
Preparatory information:
2 -On the OSD node, identify device's the osd is using:
"ssh $OSD-Node"
"ceph-disk list"
Example for osd.63:
/dev/sdl :
/dev/sdl1 ceph data, active, cluster ceph, osd.63, block /dev/sdl2, block.db /dev/sdm4, block.wal /dev/sdm3
/dev/sdl2 ceph block, for /dev/sdl1
"osd.63 is using device /dev/sdl for data, and /dev/sdm4 & /dev/sdm3 partitions for journaling devices.
3 -On the OSD node, record "/dev/disk/by-id/" label for /dev/sdl is using.
Example with information provided above:
cd /dev/disk/by-id/
ll
--cut here--
lrwxrwxrwx 1 root root 9 2020-08-06 12:20 scsi-350000399a8c8f172 -> ../../sdl
lrwxrwxrwx 1 root root 10 2020-08-06 12:20 scsi-350000399a8c8f172-part1 -> ../../sdl1
lrwxrwxrwx 1 root root 10 2020-08-06 12:42 scsi-350000399a8c8f172-part2 -> ../../sdl2
---and---
lrwxrwxrwx 1 root root 9 2020-08-06 12:20 scsi-SATA_TOSHIBA_MG07ACA1_9980A0C2F9SG -> ../../sdl
lrwxrwxrwx 1 root root 10 2020-08-06 12:20 scsi-SATA_TOSHIBA_MG07ACA1_9980A0C2F9SG-part1 -> ../../sdl1
lrwxrwxrwx 1 root root 10 2020-08-06 12:42 scsi-SATA_TOSHIBA_MG07ACA1_9980A0C2F9SG-part2 -> ../../sdl2
This information will be used when editing "yml" file.
4 -On the OSD node, record journaling partitions.
Example from the information provided above:
"parted /dev/vdb print free"
5 -On the OSD node, record the Serial Number of the hard drive so that it can be identified physically when the drive is removed from the OSD node.
Examples:
hdparm -I /dev/sdl | egrep -i 'Model\ Number|Serial\ Number'
smartctl --xall /dev/sdl | egrep -i 'Model\ Family|Device\ Model|Serial\ Number'
6 -On the Admin node, Record policy.cfg storage information. Default "profile-default"
Example
On admin node:
cd /srv/pillar/ceph/proposals/
cat policy.cfg | grep profile
profile-custom-hdd/cluster/ceph0[012345].ses5.com.sls
profile-custom-hdd/stack/default/ceph/minions/ceph0[012345].ses5.com.yml
#profile-custom-ssd/cluster/ceph0[012345].ses5.com.sls
#profile-custom-ssd/stack/default/ceph/minions/ceph0[012345].ses5.com.yml
7 -Make a backup of the yml file for host the osd is located on.
Example:
cp profile-custom-ssd/stack/default/ceph/minions/$OSD-Node-Name.ses5.com.yml \
profile-custom-ssd/stack/default/ceph/minions/$OSD-Node-Name.ses5.com.yml.bck
Where $OSD-Node-Name is OSD host name. In this example "ceph01"
Sometimes Deepsea will remove the device entry from the yml file, which can be desired or undesired, but having a backup can make the process simpler.
8 -A a precaution, drain the osd:
ceph osd reweight $OSD_ID 0
Allow the cluster to get healthy.
Monitor with "ceph -s", "ceph osd df tree". pg's will be migrated away from the osd(in this case osd.63).
9 -Remove the osd:
Generally, the work above is done as a precaution. Now it’s time to allow Deepsea to do its job.
https://documentation.suse.com/ses/5.5/single-html/ses-admin/#salt-removing-osd
Run the following command on the admin node:
salt-run disengage.safety
salt-run remove.osd OSD_ID
or
salt-run disengage.safety; salt-run remove.osd OSD_ID
(Where OSD_ID is the # only.)
Example, from information provide above, the command as follows:
salt-run remove.osd 63
"salt-run remove.osd" can be run multiple times if there is a failure.
If the command is successful, the osd will NOT be listed with the following command:
Example, from information provide above, the command as follows:
ceph osd tree | grep osd.63
Sometimes it requires force to remove and osd:
https://documentation.suse.com/ses/5.5/single-html/ses-admin/#osd-forced-removal
salt target osd.remove OSD_ID force=True
Example, from information provide above, the command as follows:
salt 'ceph01*' osd.remove 63 force=True
In extrem circumstances it may be necessary to remove the osd with:
"ceph osd purge"
Example from information above, Step #1:
ceph osd purge 63
After "salt-run remove.osd OSD_ID" is run, it is good practice to verify the partitions have also been deleted.
On the OSD node run:
ceph-disk list
"ceph-disk list" will not associate device with any osd. Information provided above, "/dev/sdl" will not be associated with osd.63.
Partitions /dev/sdm4 & /dev/sdm3 will not be associated with any osd as well.
Validate, Example from information above, Step #2:
lsblk
and
parted /dev/sdl print free
parted /dev/sdm print free
Note: journaling partitions /dev/sdm4, /dev/sdm3 were deleted:
If Deepsea did not remove the journaling partitions, then it will be necessary to remove the journaling partitions manually.
Example from information above, Step #2:
To remove partitions /dev/sdm4 and /dev/sdm3 run the following command:
parted -s /dev/sdm rm 4
parted -s /dev/sdm rm 3
Caution! Deleting the wrong partitions can cause the cluster harm and data loose.
The mount point for the osd should no longer exist:
Example from information above, Step #2:
Mount | grep "ceph-63"
If the mount point still exists, use umount.
Example from information above:
umount /var/lib/ceph/osd/ceph-63
The osd daemon should no longer be running.
Example from information above, Step #2:
systemctl status ceph-osd@63.service
If the osd daemon is still running, stop and disable:
Example from information above:
systemctl stop ceph-osd@63.service
systemctl disable ceph-osd@63.service
The auth key for the osd, should have also been removed:
Example from information above, Step #2:
ceph auth get osd.63
If the auth key for the osd, is still in the keyring, remove:
Example from information above:
ceph auth rm osd.63
10- After the osd has been removed from the cluster, it is safe to remove the hard drive from the system.
Verify the cluster gets healthy. "ceph -s"
Identify the device to be removed with "ledctl".
If the drive in question is dead, then this step may not work.
On the OSD node with drive in question:
Install "ledmon"
zypper in ledmon
To turn drive light on:
ledctl locate=/dev/sd?
To trun drive light off:
ledctl locate_off=/dev/sd?
for devices on HPE SmartArray use "hpssacli"
for devices on LSI MegaRAID use "storcli"
Some hardware does not provide a means to view drive lights. In this case, manually check each drive for the correct serial number.
Validate the correct hdd drive was removed with the serial number recorded in Step #5.
Examples:
# hdparm -I /dev/sdl | egrep -i 'Model\ Number|Serial\ Number'
# smartctl --xall /dev/sdl | egrep -i 'Model\ Family|Device\ Model|Serial\ Number'
If wrong device is removed, replace the devivce back into the same drive bay it was removed from.
It is best to shut down the OSD node when removing the device.
Adding an osd to the cluster.
https://documentation.suse.com/ses/5.5/single-html/ses-admin/#salt-node-add-disk
Requirements:
- osd disks must not have partition tables or partitions.
- Journaling devices must have enough free space to create new journaling partitions.
- yml file needs to be correct for desired osd disks and journaling devices.
Before installing the new hdd device in the OSD node, write down the Serial Number recorded on the hdd device label.
A -On the OSD node, install the new drive.
It may be necessary to shut down the OSD node to do this task properly.
If an OSD nodes needs to be shut down, consider setting the “noout” flag before shutting down the OSD node.
"ceph osd set noout"
Discover which device it is the new device "/dev/sd???":
ceph-disk list
lsblk
Typically, the new device will be recognizable by the lack of partition table, or partitions.
Also validate the correct hdd drive was installed by locating the device with the correct serial number.
Examples:
hdparm -I /dev/sdl | egrep -i 'Model\ Number|Serial\ Number'
smartctl --xall /dev/sdl | egrep -i 'Model\ Family|Device\ Model|Serial\ Number'
Make note of the "/dev/disk/by-id/???"
cd /dev/disk/by-id/
ll
See Step #3 above.
B -On the Admin node edit the yml file for this node:
If using multiple journaling devices, ensure that each journaling device is equal number of osd's assigned to each.
See Step #6:
Example:
"ceph0.ses5.com.yml"
locate the entry for the removed osd "scsi-SATA_TOSHIBA_MG07ACA1_9980A0C2F9SG":
/dev/disk/by-id/scsi-SATA_TOSHIBA_MG07ACA1_9980A0C2F9SG:
db: /dev/disk/by-id/scsi-SATA_MTFDDAK480TDN_191821E49174
db_size: 81920m
format: bluestore
wal: /dev/disk/by-id/scsi-SATA_MTFDDAK480TDN_191821E49174
wal_size: 2048m
and replace the entry with the new dev/disk/by-id/??
/dev/disk/by-id/??:
db: /dev/disk/by-id/scsi-SATA_MTFDDAK480TDN_191821E49174
db_size: 81920m
format: bluestore
wal: /dev/disk/by-id/scsi-SATA_MTFDDAK480TDN_191821E49174
wal_size: 2048m
If Deepsea removed the entry from the yml file, it is possible to add the entry manually.
Note the yml file is space sensitive. Make sure the yml file has correct syntax. See Step #7.
If the new drive is new from the factory, then it should have not partition table. If the replacement drive is a repurposed drive, it may have a partition table. The partition table and partitions need to be removed as per the documentation. See:
https://documentation.suse.com/ses/5.5/single-html/ses-deployment/#ceph-install-stack
Step: 12e
If this is *not* a new node, but the admin wants to proceed as if it was, then remove destroyedOSDs.yml on the target OSD node.
mv /etc/ceph/destroyedOSDs.yml /etc/ceph/destroyedOSDs.yml.old
Or
rm /etc/ceph/destroyedOSDs.yml
If everything was done correctly, run:
salt-run state.orch ceph.stage.1
salt-run state.orch ceph.stage.2
To summarize the steps that will be taken when the actual replacement is deployed, you can run the following command:
salt-run advise.osds
Example:
salt-run advise.osds
These devices will be deployed
data1.ceph:
/dev/disk/by-id/cciss-3600508b1001c7c24c537bdec8f3a698f:
Run 'salt-run state.orch ceph.stage.3'
Note stage.2 should see the new device. If not, something is wrong. Review steps above.
If all is good, run stage.3 to deploy the osd.
salt-run state.orch ceph.stage.3
If flags were set, then remove flags.
Use “ceph -s” to see if flags were set.
ceph osd unset noout
Repeat the steps for each osd that needs to be replaced.
Additional steps for debugging the osd deployment process:
If the osd did not deploy in stage.3, below are addtional steps to help troubleshoot where the issue may be. Start with "Adding an osd to the cluster". Instead of running stage.1-2, run stage.1, then validate with the steps below.
Does the pillar reflect the correct devices?
salt 'MinionName*' pillar.get ceph
Are the grains correct? This
salt 'MinionName*' grains.get ceph
MinionName # cat /etc/salt/grains
To update those grains to the current devices, either of these will work
salt 'MinionName*' state.apply ceph.osd.grains
or
salt 'MinionName*' osd.retain
After that, the osd.report may behave itself.
salt 'minion' osd.report
Example:
master:~ # salt 'MinionName*' osd.report
MinionName.gtslab.prv.suse.com:
No OSD configured for
/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_08080808
Does the following file exist on the minion:
/etc/ceph/destroyedOSDs.yml
If so, was the customer intending to destroy an osd and deploy a new one in its place?
Try deploying the osd from the minion.
On the minion run:
salt-call -l debug osd.deploy >/temp/osd.deploy.log
If you think everything is as it should be, then run:
salt 'MinionName*' state.apply ceph.osd
or
salt 'MinionName*' osd.deploy
or
salt-run state.orch ceph.stage.3
Example:
master:~ # salt 'MinionName*' osd.deploy
MinionName.example.com:
None
Cause
Status
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000019687
- Creation Date: 17-Aug-2020
- Modified Date:19-Aug-2020
-
- SUSE Enterprise Storage
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com