Data distribution not equal across OSDs

This document (7018732) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Enterprise Storage 4
SUSE Enterprise Storage 5

Situation

Looking at the output of "ceph osd df" or "ceph osd df tree" it shows that one or more OSDs (Object Storage Daemon) are utilized significantly more than the rest.

Resolution

Consider running "ceph osd reweight-by-utilization".

When running the above command the threshold value defaults to 120 (e.g. adjust weight downward on OSDs that are over 120% utilized). After running the command, verify the OSD usage again as it may be needed to adjust the threshold further e.g. specifying:

ceph osd reweight-by-utilization 115

If data distribution is still not ideal step the re-weight value downward in increments of 5.

NOTE: Before executing the above, make sure to read the additional information section below.

Cause

It can happen during normal use of a SUSE Enterprise Storage (SES) Cluster that data utilization are higher on some OSDs.

Additional Information

Note that by default the "reweight-by-utilization" command will have the following defaults:

oload 120
max_change 0.05
max_change_osds 5

When running the command it is possible to change the default values, for example:

# ceph osd reweight-by-utilization 110 0.05 8

The above will target OSDs 110% over utilized, 0.05 max_change and adjust a total of eight (8) OSDs for the run. To first verify the changes that will occur without any changes actually being done, use:

# ceph osd test-reweight-by-utilization 110 0.05 8

OSD utilization can be affected by various factors, for example:

- Cluster health
- Amount of configured Pools
- Configured Placement Groups (PGs) per Pool
- CRUSH Map configuration and configured rule sets.

Before making any changes to a production system it should be verified that any output, in this case OSD utilization, are understood and that the cluster is at least reported as being in a healthy state. This can be checked using for example "ceph health" and "ceph -s".

Below is some example output for the above commands showing a healthy cluster:

:~ # ceph health detail
HEALTH_OK

:~ # ceph -s
    cluster 70e9c50b-e375-37c7-a35b-1af02442b751
     health HEALTH_OK
     monmap e1: 3 mons at {ses-srv-1=192.168.178.71:6789/0,ses-srv-2=192.168.178.72:6789/0,ses-srv-3=192.168.178.73:6789/0}
            election epoch 50, quorum 0,1,2 ses-srv-1,ses-srv-2,ses-srv-3
      fsmap e44: 1/1/1 up {0=ses-srv-4=up:active}
     osdmap e109: 9 osds: 9 up, 9 in
            flags sortbitwise,require_jewel_osds
      pgmap v81521: 732 pgs, 11 pools, 1186 MB data, 515 objects
            3936 MB used, 4085 GB / 4088 GB avail
                 732 active+clean

The following article may also be of helpful / of interest: Predicting which Ceph OSD will fill up first

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.