HP Superdome X with a high number of LUNs fails to dump and renders an OOM message
This document (7017393) is provided subject to the disclaimer at the end of this document.
Environment
System is :
8 Blade Superdome X with a max IO card configuration and a very large LUN configuration (113 multipath LUNs, with a total of 370 LUN paths)
Situation
Attempts to take a crash dump of the system fail on most occasions, with a few succeeding.
In case of a failure, dmesg output shows a message that looks like the following:
makedumpfile Completed. ------------------------------------------------------------------------------- Saving dump using makedumpfile ------------------------------------------------------------------------------- [ 270.020790] alua: release port group 1 [ 270.024945] sd 10:0:1:21: alua: Detached [ 270.420280] alua: release port group 1 [ 270.424446] sd 10:0:1:25: alua: Detached [ 270.429811] makedumpfile invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 [ 270.439674] makedumpfile cpuset=/ mems_allowed=0 [ 270.444765] Pid: 12312, comm: makedumpfile Not tainted 3.0.101-57-default #1 [ 270.452525] Call Trace: [ 270.455261] [<ffffffff81004b95>] dump_trace+0x75/0x300 [ 270.461047] [<ffffffff81464233>] dump_stack+0x69/0x6f [ 270.466741] [<ffffffff810fe49e>] dump_header+0x8e/0x110 [ 270.472616] [<ffffffff810fe856>] oom_kill_process+0xa6/0x350 [ 270.478969] [<ffffffff810fedb7>] out_of_memory+0x2b7/0x310 [ 270.485133] [<ffffffff811047e5>] __alloc_pages_slowpath+0x7b5/0x7f0 [ 270.492157] [<ffffffff81104a09>] __alloc_pages_nodemask+0x1e9/0x200 [ 270.499184] [<ffffffff811407e0>] alloc_pages_vma+0xd0/0x1c0 [ 270.505444] [<ffffffff8111f24b>] do_anonymous_page+0x13b/0x300 [ 270.511995] [<ffffffff8146ae3d>] do_page_fault+0x1fd/0x4c0 [ 270.518158] [<ffffffff81467a45>] page_fault+0x25/0x30 [ 270.523858] [<00007f21d707283d>] 0x7f21d707283c [ 270.528944] Mem-Info: [ 270.531455] Node 0 DMA per-cpu: [ 270.534956] CPU 0: hi: 0, btch: 1 usd: 0 [ 270.540236] CPU 1: hi: 0, btch: 1 usd: 0 [ 270.545514] CPU 2: hi: 0, btch: 1 usd: 0 [ 270.550793] CPU 3: hi: 0, btch: 1 usd: 0 [ 270.556075] Node 0 DMA32 per-cpu: [ 270.559763] CPU 0: hi: 186, btch: 31 usd: 142 [ 270.565042] CPU 1: hi: 186, btch: 31 usd: 161 [ 270.570319] CPU 2: hi: 186, btch: 31 usd: 98 [ 270.575599] CPU 3: hi: 186, btch: 31 usd: 60 [ 270.580882] active_anon:13242 inactive_anon:243 isolated_anon:0 [ 270.580883] active_file:17 inactive_file:0 isolated_file:0 [ 270.580883] unevictable:15006 dirty:0 writeback:0 unstable:0 [ 270.580884] free:9754 slab_reclaimable:3253 slab_unreclaimable:71872 [ 270.580885] mapped:1536 shmem:869 pagetables:164 bounce:0 [ 270.612993] Node 0 DMA free:484kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:260kB
mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes [ 270.652731] lowmem_reserve[]: 0 755 755 755 [ 270.657498] Node 0 DMA32 free:38532kB min:38412kB low:48012kB high:57616kB active_anon:52968kB
inactive_anon:972kB active_file:68kB inactive_file:0kB unevictable:60024kB isolated(anon):0kB
isolated(file):0kB present:773568kB mlocked:15676kB dirty:0kB writeback:0kB mapped:6144kB shmem:3476kB
slab_reclaimable:13012kB slab_unreclaimable:287488kB kernel_stack:6600kB pagetables:656kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:19 all_unreclaimable? no [ 270.702099] lowmem_reserve[]: 0 0 0 0 [ 270.706301] Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 484kB [ 270.718196] Node 0 DMA32: 1625*4kB 710*8kB 288*16kB 123*32kB 80*64kB 57*128kB 13*256kB 0*512kB
0*1024kB 1*2048kB 0*4096kB = 38516kB [ 270.731611] 9210 total pagecache pages [ 270.735744] 0 pages in swap cache [ 270.739402] Swap cache stats: add 0, delete 0, find 0/0 [ 270.745159] Free swap = 0kB [ 270.748338] Total swap = 0kB [ 270.751516] 193457 pages RAM [ 270.754695] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name [ 270.762841] [ 256] 0 256 44991 5133 3 -17 -1000 multipathd [ 270.771554] [ 259] 0 259 3453 1025 0 -17 -1000 udevd [ 270.779792] [ 265] 0 265 2614 197 0 -17 -1000 udevd [ 270.788027] [ 268] 0 268 2769 344 0 -17 -1000 udevd [ 270.796266] [ 604] 0 604 4658 2163 0 0 0 blogd [ 270.804521] [12309] 0 12309 15060 826 1 0 0 kdumptool [ 270.813136] [12312] 0 12312 18089 11304 3 0 0 makedumpfile [ 270.822038] Out of memory: Kill process 12312 (makedumpfile) score 33 or sacrifice child [ 270.830941] Killed process 12312 (makedumpfile) total-vm:72356kB, anon-rss:44356kB, file-rss:860kB
Resolution
However the following changes have been tested successfully in the above described environment:
- increase the crashkernel size to 832M
- use udev.children-max=2 in kdump command line
- add a multipath.conf (if not existent) that blacklists over half the LUNs;
e.g. from 113 LUNs with 370 paths to 40 LUNs with 88 paths.
- rebuild initrd dump
It is nevertheless recommended to contact SUSE Technical Support when facing such an issue.
Cause
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:7017393
- Creation Date: 21-Mar-2016
- Modified Date:12-Oct-2022
-
- SUSE Linux Enterprise Server
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com