Reported by: https://mails.dpdk.org/archives/dev/2018-September/111522.html We've recently encountered a weird issue with Ubuntu 18.04 on the Skylake server. I can always reproduce this crash and I could narrowed it down. I guess it could be a GCC issue. [1] How to reproduce - ConnectX-4Lx/ConnectX-5 with mlx5 PMD in DPDK 18.02/18.05/18.08 - Ubuntu 18.04 on Intel Skylake server - gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0 - Testpmd crashes when it starts to forward traffic. Easy to reproduce. - Only happens on the Skylake server. [2] Failure point The attached patch gives an insight of why it crashes. The following is the result of the patch and the GDB commands. In summary, rte_memcpy() doesn't work as expected. In __mempool_generic_put(), there's rte_memcpy() to move the array of objects to the lcore cache. If I run memcmp() right after rte_memcpy(dst, src, n), data in dst differs from data in src. And it looks like some of data got shifted by a few bytes as you can see below. [GDB command] $dst = 0x7ffff4e09ea8 $src = 0x7fffce3fb970 $n = 256 x/32gx 0x7ffff4e09ea8 x/32gx 0x7fffce3fb970 testpmd: /home/mlnxtest/dpdk/build/include/rte_mempool.h:1140: __mempool_generic_put: Assertion `0' failed. Thread 4 "lcore-slave-1" received signal SIGABRT, Aborted. [Switching to Thread 0x7fffce3ff700 (LWP 69913)] (gdb) x/32gx 0x7ffff4e09ea8 0x7ffff4e09ea8: 0x00007fffaac38ec0 0x00007fffaac38500 0x7ffff4e09eb8: 0x00007fffaac37b40 0x00007fffaac37180 0x7ffff4e09ec8: 0x850000007fffaac3 0x7b4000007fffaac3 0x7ffff4e09ed8: 0x00007fffaac35440 0x00007fffaac34a80 0x7ffff4e09ee8: 0xaac3850000007fff 0xaac37b4000007fff 0x7ffff4e09ef8: 0x00007fffaac32d40 0x00007fffaac32380 0x7ffff4e09f08: 0x7fffaac385000000 0x7fffaac37b400000 0x7ffff4e09f18: 0x00007fffaac30640 0x00007fffaac2fc80 0x7ffff4e09f28: 0x00007fffaac2f2c0 0x00007fffaac2e900 0x7ffff4e09f38: 0x00007fffaac2df40 0x00007fffaac2d580 0x7ffff4e09f48: 0x00007fffaac2cbc0 0x00007fffaac2c200 0x7ffff4e09f58: 0x00007fffaac2b840 0x00007fffaac2ae80 0x7ffff4e09f68: 0x00007fffaac2a4c0 0x00007fffaac29b00 0x7ffff4e09f78: 0x00007fffaac29140 0x00007fffaac28780 0x7ffff4e09f88: 0x00007fffaac27dc0 0x00007fffaac27400 0x7ffff4e09f98: 0x00007fffaac26a40 0x00007fffaac26080 (gdb) x/32gx 0x7fffce3fb970 0x7fffce3fb970: 0x00007fffaac38ec0 0x00007fffaac38500 0x7fffce3fb980: 0x00007fffaac37b40 0x00007fffaac37180 0x7fffce3fb990: 0x00007fffaac367c0 0x00007fffaac35e00 0x7fffce3fb9a0: 0x00007fffaac35440 0x00007fffaac34a80 0x7fffce3fb9b0: 0x00007fffaac340c0 0x00007fffaac33700 0x7fffce3fb9c0: 0x00007fffaac32d40 0x00007fffaac32380 0x7fffce3fb9d0: 0x00007fffaac319c0 0x00007fffaac31000 0x7fffce3fb9e0: 0x00007fffaac30640 0x00007fffaac2fc80 0x7fffce3fb9f0: 0x00007fffaac2f2c0 0x00007fffaac2e900 0x7fffce3fba00: 0x00007fffaac2df40 0x00007fffaac2d580 0x7fffce3fba10: 0x00007fffaac2cbc0 0x00007fffaac2c200 0x7fffce3fba20: 0x00007fffaac2b840 0x00007fffaac2ae80 0x7fffce3fba30: 0x00007fffaac2a4c0 0x00007fffaac29b00 0x7fffce3fba40: 0x00007fffaac29140 0x00007fffaac28780 0x7fffce3fba50: 0x00007fffaac27dc0 0x00007fffaac27400 0x7fffce3fba60: 0x00007fffaac26a40 0x00007fffaac26080 AFAIK, AVX512F support is disabled by default in DPDK as it is still experimental (CONFIG_RTE_ENABLE_AVX512=n). But with gcc optimization, AVX2 version of rte_memcpy() seems to be optimized with 512b instructions. If I disable it by adding EXTRA_CFLAGS="-mno-avx512f", then it works fine and doesn't crash. Do you have any idea regarding this issue or are you already aware of it? Thanks, Yongseok $ git diff diff --git a/config/common_base b/config/common_base index ad03cf433..f512b5a88 100644 --- a/config/common_base +++ b/config/common_base @@ -275,8 +275,8 @@ CONFIG_RTE_LIBRTE_MLX4_TX_MP_CACHE=8 # # Compile burst-oriented Mellanox ConnectX-4 & ConnectX-5 (MLX5) PMD # -CONFIG_RTE_LIBRTE_MLX5_PMD=n -CONFIG_RTE_LIBRTE_MLX5_DEBUG=n +CONFIG_RTE_LIBRTE_MLX5_PMD=y +CONFIG_RTE_LIBRTE_MLX5_DEBUG=y CONFIG_RTE_LIBRTE_MLX5_DLOPEN_DEPS=n CONFIG_RTE_LIBRTE_MLX5_TX_MP_CACHE=8 @@ -597,7 +597,7 @@ CONFIG_RTE_RING_USE_C11_MEM_MODEL=n # CONFIG_RTE_LIBRTE_MEMPOOL=y CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE=512 -CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n +CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=y # # Compile Mempool drivers diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h index 8b1b7f7ed..9f48028d9 100644 --- a/lib/librte_mempool/rte_mempool.h +++ b/lib/librte_mempool/rte_mempool.h @@ -39,6 +39,7 @@ #include <errno.h> #include <inttypes.h> #include <sys/queue.h> +#include <assert.h> #include <rte_config.h> #include <rte_spinlock.h> @@ -1123,6 +1124,22 @@ __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table, /* Add elements back into the cache */ rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n); + if(memcmp(&cache_objs[0], obj_table, sizeof(void *) * n)) { + printf("[GDB command] \n" + "$dst = %p\n" + "$src = %p\n" + "$n = %ld\n" + "x/%ldgx %p\n" + "x/%ldgx %p\n", + (void *)&cache_objs[0], + (const void *)obj_table, + sizeof(void *) * n, + sizeof(void *) * n / 8, (void *)&cache_objs[0], + sizeof(void *) * n / 8, (const void *)obj_table + ); + assert(0); + } + cache->len += n; if (cache->len >= cache->flushthresh) {
Also reported to Ubuntu. http://bugs.launchpad.net/bugs/1799397
Submitted a patch of the workaround http://patches.dpdk.org/patch/47266/
Wasn't aware of the issue, thanks for reporting. The patch you have sent is disabling avx512f support globally, ignoring CONFIG_RTE_ENABLE_AVX512. We will check for an option to remove avx512f support locally for rte_memcpy.
Hi Yongseok, I can NOT reproduce this issue, on a skylake box with Ubuntu 18.04 OS, same compiler [1], I am running test with two i40e ports and not observing any crash/problem. Also I add a unit test for rte_memcpy() which seems working fine. Can you please help to be able to reproduce the issue? - Is there any case this issue observed other than mlx5 data forwarding? - Do you observer the issue immediately or does a longevity test required? - Is there a way to find out what changes in final binary with or without "-mno-avx512f" flag? [1] gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
Created attachment 15 [details] rte_memcpy unit test
I don't remember all the details exactly as I reported it to dev list two months ago [1]. But... > I can NOT reproduce this issue, on a skylake box with Ubuntu 18.04 OS, same > compiler [1], I am running test with two i40e ports and not observing any > crash/problem. Did you make sure rte_mempool_put_bulk() is used? I remember regular tx burst of i40e does use rte_mempool_put() and vec Tx uses the bulky one. Copying one pointer at a time wouldn't enough to reproduce?? > - Is there any case this issue observed other than mlx5 data forwarding? I hadn't used any other PMD than mlx5. > - Do you observer the issue immediately or does a longevity test required? Almost immediately after a few packets passed. > - Is there a way to find out what changes in final binary with or without > "-mno-avx512f" flag? I had disassembled the two binaries on gdb then I could see avx512f instructions even though CONFIG_RTE_ENABLE_AVX512 is off. That's why I tried to add "-mno-avx512f" as a workaround. [1] https://mails.dpdk.org/archives/dev/2018-September/111522.html Thanks, Yongseok
(In reply to Yongseok Koh from comment #6) > I don't remember all the details exactly as I reported it to dev list two > months ago [1]. But... > > > I can NOT reproduce this issue, on a skylake box with Ubuntu 18.04 OS, same > > compiler [1], I am running test with two i40e ports and not observing any > > crash/problem. > > Did you make sure rte_mempool_put_bulk() is used? I remember regular tx > burst of i40e does use rte_mempool_put() and vec Tx uses the bulky one. > > Copying one pointer at a time wouldn't enough to reproduce?? I am testing on Skylake and by default vector driver used, I confirmed rte_mempool_put_bulk() is used. > > > - Is there any case this issue observed other than mlx5 data forwarding? > > I hadn't used any other PMD than mlx5. > > > > - Do you observer the issue immediately or does a longevity test required? > > Almost immediately after a few packets passed. > > > > - Is there a way to find out what changes in final binary with or without > > "-mno-avx512f" flag? > > I had disassembled the two binaries on gdb then I could see avx512f > instructions even though CONFIG_RTE_ENABLE_AVX512 is off. That's why I tried > to add "-mno-avx512f" as a workaround. I confirm this, gcc is using avx512 instructions for rte_mempool_put_bulk(), but this is not causing any problem in my test, same for QA team they didn't observe this issue. > > > [1] https://mails.dpdk.org/archives/dev/2018-September/111522.html > > Thanks, > Yongseok
What would you think disabling avx512 only for mlx5 driver, by adding the following to driver: +CFLAGS += -mno-avx512f I agree root cause is not %100 clarified is a concern, but taking account that this issue is not reproduced by Intel test team, and only reproduced by mlx driver, we can reduce the scope to mlx5 driver.
>What would you think disabling avx512 only for mlx5 driver, by adding the >following to driver: >+CFLAGS += -mno-avx512f >I agree root cause is not %100 clarified is a concern, but taking account that >this issue is not reproduced by Intel test team, and only reproduced by mlx >driver, we can reduce the scope to mlx5 driver. There's no reason to object your suggestion. This has been reported by a few Mellanox customers. As long as I can relieve their concern, it is okay anyway. I still think it is obvious that the crash could happen regardless of PMD being used (my test proves it) but if Intel thinks it is okay, I have no objection. Will submit a revert patch and another patch to apply it for mlx PMDs.
Please hold on the patch, let's get more input before changing it. Thanks.
Our test team reported a performance drop targetting this issue: " • New high issue: performance single core drop 1 mpps compared to rc1 for 64 bytes packets, git bisect from 8d07c82b239f( mk: disable gcc AVX512F support). "
What about an alternative way, replace rte_memcpy with memcpy, since rte_memcpy code is causing crash. Can you please test with attached 0001-replace-rte_memcpy-which-cause-crash-with-avx512-ins.patch 1) If crash has been fixed 2) And confirm this doesn't cause any performance degradation
Created attachment 16 [details] 0001-replace-rte_memcpy-which-cause-crash-with-avx512-ins.patch
Ferruh, IIRC, memcpy isn't an inline function but a library call. As you know, rte_mempool_put_bulk() is usually called in a datapath and successive calls are all inlined, up to __mempool_generic_put(). And rte_memcpy() also has __rte_always_inline. It seems intentional. And I think it would impact performance if branch happens in a datapath. How do you think? Thanks, Yongseok
As a final solution, you are right it may have performance implications. But if you can at least test it, we can have more confidence that issue is limited to rte_memcpy() or not.
Result of the test replacing use of rte_memcpy by memcpy: it works, confirming the issue is with rte_memcpy. Result of the unit test: no issue. RTE>>rte_memcpy_autotest ................................................................. Test OK
More tests: CONFIG_RTE_ENABLE_AVX=y CONFIG_RTE_ENABLE_AVX512=y => crash CONFIG_RTE_ENABLE_AVX=y CONFIG_RTE_ENABLE_AVX512=n => crash CONFIG_RTE_ENABLE_AVX=n CONFIG_RTE_ENABLE_AVX512=y => crash CONFIG_RTE_ENABLE_AVX=n CONFIG_RTE_ENABLE_AVX512=n => OK
No crash seen with code generated by clang-6 or gcc-6, probably because they do not generate AVX512 instructions. Crash is confirmed with gcc-7 and gcc-8 when using AVX2 version of rte_memcpy.
Based on CONFIG_RTE_ENABLE_AVX=n & CONFIG_RTE_ENABLE_AVX512=n works fine, I did a patch to force always rte_memcpy() scalar version. And narrowing the gcc scope to gcc >= 6.0 Can you please test with the attached patch 0001-force-using-scalar-rte_memcpy-implementation.patch? Both crash and performance affect? Thanks.
Created attachment 17 [details] 0001-force-using-scalar-rte_memcpy-implementation.patch
More infos about the crash: Command line: testpmd -w 0002:00:02.0 --vdev='net_vdev_netvsc0,iface=eth1' -- --forward-mode=txonly --stats-period 1 Testpmd output: RX-packets: 1 RX-missed: 0 RX-bytes: 90 TX-packets: 96 TX-errors: 0 TX-bytes: 6144 GDB output: Thread 4 "lcore-slave-1" received signal SIGSEGV, Segmentation fault. pkt_burst_transmit (fs=<optimized out>) at app/test-pmd/txonly.c:206 206 rte_pktmbuf_reset_headroom(pkt); Registers: rax 0x49 73 rbx 0x10045ed40 4299550016 rcx 0x1e0000040004 32985349095428 rdx 0x48 72 rsi 0x55555602c130 93825003602224 rdi 0x7ffff512b4e2 140737305031906 rbp 0x7ffff512c530 0x7ffff512c530 rsp 0x7ffff512b460 0x7ffff512b460 r8 0x0 0 r9 0xa0e 2574 r10 0xfa 250 r11 0x0 0 r12 0x10045bc40 4299537472 r13 0x109c431400000 292213221228544 r14 0x10045ed50 4299550032 r15 0x100443ac0 4299438784 rip 0x555555671c84 0x555555671c84 <pkt_burst_transmit+580> eflags 0x13206 [ PF IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 k0 0x77777777 2004318071 k1 0x7777777777777777 8608480567731124087 k2 0x7777 30583 k3 0x5000500050005 1407396358717445 k4 0xfffa 65530 k5 0x0 0 k6 0x0 0 k7 0x0 0
instruction at pkt_burst_transmit+580 is a comparison: m->data_off = (uint16_t)RTE_MIN((uint16_t)RTE_PKTMBUF_HEADROOM, 11dc84: 66 41 81 7d 36 80 00 cmpw $0x80,0x36(%r13)
Updated unit test as following [1] to be sure both rte_memcpy_aligned() and rte_memcpy_generic() pathes are called for rte_memcpy(), and confirmed still not observing the crash. [1] +static void +copy_data(unsigned int n) +{ + rte_memcpy(dst, src, n); + + for (unsigned int i = 1; i < n; i++) { + rte_memcpy(dst + i, src + i, n - i); + verify_data(n); + } +}
The full statement of the crash is rte_pktmbuf_reset_headroom(): m->data_off = (uint16_t)RTE_MIN((uint16_t)RTE_PKTMBUF_HEADROOM, uint16_t)m->buf_len); Looks like the mbuf was corrupted when calling rte_memcpy. The call stack of rte_memcpy is: mlx5_tx_complete rte_mempool_put_bulk rte_mempool_generic_put __mempool_generic_put rte_memcpy rte_memcpy_generic All functions of this call stack are inlined. So the compiler optimizations are probably different than in other call contexts, like the unit test above.
Thomas, can you please test the following, with the default config (CONFIG_RTE_ENABLE_AVX512=n)? [yskoh@scfae-sc-2 dpdk.org]$ git diff diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h index 7b758094df..71409b4b05 100644 --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h @@ -363,7 +363,7 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n) } } -static inline void * +static inline void * __attribute__((target("no-avx512f"))) rte_memcpy_generic(void *dst, const void *src, size_t n) { uintptr_t dstu = (uintptr_t)dst;
Restricting AVX2 flavour of rte_memcpy to not use AVX512 instructions works: static inline void * __attribute__((target("no-avx512f"))) rte_memcpy_generic(void *dst, const void *src, size_t n) Thanks Yongseok, it is a nice workaround. We need to check it is well supported with old compilers we want to support in DPDK.
(In reply to Thomas Monjalon from comment #26) > Restricting AVX2 flavour of rte_memcpy to not use AVX512 instructions works: > > static inline void * __attribute__((target("no-avx512f"))) > rte_memcpy_generic(void *dst, const void *src, size_t n) > > Thanks Yongseok, it is a nice workaround. > We need to check it is well supported with old compilers we want to support > in DPDK. When scalar rte_memcpy_generic() forced, compiler again generates avx512 instructions and not observing crash. With the above suggestion, avx2 version of rte_memcpy_generic() will be compiled without using avx512 instructions. Which one will be better, won't using avx512 give better performance?
(In reply to Thomas Monjalon from comment #18) > No crash seen with code generated by clang-6 or gcc-6, > probably because they do not generate AVX512 instructions. > > Crash is confirmed with gcc-7 and gcc-8 when using AVX2 version of > rte_memcpy. which gcc-6 version did you use? I could see avx512f is supported since gcc 6.4 in gcc.gnu.org.
(In reply to Ferruh YIGIT from comment #27) > When scalar rte_memcpy_generic() forced, compiler again generates avx512 > instructions and not observing crash. > > With the above suggestion, avx2 version of rte_memcpy_generic() will be > compiled without using avx512 instructions. > > Which one will be better, won't using avx512 give better performance? Need to measure it. But my gut (:-) tells avx2 version of rte_memcpy with mavx2 would be better?
09/11/2018 19:17, bugzilla@dpdk.org: > https://bugs.dpdk.org/show_bug.cgi?id=97 > > --- Comment #28 from Yongseok Koh (yskoh@mellanox.com) --- > (In reply to Thomas Monjalon from comment #18) > > No crash seen with code generated by clang-6 or gcc-6, > > probably because they do not generate AVX512 instructions. > > > > Crash is confirmed with gcc-7 and gcc-8 when using AVX2 version of > > rte_memcpy. > > which gcc-6 version did you use? > I could see avx512f is supported since gcc 6.4 in gcc.gnu.org. It is gcc-6 (Ubuntu 6.4.0-17ubuntu1) 6.4.0 20180424
(In reply to Yongseok Koh from comment #29) > (In reply to Ferruh YIGIT from comment #27) > > When scalar rte_memcpy_generic() forced, compiler again generates avx512 > > instructions and not observing crash. > > > > With the above suggestion, avx2 version of rte_memcpy_generic() will be > > compiled without using avx512 instructions. > > > > Which one will be better, won't using avx512 give better performance? > > Need to measure it. > But my gut (:-) tells avx2 version of rte_memcpy with mavx2 would be better? Agreed, needs to measure. But most probably there will be only neglectable diff. Thomas pointed out that CONFIG_RTE_ENABLE_AVX512=y also crash, this update won't cover it.
Created attachment 18 [details] v2-0001-eal-x86-force-using-SSE-memcpy-implementation.patch v2: Changes `scalar` which was wrong to `SSE`
I confirm it is a compiler bug producing wrong code for rte_mov128(). Running it step by step, it is clear some bytes are copied wrong in rte_mov128(), but not in rte_mov64(). Both functions use rte_mov32(). As a result, some mbuf pointers are corrupted in the mempool. Begin and end of debug session below: % make -j EXTRA_CFLAGS='-ggdb -O1' % sudo gdb --ex='b lib/librte_mempool/rte_mempool.h:1244' --ex=r --args build/app/testpmd -w 0002:00:02.0 --vdev='net_vdev_netvsc0,iface=eth1' -- --forward-mode=txonly --stats-period 1 1244 rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n); (gdb) si 0x000055555565b42d 1244 rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n); (gdb) si rte_memcpy (n=256, src=0x7ffff512c090, dst=0x10045f028) (gdb) x/32gx src 0x7ffff512c090: 0x0000000109c43e00 0x0000000109c434c0 0x7ffff512c0a0: 0x0000000109c42b80 0x0000000109c42240 0x7ffff512c0b0: 0x0000000109c41900 0x0000000109c40fc0 0x7ffff512c0c0: 0x0000000109c40680 0x0000000109c3fd40 0x7ffff512c0d0: 0x0000000109c3f400 0x0000000109c3eac0 0x7ffff512c0e0: 0x0000000109c3e180 0x0000000109c3d840 0x7ffff512c0f0: 0x0000000109c3cf00 0x0000000109c3c5c0 0x7ffff512c100: 0x0000000109c3bc80 0x0000000109c3b340 0x7ffff512c110: 0x0000000109c3aa00 0x0000000109c3a0c0 0x7ffff512c120: 0x0000000109c39780 0x0000000109c38e40 0x7ffff512c130: 0x0000000109c38500 0x0000000109c37bc0 0x7ffff512c140: 0x0000000109c37280 0x0000000109c36940 0x7ffff512c150: 0x0000000109c36000 0x0000000109c356c0 0x7ffff512c160: 0x0000000109c34d80 0x0000000109c34440 0x7ffff512c170: 0x0000000109c33b00 0x0000000109c331c0 0x7ffff512c180: 0x0000000109c32880 0x0000000109c31f40 (gdb) ni ... 1246 cache->len += n; (gdb) x/32gx 0x10045f028 0x10045f028: 0x0000000109c43e00 0x0000000109c434c0 0x10045f038: 0x0000000109c42b80 0x0000000109c42240 0x10045f048: 0x34c00000000109c4 0x2b800000000109c4 0x10045f058: 0x0000000109c40680 0x0000000109c3fd40 0x10045f068: 0x09c434c000000001 0x09c42b8000000001 0x10045f078: 0x0000000109c3e180 0x0000000109c3d840 0x10045f088: 0x000109c434c00000 0x000109c42b800000 0x10045f098: 0x0000000109c3bc80 0x0000000109c3b340 0x10045f0a8: 0x0000000109c3aa00 0x0000000109c3a0c0 0x10045f0b8: 0x0000000109c39780 0x0000000109c38e40 0x10045f0c8: 0x0000000109c38500 0x0000000109c37bc0 0x10045f0d8: 0x0000000109c37280 0x0000000109c36940 0x10045f0e8: 0x0000000109c36000 0x0000000109c356c0 0x10045f0f8: 0x0000000109c34d80 0x0000000109c34440 0x10045f108: 0x0000000109c33b00 0x0000000109c331c0 0x10045f118: 0x0000000109c32880 0x0000000109c31f40 (gdb) c Thread 4 "lcore-slave-1" received signal SIGSEGV, Segmentation fault. pkt_burst_transmit (fs=0x100442440) at app/test-pmd/txonly.c:206 206 rte_pktmbuf_reset_headroom(pkt);
In AVX2 version of rte_memcpy(), rte_mov128() is: rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32); rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32); Below are the two versions of: n -= 128; rte_mov128((uint8_t *)dst, (const uint8_t *)src); AVX512F enabled (native compilation): lea rdx,[rax-0x80] mov rax,QWORD PTR [rbp-0x90] vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x0] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x10],0x1 vmovups XMMWORD PTR [rsi],xmm0 vextracti128 XMMWORD PTR [rsi+0x10],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x2] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x30],0x1 vmovups XMMWORD PTR [rsi+0x20],xmm0 vextracti128 XMMWORD PTR [rsi+0x30],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x4] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x50],0x1 vmovups XMMWORD PTR [rsi+0x40],xmm0 vextracti128 XMMWORD PTR [rsi+0x50],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x6] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x70],0x1 vmovups XMMWORD PTR [rsi+0x60],xmm0 vextracti128 XMMWORD PTR [rsi+0x70],ymm0,0x1 AVX512F disabled for rte_mov128 function: lea r14,[rax-0x80] mov rsi,QWORD PTR [rbp-0x80] mov rdi,r12 call <rte_mov128> vmovdqu xmm0,XMMWORD PTR [rsi] vinserti128 ymm0,ymm0,XMMWORD PTR [rsi+0x10],0x1 vmovups XMMWORD PTR [rdi],xmm0 vextracti128 XMMWORD PTR [rdi+0x10],ymm0,0x1 vmovdqu xmm0,XMMWORD PTR [rsi+0x20] vinserti128 ymm0,ymm0,XMMWORD PTR [rsi+0x30],0x1 vmovups XMMWORD PTR [rdi+0x20],xmm0 vextracti128 XMMWORD PTR [rdi+0x30],ymm0,0x1 vmovdqu xmm0,XMMWORD PTR [rsi+0x40] vinserti128 ymm0,ymm0,XMMWORD PTR [rsi+0x50],0x1 vmovups XMMWORD PTR [rdi+0x40],xmm0 vextracti128 XMMWORD PTR [rdi+0x50],ymm0,0x1 vmovdqu xmm0,XMMWORD PTR [rsi+0x60] vinserti128 ymm0,ymm0,XMMWORD PTR [rsi+0x70],0x1 vmovups XMMWORD PTR [rdi+0x60],xmm0 vextracti128 XMMWORD PTR [rdi+0x70],ymm0,0x1 ret This second version is using AVX2 intrinsics with this patch: -static inline void +static inline void __attribute__((target("no-avx512f"))) rte_mov128(uint8_t *dst, const uint8_t *src) First version is crashing. Second version works.
The bad version of rte_mov128() is using the AVX512 instruction vmovdqu8 while the good version is using the AVX instruction vmovdqu The AVX512-enabled rte_memcpy() version corrupts the copied data (mbuf pointers) from: 0x0000000109c43e00 0x0000000109c434c0 0x0000000109c42b80 0x0000000109c42240 0x0000000109c41900 0x0000000109c40fc0 0x0000000109c40680 0x0000000109c3fd40 0x0000000109c3f400 0x0000000109c3eac0 0x0000000109c3e180 0x0000000109c3d840 0x0000000109c3cf00 0x0000000109c3c5c0 0x0000000109c3bc80 0x0000000109c3b340 to: 0x0000000109c43e00 0x0000000109c434c0 0x0000000109c42b80 0x0000000109c42240 0x34c00000000109c4 0x2b800000000109c4 0x0000000109c40680 0x0000000109c3fd40 0x09c434c000000001 0x09c42b8000000001 0x0000000109c3e180 0x0000000109c3d840 0x000109c434c00000 0x000109c42b800000 0x0000000109c3bc80 0x0000000109c3b340 I am not an expert of x86 instructions, is it a GCC bug or a CPU bug? CPU info: vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
I generate another working code without function call, closer of the failing one, by moving AVX512 disabling from rte_mov128 to the top caller function (mlx5_tx_complete): -void +void __attribute__((target("no-avx512f"))) mlx5_tx_complete(struct mlx5_txq_data *txq) Resulting code: lea rdx,[rax-0x80] mov rax,QWORD PTR [rbp-0x90] vmovdqu xmm0,XMMWORD PTR [rax*8+0x0] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x10],0x1 vmovups XMMWORD PTR [rsi],xmm0 vextracti128 XMMWORD PTR [rsi+0x10],ymm0,0x1 vmovdqu xmm0,XMMWORD PTR [rax*8+0x20] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x30],0x1 vmovups XMMWORD PTR [rsi+0x20],xmm0 vextracti128 XMMWORD PTR [rsi+0x30],ymm0,0x1 vmovdqu xmm0,XMMWORD PTR [rax*8+0x40] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x50],0x1 vmovups XMMWORD PTR [rsi+0x40],xmm0 vextracti128 XMMWORD PTR [rsi+0x50],ymm0,0x1 vmovdqu xmm0,XMMWORD PTR [rax*8+0x60] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x70],0x1 vmovups XMMWORD PTR [rsi+0x60],xmm0 vextracti128 XMMWORD PTR [rsi+0x70],ymm0,0x1 The only differences with the working code are: - vmovdqu instead of vmovdqu8 - code/registers context
(In reply to Ferruh YIGIT from comment #32) > Created attachment 18 [details] > v2-0001-eal-x86-force-using-SSE-memcpy-implementation.patch > > v2: > > Changes `scalar` which was wrong to `SSE` So, in terms of performance, which one is better? rte_memcpy_avx2 with avx2 vs. rte_memcpy_sse with avx512f? In case of CONFIG_RTE_ENABLE_AVX512=y, rte_memcpy_avx2 with avx2 can also be used. I wanted to volunteer for performance testing, but I couldn't find any available setup having the gcc version. Sorry. And because of the case (CONFIG_RTE_ENABLE_AVX512=y), we have to make the change for the meson build, right?
For Ferruh's patch, +# Workaround for Bug #97, force to use SSE rte_memcpy() +# see lib/librte_eal/common/include/arch/x86/rte_memcpy.h +ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y) +ifeq ($(shell test $(GCC_VERSION) -ge 60 && echo 1), 1) IIRC, Thomas mentioned gcc 6 is okay. Shouldn't it be '-gt'? Or, -ge 70 or 73?
Diff of assembly code: --- bad-avx512-enabled +++ good-avx512-disabled - vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x0] + vmovdqu xmm0,XMMWORD PTR [rax*8+0x0] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x10],0x1 vmovups XMMWORD PTR [rsi],xmm0 vextracti128 XMMWORD PTR [rsi+0x10],ymm0,0x1 - vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x2] + vmovdqu xmm0,XMMWORD PTR [rax*8+0x20] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x30],0x1 vmovups XMMWORD PTR [rsi+0x20],xmm0 vextracti128 XMMWORD PTR [rsi+0x30],ymm0,0x1 - vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x4] + vmovdqu xmm0,XMMWORD PTR [rax*8+0x40] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x50],0x1 vmovups XMMWORD PTR [rsi+0x40],xmm0 vextracti128 XMMWORD PTR [rsi+0x50],ymm0,0x1 - vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x6] + vmovdqu xmm0,XMMWORD PTR [rax*8+0x60] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x70],0x1 vmovups XMMWORD PTR [rsi+0x60],xmm0 vextracti128 XMMWORD PTR [rsi+0x70],ymm0,0x1 As noticed by Konstantin, the offset in vmovdqu8 is wrong: 0x2, 0x4, 0x6 instead of 0x20, 0x40, 0x60.
(In reply to Yongseok Koh from comment #38) > For Ferruh's patch, > > +# Workaround for Bug #97, force to use SSE rte_memcpy() > +# see lib/librte_eal/common/include/arch/x86/rte_memcpy.h > +ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y) > +ifeq ($(shell test $(GCC_VERSION) -ge 60 && echo 1), 1) > > IIRC, Thomas mentioned gcc 6 is okay. > Shouldn't it be '-gt'? Or, -ge 70 or 73? Yes it should be but I put -ge intentionally, to be safe with gcc6 versions.
(In reply to Thomas Monjalon from comment #26) > Restricting AVX2 flavour of rte_memcpy to not use AVX512 instructions works: > > static inline void * __attribute__((target("no-avx512f"))) > rte_memcpy_generic(void *dst, const void *src, size_t n) > > Thanks Yongseok, it is a nice workaround. > We need to check it is well supported with old compilers we want to support > in DPDK. I am getting build error with this [1], aren't you getting these? [1] In file included from .../lib/librte_eal/common/rte_malloc.c:12: In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/xmmintrin.h:1252, from /usr/lib/gcc/x86_64-redhat-linux/8/include/x86intrin.h:33, from .../build/include/rte_vect.h:28, from .../build/include/rte_memcpy.h:17, from .../lib/librte_eal/common/rte_malloc.c:12: .../build/include/rte_memcpy.h: In function ‘rte_mov16’: /usr/lib/gcc/x86_64-redhat-linux/8/include/emmintrin.h:718:1: error: inlining failed in call to always_inline ‘_mm_storeu_si128’: target specific option mismatch _mm_storeu_si128 (__m128i_u *__P, __m128i __B) ^~~~~~~~~~~~~~~~
(In reply to Ferruh YIGIT from comment #41) > (In reply to Thomas Monjalon from comment #26) > > Restricting AVX2 flavour of rte_memcpy to not use AVX512 instructions > works: > > > > static inline void * __attribute__((target("no-avx512f"))) > > rte_memcpy_generic(void *dst, const void *src, size_t n) > > > > Thanks Yongseok, it is a nice workaround. > > We need to check it is well supported with old compilers we want to support > > in DPDK. > > I am getting build error with this [1], aren't you getting these? > > /usr/lib/gcc/x86_64-redhat-linux/8/include/emmintrin.h:718:1: error: > inlining failed in call to always_inline ‘_mm_storeu_si128’: target specific > option mismatch > _mm_storeu_si128 (__m128i_u *__P, __m128i __B) > ^~~~~~~~~~~~~~~~ You cannot change optimization of an inline function. You can try this: --- a/drivers/net/mlx5/mlx5_rxtx.c +++ b/drivers/net/mlx5/mlx5_rxtx.c -uint16_t +uint16_t __attribute__((target("no-avx512f"))) mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n) Then you can check the generated code: gdb -batch -ex 'file build/app/testpmd' -ex 'set disassembly-flavor intel' -ex 'disassemble/rs mlx5_tx_burst' | grep 'vmovdqu.\? .*\*8+0x[2-6]'
In order to check which functions are impacted by this bug, we can check for a vmovdqu instruction with an offset 0x2: bin=build/app/testpmd ; regex='vmovdqu.? .*0x2\]$' ; for addr in $(objdump -dM intel $bin | sed -rn "s,^ *([a-f0-9]*):.*$regex,\1,p") ; do objdump -d --stop-address 0x$addr $bin | sed -rn 's,.*<(.*)>:$,\1,p' | tail -n1 ; done mlx5_tx_descriptor_status mlx5_tx_burst mlx5_tx_burst_mpw mlx5_tx_burst_mpw_inline mlx5_tx_burst_empw mlx5_tx_burst_raw_vec mlx5_tx_burst_vec Either we consider that the AVX512 optimization bug has no other known effect, and we disable it for above known functions, or we stay on the safe side and disable all AVX512 optimizations, as done currently.
Trying to make comparison of inlined code easier with a marker: --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h @@ -330,13 +330,19 @@ rte_mov64(uint8_t *dst, const uint8_t *src) * Copy 128 bytes from one location to another, * locations should not overlap. */ +#include <rte_atomic.h> +static volatile int dpdk_bug97_marker __attribute__((used)); static inline void rte_mov128(uint8_t *dst, const uint8_t *src) { + dpdk_bug97_marker = 0xdbdb97be; /* sequence begins */ + rte_mb(); rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32); rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32); + rte_mb(); + dpdk_bug97_marker = 0xdbdb97ed; /* sequence ends */ } The disassembled sequence can be found with this kind of sed command: gdb -batch -ex 'file build/app/testpmd' -ex 'set disassembly-flavor intel' -ex 'disassemble/rs mlx5_tx_burst' | sed '/mov.*0xdbdb97be/,/0xdbdb97ed/!d' | sed '/0xdbdb97ed/s,$,\n---,' I did not test whether the inserted instructions (mov and mfence) are changing the code of the sequence.
After talking with Ferruh offline: Does mlx5 PMD in 18.08 compile without that problem with gcc 7+? If so, then in theory it should be possible to figure out which particular code change triggers the problem (git bisect or so), and hopefully change it to overcome the issue?
After Konstantin's comment, I compiled v18.08 and run above script on it [1], as output shows the instruction we thought problematic is exists same in v18.08. Assuming there was no crash observed in v18.08, our assumption on the problematic instructions can be wrong. Can you please confirm v18.08 on Skylake has no crash problem? And agree with Konstantin that finding the git commit that first crash observed can be helpful. [1] root@ubuntu:~/development/dpdk-next-net# gdb -batch -ex 'file build/app/testpmd' -ex 'set disassembly-flavor intel' -ex 'disassemble/rs mlx5_tx_burst' | grep 'vmovdqu.\? .*\*8+0x[2-6]' 0x0000000000451567 <+7303>: 62 f1 7f 08 6f 04 c5 02 00 00 00 vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x2] 0x0000000000451589 <+7337>: 62 f1 7f 08 6f 04 c5 04 00 00 00 vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x4] 0x00000000004515ab <+7371>: 62 f1 7f 08 6f 04 c5 06 00 00 00 vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x6]
The same crash has been reported on older versions. If you look at the initial description, it has been reproduced on 18.02/18.05/18.08. I really think it is a waste of time trying to find what triggers it in mlx5. It is clearly a GCC bug. Even in mlx5_tx_burst, there are different inlined versions of rte_mov128: gdb -batch -ex 'file build/app/testpmd' -ex 'set disassembly-flavor intel' -ex 'disassemble/rs mlx5_tx_burst' | sed -rn 's,.*0x00.*:[[:space:]]*([0-9a-f][0-9a-f][[:space:]])*,,p' | sed '/0xdbdb97be/,/0xdbdb97ed/!d' | sed '/0xdbdb97ed/s,$,\n---,' mov DWORD PTR [rip+0x937b56],0xdbdb97be # 0xde0db0 <dpdk_bug97_marker> lea r13,[rdx+0x80] mov QWORD PTR [rbp-0xe0],r8 lea r8,[rcx+0x80] mfence vmovdqu8 xmm0,XMMWORD PTR [rdx] vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x10],0x1 vmovups XMMWORD PTR [rcx],xmm0 vextracti128 XMMWORD PTR [rcx+0x10],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rdx+0x20] vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x30],0x1 vmovups XMMWORD PTR [rcx+0x20],xmm0 vextracti128 XMMWORD PTR [rcx+0x30],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rdx+0x40] vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x50],0x1 vmovups XMMWORD PTR [rcx+0x40],xmm0 vextracti128 XMMWORD PTR [rcx+0x50],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rdx+0x60] vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x70],0x1 vmovups XMMWORD PTR [rcx+0x60],xmm0 vextracti128 XMMWORD PTR [rcx+0x70],ymm0,0x1 mfence mov DWORD PTR [rip+0x937acb],0xdbdb97ed # 0xde0db0 <dpdk_bug97_marker> --- mov DWORD PTR [rip+0x937912],0xdbdb97be # 0xde0db0 <dpdk_bug97_marker> lea rsi,[rax-0x80] lea r12,[rdx+0x80] mfence vmovdqu8 xmm0,XMMWORD PTR [rdx] mov rdi,QWORD PTR [rbp-0x60] vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x10],0x1 lea rcx,[rdi+0xb0] vmovups XMMWORD PTR [rdi+0x30],xmm0 vextracti128 XMMWORD PTR [rdi+0x40],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rdx+0x20] vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x30],0x1 vmovups XMMWORD PTR [rdi+0x50],xmm0 vextracti128 XMMWORD PTR [rdi+0x60],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rdx+0x40] vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x50],0x1 vmovups XMMWORD PTR [rdi+0x70],xmm0 vextracti128 XMMWORD PTR [rdi+0x80],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rdx+0x60] vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x70],0x1 vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x30],0x1 vmovups XMMWORD PTR [rdi+0x50],xmm0 vextracti128 XMMWORD PTR [rdi+0x60],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rdx+0x40] vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x50],0x1 vmovups XMMWORD PTR [rdi+0x70],xmm0 vextracti128 XMMWORD PTR [rdi+0x80],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rdx+0x60] vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x70],0x1 vmovups XMMWORD PTR [rdi+0x90],xmm0 vextracti128 XMMWORD PTR [rdi+0xa0],ymm0,0x1 mfence mov DWORD PTR [rip+0x93787c],0xdbdb97ed # 0xde0db0 <dpdk_bug97_marker> --- mov DWORD PTR [rip+0x936fd2],0xdbdb97be # 0xde0db0 <dpdk_bug97_marker> lea rcx,[rax-0x80] sub rdx,0xffffffffffffff80 mfence mov rax,QWORD PTR [rbp-0xa0] vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x0] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x10],0x1 vmovups XMMWORD PTR [rdx-0x80],xmm0 vextracti128 XMMWORD PTR [rdx-0x70],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x2] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x30],0x1 vmovups XMMWORD PTR [rdx-0x60],xmm0 vextracti128 XMMWORD PTR [rdx-0x50],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x4] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x50],0x1 vmovups XMMWORD PTR [rdx-0x40],xmm0 vextracti128 XMMWORD PTR [rdx-0x30],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x6] vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x70],0x1 lea rax,[r9+0x80] vmovups XMMWORD PTR [rdx-0x20],xmm0 vextracti128 XMMWORD PTR [rdx-0x10],ymm0,0x1 mfence mov DWORD PTR [rip+0x936f24],0xdbdb97ed # 0xde0db0 <dpdk_bug97_marker> --- mov DWORD PTR [rip+0x936e2f],0xdbdb97be # 0xde0db0 <dpdk_bug97_marker> lea rdi,[r9+0x80] add rax,0xffffffffffffff80 mfence vmovdqu8 xmm0,XMMWORD PTR [r9] sub rdx,0xffffffffffffff80 vinserti128 ymm0,ymm0,XMMWORD PTR [r9+0x10],0x1 vmovups XMMWORD PTR [rdx-0x80],xmm0 vextracti128 XMMWORD PTR [rdx-0x70],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [r9+0x20] vinserti128 ymm0,ymm0,XMMWORD PTR [r9+0x30],0x1 vmovups XMMWORD PTR [rdx-0x60],xmm0 vextracti128 XMMWORD PTR [rdx-0x50],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [r9+0x40] vinserti128 ymm0,ymm0,XMMWORD PTR [r9+0x50],0x1 vmovups XMMWORD PTR [rdx-0x40],xmm0 vextracti128 XMMWORD PTR [rdx-0x30],ymm0,0x1 vmovdqu8 xmm0,XMMWORD PTR [r9+0x60] vinserti128 ymm0,ymm0,XMMWORD PTR [r9+0x70],0x1 vmovups XMMWORD PTR [rdx-0x20],xmm0 vextracti128 XMMWORD PTR [rdx-0x10],ymm0,0x1 mfence mov DWORD PTR [rip+0x936da9],0xdbdb97ed # 0xde0db0 <dpdk_bug97_marker> Only the third inline sequence shows the offset bug.
Bug report open at GCC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88096
FYI, we are affected too by this (CONFIG_RTE_ENABLE_AVX=n, only after upgrading to DPDK 18.04). What we noticed is a segfault in mlx5_rx_replenish_bulk_mbuf. Is there a better workaround than disabling all AVX instructions with CONFIG_RTE_ENABLE_AVX=n ? I guess that shutdowns other optimizations too...
Actually, this may be another similar issue with SSE... mlx5_rx_replenish_bulk_mbuf is actually still failing with AVX=n. When using mlx5 with secondary processes. This only happens for our Skylake platform with Ubuntu 18.04 (GCC 7), our code worked fine for months on Haswell, as well as on the Skylake with Ubuntu 16.04 (which ships with GCC 5). When upgrading to 18.04 (GCC 7) it started to crash as soon as packet flowed on the secondary process (that read from the device). Just using GCC 5 solved the problem.
It seems actually fixed in 18.11 (even using GCC 7). Sorry for bothering... Thanks Thomas for helping us offline.
Hi Thomas, Yongseok, gcc defect suggests that this can be binutils issue, can you please share the binutils version has been used?
Hi Ferruh, We have it with binutils 2.30-21ubuntu1~18.04. Btw, the reason our issue was fixed with 18.11 but not with AVX=n and AVX512=n is that we compile our whole application with DPDK's flags, including therefore since 18.11 -mno-avx512f. In 18.08 simply using AVX=n and AVX512=n was not sufficient because linking time optimizations or header inclusions would produce crashing code. But if someone statically links with -ldpdk without integrating all of DPDK's CFLAGS, it may still happen. I think the Snort/Suricata DAQ for DPDK actually do that. Tom
Hi Tom, I am aware of the workaround on v18.11. My comment was based on an update on the gcc defect we have submitted. Which claims this is an issue in binutils 2.30. Please check the gcc defect for details, this one is reported as the duplicate of the one we submit: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86735 binutils 2.30 also seems updated, when it is picked by distros can fix the issue, meanwhile perhaps we can narrow down the workaround on 18.11 by checking the binutils version. Just a reminder, I would like to narrow down it because the workaround in v18.11 (-mno-avx512f) reported causing performance drop on SkyLake platform.
(In reply to Ferruh YIGIT from comment #52) > Hi Thomas, Yongseok, > > gcc defect suggests that this can be binutils issue, can you please share > the binutils version has been used? Ferruh, I don't have the repro setup anymore. Thomas might have one and could confirm the version. But, as we didn't upgrade binutils since we installed the Ubuntu release, so it would be the one shipped in the Ubuntu release. Thanks, Yongseok
(In reply to Yongseok Koh from comment #55) > (In reply to Ferruh YIGIT from comment #52) > > Hi Thomas, Yongseok, > > > > gcc defect suggests that this can be binutils issue, can you please share > > the binutils version has been used? > > Ferruh, > > I don't have the repro setup anymore. > Thomas might have one and could confirm the version. > But, as we didn't upgrade binutils since we installed the Ubuntu release, so > it would be the one shipped in the Ubuntu release. > I've confirmed that Ubuntu 18.04.1 has gcc 7.3.0-27ubuntu1~18.04 binutils 2.30-21ubuntu1~18.04 Thanks, Yongseok
FYI, I recompiled with the latest binutils from git, disabling the workaround -mnoavx512* and it works now.
Thanks for confirming Tom, I will soon send a patch to check binutils version and limit -mno-avx512f argument to binutils 2.30.
AVX512 is disabled in DPDK if an affected version of binutils is used. Bug was fixed in 17.11, 18.11 and upper.
Description: The vhost crypto library code contains a post message handler (vhost_crypto_msg_post_handler) which calls vhost_crypto_create_sess() which in turn calls transform_cipher_param() depending on the operation type. It is transform_cipher_param() https://komiya-dental.com/ that handles the payload data. The payload contains a cipher key length and a static VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH (64) byte key buffer. When http://www.iu-bloomington.com/ transform_cipher_param() handles the payload data it does not check to see if the buffer length doesn't exceed VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH. This missing check can cause https://www.webb-dev.co.uk/ out of bound reads which could trigger a crash or a potential information leak. Also, the vhost crypto library code contains a post message handler (vhost_crypto_msg_post_handler) which calls https://waytowhatsnext.com/ vhost_crypto_create_sess() which in turn calls transform_chain_param() depending on the operation type. It is transform_chain_param() that handles the payload data. The payload contains a cipher key length and a static VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH (64) byte key buffer, it http://www.acpirateradio.co.uk/ also contains a digest length and a static authentication key buffer (size: VHOST_USER_CRYPTO_MAX_HMAC_KEY_LENGTH(512)) and authentication key buffer length. None of these length values are validated. Which can lead to reading out of bound. http://www.logoarts.co.uk/ Description: The vhost crypto library code contains a post message handler (vhost_crypto_msg_post_handler) which calls vhost_crypto_create_sess() which in turn calls transform_cipher_param() depending on the operation http://www.slipstone.co.uk/ type. It is transform_cipher_param() that handles the payload data. The payload contains a cipher key length and a static VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH (64) byte key buffer. When transform_cipher_param() handles the payload data it does not check to see if the buffer length doesn't exceed VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH. This missing check can cause out of bound reads which could trigger a crash or a potential http://embermanchester.uk/ information leak. Also, the vhost crypto library code contains a post message handler (vhost_crypto_msg_post_handler) which calls vhost_crypto_create_sess() which in turn calls transform_chain_param() depending on the operation type. It is transform_chain_param() that http://connstr.net/ handles the payload data. The payload contains a cipher key length and a static VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH (64) byte key buffer, it also contains a digest length and a static authentication key buffer (size: VHOST_USER_CRYPTO_MAX_HMAC_KEY_LENGTH(512)) and authentication key buffer length. None of these length values are validated. Which can lead to reading out of bound. Description: http://joerg.li/ The vhost crypto library code contains a post message handler (vhost_crypto_msg_post_handler) which calls vhost_crypto_create_sess() which in turn calls transform_cipher_param() depending on the operation type. It is transform_cipher_param() that handles the payload data. The payload contains a cipher key length and a static http://www.jopspeech.com/ VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH (64) byte key buffer. When transform_cipher_param() handles the payload data it does not check to see if the buffer length doesn't exceed VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH. This missing check can cause out of bound reads which could trigger a crash or a potential http://www.wearelondonmade.com/ information leak. Also, the vhost crypto library code contains a post message handler (vhost_crypto_msg_post_handler) which calls vhost_crypto_create_sess() which in turn calls transform_chain_param() depending on the operation type. It is transform_chain_param() that http://www.compilatori.com/ handles the payload data. The payload contains a cipher key length and a static VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH (64) byte key buffer, it also contains a digest length and a static authentication key buffer (size: VHOST_USER_CRYPTO_MAX_HMAC_KEY_LENGTH(512)) and authentication http://www-look-4.com/ key buffer length. None of these length values are validated. Which can lead to reading out of bound.