Bug 97 - rte_memcpy() moves data incorrectly on Ubuntu 18.04 on Intel Skylake
Summary: rte_memcpy() moves data incorrectly on Ubuntu 18.04 on Intel Skylake
Status: RESOLVED FIXED
Alias: None
Product: DPDK
Classification: Unclassified
Component: core (show other bugs)
Version: 18.08
Hardware: x86 Linux
: Normal critical
Target Milestone: ---
Assignee: dev
URL: https://mails.dpdk.org/archives/dev/2...
Depends on:
Blocks:
 
Reported: 2018-10-23 19:48 CEST by Yongseok Koh
Modified: 2021-09-10 22:01 CEST (History)
7 users (show)



Attachments
rte_memcpy unit test (3.14 KB, patch)
2018-11-06 20:45 CET, Ferruh YIGIT
Details | Diff
0001-replace-rte_memcpy-which-cause-crash-with-avx512-ins.patch (1.31 KB, patch)
2018-11-08 15:41 CET, Ferruh YIGIT
Details | Diff
0001-force-using-scalar-rte_memcpy-implementation.patch (2.32 KB, patch)
2018-11-09 18:03 CET, Ferruh YIGIT
Details | Diff
v2-0001-eal-x86-force-using-SSE-memcpy-implementation.patch (2.71 KB, patch)
2018-11-09 20:32 CET, Ferruh YIGIT
Details | Diff

Description Yongseok Koh 2018-10-23 19:48:09 CEST
Reported by:
        https://mails.dpdk.org/archives/dev/2018-September/111522.html

We've recently encountered a weird issue with Ubuntu 18.04 on the Skylake
server. I can always reproduce this crash and I could narrowed it down. I guess
it could be a GCC issue.


[1] How to reproduce
- ConnectX-4Lx/ConnectX-5 with mlx5 PMD in DPDK 18.02/18.05/18.08
- Ubuntu 18.04 on Intel Skylake server
- gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
- Testpmd crashes when it starts to forward traffic. Easy to reproduce.
- Only happens on the Skylake server.


[2] Failure point

The attached patch gives an insight of why it crashes. The following is the
result of the patch and the GDB commands.

In summary, rte_memcpy() doesn't work as expected. In __mempool_generic_put(),
there's rte_memcpy() to move the array of objects to the lcore cache. If I run
memcmp() right after rte_memcpy(dst, src, n), data in dst differs from data in
src. And it looks like some of data got shifted by a few bytes as you can see
below.

	[GDB command]
	$dst = 0x7ffff4e09ea8
	$src = 0x7fffce3fb970
	$n = 256
	x/32gx 0x7ffff4e09ea8
	x/32gx 0x7fffce3fb970
	testpmd: /home/mlnxtest/dpdk/build/include/rte_mempool.h:1140: __mempool_generic_put: Assertion `0' failed.

	Thread 4 "lcore-slave-1" received signal SIGABRT, Aborted.
	[Switching to Thread 0x7fffce3ff700 (LWP 69913)]
	(gdb) x/32gx 0x7ffff4e09ea8
	0x7ffff4e09ea8: 0x00007fffaac38ec0      0x00007fffaac38500
	0x7ffff4e09eb8: 0x00007fffaac37b40      0x00007fffaac37180
	0x7ffff4e09ec8: 0x850000007fffaac3      0x7b4000007fffaac3
	0x7ffff4e09ed8: 0x00007fffaac35440      0x00007fffaac34a80
	0x7ffff4e09ee8: 0xaac3850000007fff      0xaac37b4000007fff
	0x7ffff4e09ef8: 0x00007fffaac32d40      0x00007fffaac32380
	0x7ffff4e09f08: 0x7fffaac385000000      0x7fffaac37b400000
	0x7ffff4e09f18: 0x00007fffaac30640      0x00007fffaac2fc80
	0x7ffff4e09f28: 0x00007fffaac2f2c0      0x00007fffaac2e900
	0x7ffff4e09f38: 0x00007fffaac2df40      0x00007fffaac2d580
	0x7ffff4e09f48: 0x00007fffaac2cbc0      0x00007fffaac2c200
	0x7ffff4e09f58: 0x00007fffaac2b840      0x00007fffaac2ae80
	0x7ffff4e09f68: 0x00007fffaac2a4c0      0x00007fffaac29b00
	0x7ffff4e09f78: 0x00007fffaac29140      0x00007fffaac28780
	0x7ffff4e09f88: 0x00007fffaac27dc0      0x00007fffaac27400
	0x7ffff4e09f98: 0x00007fffaac26a40      0x00007fffaac26080
	(gdb) x/32gx 0x7fffce3fb970
	0x7fffce3fb970: 0x00007fffaac38ec0      0x00007fffaac38500
	0x7fffce3fb980: 0x00007fffaac37b40      0x00007fffaac37180
	0x7fffce3fb990: 0x00007fffaac367c0      0x00007fffaac35e00
	0x7fffce3fb9a0: 0x00007fffaac35440      0x00007fffaac34a80
	0x7fffce3fb9b0: 0x00007fffaac340c0      0x00007fffaac33700
	0x7fffce3fb9c0: 0x00007fffaac32d40      0x00007fffaac32380
	0x7fffce3fb9d0: 0x00007fffaac319c0      0x00007fffaac31000
	0x7fffce3fb9e0: 0x00007fffaac30640      0x00007fffaac2fc80
	0x7fffce3fb9f0: 0x00007fffaac2f2c0      0x00007fffaac2e900
	0x7fffce3fba00: 0x00007fffaac2df40      0x00007fffaac2d580
	0x7fffce3fba10: 0x00007fffaac2cbc0      0x00007fffaac2c200
	0x7fffce3fba20: 0x00007fffaac2b840      0x00007fffaac2ae80
	0x7fffce3fba30: 0x00007fffaac2a4c0      0x00007fffaac29b00
	0x7fffce3fba40: 0x00007fffaac29140      0x00007fffaac28780
	0x7fffce3fba50: 0x00007fffaac27dc0      0x00007fffaac27400
	0x7fffce3fba60: 0x00007fffaac26a40      0x00007fffaac26080


AFAIK, AVX512F support is disabled by default in DPDK as it is still
experimental (CONFIG_RTE_ENABLE_AVX512=n). But with gcc optimization, AVX2
version of rte_memcpy() seems to be optimized with 512b instructions. If I
disable it by adding EXTRA_CFLAGS="-mno-avx512f", then it works fine and doesn't
crash.

Do you have any idea regarding this issue or are you already aware of it?


Thanks,
Yongseok


$ git diff
diff --git a/config/common_base b/config/common_base
index ad03cf433..f512b5a88 100644
--- a/config/common_base
+++ b/config/common_base
@@ -275,8 +275,8 @@ CONFIG_RTE_LIBRTE_MLX4_TX_MP_CACHE=8
 #
 # Compile burst-oriented Mellanox ConnectX-4 & ConnectX-5 (MLX5) PMD
 #
-CONFIG_RTE_LIBRTE_MLX5_PMD=n
-CONFIG_RTE_LIBRTE_MLX5_DEBUG=n
+CONFIG_RTE_LIBRTE_MLX5_PMD=y
+CONFIG_RTE_LIBRTE_MLX5_DEBUG=y
 CONFIG_RTE_LIBRTE_MLX5_DLOPEN_DEPS=n
 CONFIG_RTE_LIBRTE_MLX5_TX_MP_CACHE=8

@@ -597,7 +597,7 @@ CONFIG_RTE_RING_USE_C11_MEM_MODEL=n
 #
 CONFIG_RTE_LIBRTE_MEMPOOL=y
 CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE=512
-CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
+CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=y

 #
 # Compile Mempool drivers
diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 8b1b7f7ed..9f48028d9 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -39,6 +39,7 @@
 #include <errno.h>
 #include <inttypes.h>
 #include <sys/queue.h>
+#include <assert.h>

 #include <rte_config.h>
 #include <rte_spinlock.h>
@@ -1123,6 +1124,22 @@ __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
        /* Add elements back into the cache */
        rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);

+       if(memcmp(&cache_objs[0], obj_table, sizeof(void *) * n)) {
+               printf("[GDB command] \n"
+                      "$dst = %p\n"
+                      "$src = %p\n"
+                      "$n = %ld\n"
+                      "x/%ldgx %p\n"
+                      "x/%ldgx %p\n",
+                      (void *)&cache_objs[0],
+                      (const void *)obj_table,
+                      sizeof(void *) * n,
+                      sizeof(void *) * n / 8, (void *)&cache_objs[0],
+                      sizeof(void *) * n / 8, (const void *)obj_table
+                      );
+               assert(0);
+       }
+
        cache->len += n;

        if (cache->len >= cache->flushthresh) {
Comment 1 Yongseok Koh 2018-10-23 19:50:53 CEST
Also reported to Ubuntu.
http://bugs.launchpad.net/bugs/1799397
Comment 2 Yongseok Koh 2018-10-23 23:30:16 CEST
Submitted a patch of the workaround
http://patches.dpdk.org/patch/47266/
Comment 3 Ferruh YIGIT 2018-11-02 12:46:27 CET
Wasn't aware of the issue, thanks for reporting.

The patch you have sent is disabling avx512f support globally, ignoring CONFIG_RTE_ENABLE_AVX512.

We will check for an option to remove avx512f support locally for rte_memcpy.
Comment 4 Ferruh YIGIT 2018-11-06 20:44:00 CET
Hi Yongseok,

I can NOT reproduce this issue, on a skylake box with Ubuntu 18.04 OS, same compiler [1], I am running test with two i40e ports and not observing any crash/problem.

Also I add a unit test for rte_memcpy() which seems working fine.

Can you please help to be able to reproduce the issue?

- Is there any case this issue observed other than mlx5 data forwarding?
- Do you observer the issue immediately or does a longevity test required?
- Is there a way to find out what changes in final binary with or without "-mno-avx512f" flag?


[1]
gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
Comment 5 Ferruh YIGIT 2018-11-06 20:45:33 CET
Created attachment 15 [details]
rte_memcpy unit test
Comment 6 Yongseok Koh 2018-11-06 22:09:13 CET
I don't remember all the details exactly as I reported it to dev list two months ago [1]. But...

> I can NOT reproduce this issue, on a skylake box with Ubuntu 18.04 OS, same
> compiler [1], I am running test with two i40e ports and not observing any
> crash/problem.

Did you make sure rte_mempool_put_bulk() is used? I remember regular tx burst of i40e does use rte_mempool_put() and vec Tx uses the bulky one.

Copying one pointer at a time wouldn't enough to reproduce??


> - Is there any case this issue observed other than mlx5 data forwarding?

I hadn't used any other PMD than mlx5.


> - Do you observer the issue immediately or does a longevity test required?

Almost immediately after a few packets passed.


> - Is there a way to find out what changes in final binary with or without
> "-mno-avx512f" flag?

I had disassembled the two binaries on gdb then I could see avx512f instructions even though CONFIG_RTE_ENABLE_AVX512 is off. That's why I tried to add "-mno-avx512f" as a workaround.


[1] https://mails.dpdk.org/archives/dev/2018-September/111522.html

Thanks,
Yongseok
Comment 7 Ferruh YIGIT 2018-11-07 18:48:58 CET
(In reply to Yongseok Koh from comment #6)
> I don't remember all the details exactly as I reported it to dev list two
> months ago [1]. But...
> 
> > I can NOT reproduce this issue, on a skylake box with Ubuntu 18.04 OS, same
> > compiler [1], I am running test with two i40e ports and not observing any
> > crash/problem.
> 
> Did you make sure rte_mempool_put_bulk() is used? I remember regular tx
> burst of i40e does use rte_mempool_put() and vec Tx uses the bulky one.
> 
> Copying one pointer at a time wouldn't enough to reproduce??

I am testing on Skylake and by default vector driver used, I confirmed rte_mempool_put_bulk() is used.

> 
> > - Is there any case this issue observed other than mlx5 data forwarding?
> 
> I hadn't used any other PMD than mlx5.
> 
> 
> > - Do you observer the issue immediately or does a longevity test required?
> 
> Almost immediately after a few packets passed.
> 
> 
> > - Is there a way to find out what changes in final binary with or without
> > "-mno-avx512f" flag?
> 
> I had disassembled the two binaries on gdb then I could see avx512f
> instructions even though CONFIG_RTE_ENABLE_AVX512 is off. That's why I tried
> to add "-mno-avx512f" as a workaround.

I confirm this, gcc is using avx512 instructions for rte_mempool_put_bulk(), but this is not causing any problem in my test, same for QA team they didn't observe this issue.

> 
> 
> [1] https://mails.dpdk.org/archives/dev/2018-September/111522.html
> 
> Thanks,
> Yongseok
Comment 8 Ferruh YIGIT 2018-11-07 18:56:26 CET
What would you think disabling avx512 only for mlx5 driver, by adding the following to driver:
+CFLAGS += -mno-avx512f

I agree root cause is not %100 clarified is a concern, but taking account that this issue is not reproduced by Intel test team, and only reproduced by mlx driver, we can reduce the scope to mlx5 driver.
Comment 9 Yongseok Koh 2018-11-07 19:43:19 CET
>What would you think disabling avx512 only for mlx5 driver, by adding the
>following to driver:
>+CFLAGS += -mno-avx512f

>I agree root cause is not %100 clarified is a concern, but taking account that
>this issue is not reproduced by Intel test team, and only reproduced by mlx
>driver, we can reduce the scope to mlx5 driver.

There's no reason to object your suggestion. This has been reported by a few Mellanox customers. As long as I can relieve their concern, it is okay anyway.

I still think it is obvious that the crash could happen regardless of PMD being used (my test proves it) but if Intel thinks it is okay, I have no objection. Will submit a revert patch and another patch to apply it for mlx PMDs.
Comment 10 Ferruh YIGIT 2018-11-07 19:47:53 CET
Please hold on the patch, let's get more input before changing it. Thanks.
Comment 11 Ferruh YIGIT 2018-11-08 10:21:32 CET
Our test team reported a performance drop targetting this issue:

"
•	New high issue: performance single core drop 1 mpps compared to rc1 for 64 bytes packets, git bisect  from 8d07c82b239f( mk: disable gcc AVX512F support). 
"
Comment 12 Ferruh YIGIT 2018-11-08 15:40:04 CET
What about an alternative way, replace rte_memcpy with memcpy, since rte_memcpy code is causing crash.

Can you please test with attached 0001-replace-rte_memcpy-which-cause-crash-with-avx512-ins.patch
1) If crash has been fixed
2) And confirm this doesn't cause any performance degradation
Comment 13 Ferruh YIGIT 2018-11-08 15:41:49 CET
Created attachment 16 [details]
0001-replace-rte_memcpy-which-cause-crash-with-avx512-ins.patch
Comment 14 Yongseok Koh 2018-11-08 21:05:54 CET
Ferruh,

IIRC, memcpy isn't an inline function but a library call. As you know, rte_mempool_put_bulk() is usually called in a datapath and successive calls are all inlined, up to __mempool_generic_put(). And rte_memcpy() also has __rte_always_inline. It seems intentional. And I think it would impact performance if branch happens in a datapath. How do you think?

Thanks,
Yongseok
Comment 15 Ferruh YIGIT 2018-11-09 12:33:16 CET
As a final solution, you are right it may have performance implications.

But if you can at least test it, we can have more confidence that issue is limited to rte_memcpy() or not.
Comment 16 Thomas Monjalon 2018-11-09 14:20:46 CET
Result of the test replacing use of rte_memcpy by memcpy: it works, confirming the issue is with rte_memcpy.

Result of the unit test: no issue.

RTE>>rte_memcpy_autotest
.................................................................
Test OK
Comment 17 Thomas Monjalon 2018-11-09 14:55:45 CET
More tests:

CONFIG_RTE_ENABLE_AVX=y
CONFIG_RTE_ENABLE_AVX512=y
=> crash

CONFIG_RTE_ENABLE_AVX=y
CONFIG_RTE_ENABLE_AVX512=n
=> crash

CONFIG_RTE_ENABLE_AVX=n
CONFIG_RTE_ENABLE_AVX512=y
=> crash

CONFIG_RTE_ENABLE_AVX=n
CONFIG_RTE_ENABLE_AVX512=n
=> OK
Comment 18 Thomas Monjalon 2018-11-09 17:11:05 CET
No crash seen with code generated by clang-6 or gcc-6,
probably because they do not generate AVX512 instructions.

Crash is confirmed with gcc-7 and gcc-8 when using AVX2 version of rte_memcpy.
Comment 19 Ferruh YIGIT 2018-11-09 18:02:53 CET
Based on CONFIG_RTE_ENABLE_AVX=n & CONFIG_RTE_ENABLE_AVX512=n works fine, I did a patch to force always rte_memcpy() scalar version.

And narrowing the gcc scope to gcc >= 6.0

Can you please test with the attached patch 0001-force-using-scalar-rte_memcpy-implementation.patch?
Both crash and performance affect?

Thanks.
Comment 20 Ferruh YIGIT 2018-11-09 18:03:37 CET
Created attachment 17 [details]
0001-force-using-scalar-rte_memcpy-implementation.patch
Comment 21 Thomas Monjalon 2018-11-09 18:18:07 CET
More infos about the crash:

Command line:
  testpmd -w 0002:00:02.0 --vdev='net_vdev_netvsc0,iface=eth1' -- --forward-mode=txonly --stats-period 1

Testpmd output:
  RX-packets: 1          RX-missed: 0          RX-bytes:  90
  TX-packets: 96         TX-errors: 0          TX-bytes:  6144

GDB output:
  Thread 4 "lcore-slave-1" received signal SIGSEGV, Segmentation fault.
  pkt_burst_transmit (fs=<optimized out>) at app/test-pmd/txonly.c:206
  206                     rte_pktmbuf_reset_headroom(pkt);

Registers:
rax            0x49     73
rbx            0x10045ed40      4299550016
rcx            0x1e0000040004   32985349095428
rdx            0x48     72
rsi            0x55555602c130   93825003602224
rdi            0x7ffff512b4e2   140737305031906
rbp            0x7ffff512c530   0x7ffff512c530
rsp            0x7ffff512b460   0x7ffff512b460
r8             0x0      0
r9             0xa0e    2574
r10            0xfa     250
r11            0x0      0
r12            0x10045bc40      4299537472
r13            0x109c431400000  292213221228544
r14            0x10045ed50      4299550032
r15            0x100443ac0      4299438784
rip            0x555555671c84   0x555555671c84 <pkt_burst_transmit+580>
eflags         0x13206  [ PF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
k0             0x77777777       2004318071
k1             0x7777777777777777       8608480567731124087
k2             0x7777   30583
k3             0x5000500050005  1407396358717445
k4             0xfffa   65530
k5             0x0      0
k6             0x0      0
k7             0x0      0
Comment 22 Thomas Monjalon 2018-11-09 18:22:41 CET
instruction at pkt_burst_transmit+580 is a comparison:

        m->data_off = (uint16_t)RTE_MIN((uint16_t)RTE_PKTMBUF_HEADROOM,
  11dc84:       66 41 81 7d 36 80 00    cmpw   $0x80,0x36(%r13)
Comment 23 Ferruh YIGIT 2018-11-09 18:40:56 CET
Updated unit test as following [1] to be sure both rte_memcpy_aligned() and rte_memcpy_generic() pathes are called for rte_memcpy(), and confirmed still not observing the crash.

[1]
+static void
+copy_data(unsigned int n)
+{
+       rte_memcpy(dst, src, n);
+
+       for (unsigned int i = 1; i < n; i++) {
+               rte_memcpy(dst + i, src + i, n - i);
+               verify_data(n);
+       }
+}
Comment 24 Thomas Monjalon 2018-11-09 18:51:37 CET
The full statement of the crash is rte_pktmbuf_reset_headroom():

  m->data_off = (uint16_t)RTE_MIN((uint16_t)RTE_PKTMBUF_HEADROOM, uint16_t)m->buf_len);

Looks like the mbuf was corrupted when calling rte_memcpy.
The call stack of rte_memcpy is:

mlx5_tx_complete
  rte_mempool_put_bulk
    rte_mempool_generic_put
      __mempool_generic_put
        rte_memcpy
          rte_memcpy_generic

All functions of this call stack are inlined.
So the compiler optimizations are probably different than in other call contexts, like the unit test above.
Comment 25 Yongseok Koh 2018-11-09 18:55:46 CET
Thomas, can you please test the following, with the default config (CONFIG_RTE_ENABLE_AVX512=n)?

[yskoh@scfae-sc-2 dpdk.org]$ git diff
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 7b758094df..71409b4b05 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -363,7 +363,7 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
        }
 }

-static inline void *
+static inline void * __attribute__((target("no-avx512f")))
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
        uintptr_t dstu = (uintptr_t)dst;
Comment 26 Thomas Monjalon 2018-11-09 19:01:22 CET
Restricting AVX2 flavour of rte_memcpy to not use AVX512 instructions works:

  static inline void * __attribute__((target("no-avx512f")))
  rte_memcpy_generic(void *dst, const void *src, size_t n)

Thanks Yongseok, it is a nice workaround.
We need to check it is well supported with old compilers we want to support in DPDK.
Comment 27 Ferruh YIGIT 2018-11-09 19:03:36 CET
(In reply to Thomas Monjalon from comment #26)
> Restricting AVX2 flavour of rte_memcpy to not use AVX512 instructions works:
> 
>   static inline void * __attribute__((target("no-avx512f")))
>   rte_memcpy_generic(void *dst, const void *src, size_t n)
> 
> Thanks Yongseok, it is a nice workaround.
> We need to check it is well supported with old compilers we want to support
> in DPDK.

When scalar rte_memcpy_generic() forced, compiler again generates avx512 instructions and not observing crash.

With the above suggestion, avx2 version of rte_memcpy_generic() will be compiled without using avx512 instructions.

Which one will be better, won't using avx512 give better performance?
Comment 28 Yongseok Koh 2018-11-09 19:17:14 CET
(In reply to Thomas Monjalon from comment #18)
> No crash seen with code generated by clang-6 or gcc-6,
> probably because they do not generate AVX512 instructions.
> 
> Crash is confirmed with gcc-7 and gcc-8 when using AVX2 version of
> rte_memcpy.

which gcc-6 version did you use?
I could see avx512f is supported since gcc 6.4 in gcc.gnu.org.
Comment 29 Yongseok Koh 2018-11-09 19:20:26 CET
(In reply to Ferruh YIGIT from comment #27)
> When scalar rte_memcpy_generic() forced, compiler again generates avx512
> instructions and not observing crash.
> 
> With the above suggestion, avx2 version of rte_memcpy_generic() will be
> compiled without using avx512 instructions.
> 
> Which one will be better, won't using avx512 give better performance?

Need to measure it.
But my gut (:-) tells avx2 version of rte_memcpy with mavx2 would be better?
Comment 30 Thomas Monjalon 2018-11-09 19:56:13 CET
09/11/2018 19:17, bugzilla@dpdk.org:
> https://bugs.dpdk.org/show_bug.cgi?id=97
> 
> --- Comment #28 from Yongseok Koh (yskoh@mellanox.com) ---
> (In reply to Thomas Monjalon from comment #18)
> > No crash seen with code generated by clang-6 or gcc-6,
> > probably because they do not generate AVX512 instructions.
> > 
> > Crash is confirmed with gcc-7 and gcc-8 when using AVX2 version of
> > rte_memcpy.
> 
> which gcc-6 version did you use?
> I could see avx512f is supported since gcc 6.4 in gcc.gnu.org.

It is gcc-6 (Ubuntu 6.4.0-17ubuntu1) 6.4.0 20180424
Comment 31 Ferruh YIGIT 2018-11-09 20:01:04 CET
(In reply to Yongseok Koh from comment #29)
> (In reply to Ferruh YIGIT from comment #27)
> > When scalar rte_memcpy_generic() forced, compiler again generates avx512
> > instructions and not observing crash.
> > 
> > With the above suggestion, avx2 version of rte_memcpy_generic() will be
> > compiled without using avx512 instructions.
> > 
> > Which one will be better, won't using avx512 give better performance?
> 
> Need to measure it.
> But my gut (:-) tells avx2 version of rte_memcpy with mavx2 would be better?

Agreed, needs to measure.
But most probably there will be only neglectable diff.

Thomas pointed out that CONFIG_RTE_ENABLE_AVX512=y also crash, this update won't cover it.
Comment 32 Ferruh YIGIT 2018-11-09 20:32:41 CET
Created attachment 18 [details]
v2-0001-eal-x86-force-using-SSE-memcpy-implementation.patch

v2:

Changes `scalar` which was wrong to `SSE`
Comment 33 Thomas Monjalon 2018-11-10 01:09:48 CET
I confirm it is a compiler bug producing wrong code for rte_mov128().
Running it step by step, it is clear some bytes are copied wrong in rte_mov128(),
but not in rte_mov64(). Both functions use rte_mov32().
As a result, some mbuf pointers are corrupted in the mempool.

Begin and end of debug session below:

% make -j EXTRA_CFLAGS='-ggdb -O1'
% sudo gdb --ex='b lib/librte_mempool/rte_mempool.h:1244' --ex=r --args build/app/testpmd  -w 0002:00:02.0 --vdev='net_vdev_netvsc0,iface=eth1' -- --forward-mode=txonly --stats-period 1
1244            rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
(gdb) si
0x000055555565b42d      1244            rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
(gdb) si
rte_memcpy (n=256, src=0x7ffff512c090, dst=0x10045f028)
(gdb) x/32gx src
0x7ffff512c090: 0x0000000109c43e00      0x0000000109c434c0
0x7ffff512c0a0: 0x0000000109c42b80      0x0000000109c42240
0x7ffff512c0b0: 0x0000000109c41900      0x0000000109c40fc0
0x7ffff512c0c0: 0x0000000109c40680      0x0000000109c3fd40
0x7ffff512c0d0: 0x0000000109c3f400      0x0000000109c3eac0
0x7ffff512c0e0: 0x0000000109c3e180      0x0000000109c3d840
0x7ffff512c0f0: 0x0000000109c3cf00      0x0000000109c3c5c0
0x7ffff512c100: 0x0000000109c3bc80      0x0000000109c3b340
0x7ffff512c110: 0x0000000109c3aa00      0x0000000109c3a0c0
0x7ffff512c120: 0x0000000109c39780      0x0000000109c38e40
0x7ffff512c130: 0x0000000109c38500      0x0000000109c37bc0
0x7ffff512c140: 0x0000000109c37280      0x0000000109c36940
0x7ffff512c150: 0x0000000109c36000      0x0000000109c356c0
0x7ffff512c160: 0x0000000109c34d80      0x0000000109c34440
0x7ffff512c170: 0x0000000109c33b00      0x0000000109c331c0
0x7ffff512c180: 0x0000000109c32880      0x0000000109c31f40
(gdb) ni
...
1246            cache->len += n;
(gdb) x/32gx 0x10045f028
0x10045f028:    0x0000000109c43e00      0x0000000109c434c0
0x10045f038:    0x0000000109c42b80      0x0000000109c42240
0x10045f048:    0x34c00000000109c4      0x2b800000000109c4
0x10045f058:    0x0000000109c40680      0x0000000109c3fd40
0x10045f068:    0x09c434c000000001      0x09c42b8000000001
0x10045f078:    0x0000000109c3e180      0x0000000109c3d840
0x10045f088:    0x000109c434c00000      0x000109c42b800000
0x10045f098:    0x0000000109c3bc80      0x0000000109c3b340
0x10045f0a8:    0x0000000109c3aa00      0x0000000109c3a0c0
0x10045f0b8:    0x0000000109c39780      0x0000000109c38e40
0x10045f0c8:    0x0000000109c38500      0x0000000109c37bc0
0x10045f0d8:    0x0000000109c37280      0x0000000109c36940
0x10045f0e8:    0x0000000109c36000      0x0000000109c356c0
0x10045f0f8:    0x0000000109c34d80      0x0000000109c34440
0x10045f108:    0x0000000109c33b00      0x0000000109c331c0
0x10045f118:    0x0000000109c32880      0x0000000109c31f40
(gdb) c
Thread 4 "lcore-slave-1" received signal SIGSEGV, Segmentation fault.
pkt_burst_transmit (fs=0x100442440) at app/test-pmd/txonly.c:206
206                     rte_pktmbuf_reset_headroom(pkt);
Comment 34 Thomas Monjalon 2018-11-10 02:13:11 CET
In AVX2 version of rte_memcpy(), rte_mov128() is:

    rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
    rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
    rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
    rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);

Below are the two versions of:

    n -= 128;
    rte_mov128((uint8_t *)dst, (const uint8_t *)src);

AVX512F enabled (native compilation):

    lea    rdx,[rax-0x80]
    mov    rax,QWORD PTR [rbp-0x90]
    vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x0]
    vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x10],0x1
    vmovups XMMWORD PTR [rsi],xmm0
    vextracti128 XMMWORD PTR [rsi+0x10],ymm0,0x1
    vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x2]
    vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x30],0x1
    vmovups XMMWORD PTR [rsi+0x20],xmm0
    vextracti128 XMMWORD PTR [rsi+0x30],ymm0,0x1
    vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x4]
    vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x50],0x1
    vmovups XMMWORD PTR [rsi+0x40],xmm0
    vextracti128 XMMWORD PTR [rsi+0x50],ymm0,0x1
    vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x6]
    vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x70],0x1
    vmovups XMMWORD PTR [rsi+0x60],xmm0
    vextracti128 XMMWORD PTR [rsi+0x70],ymm0,0x1

AVX512F disabled for rte_mov128 function:

    lea    r14,[rax-0x80]
    mov    rsi,QWORD PTR [rbp-0x80]
    mov    rdi,r12
    call   <rte_mov128>
    vmovdqu xmm0,XMMWORD PTR [rsi]
    vinserti128 ymm0,ymm0,XMMWORD PTR [rsi+0x10],0x1
    vmovups XMMWORD PTR [rdi],xmm0
    vextracti128 XMMWORD PTR [rdi+0x10],ymm0,0x1
    vmovdqu xmm0,XMMWORD PTR [rsi+0x20]
    vinserti128 ymm0,ymm0,XMMWORD PTR [rsi+0x30],0x1
    vmovups XMMWORD PTR [rdi+0x20],xmm0
    vextracti128 XMMWORD PTR [rdi+0x30],ymm0,0x1
    vmovdqu xmm0,XMMWORD PTR [rsi+0x40]
    vinserti128 ymm0,ymm0,XMMWORD PTR [rsi+0x50],0x1
    vmovups XMMWORD PTR [rdi+0x40],xmm0
    vextracti128 XMMWORD PTR [rdi+0x50],ymm0,0x1
    vmovdqu xmm0,XMMWORD PTR [rsi+0x60]
    vinserti128 ymm0,ymm0,XMMWORD PTR [rsi+0x70],0x1
    vmovups XMMWORD PTR [rdi+0x60],xmm0
    vextracti128 XMMWORD PTR [rdi+0x70],ymm0,0x1
    ret    

This second version is using AVX2 intrinsics with this patch:

    -static inline void
    +static inline void __attribute__((target("no-avx512f")))
     rte_mov128(uint8_t *dst, const uint8_t *src)

First version is crashing.
Second version works.
Comment 35 Thomas Monjalon 2018-11-10 02:51:28 CET
The bad version of rte_mov128() is using the AVX512 instruction
    vmovdqu8
while the good version is using the AVX instruction
    vmovdqu

The AVX512-enabled rte_memcpy() version corrupts the copied data (mbuf pointers) from:
    0x0000000109c43e00    0x0000000109c434c0
    0x0000000109c42b80    0x0000000109c42240
    0x0000000109c41900    0x0000000109c40fc0
    0x0000000109c40680    0x0000000109c3fd40
    0x0000000109c3f400    0x0000000109c3eac0
    0x0000000109c3e180    0x0000000109c3d840
    0x0000000109c3cf00    0x0000000109c3c5c0
    0x0000000109c3bc80    0x0000000109c3b340
to:
    0x0000000109c43e00    0x0000000109c434c0
    0x0000000109c42b80    0x0000000109c42240
    0x34c00000000109c4    0x2b800000000109c4
    0x0000000109c40680    0x0000000109c3fd40
    0x09c434c000000001    0x09c42b8000000001
    0x0000000109c3e180    0x0000000109c3d840
    0x000109c434c00000    0x000109c42b800000
    0x0000000109c3bc80    0x0000000109c3b340


I am not an expert of x86 instructions,
is it a GCC bug or a CPU bug?


CPU info:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
Comment 36 Thomas Monjalon 2018-11-10 13:14:50 CET
I generate another working code without function call, closer of the failing one, by moving AVX512 disabling from rte_mov128 to the top caller function (mlx5_tx_complete):

    -void
    +void __attribute__((target("no-avx512f")))
     mlx5_tx_complete(struct mlx5_txq_data *txq)

Resulting code:

    lea    rdx,[rax-0x80]
    mov    rax,QWORD PTR [rbp-0x90]
    vmovdqu xmm0,XMMWORD PTR [rax*8+0x0]
    vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x10],0x1
    vmovups XMMWORD PTR [rsi],xmm0
    vextracti128 XMMWORD PTR [rsi+0x10],ymm0,0x1
    vmovdqu xmm0,XMMWORD PTR [rax*8+0x20]
    vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x30],0x1
    vmovups XMMWORD PTR [rsi+0x20],xmm0
    vextracti128 XMMWORD PTR [rsi+0x30],ymm0,0x1
    vmovdqu xmm0,XMMWORD PTR [rax*8+0x40]
    vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x50],0x1
    vmovups XMMWORD PTR [rsi+0x40],xmm0
    vextracti128 XMMWORD PTR [rsi+0x50],ymm0,0x1
    vmovdqu xmm0,XMMWORD PTR [rax*8+0x60]
    vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x70],0x1
    vmovups XMMWORD PTR [rsi+0x60],xmm0
    vextracti128 XMMWORD PTR [rsi+0x70],ymm0,0x1

The only differences with the working code are:
    - vmovdqu instead of vmovdqu8
    - code/registers context
Comment 37 Yongseok Koh 2018-11-11 12:23:26 CET
(In reply to Ferruh YIGIT from comment #32)
> Created attachment 18 [details]
> v2-0001-eal-x86-force-using-SSE-memcpy-implementation.patch
> 
> v2:
> 
> Changes `scalar` which was wrong to `SSE`

So, in terms of performance, which one is better?
rte_memcpy_avx2 with avx2 vs. rte_memcpy_sse with avx512f?

In case of CONFIG_RTE_ENABLE_AVX512=y, rte_memcpy_avx2 with avx2 can also be used.

I wanted to volunteer for performance testing, but I couldn't find any available setup having the gcc version. Sorry.

And because of the case (CONFIG_RTE_ENABLE_AVX512=y), we have to make the change for the meson build, right?
Comment 38 Yongseok Koh 2018-11-11 12:30:09 CET
For Ferruh's patch,

+# Workaround for Bug #97, force to use SSE rte_memcpy()
+# see lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y)
+ifeq ($(shell test $(GCC_VERSION) -ge 60 && echo 1), 1)

IIRC, Thomas mentioned gcc 6 is okay.
Shouldn't it be '-gt'? Or, -ge 70 or 73?
Comment 39 Thomas Monjalon 2018-11-11 19:19:00 CET
Diff of assembly code:

--- bad-avx512-enabled
+++ good-avx512-disabled
-    vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x0]
+    vmovdqu xmm0,XMMWORD PTR [rax*8+0x0]
     vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x10],0x1
     vmovups XMMWORD PTR [rsi],xmm0
     vextracti128 XMMWORD PTR [rsi+0x10],ymm0,0x1
-    vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x2]
+    vmovdqu xmm0,XMMWORD PTR [rax*8+0x20]
     vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x30],0x1
     vmovups XMMWORD PTR [rsi+0x20],xmm0
     vextracti128 XMMWORD PTR [rsi+0x30],ymm0,0x1
-    vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x4]
+    vmovdqu xmm0,XMMWORD PTR [rax*8+0x40]
     vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x50],0x1
     vmovups XMMWORD PTR [rsi+0x40],xmm0
     vextracti128 XMMWORD PTR [rsi+0x50],ymm0,0x1
-    vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x6]
+    vmovdqu xmm0,XMMWORD PTR [rax*8+0x60]
     vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x70],0x1
     vmovups XMMWORD PTR [rsi+0x60],xmm0
     vextracti128 XMMWORD PTR [rsi+0x70],ymm0,0x1

As noticed by Konstantin, the offset in vmovdqu8 is wrong:
0x2, 0x4, 0x6 instead of 0x20, 0x40, 0x60.
Comment 40 Ferruh YIGIT 2018-11-12 11:06:55 CET
(In reply to Yongseok Koh from comment #38)
> For Ferruh's patch,
> 
> +# Workaround for Bug #97, force to use SSE rte_memcpy()
> +# see lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> +ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y)
> +ifeq ($(shell test $(GCC_VERSION) -ge 60 && echo 1), 1)
> 
> IIRC, Thomas mentioned gcc 6 is okay.
> Shouldn't it be '-gt'? Or, -ge 70 or 73?

Yes it should be but I put -ge intentionally, to be safe with gcc6 versions.
Comment 41 Ferruh YIGIT 2018-11-12 12:17:17 CET
(In reply to Thomas Monjalon from comment #26)
> Restricting AVX2 flavour of rte_memcpy to not use AVX512 instructions works:
> 
>   static inline void * __attribute__((target("no-avx512f")))
>   rte_memcpy_generic(void *dst, const void *src, size_t n)
> 
> Thanks Yongseok, it is a nice workaround.
> We need to check it is well supported with old compilers we want to support
> in DPDK.

I am getting build error with this [1], aren't you getting these?


[1]
In file included from .../lib/librte_eal/common/rte_malloc.c:12:                                                                  
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/xmmintrin.h:1252,
                 from /usr/lib/gcc/x86_64-redhat-linux/8/include/x86intrin.h:33,
                 from .../build/include/rte_vect.h:28,
                 from .../build/include/rte_memcpy.h:17,
                 from .../lib/librte_eal/common/rte_malloc.c:12:
.../build/include/rte_memcpy.h: In function ‘rte_mov16’:
/usr/lib/gcc/x86_64-redhat-linux/8/include/emmintrin.h:718:1: error: inlining failed in call to always_inline ‘_mm_storeu_si128’: target specific option mismatch
 _mm_storeu_si128 (__m128i_u *__P, __m128i __B)
 ^~~~~~~~~~~~~~~~
Comment 42 Thomas Monjalon 2018-11-12 18:38:08 CET
(In reply to Ferruh YIGIT from comment #41)
> (In reply to Thomas Monjalon from comment #26)
> > Restricting AVX2 flavour of rte_memcpy to not use AVX512 instructions
> works:
> > 
> >   static inline void * __attribute__((target("no-avx512f")))
> >   rte_memcpy_generic(void *dst, const void *src, size_t n)
> > 
> > Thanks Yongseok, it is a nice workaround.
> > We need to check it is well supported with old compilers we want to support
> > in DPDK.
> 
> I am getting build error with this [1], aren't you getting these?
> 
> /usr/lib/gcc/x86_64-redhat-linux/8/include/emmintrin.h:718:1: error:
> inlining failed in call to always_inline ‘_mm_storeu_si128’: target specific
> option mismatch
>  _mm_storeu_si128 (__m128i_u *__P, __m128i __B)
>  ^~~~~~~~~~~~~~~~

You cannot change optimization of an inline function.
You can try this:

--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
-uint16_t
+uint16_t __attribute__((target("no-avx512f")))
 mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)

Then you can check the generated code:

gdb -batch -ex 'file build/app/testpmd' -ex 'set disassembly-flavor intel' -ex 'disassemble/rs mlx5_tx_burst' | grep 'vmovdqu.\? .*\*8+0x[2-6]'
Comment 43 Thomas Monjalon 2018-11-12 21:43:00 CET
In order to check which functions are impacted by this bug, we can check for a vmovdqu instruction with an offset 0x2:

bin=build/app/testpmd ; regex='vmovdqu.? .*0x2\]$' ; for addr in $(objdump -dM intel $bin | sed -rn "s,^ *([a-f0-9]*):.*$regex,\1,p") ; do objdump -d --stop-address 0x$addr $bin | sed -rn 's,.*<(.*)>:$,\1,p' | tail -n1 ; done

mlx5_tx_descriptor_status
mlx5_tx_burst
mlx5_tx_burst_mpw
mlx5_tx_burst_mpw_inline
mlx5_tx_burst_empw
mlx5_tx_burst_raw_vec
mlx5_tx_burst_vec

Either we consider that the AVX512 optimization bug has no other known effect, and we disable it for above known functions, or we stay on the safe side and disable all AVX512 optimizations, as done currently.
Comment 44 Thomas Monjalon 2018-11-15 16:53:24 CET
Trying to make comparison of inlined code easier with a marker:

--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -330,13 +330,19 @@ rte_mov64(uint8_t *dst, const uint8_t *src)
  * Copy 128 bytes from one location to another,
  * locations should not overlap.
  */
+#include <rte_atomic.h>
+static volatile int dpdk_bug97_marker __attribute__((used));
 static inline void
 rte_mov128(uint8_t *dst, const uint8_t *src)
 {
+       dpdk_bug97_marker = 0xdbdb97be; /* sequence begins */
+       rte_mb();
        rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
        rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
        rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
        rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
+       rte_mb();
+       dpdk_bug97_marker = 0xdbdb97ed; /* sequence ends */
 }


The disassembled sequence can be found with this kind of sed command:

gdb -batch -ex 'file build/app/testpmd' -ex 'set disassembly-flavor intel' -ex 'disassemble/rs mlx5_tx_burst' | sed '/mov.*0xdbdb97be/,/0xdbdb97ed/!d' | sed '/0xdbdb97ed/s,$,\n---,'


I did not test whether the inserted instructions (mov and mfence) are changing the code of the sequence.
Comment 45 Konstantin Ananyev 2018-11-16 12:57:13 CET
After talking with Ferruh offline:
Does mlx5 PMD in 18.08 compile without that problem with gcc 7+?
If so, then in theory it should be possible to figure out 
which particular code change triggers the problem (git bisect or so),
and hopefully change it to overcome the issue?
Comment 46 Ferruh YIGIT 2018-11-16 13:04:36 CET
After Konstantin's comment, I compiled v18.08 and run above script on it [1], as output shows the instruction we thought problematic is exists same in v18.08.

Assuming there was no crash observed in v18.08, our assumption on the problematic instructions can be wrong.

Can you please confirm v18.08 on Skylake has no crash problem?

And agree with Konstantin that finding the git commit that first crash observed can be helpful.



[1]
root@ubuntu:~/development/dpdk-next-net# gdb -batch -ex 'file build/app/testpmd' -ex 'set disassembly-flavor intel' -ex 'disassemble/rs mlx5_tx_burst' | grep 'vmovdqu.\? .*\*8+0x[2-6]'
   0x0000000000451567 <+7303>:  62 f1 7f 08 6f 04 c5 02 00 00 00        vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x2]
   0x0000000000451589 <+7337>:  62 f1 7f 08 6f 04 c5 04 00 00 00        vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x4]
   0x00000000004515ab <+7371>:  62 f1 7f 08 6f 04 c5 06 00 00 00        vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x6]
Comment 47 Thomas Monjalon 2018-11-16 15:37:37 CET
The same crash has been reported on older versions.
If you look at the initial description, it has been reproduced on 18.02/18.05/18.08.

I really think it is a waste of time trying to find what triggers it in mlx5.
It is clearly a GCC bug.

Even in mlx5_tx_burst, there are different inlined versions of rte_mov128:

gdb -batch -ex 'file build/app/testpmd' -ex 'set disassembly-flavor intel' -ex 'disassemble/rs mlx5_tx_burst' |
sed -rn 's,.*0x00.*:[[:space:]]*([0-9a-f][0-9a-f][[:space:]])*,,p' | sed '/0xdbdb97be/,/0xdbdb97ed/!d' | sed '/0xdbdb97ed/s,$,\n---,'

mov    DWORD PTR [rip+0x937b56],0xdbdb97be        # 0xde0db0 <dpdk_bug97_marker>
lea    r13,[rdx+0x80]
mov    QWORD PTR [rbp-0xe0],r8
lea    r8,[rcx+0x80]
mfence 
vmovdqu8 xmm0,XMMWORD PTR [rdx]
vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x10],0x1
vmovups XMMWORD PTR [rcx],xmm0
vextracti128 XMMWORD PTR [rcx+0x10],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [rdx+0x20]
vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x30],0x1
vmovups XMMWORD PTR [rcx+0x20],xmm0
vextracti128 XMMWORD PTR [rcx+0x30],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [rdx+0x40]
vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x50],0x1
vmovups XMMWORD PTR [rcx+0x40],xmm0
vextracti128 XMMWORD PTR [rcx+0x50],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [rdx+0x60]
vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x70],0x1
vmovups XMMWORD PTR [rcx+0x60],xmm0
vextracti128 XMMWORD PTR [rcx+0x70],ymm0,0x1
mfence 
mov    DWORD PTR [rip+0x937acb],0xdbdb97ed        # 0xde0db0 <dpdk_bug97_marker>
---
mov    DWORD PTR [rip+0x937912],0xdbdb97be        # 0xde0db0 <dpdk_bug97_marker>
lea    rsi,[rax-0x80]
lea    r12,[rdx+0x80]
mfence 
vmovdqu8 xmm0,XMMWORD PTR [rdx]
mov    rdi,QWORD PTR [rbp-0x60]
vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x10],0x1
lea    rcx,[rdi+0xb0]
vmovups XMMWORD PTR [rdi+0x30],xmm0
vextracti128 XMMWORD PTR [rdi+0x40],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [rdx+0x20]
vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x30],0x1
vmovups XMMWORD PTR [rdi+0x50],xmm0
vextracti128 XMMWORD PTR [rdi+0x60],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [rdx+0x40]
vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x50],0x1
vmovups XMMWORD PTR [rdi+0x70],xmm0
vextracti128 XMMWORD PTR [rdi+0x80],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [rdx+0x60]
vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x70],0x1
vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x30],0x1
vmovups XMMWORD PTR [rdi+0x50],xmm0
vextracti128 XMMWORD PTR [rdi+0x60],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [rdx+0x40]
vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x50],0x1
vmovups XMMWORD PTR [rdi+0x70],xmm0
vextracti128 XMMWORD PTR [rdi+0x80],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [rdx+0x60]
vinserti128 ymm0,ymm0,XMMWORD PTR [rdx+0x70],0x1
vmovups XMMWORD PTR [rdi+0x90],xmm0
vextracti128 XMMWORD PTR [rdi+0xa0],ymm0,0x1
mfence 
mov    DWORD PTR [rip+0x93787c],0xdbdb97ed        # 0xde0db0 <dpdk_bug97_marker>
---
mov    DWORD PTR [rip+0x936fd2],0xdbdb97be        # 0xde0db0 <dpdk_bug97_marker>
lea    rcx,[rax-0x80]
sub    rdx,0xffffffffffffff80
mfence 
mov    rax,QWORD PTR [rbp-0xa0]
vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x0]
vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x10],0x1
vmovups XMMWORD PTR [rdx-0x80],xmm0
vextracti128 XMMWORD PTR [rdx-0x70],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x2]
vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x30],0x1
vmovups XMMWORD PTR [rdx-0x60],xmm0
vextracti128 XMMWORD PTR [rdx-0x50],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x4]
vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x50],0x1
vmovups XMMWORD PTR [rdx-0x40],xmm0
vextracti128 XMMWORD PTR [rdx-0x30],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [rax*8+0x6]
vinserti128 ymm0,ymm0,XMMWORD PTR [rax*8+0x70],0x1
lea    rax,[r9+0x80]
vmovups XMMWORD PTR [rdx-0x20],xmm0
vextracti128 XMMWORD PTR [rdx-0x10],ymm0,0x1
mfence 
mov    DWORD PTR [rip+0x936f24],0xdbdb97ed        # 0xde0db0 <dpdk_bug97_marker>
---
mov    DWORD PTR [rip+0x936e2f],0xdbdb97be        # 0xde0db0 <dpdk_bug97_marker>
lea    rdi,[r9+0x80]
add    rax,0xffffffffffffff80
mfence 
vmovdqu8 xmm0,XMMWORD PTR [r9]
sub    rdx,0xffffffffffffff80
vinserti128 ymm0,ymm0,XMMWORD PTR [r9+0x10],0x1
vmovups XMMWORD PTR [rdx-0x80],xmm0
vextracti128 XMMWORD PTR [rdx-0x70],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [r9+0x20]
vinserti128 ymm0,ymm0,XMMWORD PTR [r9+0x30],0x1
vmovups XMMWORD PTR [rdx-0x60],xmm0
vextracti128 XMMWORD PTR [rdx-0x50],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [r9+0x40]
vinserti128 ymm0,ymm0,XMMWORD PTR [r9+0x50],0x1
vmovups XMMWORD PTR [rdx-0x40],xmm0
vextracti128 XMMWORD PTR [rdx-0x30],ymm0,0x1
vmovdqu8 xmm0,XMMWORD PTR [r9+0x60]
vinserti128 ymm0,ymm0,XMMWORD PTR [r9+0x70],0x1
vmovups XMMWORD PTR [rdx-0x20],xmm0
vextracti128 XMMWORD PTR [rdx-0x10],ymm0,0x1
mfence 
mov    DWORD PTR [rip+0x936da9],0xdbdb97ed        # 0xde0db0 <dpdk_bug97_marker>

Only the third inline sequence shows the offset bug.
Comment 48 Thomas Monjalon 2018-11-21 01:39:25 CET
Bug report open at GCC:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88096
Comment 49 Tom Barbette 2018-11-21 14:40:33 CET
FYI, we are affected too by this (CONFIG_RTE_ENABLE_AVX=n, only after upgrading to DPDK 18.04). What we noticed is a segfault in mlx5_rx_replenish_bulk_mbuf.

Is there a better workaround than disabling all AVX instructions with CONFIG_RTE_ENABLE_AVX=n ? I guess that shutdowns other optimizations too...
Comment 50 Tom Barbette 2018-11-26 15:04:35 CET
Actually, this may be another similar issue with SSE...
mlx5_rx_replenish_bulk_mbuf is actually still failing with AVX=n. When using mlx5 with secondary processes.
This only happens for our Skylake platform with Ubuntu 18.04 (GCC 7), our code worked fine for months on Haswell, as well as on the Skylake with Ubuntu 16.04 (which ships with GCC 5). When upgrading to 18.04 (GCC 7) it started to crash as soon as packet flowed on the secondary process (that read from the device). Just using GCC 5 solved the problem.
Comment 51 Tom Barbette 2018-11-27 00:08:49 CET
It seems actually fixed in 18.11 (even using GCC 7). Sorry for bothering... Thanks Thomas for helping us offline.
Comment 52 Ferruh YIGIT 2018-12-10 14:57:10 CET
Hi Thomas, Yongseok,

gcc defect suggests that this can be binutils issue, can you please share the binutils version has been used?
Comment 53 Tom Barbette 2018-12-10 15:40:54 CET
Hi Ferruh,
We have it with binutils 2.30-21ubuntu1~18.04.

Btw, the reason our issue was fixed with 18.11 but not with AVX=n and AVX512=n is that we compile our whole application with DPDK's flags, including therefore since 18.11 -mno-avx512f.
In 18.08 simply using AVX=n and AVX512=n was not sufficient because linking time optimizations or header inclusions would produce crashing code.

But if someone statically links with -ldpdk without integrating all of DPDK's CFLAGS, it may still happen. I think the Snort/Suricata DAQ for DPDK actually do that.

Tom
Comment 54 Ferruh YIGIT 2018-12-10 15:56:58 CET
Hi Tom,

I am aware of the workaround on v18.11.

My comment was based on an update on the gcc defect we have submitted. Which claims this is an issue in binutils 2.30.

Please check the gcc defect for details, this one is reported as the duplicate of the one we submit:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86735

binutils 2.30 also seems updated, when it is picked by distros can fix the issue, meanwhile perhaps we can narrow down the workaround on 18.11 by checking the binutils version.
Just a reminder, I would like to narrow down it because the workaround in v18.11 (-mno-avx512f) reported causing performance drop on SkyLake platform.
Comment 55 Yongseok Koh 2018-12-11 05:13:13 CET
(In reply to Ferruh YIGIT from comment #52)
> Hi Thomas, Yongseok,
> 
> gcc defect suggests that this can be binutils issue, can you please share
> the binutils version has been used?

Ferruh,

I don't have the repro setup anymore.
Thomas might have one and could confirm the version.
But, as we didn't upgrade binutils since we installed the Ubuntu release, so it would be the one shipped in the Ubuntu release.

Thanks,
Yongseok
Comment 56 Yongseok Koh 2018-12-11 05:24:34 CET
(In reply to Yongseok Koh from comment #55)
> (In reply to Ferruh YIGIT from comment #52)
> > Hi Thomas, Yongseok,
> > 
> > gcc defect suggests that this can be binutils issue, can you please share
> > the binutils version has been used?
> 
> Ferruh,
> 
> I don't have the repro setup anymore.
> Thomas might have one and could confirm the version.
> But, as we didn't upgrade binutils since we installed the Ubuntu release, so
> it would be the one shipped in the Ubuntu release.
> 
I've confirmed that Ubuntu 18.04.1 has

gcc 7.3.0-27ubuntu1~18.04
binutils 2.30-21ubuntu1~18.04

Thanks,
Yongseok
Comment 57 Tom Barbette 2018-12-18 09:16:59 CET
FYI, I recompiled with the latest binutils from git, disabling the workaround -mnoavx512* and it works now.
Comment 58 Ferruh YIGIT 2018-12-18 10:12:58 CET
Thanks for confirming Tom, I will soon send a patch to check binutils version and limit -mno-avx512f argument to binutils 2.30.
Comment 59 Thomas Monjalon 2019-07-11 09:51:53 CEST
AVX512 is disabled in DPDK if an affected version of binutils is used.
Bug was fixed in 17.11, 18.11 and upper.
Comment 60 Thomas Monjalon 2019-07-11 09:52:19 CEST
AVX512 is disabled in DPDK if an affected version of binutils is used.
Bug was fixed in 17.11, 18.11 and upper.
Comment 61 Mehmet gelisin 2021-09-10 22:01:04 CEST
Description:
  The vhost crypto library code contains a post message handler
(vhost_crypto_msg_post_handler) which calls vhost_crypto_create_sess()
which in turn calls transform_cipher_param() depending on the operation
type. It is transform_cipher_param() https://komiya-dental.com/ that handles the payload data. The
payload contains a cipher key length and a static
VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH (64) byte key buffer. When http://www.iu-bloomington.com/ 
transform_cipher_param() handles the payload data it does not check to
see if the buffer length doesn't exceed
VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH. This missing check can cause https://www.webb-dev.co.uk/
out of bound reads which could trigger a crash or a potential
information leak. Also, the vhost crypto library code contains a post
message handler (vhost_crypto_msg_post_handler) which calls https://waytowhatsnext.com/
vhost_crypto_create_sess() which in turn calls transform_chain_param()
depending on the operation type. It is transform_chain_param() that
handles the payload data. The payload contains a cipher key length and a
static VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH (64) byte key buffer, it http://www.acpirateradio.co.uk/ 
also contains a digest length and a static authentication key buffer
(size: VHOST_USER_CRYPTO_MAX_HMAC_KEY_LENGTH(512)) and authentication
key buffer length. None of these length values are validated. Which can
lead to reading out of bound. http://www.logoarts.co.uk/ 

Description:
  The vhost crypto library code contains a post message handler
(vhost_crypto_msg_post_handler) which calls vhost_crypto_create_sess()
which in turn calls transform_cipher_param() depending on the operation http://www.slipstone.co.uk/ 
type. It is transform_cipher_param() that handles the payload data. The
payload contains a cipher key length and a static
VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH (64) byte key buffer. When
transform_cipher_param() handles the payload data it does not check to
see if the buffer length doesn't exceed
VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH. This missing check can cause
out of bound reads which could trigger a crash or a potential http://embermanchester.uk/ 
information leak. Also, the vhost crypto library code contains a post
message handler (vhost_crypto_msg_post_handler) which calls
vhost_crypto_create_sess() which in turn calls transform_chain_param()
depending on the operation type. It is transform_chain_param() that http://connstr.net/
handles the payload data. The payload contains a cipher key length and a
static VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH (64) byte key buffer, it
also contains a digest length and a static authentication key buffer
(size: VHOST_USER_CRYPTO_MAX_HMAC_KEY_LENGTH(512)) and authentication
key buffer length. None of these length values are validated. Which can
lead to reading out of bound.

Description: http://joerg.li/ 
  The vhost crypto library code contains a post message handler
(vhost_crypto_msg_post_handler) which calls vhost_crypto_create_sess()
which in turn calls transform_cipher_param() depending on the operation
type. It is transform_cipher_param() that handles the payload data. The
payload contains a cipher key length and a static http://www.jopspeech.com/ 
VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH (64) byte key buffer. When
transform_cipher_param() handles the payload data it does not check to
see if the buffer length doesn't exceed
VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH. This missing check can cause
out of bound reads which could trigger a crash or a potential http://www.wearelondonmade.com/
information leak. Also, the vhost crypto library code contains a post
message handler (vhost_crypto_msg_post_handler) which calls
vhost_crypto_create_sess() which in turn calls transform_chain_param()
depending on the operation type. It is transform_chain_param() that http://www.compilatori.com/ 
handles the payload data. The payload contains a cipher key length and a
static VHOST_USER_CRYPTO_MAX_CIPHER_KEY_LENGTH (64) byte key buffer, it
also contains a digest length and a static authentication key buffer
(size: VHOST_USER_CRYPTO_MAX_HMAC_KEY_LENGTH(512)) and authentication http://www-look-4.com/ 
key buffer length. None of these length values are validated. Which can
lead to reading out of bound.

Note You need to log in before you can comment on or make changes to this bug.