Список изменений в Linux 5.10.135

 
ARM: 9216/1: Fix MAX_DMA_ADDRESS overflow [+ + +]
Author: Florian Fainelli <f.fainelli@gmail.com>
Date:   Tue Jul 19 17:33:21 2022 +0100

    ARM: 9216/1: Fix MAX_DMA_ADDRESS overflow
    
    [ Upstream commit fb0fd3469ead5b937293c213daa1f589b4b7ce46 ]
    
    Commit 26f09e9b3a06 ("mm/memblock: add memblock memory allocation apis")
    added a check to determine whether arm_dma_zone_size is exceeding the
    amount of kernel virtual address space available between the upper 4GB
    virtual address limit and PAGE_OFFSET in order to provide a suitable
    definition of MAX_DMA_ADDRESS that should fit within the 32-bit virtual
    address space. The quantity used for comparison was off by a missing
    trailing 0, leading to MAX_DMA_ADDRESS to be overflowing a 32-bit
    quantity.
    
    This was caught thanks to CONFIG_DEBUG_VIRTUAL on the bcm2711 platform
    where we define a dma_zone_size of 1GB and we have a PAGE_OFFSET value
    of 0xc000_0000 (CONFIG_VMSPLIT_3G) leading to MAX_DMA_ADDRESS being
    0x1_0000_0000 which overflows the unsigned long type used throughout
    __pa() and then __virt_addr_valid(). Because the virtual address passed
    to __virt_addr_valid() would now be 0, the function would loudly warn
    and flood the kernel log, thus making the platform unable to boot
    properly.
    
    Fixes: 26f09e9b3a06 ("mm/memblock: add memblock memory allocation apis")
    Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
    Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
    Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

ARM: crypto: comment out gcc warning that breaks clang builds [+ + +]
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Sun Jul 31 12:05:51 2022 +0200

    ARM: crypto: comment out gcc warning that breaks clang builds
    
    The gcc build warning prevents all clang-built kernels from working
    properly, so comment it out to fix the build.
    
    This is a -stable kernel only patch for now, it will be resolved
    differently in mainline releases in the future.
    
    Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
    Cc: "Justin M. Forbes" <jforbes@fedoraproject.org>
    Cc: Ard Biesheuvel <ardb@kernel.org>
    Acked-by: Arnd Bergmann <arnd@arndb.de>
    Cc: Nicolas Pitre <nico@linaro.org>
    Cc: Nathan Chancellor <nathan@kernel.org>
    Cc: Nick Desaulniers <ndesaulniers@google.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put [+ + +]
Author: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Date:   Thu Jul 21 09:10:50 2022 -0700

    Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put
    
    commit d0be8347c623e0ac4202a1d4e0373882821f56b0 upstream.
    
    This fixes the following trace which is caused by hci_rx_work starting up
    *after* the final channel reference has been put() during sock_close() but
    *before* the references to the channel have been destroyed, so instead
    the code now rely on kref_get_unless_zero/l2cap_chan_hold_unless_zero to
    prevent referencing a channel that is about to be destroyed.
    
      refcount_t: increment on 0; use-after-free.
      BUG: KASAN: use-after-free in refcount_dec_and_test+0x20/0xd0
      Read of size 4 at addr ffffffc114f5bf18 by task kworker/u17:14/705
    
      CPU: 4 PID: 705 Comm: kworker/u17:14 Tainted: G S      W
      4.14.234-00003-g1fb6d0bd49a4-dirty #28
      Hardware name: Qualcomm Technologies, Inc. SM8150 V2 PM8150
      Google Inc. MSM sm8150 Flame DVT (DT)
      Workqueue: hci0 hci_rx_work
      Call trace:
       dump_backtrace+0x0/0x378
       show_stack+0x20/0x2c
       dump_stack+0x124/0x148
       print_address_description+0x80/0x2e8
       __kasan_report+0x168/0x188
       kasan_report+0x10/0x18
       __asan_load4+0x84/0x8c
       refcount_dec_and_test+0x20/0xd0
       l2cap_chan_put+0x48/0x12c
       l2cap_recv_frame+0x4770/0x6550
       l2cap_recv_acldata+0x44c/0x7a4
       hci_acldata_packet+0x100/0x188
       hci_rx_work+0x178/0x23c
       process_one_work+0x35c/0x95c
       worker_thread+0x4cc/0x960
       kthread+0x1a8/0x1c4
       ret_from_fork+0x10/0x18
    
    Cc: stable@kernel.org
    Reported-by: Lee Jones <lee.jones@linaro.org>
    Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
    Tested-by: Lee Jones <lee.jones@linaro.org>
    Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 
bpf: Add PROG_TEST_RUN support for sk_lookup programs [+ + +]
Author: Lorenz Bauer <lmb@cloudflare.com>
Date:   Mon Aug 1 15:29:15 2022 +0800

    bpf: Add PROG_TEST_RUN support for sk_lookup programs
    
    commit 7c32e8f8bc33a5f4b113a630857e46634e3e143b upstream.
    
    Allow to pass sk_lookup programs to PROG_TEST_RUN. User space
    provides the full bpf_sk_lookup struct as context. Since the
    context includes a socket pointer that can't be exposed
    to user space we define that PROG_TEST_RUN returns the cookie
    of the selected socket or zero in place of the socket pointer.
    
    We don't support testing programs that select a reuseport socket,
    since this would mean running another (unrelated) BPF program
    from the sk_lookup test handler.
    
    Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20210303101816.36774-3-lmb@cloudflare.com
    Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

bpf: Consolidate shared test timing code [+ + +]
Author: Lorenz Bauer <lmb@cloudflare.com>
Date:   Mon Aug 1 15:29:14 2022 +0800

    bpf: Consolidate shared test timing code
    
    commit 607b9cc92bd7208338d714a22b8082fe83bcb177 upstream.
    
    Share the timing / signal interruption logic between different
    implementations of PROG_TEST_RUN. There is a change in behaviour
    as well. We check the loop exit condition before checking for
    pending signals. This resolves an edge case where a signal
    arrives during the last iteration. Instead of aborting with
    EINTR we return the successful result to user space.
    
    Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20210303101816.36774-2-lmb@cloudflare.com
    [dtcccc: fix conflicts in bpf_test_run()]
    Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
docs/kernel-parameters: Update descriptions for "mitigations=" param with retbleed [+ + +]
Author: Eiichi Tsukata <eiichi.tsukata@nutanix.com>
Date:   Thu Jul 28 04:39:07 2022 +0000

    docs/kernel-parameters: Update descriptions for "mitigations=" param with retbleed
    
    commit ea304a8b89fd0d6cf94ee30cb139dc23d9f1a62f upstream.
    
    Updates descriptions for "mitigations=off" and "mitigations=auto,nosmt"
    with the respective retbleed= settings.
    
    Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com>
    Signed-off-by: Borislav Petkov <bp@suse.de>
    Cc: corbet@lwn.net
    Link: https://lore.kernel.org/r/20220728043907.165688-1-eiichi.tsukata@nutanix.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
Documentation: fix sctp_wmem in ip-sysctl.rst [+ + +]
Author: Xin Long <lucien.xin@gmail.com>
Date:   Thu Jul 21 10:35:46 2022 -0400

    Documentation: fix sctp_wmem in ip-sysctl.rst
    
    [ Upstream commit aa709da0e032cee7c202047ecd75f437bb0126ed ]
    
    Since commit 1033990ac5b2 ("sctp: implement memory accounting on tx path"),
    SCTP has supported memory accounting on tx path where 'sctp_wmem' is used
    by sk_wmem_schedule(). So we should fix the description for this option in
    ip-sysctl.rst accordingly.
    
    v1->v2:
      - Improve the description as Marcelo suggested.
    
    Fixes: 1033990ac5b2 ("sctp: implement memory accounting on tx path")
    Signed-off-by: Xin Long <lucien.xin@gmail.com>
    Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 
EDAC/ghes: Set the DIMM label unconditionally [+ + +]
Author: Toshi Kani <toshi.kani@hpe.com>
Date:   Thu Jul 21 12:05:03 2022 -0600

    EDAC/ghes: Set the DIMM label unconditionally
    
    commit 5e2805d5379619c4a2e3ae4994e73b36439f4bad upstream.
    
    The commit
    
      cb51a371d08e ("EDAC/ghes: Setup DIMM label from DMI and use it in error reports")
    
    enforced that both the bank and device strings passed to
    dimm_setup_label() are not NULL.
    
    However, there are BIOSes, for example on a
    
      HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/15/2019
    
    which don't populate both strings:
    
      Handle 0x0020, DMI type 17, 84 bytes
      Memory Device
              Array Handle: 0x0013
              Error Information Handle: Not Provided
              Total Width: 72 bits
              Data Width: 64 bits
              Size: 32 GB
              Form Factor: DIMM
              Set: None
              Locator: PROC 1 DIMM 1        <===== device
              Bank Locator: Not Specified   <===== bank
    
    This results in a buffer overflow because ghes_edac_register() calls
    strlen() on an uninitialized label, which had non-zero values left over
    from krealloc_array():
    
      detected buffer overflow in __fortify_strlen
       ------------[ cut here ]------------
       kernel BUG at lib/string_helpers.c:983!
       invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 1 PID: 1 Comm: swapper/0 Tainted: G          I       5.18.6-200.fc36.x86_64 #1
       Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/15/2019
       RIP: 0010:fortify_panic
       ...
       Call Trace:
        <TASK>
        ghes_edac_register.cold
        ghes_probe
        platform_probe
        really_probe
        __driver_probe_device
        driver_probe_device
        __driver_attach
        ? __device_attach_driver
        bus_for_each_dev
        bus_add_driver
        driver_register
        acpi_ghes_init
        acpi_init
        ? acpi_sleep_proc_init
        do_one_initcall
    
    The label contains garbage because the commit in Fixes reallocs the
    DIMMs array while scanning the system but doesn't clear the newly
    allocated memory.
    
    Change dimm_setup_label() to always initialize the label to fix the
    issue. Set it to the empty string in case BIOS does not provide both
    bank and device so that ghes_edac_register() can keep the default label
    given by edac_mc_alloc_dimms().
    
      [ bp: Rewrite commit message. ]
    
    Fixes: b9cae27728d1f ("EDAC/ghes: Scan the system once on driver init")
    Co-developed-by: Robert Richter <rric@kernel.org>
    Signed-off-by: Robert Richter <rric@kernel.org>
    Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
    Signed-off-by: Borislav Petkov <bp@suse.de>
    Tested-by: Robert Elliott <elliott@hpe.com>
    Cc: <stable@vger.kernel.org>
    Link: https://lore.kernel.org/r/20220719220124.760359-1-toshi.kani@hpe.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
i40e: Fix interface init with MSI interrupts (no MSI-X) [+ + +]
Author: Michal Maloszewski <michal.maloszewski@intel.com>
Date:   Fri Jul 22 10:54:01 2022 -0700

    i40e: Fix interface init with MSI interrupts (no MSI-X)
    
    [ Upstream commit 5fcbb711024aac6d4db385623e6f2fdf019f7782 ]
    
    Fix the inability to bring an interface up on a setup with
    only MSI interrupts enabled (no MSI-X).
    Solution is to add a default number of QPs = 1. This is enough,
    since without MSI-X support driver enables only a basic feature set.
    
    Fixes: bc6d33c8d93f ("i40e: Fix the number of queues available to be mapped for use")
    Signed-off-by: Dawid Lukwinski <dawid.lukwinski@intel.com>
    Signed-off-by: Michal Maloszewski <michal.maloszewski@intel.com>
    Tested-by: Dave Switzer <david.switzer@intel.com>
    Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
    Link: https://lore.kernel.org/r/20220722175401.112572-1-anthony.l.nguyen@intel.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 
ice: check (DD | EOF) bits on Rx descriptor rather than (EOP | RS) [+ + +]
Author: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date:   Thu Jul 7 12:20:42 2022 +0200

    ice: check (DD | EOF) bits on Rx descriptor rather than (EOP | RS)
    
    commit 283d736ff7c7e96ac5b32c6c0de40372f8eb171e upstream.
    
    Tx side sets EOP and RS bits on descriptors to indicate that a
    particular descriptor is the last one and needs to generate an irq when
    it was sent. These bits should not be checked on completion path
    regardless whether it's the Tx or the Rx. DD bit serves this purpose and
    it indicates that a particular descriptor is either for Rx or was
    successfully Txed. EOF is also set as loopback test does not xmit
    fragmented frames.
    
    Look at (DD | EOF) bits setting in ice_lbtest_receive_frames() instead
    of EOP and RS pair.
    
    Fixes: 0e674aeb0b77 ("ice: Add handler for ethtool selftest")
    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
    Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ice: do not setup vlan for loopback VSI [+ + +]
Author: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date:   Thu Jul 7 12:20:43 2022 +0200

    ice: do not setup vlan for loopback VSI
    
    commit cc019545a238518fa9da1e2a889f6e1bb1005a63 upstream.
    
    Currently loopback test is failiing due to the error returned from
    ice_vsi_vlan_setup(). Skip calling it when preparing loopback VSI.
    
    Fixes: 0e674aeb0b77 ("ice: Add handler for ethtool selftest")
    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
    Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
igmp: Fix data-races around sysctl_igmp_qrv. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 15 10:17:44 2022 -0700

    igmp: Fix data-races around sysctl_igmp_qrv.
    
    [ Upstream commit 8ebcc62c738f68688ee7c6fec2efe5bc6d3d7e60 ]
    
    While reading sysctl_igmp_qrv, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.
    
    This test can be packed into a helper, so such changes will be in the
    follow-up series after net is merged into net-next.
    
      qrv ?: READ_ONCE(net->ipv4.sysctl_igmp_qrv);
    
    Fixes: a9fe8e29945d ("ipv4: implement igmp_qrv sysctl to tune igmp robustness variable")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 
ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptr [+ + +]
Author: Ziyang Xuan <william.xuanziyang@huawei.com>
Date:   Thu Jul 28 09:33:07 2022 +0800

    ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptr
    
    commit 85f0173df35e5462d89947135a6a5599c6c3ef6f upstream.
    
    Change net device's MTU to smaller than IPV6_MIN_MTU or unregister
    device while matching route. That may trigger null-ptr-deref bug
    for ip6_ptr probability as following.
    
    =========================================================
    BUG: KASAN: null-ptr-deref in find_match.part.0+0x70/0x134
    Read of size 4 at addr 0000000000000308 by task ping6/263
    
    CPU: 2 PID: 263 Comm: ping6 Not tainted 5.19.0-rc7+ #14
    Call trace:
     dump_backtrace+0x1a8/0x230
     show_stack+0x20/0x70
     dump_stack_lvl+0x68/0x84
     print_report+0xc4/0x120
     kasan_report+0x84/0x120
     __asan_load4+0x94/0xd0
     find_match.part.0+0x70/0x134
     __find_rr_leaf+0x408/0x470
     fib6_table_lookup+0x264/0x540
     ip6_pol_route+0xf4/0x260
     ip6_pol_route_output+0x58/0x70
     fib6_rule_lookup+0x1a8/0x330
     ip6_route_output_flags_noref+0xd8/0x1a0
     ip6_route_output_flags+0x58/0x160
     ip6_dst_lookup_tail+0x5b4/0x85c
     ip6_dst_lookup_flow+0x98/0x120
     rawv6_sendmsg+0x49c/0xc70
     inet_sendmsg+0x68/0x94
    
    Reproducer as following:
    Firstly, prepare conditions:
    $ip netns add ns1
    $ip netns add ns2
    $ip link add veth1 type veth peer name veth2
    $ip link set veth1 netns ns1
    $ip link set veth2 netns ns2
    $ip netns exec ns1 ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1
    $ip netns exec ns2 ip -6 addr add 2001:0db8:0:f101::2/64 dev veth2
    $ip netns exec ns1 ifconfig veth1 up
    $ip netns exec ns2 ifconfig veth2 up
    $ip netns exec ns1 ip -6 route add 2000::/64 dev veth1 metric 1
    $ip netns exec ns2 ip -6 route add 2001::/64 dev veth2 metric 1
    
    Secondly, execute the following two commands in two ssh windows
    respectively:
    $ip netns exec ns1 sh
    $while true; do ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1; ip -6 route add 2000::/64 dev veth1 metric 1; ping6 2000::2; done
    
    $ip netns exec ns1 sh
    $while true; do ip link set veth1 mtu 1000; ip link set veth1 mtu 1500; sleep 5; done
    
    It is because ip6_ptr has been assigned to NULL in addrconf_ifdown() firstly,
    then ip6_ignore_linkdown() accesses ip6_ptr directly without NULL check.
    
            cpu0                    cpu1
    fib6_table_lookup
    __find_rr_leaf
                            addrconf_notify [ NETDEV_CHANGEMTU ]
                            addrconf_ifdown
                            RCU_INIT_POINTER(dev->ip6_ptr, NULL)
    find_match
    ip6_ignore_linkdown
    
    So we can add NULL check for ip6_ptr before using in ip6_ignore_linkdown() to
    fix the null-ptr-deref bug.
    
    Fixes: dcd1f572954f ("net/ipv6: Remove fib6_idev")
    Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20220728013307.656257-1-william.xuanziyang@huawei.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
Linux: Linux 5.10.135 [+ + +]
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Wed Aug 3 12:00:52 2022 +0200

    Linux 5.10.135
    
    Link: https://lore.kernel.org/r/20220801114133.641770326@linuxfoundation.org
    Tested-by: Jon Hunter <jonathanh@nvidia.com>
    Tested-by: Florian Fainelli <f.fainelli@gmail.com>
    Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
    Tested-by: Shuah Khan <skhan@linuxfoundation.org>
    Tested-by: Guenter Roeck <linux@roeck-us.net>
    Tested-by: Rudi Heitbaum <rudi@heitbaum.com>
    Tested-by: Pavel Machek (CIP) <pavel@denx.de>
    Tested-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
macsec: always read MACSEC_SA_ATTR_PN as a u64 [+ + +]
Author: Sabrina Dubroca <sd@queasysnail.net>
Date:   Fri Jul 22 11:16:30 2022 +0200

    macsec: always read MACSEC_SA_ATTR_PN as a u64
    
    [ Upstream commit c630d1fe6219769049c87d1a6a0e9a6de55328a1 ]
    
    Currently, MACSEC_SA_ATTR_PN is handled inconsistently, sometimes as a
    u32, sometimes forced into a u64 without checking the actual length of
    the attribute. Instead, we can use nla_get_u64 everywhere, which will
    read up to 64 bits into a u64, capped by the actual length of the
    attribute coming from userspace.
    
    This fixes several issues:
     - the check in validate_add_rxsa doesn't work with 32-bit attributes
     - the checks in validate_add_txsa and validate_upd_sa incorrectly
       reject X << 32 (with X != 0)
    
    Fixes: 48ef50fa866a ("macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)")
    Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

macsec: fix error message in macsec_add_rxsa and _txsa [+ + +]
Author: Sabrina Dubroca <sd@queasysnail.net>
Date:   Fri Jul 22 11:16:28 2022 +0200

    macsec: fix error message in macsec_add_rxsa and _txsa
    
    [ Upstream commit 3240eac4ff20e51b87600dbd586ed814daf313db ]
    
    The expected length is MACSEC_SALT_LEN, not MACSEC_SA_ATTR_SALT.
    
    Fixes: 48ef50fa866a ("macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)")
    Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

macsec: fix NULL deref in macsec_add_rxsa [+ + +]
Author: Sabrina Dubroca <sd@queasysnail.net>
Date:   Fri Jul 22 11:16:27 2022 +0200

    macsec: fix NULL deref in macsec_add_rxsa
    
    [ Upstream commit f46040eeaf2e523a4096199fd93a11e794818009 ]
    
    Commit 48ef50fa866a added a test on tb_sa[MACSEC_SA_ATTR_PN], but
    nothing guarantees that it's not NULL at this point. The same code was
    added to macsec_add_txsa, but there it's not a problem because
    validate_add_txsa checks that the MACSEC_SA_ATTR_PN attribute is
    present.
    
    Note: it's not possible to reproduce with iproute, because iproute
    doesn't allow creating an SA without specifying the PN.
    
    Fixes: 48ef50fa866a ("macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)")
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=208315
    Reported-by: Frantisek Sumsal <fsumsal@redhat.com>
    Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

macsec: limit replay window size with XPN [+ + +]
Author: Sabrina Dubroca <sd@queasysnail.net>
Date:   Fri Jul 22 11:16:29 2022 +0200

    macsec: limit replay window size with XPN
    
    [ Upstream commit b07a0e2044057f201d694ab474f5c42a02b6465b ]
    
    IEEE 802.1AEbw-2013 (section 10.7.8) specifies that the maximum value
    of the replay window is 2^30-1, to help with recovery of the upper
    bits of the PN.
    
    To avoid leaving the existing macsec device in an inconsistent state
    if this test fails during changelink, reuse the cleanup mechanism
    introduced for HW offload. This wasn't needed until now because
    macsec_changelink_common could not fail during changelink, as
    modifying the cipher suite was not allowed.
    
    Finally, this must happen after handling IFLA_MACSEC_CIPHER_SUITE so
    that secy->xpn is set.
    
    Fixes: 48ef50fa866a ("macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)")
    Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 
mt7601u: add USB device ID for some versions of XiaoDu WiFi Dongle. [+ + +]
Author: Wei Mingzhi <whistler@member.fsf.org>
Date:   Sat Jun 19 00:08:40 2021 +0800

    mt7601u: add USB device ID for some versions of XiaoDu WiFi Dongle.
    
    commit 829eea7c94e0bac804e65975639a2f2e5f147033 upstream.
    
    USB device ID of some versions of XiaoDu WiFi Dongle is 2955:1003
    instead of 2955:1001. Both are the same mt7601u hardware.
    
    Signed-off-by: Wei Mingzhi <whistler@member.fsf.org>
    Acked-by: Jakub Kicinski <kubakici@wp.pl>
    Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
    Link: https://lore.kernel.org/r/20210618160840.305024-1-whistler@member.fsf.org
    Cc: Yan Xinyu <sdlyyxy@bupt.edu.cn>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
net/tls: Remove the context from the list in tls_device_down [+ + +]
Author: Maxim Mikityanskiy <maximmi@nvidia.com>
Date:   Thu Jul 21 12:11:27 2022 +0300

    net/tls: Remove the context from the list in tls_device_down
    
    commit f6336724a4d4220c89a4ec38bca84b03b178b1a3 upstream.
    
    tls_device_down takes a reference on all contexts it's going to move to
    the degraded state (software fallback). If sk_destruct runs afterwards,
    it can reduce the reference counter back to 1 and return early without
    destroying the context. Then tls_device_down will release the reference
    it took and call tls_device_free_ctx. However, the context will still
    stay in tls_device_down_list forever. The list will contain an item,
    memory for which is released, making a memory corruption possible.
    
    Fix the above bug by properly removing the context from all lists before
    any call to tls_device_free_ctx.
    
    Fixes: 3740651bf7e2 ("tls: Fix context leak on tls_device_down")
    Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
    Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
net: macsec: fix potential resource leak in macsec_add_rxsa() and macsec_add_txsa() [+ + +]
Author: Jianglei Nie <niejianglei2021@163.com>
Date:   Fri Jul 22 17:29:02 2022 +0800

    net: macsec: fix potential resource leak in macsec_add_rxsa() and macsec_add_txsa()
    
    [ Upstream commit c7b205fbbf3cffa374721bb7623f7aa8c46074f1 ]
    
    init_rx_sa() allocates relevant resource for rx_sa->stats and rx_sa->
    key.tfm with alloc_percpu() and macsec_alloc_tfm(). When some error
    occurs after init_rx_sa() is called in macsec_add_rxsa(), the function
    released rx_sa with kfree() without releasing rx_sa->stats and rx_sa->
    key.tfm, which will lead to a resource leak.
    
    We should call macsec_rxsa_put() instead of kfree() to decrease the ref
    count of rx_sa and release the relevant resource if the refcount is 0.
    The same bug exists in macsec_add_txsa() for tx_sa as well. This patch
    fixes the above two bugs.
    
    Fixes: 3cf3227a21d1 ("net: macsec: hardware offloading infrastructure")
    Signed-off-by: Jianglei Nie <niejianglei2021@163.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

net: ping6: Fix memleak in ipv6_renew_options(). [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 27 18:22:20 2022 -0700

    net: ping6: Fix memleak in ipv6_renew_options().
    
    commit e27326009a3d247b831eda38878c777f6f4eb3d1 upstream.
    
    When we close ping6 sockets, some resources are left unfreed because
    pingv6_prot is missing sk->sk_prot->destroy().  As reported by
    syzbot [0], just three syscalls leak 96 bytes and easily cause OOM.
    
        struct ipv6_sr_hdr *hdr;
        char data[24] = {0};
        int fd;
    
        hdr = (struct ipv6_sr_hdr *)data;
        hdr->hdrlen = 2;
        hdr->type = IPV6_SRCRT_TYPE_4;
    
        fd = socket(AF_INET6, SOCK_DGRAM, NEXTHDR_ICMP);
        setsockopt(fd, IPPROTO_IPV6, IPV6_RTHDR, data, 24);
        close(fd);
    
    To fix memory leaks, let's add a destroy function.
    
    Note the socket() syscall checks if the GID is within the range of
    net.ipv4.ping_group_range.  The default value is [1, 0] so that no
    GID meets the condition (1 <= GID <= 0).  Thus, the local DoS does
    not succeed until we change the default value.  However, at least
    Ubuntu/Fedora/RHEL loosen it.
    
        $ cat /usr/lib/sysctl.d/50-default.conf
        ...
        -net.ipv4.ping_group_range = 0 2147483647
    
    Also, there could be another path reported with these options, and
    some of them require CAP_NET_RAW.
    
      setsockopt
          IPV6_ADDRFORM (inet6_sk(sk)->pktoptions)
          IPV6_RECVPATHMTU (inet6_sk(sk)->rxpmtu)
          IPV6_HOPOPTS (inet6_sk(sk)->opt)
          IPV6_RTHDRDSTOPTS (inet6_sk(sk)->opt)
          IPV6_RTHDR (inet6_sk(sk)->opt)
          IPV6_DSTOPTS (inet6_sk(sk)->opt)
          IPV6_2292PKTOPTIONS (inet6_sk(sk)->opt)
    
      getsockopt
          IPV6_FLOWLABEL_MGR (inet6_sk(sk)->ipv6_fl_list)
    
    For the record, I left a different splat with syzbot's one.
    
      unreferenced object 0xffff888006270c60 (size 96):
        comm "repro2", pid 231, jiffies 4294696626 (age 13.118s)
        hex dump (first 32 bytes):
          01 00 00 00 44 00 00 00 00 00 00 00 00 00 00 00  ....D...........
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000f6bc7ea9>] sock_kmalloc (net/core/sock.c:2564 net/core/sock.c:2554)
          [<000000006d699550>] do_ipv6_setsockopt.constprop.0 (net/ipv6/ipv6_sockglue.c:715)
          [<00000000c3c3b1f5>] ipv6_setsockopt (net/ipv6/ipv6_sockglue.c:1024)
          [<000000007096a025>] __sys_setsockopt (net/socket.c:2254)
          [<000000003a8ff47b>] __x64_sys_setsockopt (net/socket.c:2265 net/socket.c:2262 net/socket.c:2262)
          [<000000007c409dcb>] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
          [<00000000e939c4a9>] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
    
    [0]: https://syzkaller.appspot.com/bug?extid=a8430774139ec3ab7176
    
    Fixes: 6d0bfe226116 ("net: ipv6: Add IPv6 support to the ping socket.")
    Reported-by: syzbot+a8430774139ec3ab7176@syzkaller.appspotmail.com
    Reported-by: Ayushman Dutta <ayudutta@amazon.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220728012220.46918-1-kuniyu@amazon.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

net: sungem_phy: Add of_node_put() for reference returned by of_get_parent() [+ + +]
Author: Liang He <windhl@126.com>
Date:   Wed Jul 20 21:10:03 2022 +0800

    net: sungem_phy: Add of_node_put() for reference returned by of_get_parent()
    
    [ Upstream commit ebbbe23fdf6070e31509638df3321688358cc211 ]
    
    In bcm5421_init(), we should call of_node_put() for the reference
    returned by of_get_parent() which has increased the refcount.
    
    Fixes: 3c326fe9cb7a ("[PATCH] ppc64: Add new PHY to sungem")
    Signed-off-by: Liang He <windhl@126.com>
    Link: https://lore.kernel.org/r/20220720131003.1287426-1-windhl@126.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 
netfilter: nf_queue: do not allow packet truncation below transport header offset [+ + +]
Author: Florian Westphal <fw@strlen.de>
Date:   Tue Jul 26 12:42:06 2022 +0200

    netfilter: nf_queue: do not allow packet truncation below transport header offset
    
    [ Upstream commit 99a63d36cb3ed5ca3aa6fcb64cffbeaf3b0fb164 ]
    
    Domingo Dirutigliano and Nicola Guerrera report kernel panic when
    sending nf_queue verdict with 1-byte nfta_payload attribute.
    
    The IP/IPv6 stack pulls the IP(v6) header from the packet after the
    input hook.
    
    If user truncates the packet below the header size, this skb_pull() will
    result in a malformed skb (skb->len < 0).
    
    Fixes: 7af4cc3fa158 ("[NETFILTER]: Add "nfnetlink_queue" netfilter queue handler over nfnetlink")
    Reported-by: Domingo Dirutigliano <pwnzer0tt1@proton.me>
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 
nouveau/svm: Fix to migrate all requested pages [+ + +]
Author: Alistair Popple <apopple@nvidia.com>
Date:   Wed Jul 20 16:27:45 2022 +1000

    nouveau/svm: Fix to migrate all requested pages
    
    commit 66cee9097e2b74ff3c8cc040ce5717c521a0c3fa upstream.
    
    Users may request that pages from an OpenCL SVM allocation be migrated
    to the GPU with clEnqueueSVMMigrateMem(). In Nouveau this will call into
    nouveau_dmem_migrate_vma() to do the migration. If the total range to be
    migrated exceeds SG_MAX_SINGLE_ALLOC the pages will be migrated in
    chunks of size SG_MAX_SINGLE_ALLOC. However a typo in updating the
    starting address means that only the first chunk will get migrated.
    
    Fix the calculation so that the entire range will get migrated if
    possible.
    
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Fixes: e3d8b0890469 ("drm/nouveau/svm: map pages after migration")
    Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
    Reviewed-by: Lyude Paul <lyude@redhat.com>
    Signed-off-by: Lyude Paul <lyude@redhat.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20220720062745.960701-1-apopple@nvidia.com
    Cc: <stable@vger.kernel.org> # v5.8+
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
ntfs: fix use-after-free in ntfs_ucsncmp() [+ + +]
Author: ChenXiaoSong <chenxiaosong2@huawei.com>
Date:   Thu Jul 7 18:53:29 2022 +0800

    ntfs: fix use-after-free in ntfs_ucsncmp()
    
    commit 38c9c22a85aeed28d0831f230136e9cf6fa2ed44 upstream.
    
    Syzkaller reported use-after-free bug as follows:
    
    ==================================================================
    BUG: KASAN: use-after-free in ntfs_ucsncmp+0x123/0x130
    Read of size 2 at addr ffff8880751acee8 by task a.out/879
    
    CPU: 7 PID: 879 Comm: a.out Not tainted 5.19.0-rc4-next-20220630-00001-gcc5218c8bd2c-dirty #7
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
    Call Trace:
     <TASK>
     dump_stack_lvl+0x1c0/0x2b0
     print_address_description.constprop.0.cold+0xd4/0x484
     print_report.cold+0x55/0x232
     kasan_report+0xbf/0xf0
     ntfs_ucsncmp+0x123/0x130
     ntfs_are_names_equal.cold+0x2b/0x41
     ntfs_attr_find+0x43b/0xb90
     ntfs_attr_lookup+0x16d/0x1e0
     ntfs_read_locked_attr_inode+0x4aa/0x2360
     ntfs_attr_iget+0x1af/0x220
     ntfs_read_locked_inode+0x246c/0x5120
     ntfs_iget+0x132/0x180
     load_system_files+0x1cc6/0x3480
     ntfs_fill_super+0xa66/0x1cf0
     mount_bdev+0x38d/0x460
     legacy_get_tree+0x10d/0x220
     vfs_get_tree+0x93/0x300
     do_new_mount+0x2da/0x6d0
     path_mount+0x496/0x19d0
     __x64_sys_mount+0x284/0x300
     do_syscall_64+0x3b/0xc0
     entry_SYSCALL_64_after_hwframe+0x46/0xb0
    RIP: 0033:0x7f3f2118d9ea
    Code: 48 8b 0d a9 f4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 f4 0b 00 f7 d8 64 89 01 48
    RSP: 002b:00007ffc269deac8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3f2118d9ea
    RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffc269dec00
    RBP: 00007ffc269dec80 R08: 00007ffc269deb00 R09: 00007ffc269dec44
    R10: 0000000000000000 R11: 0000000000000202 R12: 000055f81ab1d220
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
     </TASK>
    
    The buggy address belongs to the physical page:
    page:0000000085430378 refcount:1 mapcount:1 mapping:0000000000000000 index:0x555c6a81d pfn:0x751ac
    memcg:ffff888101f7e180
    anon flags: 0xfffffc00a0014(uptodate|lru|mappedtodisk|swapbacked|node=0|zone=1|lastcpupid=0x1fffff)
    raw: 000fffffc00a0014 ffffea0001bf2988 ffffea0001de2448 ffff88801712e201
    raw: 0000000555c6a81d 0000000000000000 0000000100000000 ffff888101f7e180
    page dumped because: kasan: bad access detected
    
    Memory state around the buggy address:
     ffff8880751acd80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
     ffff8880751ace00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    >ffff8880751ace80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
                                                              ^
     ffff8880751acf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
     ffff8880751acf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ==================================================================
    
    The reason is that struct ATTR_RECORD->name_offset is 6485, end address of
    name string is out of bounds.
    
    Fix this by adding sanity check on end address of attribute name string.
    
    [akpm@linux-foundation.org: coding-style cleanups]
    [chenxiaosong2@huawei.com: cleanup suggested by Hawkins Jiawei]
      Link: https://lkml.kernel.org/r/20220709064511.3304299-1-chenxiaosong2@huawei.com
    Link: https://lkml.kernel.org/r/20220707105329.4020708-1-chenxiaosong2@huawei.com
    Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
    Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
    Cc: Anton Altaparmakov <anton@tuxera.com>
    Cc: ChenXiaoSong <chenxiaosong2@huawei.com>
    Cc: Yongqiang Liu <liuyongqiang13@huawei.com>
    Cc: Zhang Yi <yi.zhang@huawei.com>
    Cc: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
page_alloc: fix invalid watermark check on a negative value [+ + +]
Author: Jaewon Kim <jaewon31.kim@samsung.com>
Date:   Mon Jul 25 18:52:12 2022 +0900

    page_alloc: fix invalid watermark check on a negative value
    
    commit 9282012fc0aa248b77a69f5eb802b67c5a16bb13 upstream.
    
    There was a report that a task is waiting at the
    throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was
    increasing.
    
    This is a bug where zone_watermark_fast returns true even when the free
    is very low. The commit f27ce0e14088 ("page_alloc: consider highatomic
    reserve in watermark fast") changed the watermark fast to consider
    highatomic reserve. But it did not handle a negative value case which
    can be happened when reserved_highatomic pageblock is bigger than the
    actual free.
    
    If watermark is considered as ok for the negative value, allocating
    contexts for order-0 will consume all free pages without direct reclaim,
    and finally free page may become depleted except highatomic free.
    
    Then allocating contexts may fall into throttle_direct_reclaim. This
    symptom may easily happen in a system where wmark min is low and other
    reclaimers like kswapd does not make free pages quickly.
    
    Handle the negative case by using MIN.
    
    Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung.com
    Fixes: f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast")
    Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
    Reported-by: GyeongHwan Hong <gh21.hong@samsung.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Baoquan He <bhe@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Yong-Taek Lee <ytk.lee@samsung.com>
    Cc: <stable@vger.kerenl.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
perf symbol: Correct address for bss symbols [+ + +]
Author: Leo Yan <leo.yan@linaro.org>
Date:   Sun Jul 24 14:00:12 2022 +0800

    perf symbol: Correct address for bss symbols
    
    [ Upstream commit 2d86612aacb7805f72873691a2644d7279ed0630 ]
    
    When using 'perf mem' and 'perf c2c', an issue is observed that tool
    reports the wrong offset for global data symbols.  This is a common
    issue on both x86 and Arm64 platforms.
    
    Let's see an example, for a test program, below is the disassembly for
    its .bss section which is dumped with objdump:
    
      ...
    
      Disassembly of section .bss:
    
      0000000000004040 <completed.0>:
            ...
    
      0000000000004080 <buf1>:
            ...
    
      00000000000040c0 <buf2>:
            ...
    
      0000000000004100 <thread>:
            ...
    
    First we used 'perf mem record' to run the test program and then used
    'perf --debug verbose=4 mem report' to observe what's the symbol info
    for 'buf1' and 'buf2' structures.
    
      # ./perf mem record -e ldlat-loads,ldlat-stores -- false_sharing.exe 8
      # ./perf --debug verbose=4 mem report
        ...
        dso__load_sym_internal: adjusting symbol: st_value: 0x40c0 sh_addr: 0x4040 sh_offset: 0x3028
        symbol__new: buf2 0x30a8-0x30e8
        ...
        dso__load_sym_internal: adjusting symbol: st_value: 0x4080 sh_addr: 0x4040 sh_offset: 0x3028
        symbol__new: buf1 0x3068-0x30a8
        ...
    
    The perf tool relies on libelf to parse symbols, in executable and
    shared object files, 'st_value' holds a virtual address; 'sh_addr' is
    the address at which section's first byte should reside in memory, and
    'sh_offset' is the byte offset from the beginning of the file to the
    first byte in the section.  The perf tool uses below formula to convert
    a symbol's memory address to a file address:
    
      file_address = st_value - sh_addr + sh_offset
                        ^
                        ` Memory address
    
    We can see the final adjusted address ranges for buf1 and buf2 are
    [0x30a8-0x30e8) and [0x3068-0x30a8) respectively, apparently this is
    incorrect, in the code, the structure for 'buf1' and 'buf2' specifies
    compiler attribute with 64-byte alignment.
    
    The problem happens for 'sh_offset', libelf returns it as 0x3028 which
    is not 64-byte aligned, combining with disassembly, it's likely libelf
    doesn't respect the alignment for .bss section, therefore, it doesn't
    return the aligned value for 'sh_offset'.
    
    Suggested by Fangrui Song, ELF file contains program header which
    contains PT_LOAD segments, the fields p_vaddr and p_offset in PT_LOAD
    segments contain the execution info.  A better choice for converting
    memory address to file address is using the formula:
    
      file_address = st_value - p_vaddr + p_offset
    
    This patch introduces elf_read_program_header() which returns the
    program header based on the passed 'st_value', then it uses the formula
    above to calculate the symbol file address; and the debugging log is
    updated respectively.
    
    After applying the change:
    
      # ./perf --debug verbose=4 mem report
        ...
        dso__load_sym_internal: adjusting symbol: st_value: 0x40c0 p_vaddr: 0x3d28 p_offset: 0x2d28
        symbol__new: buf2 0x30c0-0x3100
        ...
        dso__load_sym_internal: adjusting symbol: st_value: 0x4080 p_vaddr: 0x3d28 p_offset: 0x2d28
        symbol__new: buf1 0x3080-0x30c0
        ...
    
    Fixes: f17e04afaff84b5c ("perf report: Fix ELF symbol parsing")
    Reported-by: Chang Rui <changruinj@gmail.com>
    Suggested-by: Fangrui Song <maskray@google.com>
    Signed-off-by: Leo Yan <leo.yan@linaro.org>
    Acked-by: Namhyung Kim <namhyung@kernel.org>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220724060013.171050-2-leo.yan@linaro.org
    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 
Revert "ocfs2: mount shared volume without ha stack" [+ + +]
Author: Junxiao Bi <ocfs2-devel@oss.oracle.com>
Date:   Fri Jun 3 15:28:01 2022 -0700

    Revert "ocfs2: mount shared volume without ha stack"
    
    commit c80af0c250c8f8a3c978aa5aafbe9c39b336b813 upstream.
    
    This reverts commit 912f655d78c5d4ad05eac287f23a435924df7144.
    
    This commit introduced a regression that can cause mount hung.  The
    changes in __ocfs2_find_empty_slot causes that any node with none-zero
    node number can grab the slot that was already taken by node 0, so node 1
    will access the same journal with node 0, when it try to grab journal
    cluster lock, it will hung because it was already acquired by node 0.
    It's very easy to reproduce this, in one cluster, mount node 0 first, then
    node 1, you will see the following call trace from node 1.
    
    [13148.735424] INFO: task mount.ocfs2:53045 blocked for more than 122 seconds.
    [13148.739691]       Not tainted 5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2
    [13148.742560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [13148.745846] task:mount.ocfs2     state:D stack:    0 pid:53045 ppid: 53044 flags:0x00004000
    [13148.749354] Call Trace:
    [13148.750718]  <TASK>
    [13148.752019]  ? usleep_range+0x90/0x89
    [13148.753882]  __schedule+0x210/0x567
    [13148.755684]  schedule+0x44/0xa8
    [13148.757270]  schedule_timeout+0x106/0x13c
    [13148.759273]  ? __prepare_to_swait+0x53/0x78
    [13148.761218]  __wait_for_common+0xae/0x163
    [13148.763144]  __ocfs2_cluster_lock.constprop.0+0x1d6/0x870 [ocfs2]
    [13148.765780]  ? ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
    [13148.768312]  ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
    [13148.770968]  ocfs2_journal_init+0x91/0x340 [ocfs2]
    [13148.773202]  ocfs2_check_volume+0x39/0x461 [ocfs2]
    [13148.775401]  ? iput+0x69/0xba
    [13148.777047]  ocfs2_mount_volume.isra.0.cold+0x40/0x1f5 [ocfs2]
    [13148.779646]  ocfs2_fill_super+0x54b/0x853 [ocfs2]
    [13148.781756]  mount_bdev+0x190/0x1b7
    [13148.783443]  ? ocfs2_remount+0x440/0x440 [ocfs2]
    [13148.785634]  legacy_get_tree+0x27/0x48
    [13148.787466]  vfs_get_tree+0x25/0xd0
    [13148.789270]  do_new_mount+0x18c/0x2d9
    [13148.791046]  __x64_sys_mount+0x10e/0x142
    [13148.792911]  do_syscall_64+0x3b/0x89
    [13148.794667]  entry_SYSCALL_64_after_hwframe+0x170/0x0
    [13148.797051] RIP: 0033:0x7f2309f6e26e
    [13148.798784] RSP: 002b:00007ffdcee7d408 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
    [13148.801974] RAX: ffffffffffffffda RBX: 00007ffdcee7d4a0 RCX: 00007f2309f6e26e
    [13148.804815] RDX: 0000559aa762a8ae RSI: 0000559aa939d340 RDI: 0000559aa93a22b0
    [13148.807719] RBP: 00007ffdcee7d5b0 R08: 0000559aa93a2290 R09: 00007f230a0b4820
    [13148.810659] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcee7d420
    [13148.813609] R13: 0000000000000000 R14: 0000559aa939f000 R15: 0000000000000000
    [13148.816564]  </TASK>
    
    To fix it, we can just fix __ocfs2_find_empty_slot.  But original commit
    introduced the feature to mount ocfs2 locally even it is cluster based,
    that is a very dangerous, it can easily cause serious data corruption,
    there is no way to stop other nodes mounting the fs and corrupting it.
    Setup ha or other cluster-aware stack is just the cost that we have to
    take for avoiding corruption, otherwise we have to do it in kernel.
    
    Link: https://lkml.kernel.org/r/20220603222801.42488-1-junxiao.bi@oracle.com
    Fixes: 912f655d78c5("ocfs2: mount shared volume without ha stack")
    Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
    Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
    Cc: Mark Fasheh <mark@fasheh.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Cc: Changwei Ge <gechangwei@live.cn>
    Cc: Gang He <ghe@suse.com>
    Cc: Jun Piao <piaojun@huawei.com>
    Cc: <heming.zhao@suse.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
Revert "tcp: change pingpong threshold to 3" [+ + +]
Author: Wei Wang <weiwan@google.com>
Date:   Thu Jul 21 20:44:04 2022 +0000

    Revert "tcp: change pingpong threshold to 3"
    
    commit 4d8f24eeedc58d5f87b650ddda73c16e8ba56559 upstream.
    
    This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
    
    This to-be-reverted commit was meant to apply a stricter rule for the
    stack to enter pingpong mode. However, the condition used to check for
    interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
    jiffy based and might be too coarse, which delays the stack entering
    pingpong mode.
    We revert this patch so that we no longer use the above condition to
    determine interactive session, and also reduce pingpong threshold to 1.
    
    Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
    Reported-by: LemmyHuang <hlm3280@163.com>
    Suggested-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Wei Wang <weiwan@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220721204404.388396-1-weiwan@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
s390/archrandom: prevent CPACF trng invocations in interrupt context [+ + +]
Author: Harald Freudenberger <freude@linux.ibm.com>
Date:   Wed Jul 13 15:17:21 2022 +0200

    s390/archrandom: prevent CPACF trng invocations in interrupt context
    
    commit 918e75f77af7d2e049bb70469ec0a2c12782d96a upstream.
    
    This patch slightly reworks the s390 arch_get_random_seed_{int,long}
    implementation: Make sure the CPACF trng instruction is never
    called in any interrupt context. This is done by adding an
    additional condition in_task().
    
    Justification:
    
    There are some constrains to satisfy for the invocation of the
    arch_get_random_seed_{int,long}() functions:
    - They should provide good random data during kernel initialization.
    - They should not be called in interrupt context as the TRNG
      instruction is relatively heavy weight and may for example
      make some network loads cause to timeout and buck.
    
    However, it was not clear what kind of interrupt context is exactly
    encountered during kernel init or network traffic eventually calling
    arch_get_random_seed_long().
    
    After some days of investigations it is clear that the s390
    start_kernel function is not running in any interrupt context and
    so the trng is called:
    
    Jul 11 18:33:39 t35lp54 kernel:  [<00000001064e90ca>] arch_get_random_seed_long.part.0+0x32/0x70
    Jul 11 18:33:39 t35lp54 kernel:  [<000000010715f246>] random_init+0xf6/0x238
    Jul 11 18:33:39 t35lp54 kernel:  [<000000010712545c>] start_kernel+0x4a4/0x628
    Jul 11 18:33:39 t35lp54 kernel:  [<000000010590402a>] startup_continue+0x2a/0x40
    
    The condition in_task() is true and the CPACF trng provides random data
    during kernel startup.
    
    The network traffic however, is more difficult. A typical call stack
    looks like this:
    
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b5600fc>] extract_entropy.constprop.0+0x23c/0x240
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b560136>] crng_reseed+0x36/0xd8
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b5604b8>] crng_make_state+0x78/0x340
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b5607e0>] _get_random_bytes+0x60/0xf8
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b56108a>] get_random_u32+0xda/0x248
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008aefe7a8>] kfence_guarded_alloc+0x48/0x4b8
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008aeff35e>] __kfence_alloc+0x18e/0x1b8
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008aef7f10>] __kmalloc_node_track_caller+0x368/0x4d8
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b611eac>] kmalloc_reserve+0x44/0xa0
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b611f98>] __alloc_skb+0x90/0x178
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b6120dc>] __napi_alloc_skb+0x5c/0x118
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b8f06b4>] qeth_extract_skb+0x13c/0x680
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b8f6526>] qeth_poll+0x256/0x3f8
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b63d76e>] __napi_poll.constprop.0+0x46/0x2f8
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b63dbec>] net_rx_action+0x1cc/0x408
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b937302>] __do_softirq+0x132/0x6b0
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008abf46ce>] __irq_exit_rcu+0x13e/0x170
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008abf531a>] irq_exit_rcu+0x22/0x50
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b922506>] do_io_irq+0xe6/0x198
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b935826>] io_int_handler+0xd6/0x110
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b9358a6>] psw_idle_exit+0x0/0xa
    Jul 06 17:37:07 t35lp54 kernel: ([<000000008ab9c59a>] arch_cpu_idle+0x52/0xe0)
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b933cfe>] default_idle_call+0x6e/0xd0
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008ac59f4e>] do_idle+0xf6/0x1b0
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008ac5a28e>] cpu_startup_entry+0x36/0x40
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008abb0d90>] smp_start_secondary+0x148/0x158
    Jul 06 17:37:07 t35lp54 kernel:  [<000000008b935b9e>] restart_int_handler+0x6e/0x90
    
    which confirms that the call is in softirq context. So in_task() covers exactly
    the cases where we want to have CPACF trng called: not in nmi, not in hard irq,
    not in soft irq but in normal task context and during kernel init.
    
    Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
    Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
    Link: https://lore.kernel.org/r/20220713131721.257907-1-freude@linux.ibm.com
    Fixes: e4f74400308c ("s390/archrandom: simplify back to earlier design and initialize earlier")
    [agordeev@linux.ibm.com changed desc, added Fixes and Link, removed -stable]
    Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
scsi: ufs: host: Hold reference returned by of_parse_phandle() [+ + +]
Author: Liang He <windhl@126.com>
Date:   Tue Jul 19 15:15:29 2022 +0800

    scsi: ufs: host: Hold reference returned by of_parse_phandle()
    
    commit a3435afba87dc6cd83f5595e7607f3c40f93ef01 upstream.
    
    In ufshcd_populate_vreg(), we should hold the reference returned by
    of_parse_phandle() and then use it to call of_node_put() for refcount
    balance.
    
    Link: https://lore.kernel.org/r/20220719071529.1081166-1-windhl@126.com
    Fixes: aa4976130934 ("ufs: Add regulator enable support")
    Reviewed-by: Bart Van Assche <bvanassche@acm.org>
    Signed-off-by: Liang He <windhl@126.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
sctp: fix sleep in atomic context bug in timer handlers [+ + +]
Author: Duoming Zhou <duoming@zju.edu.cn>
Date:   Sat Jul 23 09:58:09 2022 +0800

    sctp: fix sleep in atomic context bug in timer handlers
    
    [ Upstream commit b89fc26f741d9f9efb51cba3e9b241cf1380ec5a ]
    
    There are sleep in atomic context bugs in timer handlers of sctp
    such as sctp_generate_t3_rtx_event(), sctp_generate_probe_event(),
    sctp_generate_t1_init_event(), sctp_generate_timeout_event(),
    sctp_generate_t3_rtx_event() and so on.
    
    The root cause is sctp_sched_prio_init_sid() with GFP_KERNEL parameter
    that may sleep could be called by different timer handlers which is in
    interrupt context.
    
    One of the call paths that could trigger bug is shown below:
    
          (interrupt context)
    sctp_generate_probe_event
      sctp_do_sm
        sctp_side_effects
          sctp_cmd_interpreter
            sctp_outq_teardown
              sctp_outq_init
                sctp_sched_set_sched
                  n->init_sid(..,GFP_KERNEL)
                    sctp_sched_prio_init_sid //may sleep
    
    This patch changes gfp_t parameter of init_sid in sctp_sched_set_sched()
    from GFP_KERNEL to GFP_ATOMIC in order to prevent sleep in atomic
    context bugs.
    
    Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations")
    Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
    Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Link: https://lore.kernel.org/r/20220723015809.11553-1-duoming@zju.edu.cn
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

sctp: leave the err path free in sctp_stream_init to sctp_stream_free [+ + +]
Author: Xin Long <lucien.xin@gmail.com>
Date:   Mon Jul 25 18:11:06 2022 -0400

    sctp: leave the err path free in sctp_stream_init to sctp_stream_free
    
    [ Upstream commit 181d8d2066c000ba0a0e6940a7ad80f1a0e68e9d ]
    
    A NULL pointer dereference was reported by Wei Chen:
    
      BUG: kernel NULL pointer dereference, address: 0000000000000000
      RIP: 0010:__list_del_entry_valid+0x26/0x80
      Call Trace:
       <TASK>
       sctp_sched_dequeue_common+0x1c/0x90
       sctp_sched_prio_dequeue+0x67/0x80
       __sctp_outq_teardown+0x299/0x380
       sctp_outq_free+0x15/0x20
       sctp_association_free+0xc3/0x440
       sctp_do_sm+0x1ca7/0x2210
       sctp_assoc_bh_rcv+0x1f6/0x340
    
    This happens when calling sctp_sendmsg without connecting to server first.
    In this case, a data chunk already queues up in send queue of client side
    when processing the INIT_ACK from server in sctp_process_init() where it
    calls sctp_stream_init() to alloc stream_in. If it fails to alloc stream_in
    all stream_out will be freed in sctp_stream_init's err path. Then in the
    asoc freeing it will crash when dequeuing this data chunk as stream_out
    is missing.
    
    As we can't free stream out before dequeuing all data from send queue, and
    this patch is to fix it by moving the err path stream_out/in freeing in
    sctp_stream_init() to sctp_stream_free() which is eventually called when
    freeing the asoc in sctp_association_free(). This fix also makes the code
    in sctp_process_init() more clear.
    
    Note that in sctp_association_init() when it fails in sctp_stream_init(),
    sctp_association_free() will not be called, and in that case it should
    go to 'stream_free' err path to free stream instead of 'fail_init'.
    
    Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations")
    Reported-by: Wei Chen <harperchen1110@gmail.com>
    Signed-off-by: Xin Long <lucien.xin@gmail.com>
    Link: https://lore.kernel.org/r/831a3dc100c4908ff76e5bcc363be97f2778bc0b.1658787066.git.lucien.xin@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 
selftests: bpf: Don't run sk_lookup in verifier tests [+ + +]
Author: Lorenz Bauer <lmb@cloudflare.com>
Date:   Mon Aug 1 15:29:16 2022 +0800

    selftests: bpf: Don't run sk_lookup in verifier tests
    
    commit b4f894633fa14d7d46ba7676f950b90a401504bb upstream.
    
    sk_lookup doesn't allow setting data_in for bpf_prog_run. This doesn't
    play well with the verifier tests, since they always set a 64 byte
    input buffer. Allow not running verifier tests by setting
    bpf_test.runs to a negative value and don't run the ctx access case
    for sk_lookup. We have dedicated ctx access tests so skipping here
    doesn't reduce coverage.
    
    Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20210303101816.36774-6-lmb@cloudflare.com
    Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
sfc: disable softirqs for ptp TX [+ + +]
Author: Alejandro Lucero <alejandro.lucero-palau@amd.com>
Date:   Tue Jul 26 08:45:04 2022 +0200

    sfc: disable softirqs for ptp TX
    
    [ Upstream commit 67c3b611d92fc238c43734878bc3e232ab570c79 ]
    
    Sending a PTP packet can imply to use the normal TX driver datapath but
    invoked from the driver's ptp worker. The kernel generic TX code
    disables softirqs and preemption before calling specific driver TX code,
    but the ptp worker does not. Although current ptp driver functionality
    does not require it, there are several reasons for doing so:
    
       1) The invoked code is always executed with softirqs disabled for non
          PTP packets.
       2) Better if a ptp packet transmission is not interrupted by softirq
          handling which could lead to high latencies.
       3) netdev_xmit_more used by the TX code requires preemption to be
          disabled.
    
    Indeed a solution for dealing with kernel preemption state based on static
    kernel configuration is not possible since the introduction of dynamic
    preemption level configuration at boot time using the static calls
    functionality.
    
    Fixes: f79c957a0b537 ("drivers: net: sfc: use netdev_xmit_more helper")
    Signed-off-by: Alejandro Lucero <alejandro.lucero-palau@amd.com>
    Link: https://lore.kernel.org/r/20220726064504.49613-1-alejandro.lucero-palau@amd.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 
tcp: Fix a data-race around sysctl_tcp_adv_win_scale. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:14 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_adv_win_scale.
    
    commit 36eeee75ef0157e42fb6593dcc65daab289b559e upstream.
    
    While reading sysctl_tcp_adv_win_scale, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.
    
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

tcp: Fix a data-race around sysctl_tcp_app_win. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:13 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_app_win.
    
    commit 02ca527ac5581cf56749db9fd03d854e842253dd upstream.
    
    While reading sysctl_tcp_app_win, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.
    
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

tcp: Fix a data-race around sysctl_tcp_autocorking. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:25 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_autocorking.
    
    [ Upstream commit 85225e6f0a76e6745bc841c9f25169c509b573d8 ]
    
    While reading sysctl_tcp_autocorking, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.
    
    Fixes: f54b311142a9 ("tcp: auto corking")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

tcp: Fix a data-race around sysctl_tcp_challenge_ack_limit. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:21 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_challenge_ack_limit.
    
    commit db3815a2fa691da145cfbe834584f31ad75df9ff upstream.
    
    While reading sysctl_tcp_challenge_ack_limit, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.
    
    Fixes: 282f23c6ee34 ("tcp: implement RFC 5961 3.2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

tcp: Fix a data-race around sysctl_tcp_comp_sack_delay_ns. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 22 11:22:01 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_comp_sack_delay_ns.
    
    [ Upstream commit 4866b2b0f7672b6d760c4b8ece6fb56f965dcc8a ]
    
    While reading sysctl_tcp_comp_sack_delay_ns, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.
    
    Fixes: 6d82aa242092 ("tcp: add tcp_comp_sack_delay_ns sysctl")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

tcp: Fix a data-race around sysctl_tcp_comp_sack_nr. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 22 11:22:03 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_comp_sack_nr.
    
    [ Upstream commit 79f55473bfc8ac51bd6572929a679eeb4da22251 ]
    
    While reading sysctl_tcp_comp_sack_nr, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.
    
    Fixes: 9c21d2fc41c0 ("tcp: add tcp_comp_sack_nr sysctl")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

tcp: Fix a data-race around sysctl_tcp_comp_sack_slack_ns. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 22 11:22:02 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_comp_sack_slack_ns.
    
    [ Upstream commit 22396941a7f343d704738360f9ef0e6576489d43 ]
    
    While reading sysctl_tcp_comp_sack_slack_ns, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.
    
    Fixes: a70437cc09a1 ("tcp: add hrtimer slack to sack compression")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

tcp: Fix a data-race around sysctl_tcp_frto. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:15 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_frto.
    
    commit 706c6202a3589f290e1ef9be0584a8f4a3cc0507 upstream.
    
    While reading sysctl_tcp_frto, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.
    
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

tcp: Fix a data-race around sysctl_tcp_invalid_ratelimit. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:26 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_invalid_ratelimit.
    
    [ Upstream commit 2afdbe7b8de84c28e219073a6661080e1b3ded48 ]
    
    While reading sysctl_tcp_invalid_ratelimit, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.
    
    Fixes: 032ee4236954 ("tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

tcp: Fix a data-race around sysctl_tcp_limit_output_bytes. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:20 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_limit_output_bytes.
    
    commit 9fb90193fbd66b4c5409ef729fd081861f8b6351 upstream.
    
    While reading sysctl_tcp_limit_output_bytes, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.
    
    Fixes: 46d3ceabd8d9 ("tcp: TCP Small Queues")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

tcp: Fix a data-race around sysctl_tcp_min_rtt_wlen. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:24 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_min_rtt_wlen.
    
    [ Upstream commit 1330ffacd05fc9ac4159d19286ce119e22450ed2 ]
    
    While reading sysctl_tcp_min_rtt_wlen, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.
    
    Fixes: f672258391b4 ("tcp: track min RTT using windowed min-filter")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

tcp: Fix a data-race around sysctl_tcp_min_tso_segs. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:22 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_min_tso_segs.
    
    [ Upstream commit e0bb4ab9dfddd872622239f49fb2bd403b70853b ]
    
    While reading sysctl_tcp_min_tso_segs, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.
    
    Fixes: 95bd09eb2750 ("tcp: TSO packets automatic sizing")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

tcp: Fix a data-race around sysctl_tcp_nometrics_save. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:16 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_nometrics_save.
    
    commit 8499a2454d9e8a55ce616ede9f9580f36fd5b0f3 upstream.
    
    While reading sysctl_tcp_nometrics_save, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.
    
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

tcp: Fix data-races around sysctl_tcp_dsack. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:12 2022 -0700

    tcp: Fix data-races around sysctl_tcp_dsack.
    
    commit 58ebb1c8b35a8ef38cd6927431e0fa7b173a632d upstream.
    
    While reading sysctl_tcp_dsack, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.
    
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

tcp: Fix data-races around sysctl_tcp_moderate_rcvbuf. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:18 2022 -0700

    tcp: Fix data-races around sysctl_tcp_moderate_rcvbuf.
    
    commit 780476488844e070580bfc9e3bc7832ec1cea883 upstream.
    
    While reading sysctl_tcp_moderate_rcvbuf, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its readers.
    
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

tcp: Fix data-races around sysctl_tcp_no_ssthresh_metrics_save. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:17 2022 -0700

    tcp: Fix data-races around sysctl_tcp_no_ssthresh_metrics_save.
    
    commit ab1ba21b523ab496b1a4a8e396333b24b0a18f9a upstream.
    
    While reading sysctl_tcp_no_ssthresh_metrics_save, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its readers.
    
    Fixes: 65e6d90168f3 ("net-tcp: Disable TCP ssthresh metrics cache by default")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

tcp: Fix data-races around sysctl_tcp_reflect_tos. [+ + +]
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 22 11:22:04 2022 -0700

    tcp: Fix data-races around sysctl_tcp_reflect_tos.
    
    [ Upstream commit 870e3a634b6a6cb1543b359007aca73fe6a03ac5 ]
    
    While reading sysctl_tcp_reflect_tos, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.
    
    Fixes: ac8f1710c12b ("tcp: reflect tos value received in SYN to the socket")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Acked-by: Wei Wang <weiwan@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 
virtio-net: fix the race between refill work and close [+ + +]
Author: Jason Wang <jasowang@redhat.com>
Date:   Mon Jul 25 15:21:59 2022 +0800

    virtio-net: fix the race between refill work and close
    
    [ Upstream commit 5a159128faff151b7fe5f4eb0f310b1e0a2d56bf ]
    
    We try using cancel_delayed_work_sync() to prevent the work from
    enabling NAPI. This is insufficient since we don't disable the source
    of the refill work scheduling. This means an NAPI poll callback after
    cancel_delayed_work_sync() can schedule the refill work then can
    re-enable the NAPI that leads to use-after-free [1].
    
    Since the work can enable NAPI, we can't simply disable NAPI before
    calling cancel_delayed_work_sync(). So fix this by introducing a
    dedicated boolean to control whether or not the work could be
    scheduled from NAPI.
    
    [1]
    ==================================================================
    BUG: KASAN: use-after-free in refill_work+0x43/0xd4
    Read of size 2 at addr ffff88810562c92e by task kworker/2:1/42
    
    CPU: 2 PID: 42 Comm: kworker/2:1 Not tainted 5.19.0-rc1+ #480
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
    Workqueue: events refill_work
    Call Trace:
     <TASK>
     dump_stack_lvl+0x34/0x44
     print_report.cold+0xbb/0x6ac
     ? _printk+0xad/0xde
     ? refill_work+0x43/0xd4
     kasan_report+0xa8/0x130
     ? refill_work+0x43/0xd4
     refill_work+0x43/0xd4
     process_one_work+0x43d/0x780
     worker_thread+0x2a0/0x6f0
     ? process_one_work+0x780/0x780
     kthread+0x167/0x1a0
     ? kthread_exit+0x50/0x50
     ret_from_fork+0x22/0x30
     </TASK>
    ...
    
    Fixes: b2baed69e605c ("virtio_net: set/cancel work on ndo_open/ndo_stop")
    Signed-off-by: Jason Wang <jasowang@redhat.com>
    Acked-by: Michael S. Tsirkin <mst@redhat.com>
    Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 
watch_queue: Fix missing locking in add_watch_to_object() [+ + +]
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Jul 28 10:31:12 2022 +0100

    watch_queue: Fix missing locking in add_watch_to_object()
    
    commit e64ab2dbd882933b65cd82ff6235d705ad65dbb6 upstream.
    
    If a watch is being added to a queue, it needs to guard against
    interference from addition of a new watch, manual removal of a watch and
    removal of a watch due to some other queue being destroyed.
    
    KEYCTL_WATCH_KEY guards against this for the same {key,queue} pair by
    holding the key->sem writelocked and by holding refs on both the key and
    the queue - but that doesn't prevent interaction from other {key,queue}
    pairs.
    
    While add_watch_to_object() does take the spinlock on the event queue,
    it doesn't take the lock on the source's watch list.  The assumption was
    that the caller would prevent that (say by taking key->sem) - but that
    doesn't prevent interference from the destruction of another queue.
    
    Fix this by locking the watcher list in add_watch_to_object().
    
    Fixes: c73be61cede5 ("pipe: Add general notification queue support")
    Reported-by: syzbot+03d7b43290037d1f87ca@syzkaller.appspotmail.com
    Signed-off-by: David Howells <dhowells@redhat.com>
    cc: keyrings@vger.kernel.org
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

watch_queue: Fix missing rcu annotation [+ + +]
Author: David Howells <dhowells@redhat.com>
Date:   Thu Jul 28 10:31:06 2022 +0100

    watch_queue: Fix missing rcu annotation
    
    commit e0339f036ef4beb9b20f0b6532a1e0ece7f594c6 upstream.
    
    Since __post_watch_notification() walks wlist->watchers with only the
    RCU read lock held, we need to use RCU methods to add to the list (we
    already use RCU methods to remove from the list).
    
    Fix add_watch_to_object() to use hlist_add_head_rcu() instead of
    hlist_add_head() for that list.
    
    Fixes: c73be61cede5 ("pipe: Add general notification queue support")
    Signed-off-by: David Howells <dhowells@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
x86/bugs: Do not enable IBPB at firmware entry when IBPB is not available [+ + +]
Author: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Date:   Thu Jul 28 09:26:02 2022 -0300

    x86/bugs: Do not enable IBPB at firmware entry when IBPB is not available
    
    commit 571c30b1a88465a1c85a6f7762609939b9085a15 upstream.
    
    Some cloud hypervisors do not provide IBPB on very recent CPU processors,
    including AMD processors affected by Retbleed.
    
    Using IBPB before firmware calls on such systems would cause a GPF at boot
    like the one below. Do not enable such calls when IBPB support is not
    present.
    
      EFI Variables Facility v0.08 2004-May-17
      general protection fault, maybe for address 0x1: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 0 PID: 24 Comm: kworker/u2:1 Not tainted 5.19.0-rc8+ #7
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
      Workqueue: efi_rts_wq efi_call_rts
      RIP: 0010:efi_call_rts
      Code: e8 37 33 58 ff 41 bf 48 00 00 00 49 89 c0 44 89 f9 48 83 c8 01 4c 89 c2 48 c1 ea 20 66 90 b9 49 00 00 00 b8 01 00 00 00 31 d2 <0f> 30 e8 7b 9f 5d ff e8 f6 f8 ff ff 4c 89 f1 4c 89 ea 4c 89 e6 48
      RSP: 0018:ffffb373800d7e38 EFLAGS: 00010246
      RAX: 0000000000000001 RBX: 0000000000000006 RCX: 0000000000000049
      RDX: 0000000000000000 RSI: ffff94fbc19d8fe0 RDI: ffff94fbc1b2b300
      RBP: ffffb373800d7e70 R08: 0000000000000000 R09: 0000000000000000
      R10: 000000000000000b R11: 000000000000000b R12: ffffb3738001fd78
      R13: ffff94fbc2fcfc00 R14: ffffb3738001fd80 R15: 0000000000000048
      FS:  0000000000000000(0000) GS:ffff94fc3da00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff94fc30201000 CR3: 000000006f610000 CR4: 00000000000406f0
      Call Trace:
       <TASK>
       ? __wake_up
       process_one_work
       worker_thread
       ? rescuer_thread
       kthread
       ? kthread_complete_and_exit
       ret_from_fork
       </TASK>
      Modules linked in:
    
    Fixes: 28a99e95f55c ("x86/amd: Use IBPB for firmware calls")
    Reported-by: Dimitri John Ledkov <dimitri.ledkov@canonical.com>
    Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
    Signed-off-by: Borislav Petkov <bp@suse.de>
    Cc: <stable@vger.kernel.org>
    Link: https://lore.kernel.org/r/20220728122602.2500509-1-cascardo@canonical.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 
xfs: Enforce attr3 buffer recovery order [+ + +]
Author: Dave Chinner <dchinner@redhat.com>
Date:   Fri Jul 29 18:16:09 2022 +0200

    xfs: Enforce attr3 buffer recovery order
    
    commit d8f4c2d0398fa1d92cacf854daf80d21a46bfefc upstream.
    
    >From the department of "WTAF? How did we miss that!?"...
    
    When we are recovering a buffer, the first thing we do is check the
    buffer magic number and extract the LSN from the buffer. If the LSN
    is older than the current LSN, we replay the modification to it. If
    the metadata on disk is newer than the transaction in the log, we
    skip it. This is a fundamental v5 filesystem metadata recovery
    behaviour.
    
    generic/482 failed with an attribute writeback failure during log
    recovery. The write verifier caught the corruption before it got
    written to disk, and the attr buffer dump looked like:
    
    XFS (dm-3): Metadata corruption detected at xfs_attr3_leaf_verify+0x275/0x2e0, xfs_attr3_leaf block 0x19be8
    XFS (dm-3): Unmount and run xfs_repair
    XFS (dm-3): First 128 bytes of corrupted metadata buffer:
    00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 4d 2a 01 e1  ........;...M*..
    00000010: 00 00 00 00 00 01 9b e8 00 00 00 01 00 00 05 38  ...............8
                                      ^^^^^^^^^^^^^^^^^^^^^^^
    00000020: df 39 5e 51 58 ac 44 b6 8d c5 e7 10 44 09 bc 17  .9^QX.D.....D...
    00000030: 00 00 00 00 00 02 00 83 00 03 00 cc 0f 24 01 00  .............$..
    00000040: 00 68 0e bc 0f c8 00 10 00 00 00 00 00 00 00 00  .h..............
    00000050: 00 00 3c 31 0f 24 01 00 00 00 3c 32 0f 88 01 00  ..<1.$....<2....
    00000060: 00 00 3c 33 0f d8 01 00 00 00 00 00 00 00 00 00  ..<3............
    00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    .....
    
    The highlighted bytes are the LSN that was replayed into the
    buffer: 0x100000538. This is cycle 1, block 0x538. Prior to replay,
    that block on disk looks like this:
    
    $ sudo xfs_db -c "fsb 0x417d" -c "type attr3" -c p /dev/mapper/thin-vol
    hdr.info.hdr.forw = 0
    hdr.info.hdr.back = 0
    hdr.info.hdr.magic = 0x3bee
    hdr.info.crc = 0xb5af0bc6 (correct)
    hdr.info.bno = 105448
    hdr.info.lsn = 0x100000900
                   ^^^^^^^^^^^
    hdr.info.uuid = df395e51-58ac-44b6-8dc5-e7104409bc17
    hdr.info.owner = 131203
    hdr.count = 2
    hdr.usedbytes = 120
    hdr.firstused = 3796
    hdr.holes = 1
    hdr.freemap[0-2] = [base,size]
    
    Note the LSN stamped into the buffer on disk: 1/0x900. The version
    on disk is much newer than the log transaction that was being
    replayed. That's a bug, and should -never- happen.
    
    So I immediately went to look at xlog_recover_get_buf_lsn() to check
    that we handled the LSN correctly. I was wondering if there was a
    similar "two commits with the same start LSN skips the second
    replay" problem with buffers. I didn't get that far, because I found
    a much more basic, rudimentary bug: xlog_recover_get_buf_lsn()
    doesn't recognise buffers with XFS_ATTR3_LEAF_MAGIC set in them!!!
    
    IOWs, attr3 leaf buffers fall through the magic number checks
    unrecognised, so trigger the "recover immediately" behaviour instead
    of undergoing an LSN check. IOWs, we incorrectly replay ATTR3 leaf
    buffers and that causes silent on disk corruption of inode attribute
    forks and potentially other things....
    
    Git history shows this is *another* zero day bug, this time
    introduced in commit 50d5c8d8e938 ("xfs: check LSN ordering for v5
    superblocks during recovery") which failed to handle the attr3 leaf
    buffers in recovery. And we've failed to handle them ever since...
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: fix log intent recovery ENOSPC shutdowns when inactivating inodes [+ + +]
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Jul 29 18:16:04 2022 +0200

    xfs: fix log intent recovery ENOSPC shutdowns when inactivating inodes
    
    commit 81ed94751b1513fcc5978dcc06eb1f5b4e55a785 upstream.
    
    During regular operation, the xfs_inactive operations create
    transactions with zero block reservation because in general we're
    freeing space, not asking for more.  The per-AG space reservations
    created at mount time enable us to handle expansions of the refcount
    btree without needing to reserve blocks to the transaction.
    
    Unfortunately, log recovery doesn't create the per-AG space reservations
    when intent items are being recovered.  This isn't an issue for intent
    item recovery itself because they explicitly request blocks, but any
    inode inactivation that can happen during log recovery uses the same
    xfs_inactive paths as regular runtime.  If a refcount btree expansion
    happens, the transaction will fail due to blk_res_used > blk_res, and we
    shut down the filesystem unnecessarily.
    
    Fix this problem by making per-AG reservations temporarily so that we
    can handle the inactivations, and releasing them at the end.  This
    brings the recovery environment closer to the runtime environment.
    
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: force the log offline when log intent item recovery fails [+ + +]
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Jul 29 18:16:05 2022 +0200

    xfs: force the log offline when log intent item recovery fails
    
    commit 4e6b8270c820c8c57a73f869799a0af2b56eff3e upstream.
    
    If any part of log intent item recovery fails, we should shut down the
    log immediately to stop the log from writing a clean unmount record to
    disk, because the metadata is not consistent.  The inability to cancel a
    dirty transaction catches most of these cases, but there are a few
    things that have slipped through the cracks, such as ENOSPC from a
    transaction allocation, or runtime errors that result in cancellation of
    a non-dirty transaction.
    
    This solves some weird behaviors reported by customers where a system
    goes down, the first mount fails, the second succeeds, but then the fs
    goes down later because of inconsistent metadata.
    
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: hold buffer across unpin and potential shutdown processing [+ + +]
Author: Brian Foster <bfoster@redhat.com>
Date:   Fri Jul 29 18:16:06 2022 +0200

    xfs: hold buffer across unpin and potential shutdown processing
    
    commit 84d8949e770745b16a7e8a68dcb1d0f3687bdee9 upstream.
    
    The special processing used to simulate a buffer I/O failure on fs
    shutdown has a difficult to reproduce race that can result in a use
    after free of the associated buffer. Consider a buffer that has been
    committed to the on-disk log and thus is AIL resident. The buffer
    lands on the writeback delwri queue, but is subsequently locked,
    committed and pinned by another transaction before submitted for
    I/O. At this point, the buffer is stuck on the delwri queue as it
    cannot be submitted for I/O until it is unpinned. A log checkpoint
    I/O failure occurs sometime later, which aborts the bli. The unpin
    handler is called with the aborted log item, drops the bli reference
    count, the pin count, and falls into the I/O failure simulation
    path.
    
    The potential problem here is that once the pin count falls to zero
    in ->iop_unpin(), xfsaild is free to retry delwri submission of the
    buffer at any time, before the unpin handler even completes. If
    delwri queue submission wins the race to the buffer lock, it
    observes the shutdown state and simulates the I/O failure itself.
    This releases both the bli and delwri queue holds and frees the
    buffer while xfs_buf_item_unpin() sits on xfs_buf_lock() waiting to
    run through the same failure sequence. This problem is rare and
    requires many iterations of fstest generic/019 (which simulates disk
    I/O failures) to reproduce.
    
    To avoid this problem, grab a hold on the buffer before the log item
    is unpinned if the associated item has been aborted and will require
    a simulated I/O failure. The hold is already required for the
    simulated I/O failure, so the ordering simply guarantees the unpin
    handler access to the buffer before it is unpinned and thus
    processed by the AIL. This particular ordering is required so long
    as the AIL does not acquire a reference on the bli, which is the
    long term solution to this problem.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: logging the on disk inode LSN can make it go backwards [+ + +]
Author: Dave Chinner <dchinner@redhat.com>
Date:   Fri Jul 29 18:16:08 2022 +0200

    xfs: logging the on disk inode LSN can make it go backwards
    
    commit 32baa63d82ee3f5ab3bd51bae6bf7d1c15aed8c7 upstream.
    
    When we log an inode, we format the "log inode" core and set an LSN
    in that inode core. We do that via xfs_inode_item_format_core(),
    which calls:
    
            xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);
    
    to format the log inode. It writes the LSN from the inode item into
    the log inode, and if recovery decides the inode item needs to be
    replayed, it recovers the log inode LSN field and writes it into the
    on disk inode LSN field.
    
    Now this might seem like a reasonable thing to do, but it is wrong
    on multiple levels. Firstly, if the item is not yet in the AIL,
    item->li_lsn is zero. i.e. the first time the inode it is logged and
    formatted, the LSN we write into the log inode will be zero. If we
    only log it once, recovery will run and can write this zero LSN into
    the inode.
    
    This means that the next time the inode is logged and log recovery
    runs, it will *always* replay changes to the inode regardless of
    whether the inode is newer on disk than the version in the log and
    that violates the entire purpose of recording the LSN in the inode
    at writeback time (i.e. to stop it going backwards in time on disk
    during recovery).
    
    Secondly, if we commit the CIL to the journal so the inode item
    moves to the AIL, and then relog the inode, the LSN that gets
    stamped into the log inode will be the LSN of the inode's current
    location in the AIL, not it's age on disk. And it's not the LSN that
    will be associated with the current change. That means when log
    recovery replays this inode item, the LSN that ends up on disk is
    the LSN for the previous changes in the log, not the current
    changes being replayed. IOWs, after recovery the LSN on disk is not
    in sync with the LSN of the modifications that were replayed into
    the inode. This, again, violates the recovery ordering semantics
    that on-disk writeback LSNs provide.
    
    Hence the inode LSN in the log dinode is -always- invalid.
    
    Thirdly, recovery actually has the LSN of the log transaction it is
    replaying right at hand - it uses it to determine if it should
    replay the inode by comparing it to the on-disk inode's LSN. But it
    doesn't use that LSN to stamp the LSN into the inode which will be
    written back when the transaction is fully replayed. It uses the one
    in the log dinode, which we know is always going to be incorrect.
    
    Looking back at the change history, the inode logging was broken by
    commit 93f958f9c41f ("xfs: cull unnecessary icdinode fields") way
    back in 2016 by a stupid idiot who thought he knew how this code
    worked. i.e. me. That commit replaced an in memory di_lsn field that
    was updated only at inode writeback time from the inode item.li_lsn
    value - and hence always contained the same LSN that appeared in the
    on-disk inode - with a read of the inode item LSN at inode format
    time. CLearly these are not the same thing.
    
    Before 93f958f9c41f, the log recovery behaviour was irrelevant,
    because the LSN in the log inode always matched the on-disk LSN at
    the time the inode was logged, hence recovery of the transaction
    would never make the on-disk LSN in the inode go backwards or get
    out of sync.
    
    A symptom of the problem is this, caught from a failure of
    generic/482. Before log recovery, the inode has been allocated but
    never used:
    
    xfs_db> inode 393388
    xfs_db> p
    core.magic = 0x494e
    core.mode = 0
    ....
    v3.crc = 0x99126961 (correct)
    v3.change_count = 0
    v3.lsn = 0
    v3.flags2 = 0
    v3.cowextsize = 0
    v3.crtime.sec = Thu Jan  1 10:00:00 1970
    v3.crtime.nsec = 0
    
    After log recovery:
    
    xfs_db> p
    core.magic = 0x494e
    core.mode = 020444
    ....
    v3.crc = 0x23e68f23 (correct)
    v3.change_count = 2
    v3.lsn = 0
    v3.flags2 = 0
    v3.cowextsize = 0
    v3.crtime.sec = Thu Jul 22 17:03:03 2021
    v3.crtime.nsec = 751000000
    ...
    
    You can see that the LSN of the on-disk inode is 0, even though it
    clearly has been written to disk. I point out this inode, because
    the generic/482 failure occurred because several adjacent inodes in
    this specific inode cluster were not replayed correctly and still
    appeared to be zero on disk when all the other metadata (inobt,
    finobt, directories, etc) indicated they should be allocated and
    written back.
    
    The fix for this is two-fold. The first is that we need to either
    revert the LSN changes in 93f958f9c41f or stop logging the inode LSN
    altogether. If we do the former, log recovery does not need to
    change but we add 8 bytes of memory per inode to store what is
    largely a write-only inode field. If we do the latter, log recovery
    needs to stamp the on-disk inode in the same manner that inode
    writeback does.
    
    I prefer the latter, because we shouldn't really be trying to log
    and replay changes to the on disk LSN as the on-disk value is the
    canonical source of the on-disk version of the inode. It also
    matches the way we recover buffer items - we create a buf_log_item
    that carries the current recovery transaction LSN that gets stamped
    into the buffer by the write verifier when it gets written back
    when the transaction is fully recovered.
    
    However, this might break log recovery on older kernels even more,
    so I'm going to simply ignore the logged value in recovery and stamp
    the on-disk inode with the LSN of the transaction being recovered
    that will trigger writeback on transaction recovery completion. This
    will ensure that the on-disk inode LSN always reflects the LSN of
    the last change that was written to disk, regardless of whether it
    comes from log recovery or runtime writeback.
    
    Fixes: 93f958f9c41f ("xfs: cull unnecessary icdinode fields")
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: prevent UAF in xfs_log_item_in_current_chkpt [+ + +]
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Jul 29 18:16:03 2022 +0200

    xfs: prevent UAF in xfs_log_item_in_current_chkpt
    
    commit f8d92a66e810acbef6ddbc0bd0cbd9b117ce8acd upstream.
    
    While I was running with KASAN and lockdep enabled, I stumbled upon an
    KASAN report about a UAF to a freed CIL checkpoint.  Looking at the
    comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me
    that the original patch to xfs_defer_finish_noroll should have done
    something to lock the CIL to prevent it from switching the CIL contexts
    while the predicate runs.
    
    For upper level code that needs to know if a given log item is new
    enough not to need relogging, add a new wrapper that takes the CIL
    context lock long enough to sample the current CIL context.  This is
    kind of racy in that the CIL can switch the contexts immediately after
    sampling, but that's ok because the consequence is that the defer ops
    code is a little slow to relog items.
    
     ==================================================================
     BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs]
     Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999
    
     CPU: 1 PID: 527999 Comm: fsstress Tainted: G      D      5.16.0-rc4-xfsx #rc4
     Call Trace:
      <TASK>
      dump_stack_lvl+0x45/0x59
      print_address_description.constprop.0+0x1f/0x140
      kasan_report.cold+0x83/0xdf
      xfs_log_item_in_current_chkpt+0x139/0x160
      xfs_defer_finish_noroll+0x3bb/0x1e30
      __xfs_trans_commit+0x6c8/0xcf0
      xfs_reflink_remap_extent+0x66f/0x10e0
      xfs_reflink_remap_blocks+0x2dd/0xa90
      xfs_file_remap_range+0x27b/0xc30
      vfs_dedupe_file_range_one+0x368/0x420
      vfs_dedupe_file_range+0x37c/0x5d0
      do_vfs_ioctl+0x308/0x1260
      __x64_sys_ioctl+0xa1/0x170
      do_syscall_64+0x35/0x80
      entry_SYSCALL_64_after_hwframe+0x44/0xae
     RIP: 0033:0x7f2c71a2950b
     Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff
    ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01
    f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
     RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
     RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b
     RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004
     RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003
     R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a
     R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20
      </TASK>
    
     Allocated by task 464064:
      kasan_save_stack+0x1e/0x50
      __kasan_kmalloc+0x81/0xa0
      kmem_alloc+0xcd/0x2c0 [xfs]
      xlog_cil_ctx_alloc+0x17/0x1e0 [xfs]
      xlog_cil_push_work+0x141/0x13d0 [xfs]
      process_one_work+0x7f6/0x1380
      worker_thread+0x59d/0x1040
      kthread+0x3b0/0x490
      ret_from_fork+0x1f/0x30
    
     Freed by task 51:
      kasan_save_stack+0x1e/0x50
      kasan_set_track+0x21/0x30
      kasan_set_free_info+0x20/0x30
      __kasan_slab_free+0xed/0x130
      slab_free_freelist_hook+0x7f/0x160
      kfree+0xde/0x340
      xlog_cil_committed+0xbfd/0xfe0 [xfs]
      xlog_cil_process_committed+0x103/0x1c0 [xfs]
      xlog_state_do_callback+0x45d/0xbd0 [xfs]
      xlog_ioend_work+0x116/0x1c0 [xfs]
      process_one_work+0x7f6/0x1380
      worker_thread+0x59d/0x1040
      kthread+0x3b0/0x490
      ret_from_fork+0x1f/0x30
    
     Last potentially related work creation:
      kasan_save_stack+0x1e/0x50
      __kasan_record_aux_stack+0xb7/0xc0
      insert_work+0x48/0x2e0
      __queue_work+0x4e7/0xda0
      queue_work_on+0x69/0x80
      xlog_cil_push_now.isra.0+0x16b/0x210 [xfs]
      xlog_cil_force_seq+0x1b7/0x850 [xfs]
      xfs_log_force_seq+0x1c7/0x670 [xfs]
      xfs_file_fsync+0x7c1/0xa60 [xfs]
      __x64_sys_fsync+0x52/0x80
      do_syscall_64+0x35/0x80
      entry_SYSCALL_64_after_hwframe+0x44/0xae
    
     The buggy address belongs to the object at ffff88804ea5f600
      which belongs to the cache kmalloc-256 of size 256
     The buggy address is located 8 bytes inside of
      256-byte region [ffff88804ea5f600, ffff88804ea5f700)
     The buggy address belongs to the page:
     page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e
     head:ffffea00013a9780 order:1 compound_mapcount:0
     flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff)
     raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40
     raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000
     page dumped because: kasan: bad access detected
    
     Memory state around the buggy address:
      ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
     >ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                           ^
      ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
     ==================================================================
    
    Fixes: 4e919af7827a ("xfs: periodically relog deferred intent items")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: refactor xfs_file_fsync [+ + +]
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Jul 29 18:16:01 2022 +0200

    xfs: refactor xfs_file_fsync
    
    commit f22c7f87777361f94aa17f746fbadfa499248dc8 upstream.
    
    [backported for dependency]
    
    Factor out the log syncing logic into two helpers to make the code easier
    to read and more maintainable.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: remove dead stale buf unpin handling code [+ + +]
Author: Brian Foster <bfoster@redhat.com>
Date:   Fri Jul 29 18:16:07 2022 +0200

    xfs: remove dead stale buf unpin handling code
    
    commit e53d3aa0b605c49d780e1b2fd0b49dba4154f32b upstream.
    
    This code goes back to a time when transaction commits wrote
    directly to iclogs. The associated log items were pinned, written to
    the log, and then "uncommitted" if some part of the log write had
    failed. This uncommit sequence called an ->iop_unpin_remove()
    handler that was eventually folded into ->iop_unpin() via the remove
    parameter. The log subsystem has since changed significantly in that
    transactions commit to the CIL instead of direct to iclogs, though
    log items must still be aborted in the event of an eventual log I/O
    error. However, the context for a log item abort is now asynchronous
    from transaction commit, which means the committing transaction has
    been freed by this point in time and the transaction uncommit
    sequence of events is no longer relevant.
    
    Further, since stale buffers remain locked at transaction commit
    through unpin, we can be certain that the buffer is not associated
    with any transaction when the unpin callback executes. Remove this
    unused hunk of code and replace it with an assertion that the buffer
    is disassociated from transaction context.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

xfs: xfs_log_force_lsn isn't passed a LSN [+ + +]
Author: Dave Chinner <dchinner@redhat.com>
Date:   Fri Jul 29 18:16:02 2022 +0200

    xfs: xfs_log_force_lsn isn't passed a LSN
    
    commit 5f9b4b0de8dc2fb8eb655463b438001c111570fe upstream.
    
    [backported from CIL scalability series for dependency]
    
    In doing an investigation into AIL push stalls, I was looking at the
    log force code to see if an async CIL push could be done instead.
    This lead me to xfs_log_force_lsn() and looking at how it works.
    
    xfs_log_force_lsn() is only called from inode synchronisation
    contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
    value as the LSN to sync the log to. This gets passed to
    xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
    journal, and then used by xfs_log_force_lsn() to flush the iclogs to
    the journal.
    
    The problem is that ip->i_itemp->ili_last_lsn does not store a
    log sequence number. What it stores is passed to it from the
    ->iop_committing method, which is called by xfs_log_commit_cil().
    The value this passes to the iop_committing method is the CIL
    context sequence number that the item was committed to.
    
    As it turns out, xlog_cil_force_lsn() converts the sequence to an
    actual commit LSN for the related context and returns that to
    xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
    variable that contained a sequence with an actual LSN and then uses
    that to sync the iclogs.
    
    This caused me some confusion for a while, even though I originally
    wrote all this code a decade ago. ->iop_committing is only used by
    a couple of log item types, and only inode items use the sequence
    number it is passed.
    
    Let's clean up the API, CIL structures and inode log item to call it
    a sequence number, and make it clear that the high level code is
    using CIL sequence numbers and not on-disk LSNs for integrity
    synchronisation purposes.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
    Acked-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>