1. 09 Oct, 2021 40 commits
    • Maitreya29's avatar
      drivers: net: Drop WireGuard · 35ac2e95
      Maitreya29 authored and arnavpuranik's avatar arnavpuranik committed
      
      Signed-off-by: arnavpuranik's avatararnavpuranik <puranikarnav@gmail.com>
      35ac2e95
    • Maitreya29's avatar
      net: adapt bpf_xdp_copy · 84105dbd
      Maitreya29 authored and arnavpuranik's avatar arnavpuranik committed
      84105dbd
    • Maitreya29's avatar
    • Maitreya29's avatar
      Revert "net/compat: Add missing sock updates for SCM_RIGHTS" · 27c32884
      Maitreya29 authored and arnavpuranik's avatar arnavpuranik committed
      This reverts commit 34c21662.
      27c32884
    • arnavpuranik's avatar
      defconfig: drop KLapse · fe3e8cd2
      arnavpuranik authored
      
      Signed-off-by: arnavpuranik's avatararnavpuranik <puranikarnav@gmail.com>
      fe3e8cd2
    • arnavpuranik's avatar
      Revert "Introducing KLapse - A kernel level livedisplay module v4.0" · 146d5c0c
      arnavpuranik authored
      This reverts commit 54d5dc99.
      146d5c0c
    • Sourajit Karmakar's avatar
      defconfig: Enable necessary configs for the BPF backport · 807f03dd
      Sourajit Karmakar authored and arnavpuranik's avatar arnavpuranik committed
      
      Signed-off-by: arnavpuranik's avatararnavpuranik <puranikarnav@gmail.com>
      807f03dd
    • Daniel Borkmann's avatar
      bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K · f52dc435
      Daniel Borkmann authored and arnavpuranik's avatar arnavpuranik committed
      
      
      [ Upstream commit fdadd04931c2d7cd294dc5b2b342863f94be53a3 ]
      
      Michael and Sandipan report:
      
        Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
        JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
        and is adjusted again at init time if MODULES_VADDR is defined.
      
        For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
        the compile-time default at boot-time, which is 0x9c400000 when
        using 64K page size. This overflows the signed 32-bit bpf_jit_limit
        value:
      
        root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
        -1673527296
      
        and can cause various unexpected failures throughout the network
        stack. In one case `strace dhclient eth0` reported:
      
        setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
                   16) = -1 ENOTSUPP (Unknown error 524)
      
        and similar failures can be seen with tools like tcpdump. This doesn't
        always reproduce however, and I'm not sure why. The more consistent
        failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
        host would time out on systemd/netplan configuring a virtio-net NIC
        with no noticeable errors in the logs.
      
      Given this and also given that in near future some architectures like
      arm64 will have a custom area for BPF JIT image allocations we should
      get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
      4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
      so therefore add another overridable bpf_jit_alloc_exec_limit() helper
      function which returns the possible size of the memory area for deriving
      the default heuristic in bpf_jit_charge_init().
      
      Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
      bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
      JIT memory provider, and therefore in case archs implement their custom
      module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
      vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
      
      Additionally, for archs supporting large page sizes, we should change
      the sysctl to be handled as long to not run into sysctl restrictions
      in future.
      
      Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
      Reported-by: default avatarSandipan Das <sandipan@linux.ibm.com>
      Reported-by: default avatarMichael Roth <mdroth@linux.vnet.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarMichael Roth <mdroth@linux.vnet.ibm.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f52dc435
    • Anay Wadhera's avatar
      Revert "cgroup: Disable IRQs while holding css_set_lock" · 4e8fdbbd
      Anay Wadhera authored and arnavpuranik's avatar arnavpuranik committed
      This reverts commit ac7b270e91c7b0d1b1c5544532852b55177004f1.
      4e8fdbbd
    • Colin Cross's avatar
      cgroup: Add generic cgroup subsystem permission checks · afd1b80d
      Colin Cross authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Rather than using explicit euid == 0 checks when trying to move
      tasks into a cgroup via CFS, move permission checks into each
      specific cgroup subsystem. If a subsystem does not specify a
      'allow_attach' handler, then we fall back to doing our checks
      the old way.
      
      Use the 'allow_attach' handler for the 'cpu' cgroup to allow
      non-root processes to add arbitrary processes to a 'cpu' cgroup
      if it has the CAP_SYS_NICE capability set.
      
      This version of the patch adds a 'allow_attach' handler instead
      of reusing the 'can_attach' handler.  If the 'can_attach' handler
      is reused, a new cgroup that implements 'can_attach' but not
      the permission checks could end up with no permission checks
      at all.
      
      Change-Id: Icfa950aa9321d1ceba362061d32dc7dfa2c64f0c
      Original-Author: San Mehat <san@google.com>
      Signed-off-by: default avatarColin Cross <ccross@android.com>
      afd1b80d
    • Rom Lemarchand's avatar
      cgroup: refactor allow_attach function into common code · 33903441
      Rom Lemarchand authored and arnavpuranik's avatar arnavpuranik committed
      
      
      move cpu_cgroup_allow_attach to a common subsys_cgroup_allow_attach.
      This allows any process with CAP_SYS_NICE to move tasks across cgroups if
      they use this function as their allow_attach handler.
      
      Bug: 18260435
      Change-Id: I6bb4933d07e889d0dc39e33b4e71320c34a2c90f
      Signed-off-by: default avatarRom Lemarchand <romlem@android.com>
      33903441
    • Daniel Borkmann's avatar
      bpf: Fix buggy rsh min/max bounds tracking · 0c47819f
      Daniel Borkmann authored and arnavpuranik's avatar arnavpuranik committed
      
      
      [ no upstream commit ]
      
      Fix incorrect bounds tracking for RSH opcode. Commit f23cc643f9ba ("bpf: fix
      range arithmetic for bpf map access") had a wrong assumption about min/max
      bounds. The new dst_reg->min_value needs to be derived by right shifting the
      max_val bounds, not min_val, and likewise new dst_reg->max_value needs to be
      derived by right shifting the min_val bounds, not max_val. Later stable kernels
      than 4.9 are not affected since bounds tracking was overall reworked and they
      already track this similarly as in the fix.
      
      Fixes: f23cc643f9ba ("bpf: fix range arithmetic for bpf map access")
      Reported-by: Ryota Shiga (Flatt Security)
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0c47819f
    • Tejun Heo's avatar
      cgroup: add tracepoints for basic operations · 66606133
      Tejun Heo authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Debugging what goes wrong with cgroup setup can get hairy.  Add
      tracepoints for cgroup hierarchy mount, cgroup creation/destruction
      and task migration operations for better visibility.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      66606133
    • Daniel Bristot de Oliveira's avatar
      cgroup: Disable IRQs while holding css_set_lock · c3e860af
      Daniel Bristot de Oliveira authored and arnavpuranik's avatar arnavpuranik committed
      
      
      While testing the deadline scheduler + cgroup setup I hit this
      warning.
      
      [  132.612935] ------------[ cut here ]------------
      [  132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
      [  132.612952] Modules linked in: (a ton of modules...)
      [  132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2
      [  132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
      [  132.612982]  0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e
      [  132.612984]  0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b
      [  132.612985]  00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00
      [  132.612986] Call Trace:
      [  132.612987]  <IRQ>  [<ffffffff813d229e>] dump_stack+0x63/0x85
      [  132.612994]  [<ffffffff810a652b>] __warn+0xcb/0xf0
      [  132.612997]  [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170
      [  132.612999]  [<ffffffff810a665d>] warn_slowpath_null+0x1d/0x20
      [  132.613000]  [<ffffffff810aba5b>] __local_bh_enable_ip+0x6b/0x80
      [  132.613008]  [<ffffffff817d6c8a>] _raw_write_unlock_bh+0x1a/0x20
      [  132.613010]  [<ffffffff817d6c9e>] _raw_spin_unlock_bh+0xe/0x10
      [  132.613015]  [<ffffffff811388ac>] put_css_set+0x5c/0x60
      [  132.613016]  [<ffffffff8113dc7f>] cgroup_free+0x7f/0xa0
      [  132.613017]  [<ffffffff810a3912>] __put_task_struct+0x42/0x140
      [  132.613018]  [<ffffffff810e776a>] dl_task_timer+0xca/0x250
      [  132.613027]  [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170
      [  132.613030]  [<ffffffff8111371e>] __hrtimer_run_queues+0xee/0x270
      [  132.613031]  [<ffffffff81113ec8>] hrtimer_interrupt+0xa8/0x190
      [  132.613034]  [<ffffffff81051a58>] local_apic_timer_interrupt+0x38/0x60
      [  132.613035]  [<ffffffff817d9b0d>] smp_apic_timer_interrupt+0x3d/0x50
      [  132.613037]  [<ffffffff817d7c5c>] apic_timer_interrupt+0x8c/0xa0
      [  132.613038]  <EOI>  [<ffffffff81063466>] ? native_safe_halt+0x6/0x10
      [  132.613043]  [<ffffffff81037a4e>] default_idle+0x1e/0xd0
      [  132.613044]  [<ffffffff810381cf>] arch_cpu_idle+0xf/0x20
      [  132.613046]  [<ffffffff810e8fda>] default_idle_call+0x2a/0x40
      [  132.613047]  [<ffffffff810e92d7>] cpu_startup_entry+0x2e7/0x340
      [  132.613048]  [<ffffffff81050235>] start_secondary+0x155/0x190
      [  132.613049] ---[ end trace f91934d162ce9977 ]---
      
      The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
      context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
      this problem - and other problems of sharing a spinlock with an
      interrupt.
      
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: cgroups@vger.kernel.org
      Cc: stable@vger.kernel.org # 4.5+
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatar"Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      c3e860af
    • Johannes Weiner's avatar
      FROMLIST: kernel: cgroup: add poll file operation · e49b20df
      Johannes Weiner authored and arnavpuranik's avatar arnavpuranik committed
      Cgroup has a standardized poll/notification mechanism for waking all
      pollers on all fds when a filesystem node changes.  To allow polling for
      custom events, add a .poll callback that can override the default.
      
      This is in preparation for pollable cgroup pressure files which have
      per-fd trigger configurations.
      
      Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      
      (in linux-next: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=c88177361203be291a49956b6c9d5ec164ea24b2
      
      )
      
      Conflicts:
              include/linux/cgroup-defs.h
              kernel/cgroup.c
      
      1. made changes in kernel/cgroup.c instead of kernel/cgroup/cgroup.c
      2. replaced __poll_t with unsigned int
      
      Bug: 111308141
      Test: modified lmkd to use PSI and tested using lmkd_unit_test
      
      Change-Id: Ie3d914197d1f150e1d83c6206865566a7cbff1b4
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      e49b20df
    • Tejun Heo's avatar
      UPSTREAM: cgroup add cftype->open/release() callbacks · c44bf9e1
      Tejun Heo authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Pipe the newly added kernfs->open/release() callbacks through cftype.
      While at it, as cleanup operations now can be performed from
      ->release() instead of ->seq_stop(), make the latter optional.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarAcked-by: Zefan Li <lizefan@huawei.com>
      
      (cherry picked from commit e90cbebc3fa5caea4c8bfeb0d0157a0cee53efc7)
      
      Bug: 111308141
      Test: modified lmkd to use PSI and tested using lmkd_unit_test
      
      Change-Id: Iff9794cbbc2c7067c24cb2f767bbdeffa26b5180
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      c44bf9e1
    • Zefan Li's avatar
      netprio_cgroup: Fix unlimited memory leak of v2 cgroups · 830b0d72
      Zefan Li authored and arnavpuranik's avatar arnavpuranik committed
      
      
      [ Upstream commit 090e28b229af92dc5b40786ca673999d59e73056 ]
      
      If systemd is configured to use hybrid mode which enables the use of
      both cgroup v1 and v2, systemd will create new cgroup on both the default
      root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
      task to the two cgroups. If the task does some network thing then the v2
      cgroup can never be freed after the session exited.
      
      One of our machines ran into OOM due to this memory leak.
      
      In the scenario described above when sk_alloc() is called
      cgroup_sk_alloc() thought it's in v2 mode, so it stores
      the cgroup pointer in sk->sk_cgrp_data and increments
      the cgroup refcnt, but then sock_update_netprioidx()
      thought it's in v1 mode, so it stores netprioidx value
      in sk->sk_cgrp_data, so the cgroup refcnt will never be freed.
      
      Currently we do the mode switch when someone writes to the ifpriomap
      cgroup control file. The easiest fix is to also do the switch when
      a task is attached to a new cgroup.
      
      Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
      Reported-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Tested-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarZefan Li <lizefan@huawei.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      830b0d72
    • Shakeel Butt's avatar
      cgroup: memcg: net: do not associate sock with unrelated cgroup · 0b254907
      Shakeel Butt authored and arnavpuranik's avatar arnavpuranik committed
      
      
      [ Upstream commit e876ecc67db80dfdb8e237f71e5b43bb88ae549c ]
      
      We are testing network memory accounting in our setup and noticed
      inconsistent network memory usage and often unrelated cgroups network
      usage correlates with testing workload. On further inspection, it
      seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
      irq context specially for cgroup v1.
      
      mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
      and kind of assumes that this can only happen from sk_clone_lock()
      and the source sock object has already associated cgroup. However in
      cgroup v1, where network memory accounting is opt-in, the source sock
      can be unassociated with any cgroup and the new cloned sock can get
      associated with unrelated interrupted cgroup.
      
      Cgroup v2 can also suffer if the source sock object was created by
      process in the root cgroup or if sk_alloc() is called in irq context.
      The fix is to just do nothing in interrupt.
      
      WARNING: Please note that about half of the TCP sockets are allocated
      from the IRQ context, so, memory used by such sockets will not be
      accouted by the memcg.
      
      The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
      
      CPU: 70 PID: 12720 Comm: ssh Tainted:  5.6.0-smp-DEV #1
      Hardware name: ...
      Call Trace:
       <IRQ>
       dump_stack+0x57/0x75
       mem_cgroup_sk_alloc+0xe9/0xf0
       sk_clone_lock+0x2a7/0x420
       inet_csk_clone_lock+0x1b/0x110
       tcp_create_openreq_child+0x23/0x3b0
       tcp_v6_syn_recv_sock+0x88/0x730
       tcp_check_req+0x429/0x560
       tcp_v6_rcv+0x72d/0xa40
       ip6_protocol_deliver_rcu+0xc9/0x400
       ip6_input+0x44/0xd0
       ? ip6_protocol_deliver_rcu+0x400/0x400
       ip6_rcv_finish+0x71/0x80
       ipv6_rcv+0x5b/0xe0
       ? ip6_sublist_rcv+0x2e0/0x2e0
       process_backlog+0x108/0x1e0
       net_rx_action+0x26b/0x460
       __do_softirq+0x104/0x2a6
       do_softirq_own_stack+0x2a/0x40
       </IRQ>
       do_softirq.part.19+0x40/0x50
       __local_bh_enable_ip+0x51/0x60
       ip6_finish_output2+0x23d/0x520
       ? ip6table_mangle_hook+0x55/0x160
       __ip6_finish_output+0xa1/0x100
       ip6_finish_output+0x30/0xd0
       ip6_output+0x73/0x120
       ? __ip6_finish_output+0x100/0x100
       ip6_xmit+0x2e3/0x600
       ? ipv6_anycast_cleanup+0x50/0x50
       ? inet6_csk_route_socket+0x136/0x1e0
       ? skb_free_head+0x1e/0x30
       inet6_csk_xmit+0x95/0xf0
       __tcp_transmit_skb+0x5b4/0xb20
       __tcp_send_ack.part.60+0xa3/0x110
       tcp_send_ack+0x1d/0x20
       tcp_rcv_state_process+0xe64/0xe80
       ? tcp_v6_connect+0x5d1/0x5f0
       tcp_v6_do_rcv+0x1b1/0x3f0
       ? tcp_v6_do_rcv+0x1b1/0x3f0
       __release_sock+0x7f/0xd0
       release_sock+0x30/0xa0
       __inet_stream_connect+0x1c3/0x3b0
       ? prepare_to_wait+0xb0/0xb0
       inet_stream_connect+0x3b/0x60
       __sys_connect+0x101/0x120
       ? __sys_getsockopt+0x11b/0x140
       __x64_sys_connect+0x1a/0x20
       do_syscall_64+0x51/0x200
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
      Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking")
      Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0b254907
    • Yang Yingliang's avatar
      cgroup: add missing skcd->no_refcnt check in cgroup_sk_clone() · cf01e9d4
      Yang Yingliang authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Add skcd->no_refcnt check which is missed when backporting
      ad0f75e5f57c ("cgroup: fix cgroup_sk_alloc() for sk_clone_lock()").
      
      This patch is needed in stable-4.9, stable-4.14 and stable-4.19.
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cf01e9d4
    • Cong Wang's avatar
      cgroup: fix cgroup_sk_alloc() for sk_clone_lock() · 5ce6eb46
      Cong Wang authored and arnavpuranik's avatar arnavpuranik committed
      
      
      [ Upstream commit ad0f75e5f57ccbceec13274e1e242f2b5a6397ed ]
      
      When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
      copied, so the cgroup refcnt must be taken too. And, unlike the
      sk_alloc() path, sock_update_netprioidx() is not called here.
      Therefore, it is safe and necessary to grab the cgroup refcnt
      even when cgroup_sk_alloc is disabled.
      
      sk_clone_lock() is in BH context anyway, the in_interrupt()
      would terminate this function if called there. And for sk_alloc()
      skcd->val is always zero. So it's safe to factor out the code
      to make it more readable.
      
      The global variable 'cgroup_sk_alloc_disabled' is used to determine
      whether to take these reference counts. It is impossible to make
      the reference counting correct unless we save this bit of information
      in skcd->val. So, add a new bit there to record whether the socket
      has already taken the reference counts. This obviously relies on
      kmalloc() to align cgroup pointers to at least 4 bytes,
      ARCH_KMALLOC_MINALIGN is certainly larger than that.
      
      This bug seems to be introduced since the beginning, commit
      d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
      tried to fix it but not compeletely. It seems not easy to trigger until
      the recent commit 090e28b229af
      ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.
      
      Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
      Reported-by: default avatarCameron Berkenpas <cam@neo-zeon.de>
      Reported-by: default avatarPeter Geis <pgwipeout@gmail.com>
      Reported-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reported-by: dsonck92's avatarDaniël Sonck <dsonck92@gmail.com>
      Reported-by: default avatarZhang Qiang <qiang.zhang@windriver.com>
      Tested-by: default avatarCameron Berkenpas <cam@neo-zeon.de>
      Tested-by: default avatarPeter Geis <pgwipeout@gmail.com>
      Tested-by: default avatarThomas Lamprecht <t.lamprecht@proxmox.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ce6eb46
    • Chenbo Feng's avatar
      BACKPORT: UPSTREAM: Add a eBPF helper function to retrieve socket uid · 3a4a58cf
      Chenbo Feng authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Cherry-pick from commit 6acc5c2910689fc6ee181bf63085c5efff6a42bd
      
      Returns the owner uid of the socket inside a sk_buff. This is useful to
      perform per-UID accounting of network traffic or per-UID packet
      filtering. The socket need to be a fullsock otherwise overflowuid is
      returned.
      Signed-off-by: default avatarChenbo Feng <fengc@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Bug: 30950746
      Change-Id: Idc00947ccfdd4e9f2214ffc4178d701cd9ead0ac
      3a4a58cf
    • Chenbo Feng's avatar
      BACKPORT: UPSTREAM: Add a helper function to get socket cookie in eBPF · ebd9a767
      Chenbo Feng authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Cherrypick from commit: 91b8270f2a4d1d9b268de90451cdca63a70052d6
      
      Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff
      with a known socket. Generates a new cookie if one was not yet set.If
      the socket pointer inside sk_buff is NULL, 0 is returned. The helper
      function coud be useful in monitoring per socket networking traffic
      statistics and provide a unique socket identifier per namespace.
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarChenbo Feng <fengc@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Bug: 30950746
      Change-Id: I95918dcc3ceffb3061495a859d28aee88e3cde3c
      ebd9a767
    • Chenbo Feng's avatar
      ANDROID: Fix missing uapi headers · 8ac37566
      Chenbo Feng authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Update the missing bpf helper function name in bpf_func_id to keep the
      uapi header consistent with upstream uapi header because we need the
      new added bpf helper function bpf get_socket_cookie and get_socket_uid.
      The patch related to those headers are not backetported since they are
      not related and backport them will bring in extra confilict.
      Signed-off-by: default avatarChenbo Feng <fengc@google.com>
      Bug: 30950746
      Change-Id: I2b5fd03799ac5f2e3243ab11a1bccb932f06c312
      8ac37566
    • Daniel Borkmann's avatar
      bpf: add helper to invalidate hash · 970c9450
      Daniel Borkmann authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Add a small helper that complements 36bbef52c7eb ("bpf: direct packet
      write and access for helpers for clsact progs") for invalidating the
      current skb->hash after mangling on headers via direct packet write.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      970c9450
    • Anay Wadhera's avatar
      net: take compile fix from 4.9 · f15571f2
      Anay Wadhera authored and arnavpuranik's avatar arnavpuranik committed
      f15571f2
    • Anay Wadhera's avatar
      cgroup: replace out_idr_free with actual code · 95e3e935
      Anay Wadhera authored and arnavpuranik's avatar arnavpuranik committed
      95e3e935
    • Daniel Borkmann's avatar
      ip_tunnel: add support for setting flow label via collect metadata · b6eddf44
      Daniel Borkmann authored and arnavpuranik's avatar arnavpuranik committed
      
      
      This patch extends udp_tunnel6_xmit_skb() to pass in the IPv6 flow label
      from call sites. Currently, there's no such option and it's always set to
      zero when writing ip6_flow_hdr(). Add a label member to ip_tunnel_key, so
      that flow-based tunnels via collect metadata frontends can make use of it.
      vxlan and geneve will be converted to add flow label support separately.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6eddf44
    • Jamal Hadi Salim's avatar
    • Aditya Kali's avatar
      kernfs: define kernfs_node_dentry · d67756da
      Aditya Kali authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Add a new kernfs api is added to lookup the dentry for a particular
      kernfs path.
      Signed-off-by: default avatarAditya Kali <adityakali@google.com>
      Signed-off-by: default avatarSerge E. Hallyn <serge.hallyn@canonical.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d67756da
    • Tejun Heo's avatar
      BACKPORT: cgroup: misc changes · 3980fb90
      Tejun Heo authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Misc trivial changes to prepare for future changes.  No functional
      difference.
      
      * Expose cgroup_get(), cgroup_tryget() and cgroup_parent().
      
      * Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp.
      
      * Rename cgroup_stats_show() to cgroup_stat_show() for consistency
        with the file name.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      
      (cherry picked from commit 3e48930cc74f0c212ee1838f89ad0ca7fcf2fea1)
      
      Conflicts:
              kernel/cgroup/cgroup.c
      
      (1. manual merge because kernel/cgroup/cgroup.c is under kernel/cgroup.c
      2. cgroup_stats_show change is skipped because the function dos not exist)
      
      Bug: 111308141
      Test: modified lmkd to use PSI and tested using lmkd_unit_test
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Change-Id: I756ee3dcf0d0f3da69cd1b58e644271625053538
      3980fb90
    • Josh Poimboeuf's avatar
      objtool, modules: Discard objtool annotation sections for modules · b521eb67
      Josh Poimboeuf authored and arnavpuranik's avatar arnavpuranik committed
      
      
      commit e390f9a9689a42f477a6073e2e7df530a4c1b740 upstream.
      
      The '__unreachable' and '__func_stack_frame_non_standard' sections are
      only used at compile time.  They're discarded for vmlinux but they
      should also be discarded for modules.
      
      Since this is a recurring pattern, prefix the section names with
      ".discard.".  It's a nice convention and vmlinux.lds.h already discards
      such sections.
      
      Also remove the 'a' (allocatable) flag from the __unreachable section
      since it doesn't make sense for a discarded section.
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: d1091c7fa3d5 ("objtool: Improve detection of BUG() and other dead ends")
      Link: http://lkml.kernel.org/r/20170301180444.lhd53c5tibc4ns77@treble
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      [dwmw2: Remove the unreachable part in backporting since it's not here yet]
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.ku>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b521eb67
    • Josh Poimboeuf's avatar
      objtool: Add STACK_FRAME_NON_STANDARD() macro · 2db960f0
      Josh Poimboeuf authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Add a new macro, STACK_FRAME_NON_STANDARD(), which is used to denote a
      function which does something unusual related to its stack frame.  Use
      of the macro prevents objtool from emitting a false positive warning.
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris J Arges <chris.j.arges@canonical.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/34487a17b23dba43c50941599d47054a9584b219.1456719558.git.jpoimboe@redhat.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2db960f0
    • Anay Wadhera's avatar
      remove leftovers from 6ea07b4590d3174a53303 · 0c5fcb45
      Anay Wadhera authored and arnavpuranik's avatar arnavpuranik committed
      0c5fcb45
    • Eric W. Biederman's avatar
      fs: Add user namespace member to struct super_block · d550e752
      Eric W. Biederman authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Start marking filesystems with a user namespace owner, s_user_ns.  In
      this change this is only used for permission checks of who may mount a
      filesystem.  Ultimately s_user_ns will be used for translating ids and
      checking capabilities for filesystems mounted from user namespaces.
      
      The default policy for setting s_user_ns is implemented in sget(),
      which arranges for s_user_ns to be set to current_user_ns() and to
      ensure that the mounter of the filesystem has CAP_SYS_ADMIN in that
      user_ns.
      
      The guts of sget are split out into another function sget_userns().
      The function sget_userns calls alloc_super with the specified user
      namespace or it verifies the existing superblock that was found
      has the expected user namespace, and fails with EBUSY when it is not.
      This failing prevents users with the wrong privileges mounting a
      filesystem.
      
      The reason for the split of sget_userns from sget is that in some
      cases such as mount_ns and kernfs_mount_ns a different policy for
      permission checking of mounts and setting s_user_ns is necessary, and
      the existence of sget_userns() allows those policies to be
      implemented.
      
      The helper mount_ns is expected to be used for filesystems such as
      proc and mqueuefs which present per namespace information.  The
      function mount_ns is modified to call sget_userns instead of sget to
      ensure the user namespace owner of the namespace whose information is
      presented by the filesystem is used on the superblock.
      
      For sysfs and cgroup the appropriate permission checks are already in
      place, and kernfs_mount_ns is modified to call sget_userns so that
      the init_user_ns is the only user namespace used.
      
      For the cgroup filesystem cgroup namespace mounts are bind mounts of a
      subset of the full cgroup filesystem and as such s_user_ns must be the
      same for all of them as there is only a single superblock.
      
      Mounts of sysfs that vary based on the network namespace could in principle
      change s_user_ns but it keeps the analysis and implementation of kernfs
      simpler if that is not supported, and at present there appear to be no
      benefits from supporting a different s_user_ns on any sysfs mount.
      
      Getting the details of setting s_user_ns correct has been
      a long process.  Thanks to Pavel Tikhorirorv who spotted a leak
      in sget_userns.  Thanks to Seth Forshee who has kept the work alive.
      
      Thanks-to: Seth Forshee <seth.forshee@canonical.com>
      Thanks-to: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      d550e752
    • Eric W. Biederman's avatar
      vfs: Pass data, ns, and ns->userns to mount_ns · ca8a1ddc
      Eric W. Biederman authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Today what is normally called data (the mount options) is not passed
      to fill_super through mount_ns.
      
      Pass the mount options and the namespace separately to mount_ns so
      that filesystems such as proc that have mount options, can use
      mount_ns.
      
      Pass the user namespace to mount_ns so that the standard permission
      check that verifies the mounter has permissions over the namespace can
      be performed in mount_ns instead of in each filesystems .mount method.
      Thus removing the duplication between mqueuefs and proc in terms of
      permission checks.  The extra permission check does not currently
      affect the rpc_pipefs filesystem and the nfsd filesystem as those
      filesystems do not currently allow unprivileged mounts.  Without
      unpvileged mounts it is guaranteed that the caller has already passed
      capable(CAP_SYS_ADMIN) which guarantees extra permission check will
      pass.
      
      Update rpc_pipefs and the nfsd filesystem to ensure that the network
      namespace reference is always taken in fill_super and always put in kill_sb
      so that the logic is simpler and so that errors originating inside of
      fill_super do not cause a network namespace leak.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      ca8a1ddc
    • Tejun Heo's avatar
      kernfs: make kernfs_path*() behave in the style of strlcpy() · b44c18c3
      Tejun Heo authored and arnavpuranik's avatar arnavpuranik committed
      
      
      kernfs_path*() functions always return the length of the full path but
      the path content is undefined if the length is larger than the
      provided buffer.  This makes its behavior different from strlcpy() and
      requires error handling in all its users even when they don't care
      about truncation.  In addition, the implementation can actully be
      simplified by making it behave properly in strlcpy() style.
      
      * Update kernfs_path_from_node_locked() to always fill up the buffer
        with path.  If the buffer is not large enough, the output is
        truncated and terminated.
      
      * kernfs_path() no longer needs error handling.  Make it a simple
        inline wrapper around kernfs_path_from_node().
      
      * sysfs_warn_dup()'s use of kernfs_path() doesn't need error handling.
        Updated accordingly.
      
      * cgroup_path()'s use of kernfs_path() updated to retain the old
        behavior.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@ubuntu.com>
      b44c18c3
    • Serge Hallyn's avatar
      kernfs_path_from_node_locked: don't overwrite nlen · 029f0ddf
      Serge Hallyn authored and arnavpuranik's avatar arnavpuranik committed
      
      
      We've calculated @len to be the bytes we need for '/..' entries from
      @kn_from to the common ancestor, and calculated @nlen to be the extra
      bytes we need to get from the common ancestor to @kn_to.  We use them
      as such at the end.  But in the loop copying the actual entries, we
      overwrite @nlen.  Use a temporary variable for that instead.
      
      Without this, the return length, when the buffer is large enough, is
      wrong.  (When the buffer is NULL or too small, the returned value is
      correct. The buffer contents are also correct.)
      
      Interestingly, no callers of this function are affected by this as of
      yet.  However the upcoming cgroup_show_path() will be.
      Signed-off-by: default avatarSerge Hallyn <serge.hallyn@ubuntu.com>
      029f0ddf
    • Anay Wadhera's avatar
      arm64: bpf_jit_comp: drop artifact · 6fedbf34
      Anay Wadhera authored and arnavpuranik's avatar arnavpuranik committed
      6fedbf34
    • Arnd Bergmann's avatar
      UPSTREAM: cgroup: move CONFIG_SOCK_CGROUP_DATA to init/Kconfig · 04f08da8
      Arnd Bergmann authored and arnavpuranik's avatar arnavpuranik committed
      
      
      We now 'select SOCK_CGROUP_DATA' but Kconfig complains that this is
      not right when CONFIG_NET is disabled and there is no socket interface:
      
      warning: (CGROUP_BPF) selects SOCK_CGROUP_DATA which has unmet direct dependencies (NET)
      
      I don't know what the correct solution for this is, but simply removing
      the dependency on NET from SOCK_CGROUP_DATA by moving it out of the
      'if NET' section avoids the warning and does not produce other build
      errors.
      
      Fixes: 483c4933ea09 ("cgroup: Fix CGROUP_BPF config")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      Fixes: Change-Id: Ib41ef78fba02eb9e592558ddbf06f9ec0aa337b6
             ("UPSTREAM: cgroup: Fix CGROUP_BPF config")
      (cherry picked from commit 73b351473547e543e9c8166dd67fd99c64c15b0b)
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      04f08da8
    • Andy Lutomirski's avatar
      UPSTREAM: cgroup: Fix CGROUP_BPF config · 9d64ec6e
      Andy Lutomirski authored and arnavpuranik's avatar arnavpuranik committed
      
      
      Cherry-pick from commit 483c4933ea09b7aa625b9d64af286fc22ec7e419
      
      CGROUP_BPF depended on SOCK_CGROUP_DATA which can't be manually
      enabled, making it rather challenging to turn CGROUP_BPF on.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Bug: 30950746
      Change-Id: Ib41ef78fba02eb9e592558ddbf06f9ec0aa337b6
      9d64ec6e