Age | Commit message (Collapse) | Author |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Currently, cg_unified_update() recognizes OpenRC cgroup-v1/v2 hybrid has
systemd-v233 cgroup-v1/v2 hybrid; later systemd code will be in for a shock
when /sys/fs/cgroup/systemd doesn't exist!
|
|
The "inherit" mode inspects /proc/self/mountinfo to do its best to
replicate the cgroup setup of the outer host. It is used by default unless
a different specific cgroup setup is to be used; either because the user
requested it (via $UNIFIED_CGROUP_HIERARCHY), or because
pick_cgroup_version() sniffed that the container has a version of systemd
that doesn't support the outer host's setup.
This means that nspawn can now be used when
outer_cgver=CGROUP_UNIFIED_UNKNOWN; AKA when running on a non-systemd host.
|
|
This is large and needs split in to multiple commits.
- improve error messages
- work correctly when outer_cgver==FULL && inner_cgver==NONE
|
|
|
|
The current situation:
If !use_cgns, then we mount the systemd hierarchy RW, bind the
container's subcgroup, then remount the hierarchy RO. This gives the
container RW access to its subcgroup, but makes the rest of the hierarchy
RO.
Except that the remount happens inside the container's final mount
namespace; which means that the container could just remount it RW.
I know that all systemd-nspawn features are not security features, and
provide protection against accidents only. But we can do better!
We can't just move the remount operation to where we mount cgroups in the
namespace helper, because we don't know what the container's subcgroup is
yet: the inner child (the container's PID 1) hasn't yet been created; let
alone moved into its final cgroup by machined. So instead, we wait until
after we've raw_clone()ed the inner child, check what its cgroup is, and
*then* mount the cgroups. The mounts will propagate from the helper's
mount namespace to the child mount namespace.
This solution presents several challenges of its own:
1. The outer child will have chroot()ed by the time it goes to look at the
inner child's cgroup. So how is it going to look at
/proc/${inner_child}/cgroup if it doesn't have access to /proc? We'll
have the parent open the file, then pass the file descriptor to the
outer child over a socket.
2. The parent will have to work with the fact that the outer and inner
child coexist; before the outer child exited as soon as the inner child
was clone()ed, and the parent got away with pretending that only one
existed at a time. We'll have to add a "barrier" to stop the outer
child from exiting at a point where the resulting SIGCHLD could cause a
problem for the parent. The obvious answer might be to add a second
literal barrier (as in barrier.c), but the socket mentioned above
serves the purpose when !use_cgns, so go ahead and re-purpose the
socket to serve as a barrier even when use_cgns.
3. MS_REMOUNT operations don't propagate between mount namespaces. Part
of me thinks "that's a bug/omission in the kernel", but another part of
me is saying "no, that's ridiculous, there has to be a good reason why
MS_REMOUNT operations don't propagate between mount namespaces."
Anyway, this means that the various MS_REMOUNT operations to make
things read-only are being entirely ignored by the container's
namespace.
Actually, because when we remount the tmpfs we don't pass MS_BIND, the
superblock is marked read-only; and since the superblock is a shared
global, it effectively propagated. The mountpoint-specific flags still
say "rw" in the container, but the superblock overrides them. It's a
bit weird, but it works.
So that just leaves the cgroup mounts. Instead of always mounting them
RW, then remounting them RO if necessary, just mount them RO the first
time. This is actually what it used to do a couple of years ago, but
in c053458 it was changed because "Otherwise we'll generate kernel
runtime warnings about non-matching mount options." Maybe that was
true then, but it's not true today; today it does not generate that
warning for a differing MS_RDONLY flag (though it does if the string
options are different).
|
|
sync_cgroup() can sync name=systemd->unified or unified->name=systemd,
depending on the setup. However, the names of things, comments, and error
messages all assume (send the false impression) that it only goes
name=systemd->unfied.
|
|
The `--help` text lies about what the `-U` flag does, and under-documents
the `--private-users` values. . Fix that.
|
|
|
|
the caller cgroup_setup()
|
|
inner
It mounts them in the outer child if !use_cgns, or inner child if use_cgns.
Previously, it decided the mounts at the same place that it actually
mounted them. But, for simplicity and flexibility, always decide them in
the outer child.
Also, drop the now superfluous "dest" argument to cgroup_mount_mounts().
|
|
|
|
This is part 2 of a 2-part commit; it modifies the pre-existing code to use
the cgmount_add(), cgroup_mount_mounts(), and cgroup_free_mounts()
functions added in part 1.
Instead of actually doing anything when making decisions around cgroup, we
use cgroup_decide_mounts_*() to build up a CGMounts structure that is a
list of actions to perform when setting up group in the container. Then we
pass that to cgroup_mount_mounts(), which actually does everything.
|
|
This is part 1 of a 2-part commit; it adds the functions, but doesn't
modify anything to use them; that's part 2. I've split it up for clarity
of the diff, and to make rebasing easier.
Add (static) cgmount_add(), to build a list of CGMounts;
cgroup_free_mounts() to eventually free that list; and
cgroup_mount_mounts() to perform the mounts in the list.
The behavior of cgroup_mount_mounts() borrows from/imitates:
- mount_legacy_cgroup_hierarchy(): Most everything, but it shoves the
decision of whether to use cgroup v1 or v2 to be the responsibility of
whoever is building the list.
- mount_legacy_cgns_[un]supported(): The tmpfs logic, the logics on when
to mount a cgroup hierarchy RO or RW, the logics on when to remount
/sys/fs/cgroup RO, what flags to pass to path_is_mountpoint().
- mount_unified_cgroups(): logics on deciding if a mountpoint is a cgroup
hierarchy, what flags to pass to path_is_mountpoint().
|
|
It's against the systemd systemd style guide, but it really helps me
in maintaining and working on the notsystemd patchset.
|
|
|
|
It was being filtered out in get_v1_hierarchies(), which seems a little
silly because it is a v1 hierarchy. Instead, filter it in
mount_legacy_cgns_supported(); the function that actually cares about
skipping it. This is a little less efficient, since get_v1_hierarchies()
now has to do the allocations to put it in the returned set; but that's
minor and the code is clearer now.
|
|
name=unified isn't a real v1 hierarchy, it's just a made-up name to refer
to the v2 hierarchy with a v1-looking string. It will never show up when
we ask the kernel for a list of v1 hierarchies. It looks like Tejun Heo
just grepped for all uses of "name=systemd" when he introduced name=unified
in 2977724b09eb997fc84a80517447b5d4a70770c7.
|
|
mount_legacy_cgns_supported() is very clearly meant to be a version of
mount_legacy_cgns_unsupported() modified to cope with the fact that it has
already chroot()ed, and thus can't look at the host /sys. So, the loops
and such look similar.
However, to cope with the fact that it can't look at /sys, it deals with
hierarchies in the outermost loop, rather than controllers. Yet, it kept
the list variable named "controllers". That's confusing.
|
|
cgroup_setup()
|
|
Yes, the relevant functions in cgroup-util actually do cache the values
with static variables, so we aren't saving any time by avoiding lookups.
But passing it around as a value makes the flow much nicer, I think; and it
makes it clearer when we can expect it to fail.
This moves the call to cg_*() out of parse_argv(), to main(), right
after it calls parse_argv(); since main() is the function that
ultimately needs the value from cg_version().
|
|
|
|
|
|
It's silly that every time we check arg_use_cgns we also have to check
cg_ns_supported().
So, simplify these checks and force arg_use_cgns = false if the kernel
doesn't support cg_ns_supported.
|
|
The name is a bit if a misnomer, because it doesn't work with a PID, but
with an already-opened /proc/${pid}/cgroup 'FILE*'.
It lets us bypass the controller=NULL and
controller=SYSTEMD_CGROUP_CONTROLLER special cases of cg_pid_get_path(), so
that we can directly check the cgroup v2 hierarchy; even if systemd isn't
using it (useful if we're doing work with a container that does use the v2
hierarchy). It also requires the caller to have already fopen()ed the
cgroup file (useful so that we can read from an fd passed over a socket).
|
|
|
|
This replaces a bunch of confusing if/else trees with
simple-to-reason-about switch blocks that make it absolutely clear what
happens in each case of the enum.
In nspawn, avoid all of the cg_* functions that query properties of the
underlying CGroupUnified, and just check cg_version directly. Even in the
cases where the checks on it can't be done in a switch statement, it's
clearer, because it shows the symmetry between inner_cgver and outer_cgver.
This should not change the behavior at all; except that there are a few
more assert()s on things that shouldn't happen.
|
|
|
|
Conceptually, the addition of bool unified_systemd_v232 split
CGROUP_UNIFIED_SYSTEMD in to two separate values. So, split it.
The "tricky" part is when to switch the old CGROUP_UNIFIED_SYSTEMD to
CGROUP_UNIFIED_SYSTEMD232 and when to switch it to
CGROUP_UNIFIED_SYSTEMD233. All ">= CGROUP_UNIFIED_SYSTEMD" checks go to
232, since that preserves the existing behavior.
|
|
|
|
|
|
Currently, mount_sysfs() only creates /sys/fs/cgroup if cg_ns_supported().
The comment explains that we need to "Create mountpoint for
cgroups. Otherwise we are not allowed since we remount /sys read-only.";
that is: that we need to do it now, rather than later. However, the
comment doesn't do anything to explain why we only need to do this if
cg_ns_supported(); shouldn't we _always_ need to do it?
The answer is that if !use_cgns, then this was already done by the outer
child, so mount_sysfs() only needs to do it if use_cgns. Now,
mount_sysfs() doesn't know whether use_cgns, but !cg_ns_supported() implies
!use_cgns, so we can optimize" the case where we _know_ !use_cgns, and deal
with a no-op mkdir_p() in the false-positive where cgns_supported() but
!use_cgns.
But is it really much of an optimization? We're potentially spending an
access(2) (cg_ns_supported() could be cached from a previous call) to
potentially save an lstat(2) and mkdir(2); and all of them are on virtual
fileystems, so they should all be pretty cheap.
So, simplify and drop the conditional. It's a dubious optimization that
requires more text to explain than it's worth.
(cherry picked from commit 677a72cd3efdfde9d544b2d1fe62f352d6d8472c)
|
|
Remove "arbitrary named hierarchies" from the list of things that
cg_kernel_controllers() might return, and clarify that "name="
pseudo-controllers are not included in the returned list.
/proc/cgroups does not contain "name=" pseudo-controllers, and
cg_kernel_controllers() makes no effort to enumerate them via a different
mechanism.
(cherry picked from commit f09e86bcaa012d64addd2314fa6054657a02f64c)
|
|
Naming it arg_uid_shift is confusing because of the global arg_uid_shift in
nspawn.c
(cherry picked from commit 93dbdf6cb1466133def725986a4605f8594959ae)
|
|
(cherry picked from commit 0402948206203ccbd6b81b10d4bf8973b87b2c60)
|
|
One of the things that tmpfs_patch_options does is take an (optional) UID,
and insert "uid=${UID},gid=${UID}" into the options string. So we need a
uid_t argument, and a way of telling if we should use it. Fortunately,
that is built in to the uid_t type by having UID_INVALID as a possible
value.
So this is really a feature that requires one argument. Yet, it is somehow
taking 4! That is absurd. Simplify it to only take one argument, and have
that trickle all the way up to mount_all()'s usage.
Now, in may of the uses, the argument becomes
uid_shift == 0 ? UID_INVALID : uid_shift
because it used to treat uid_shift=0 as invalid unless the patch_ids flag
was also set. This keeps the behavior the same. Note that in all cases
where it is invoked, if !use_userns (sometimes called !userns), then
uid_shift is 0; we don't have to add any checks for that.
That said, I'm pretty sure that "uid=0" and not setting "uid=" are the
same, but Christian Brauner seemed to not think so when implementing the
cgns support. https://github.com/systemd/systemd/pull/3589
(cherry picked from commit 2fa017f16922776ff9751dc22031c7ee49920729)
|
|
One of the things that mkdir_userns{,_p}() does is take an (optional) UID,
and chown the directory to that. So we need a uid_t argument, and a way of
telling if we should use that uid_t argument. Fortunately, that is built
in to the uid_t type by having UID_INVALID as a possible value.
However, currently mkdir_userns() also takes a MountSettingsMask and checks
a couple of bits in it to decide if it should perform the chown.
Drop the mask argument, and instead have the caller pass UID_INVALID if it
shouldn't chown.
(cherry picked from commit 9c0fad5fb5f47da125bb768dbb4cd0e824cccc7c)
|
|
|
|
|
|
|
|
|
|
As far as I can tell, no code in this repository actually uses the ID
field, so this is just a man page change.
|
|
|
|
|