summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-08-16Merge tag 'notsystemd/v239.1-3' into notsystemd/masternotsystemd/masterLuke Shumaker
2018-08-16Merge tag 'notsystemd/v239.1' into notsystemd/masterLuke Shumaker
2018-08-16Merge tag 'notsystemd/v234.1' into notsystemd/masterLuke Shumaker
2018-08-16Merge tag 'notsystemd/v233.1' into notsystemd/masterLuke Shumaker
2018-08-16Merge tag 'notsystemd/v232.2' into notsystemd/masterLuke Shumaker
2018-08-16nspawn: detect_inner_cgver_from_image(): Be more detailed in the final log_debugnotsystemd/v239.1-3Luke Shumaker
2018-08-16nspawn: detect_inner_cgver_from_image(): Don't assume old systemdLuke Shumaker
2018-08-16cgroup-util: Get stricter about mode detectionLuke Shumaker
Currently, cg_unified_update() recognizes OpenRC cgroup-v1/v2 hybrid has systemd-v233 cgroup-v1/v2 hybrid; later systemd code will be in for a shock when /sys/fs/cgroup/systemd doesn't exist!
2018-08-16cgroup-util,nspawn: Add a special "inherit" cgroup mode for nspawnLuke Shumaker
The "inherit" mode inspects /proc/self/mountinfo to do its best to replicate the cgroup setup of the outer host. It is used by default unless a different specific cgroup setup is to be used; either because the user requested it (via $UNIFIED_CGROUP_HIERARCHY), or because pick_cgroup_version() sniffed that the container has a version of systemd that doesn't support the outer host's setup. This means that nspawn can now be used when outer_cgver=CGROUP_UNIFIED_UNKNOWN; AKA when running on a non-systemd host.
2018-08-16nspawn: nspawn-cgroup.c: Drastically modify cgroup_setup()Luke Shumaker
This is large and needs split in to multiple commits. - improve error messages - work correctly when outer_cgver==FULL && inner_cgver==NONE
2018-08-16nspawn: Rename chown_cgroup_path → cgdir_chownLuke Shumaker
2018-08-16nspawn: (Re)mount the systemd hierarchy RO in the outer child, not innerLuke Shumaker
The current situation: If !use_cgns, then we mount the systemd hierarchy RW, bind the container's subcgroup, then remount the hierarchy RO. This gives the container RW access to its subcgroup, but makes the rest of the hierarchy RO. Except that the remount happens inside the container's final mount namespace; which means that the container could just remount it RW. I know that all systemd-nspawn features are not security features, and provide protection against accidents only. But we can do better! We can't just move the remount operation to where we mount cgroups in the namespace helper, because we don't know what the container's subcgroup is yet: the inner child (the container's PID 1) hasn't yet been created; let alone moved into its final cgroup by machined. So instead, we wait until after we've raw_clone()ed the inner child, check what its cgroup is, and *then* mount the cgroups. The mounts will propagate from the helper's mount namespace to the child mount namespace. This solution presents several challenges of its own: 1. The outer child will have chroot()ed by the time it goes to look at the inner child's cgroup. So how is it going to look at /proc/${inner_child}/cgroup if it doesn't have access to /proc? We'll have the parent open the file, then pass the file descriptor to the outer child over a socket. 2. The parent will have to work with the fact that the outer and inner child coexist; before the outer child exited as soon as the inner child was clone()ed, and the parent got away with pretending that only one existed at a time. We'll have to add a "barrier" to stop the outer child from exiting at a point where the resulting SIGCHLD could cause a problem for the parent. The obvious answer might be to add a second literal barrier (as in barrier.c), but the socket mentioned above serves the purpose when !use_cgns, so go ahead and re-purpose the socket to serve as a barrier even when use_cgns. 3. MS_REMOUNT operations don't propagate between mount namespaces. Part of me thinks "that's a bug/omission in the kernel", but another part of me is saying "no, that's ridiculous, there has to be a good reason why MS_REMOUNT operations don't propagate between mount namespaces." Anyway, this means that the various MS_REMOUNT operations to make things read-only are being entirely ignored by the container's namespace. Actually, because when we remount the tmpfs we don't pass MS_BIND, the superblock is marked read-only; and since the superblock is a shared global, it effectively propagated. The mountpoint-specific flags still say "rw" in the container, but the superblock overrides them. It's a bit weird, but it works. So that just leaves the cgroup mounts. Instead of always mounting them RW, then remounting them RO if necessary, just mount them RO the first time. This is actually what it used to do a couple of years ago, but in c053458 it was changed because "Otherwise we'll generate kernel runtime warnings about non-matching mount options." Maybe that was true then, but it's not true today; today it does not generate that warning for a differing MS_RDONLY flag (though it does if the string options are different).
2018-08-16nspawn: Clarify sync_cgroup(); tmp dirname, error messageLuke Shumaker
sync_cgroup() can sync name=systemd->unified or unified->name=systemd, depending on the setup. However, the names of things, comments, and error messages all assume (send the false impression) that it only goes name=systemd->unfied.
2018-08-16nspawn: Improve --help textLuke Shumaker
The `--help` text lies about what the `-U` flag does, and under-documents the `--private-users` values. . Fix that.
2018-08-16nspawn: Also allow the user to force hybrid cgroup modeLuke Shumaker
2018-08-16nspawn: Hoist the decision to sync cgroups from the callee sync_cgroup() to ↵Luke Shumaker
the caller cgroup_setup()
2018-08-16nspawn: Go ahead and always decide the cgroup mounts in the outer child, not ↵Luke Shumaker
inner It mounts them in the outer child if !use_cgns, or inner child if use_cgns. Previously, it decided the mounts at the same place that it actually mounted them. But, for simplicity and flexibility, always decide them in the outer child. Also, drop the now superfluous "dest" argument to cgroup_mount_mounts().
2018-08-16nspawn: Split off cgroup_decide_mounts() from mount_cgroups()Luke Shumaker
2018-08-16nspawn: Decide all cgroup mounts/symlinks before performing any of themLuke Shumaker
This is part 2 of a 2-part commit; it modifies the pre-existing code to use the cgmount_add(), cgroup_mount_mounts(), and cgroup_free_mounts() functions added in part 1. Instead of actually doing anything when making decisions around cgroup, we use cgroup_decide_mounts_*() to build up a CGMounts structure that is a list of actions to perform when setting up group in the container. Then we pass that to cgroup_mount_mounts(), which actually does everything.
2018-08-16nspawn: Add functions for deciding cgroup mounts before performing themLuke Shumaker
This is part 1 of a 2-part commit; it adds the functions, but doesn't modify anything to use them; that's part 2. I've split it up for clarity of the diff, and to make rebasing easier. Add (static) cgmount_add(), to build a list of CGMounts; cgroup_free_mounts() to eventually free that list; and cgroup_mount_mounts() to perform the mounts in the list. The behavior of cgroup_mount_mounts() borrows from/imitates: - mount_legacy_cgroup_hierarchy(): Most everything, but it shoves the decision of whether to use cgroup v1 or v2 to be the responsibility of whoever is building the list. - mount_legacy_cgns_[un]supported(): The tmpfs logic, the logics on when to mount a cgroup hierarchy RO or RW, the logics on when to remount /sys/fs/cgroup RO, what flags to pass to path_is_mountpoint(). - mount_unified_cgroups(): logics on deciding if a mountpoint is a cgroup hierarchy, what flags to pass to path_is_mountpoint().
2018-08-16nspawn: nspawn-cgroup.c: Add dividers/headingsLuke Shumaker
It's against the systemd systemd style guide, but it really helps me in maintaining and working on the notsystemd patchset.
2018-08-16nspawn: Track the inner child and outer child PIDs separatelyLuke Shumaker
2018-08-16nspawn: Change where we filter the name=systemd hierarchyLuke Shumaker
It was being filtered out in get_v1_hierarchies(), which seems a little silly because it is a v1 hierarchy. Instead, filter it in mount_legacy_cgns_supported(); the function that actually cares about skipping it. This is a little less efficient, since get_v1_hierarchies() now has to do the allocations to put it in the returned set; but that's minor and the code is clearer now.
2018-08-16nspawn: get_v1_hierarchies(): Ditch a pointless check for "name=unified"Luke Shumaker
name=unified isn't a real v1 hierarchy, it's just a made-up name to refer to the v2 hierarchy with a v1-looking string. It will never show up when we ask the kernel for a list of v1 hierarchies. It looks like Tejun Heo just grepped for all uses of "name=systemd" when he introduced name=unified in 2977724b09eb997fc84a80517447b5d4a70770c7.
2018-08-16nspawn: mount_legacy_cgns_supported(): Rename variables to not lieLuke Shumaker
mount_legacy_cgns_supported() is very clearly meant to be a version of mount_legacy_cgns_unsupported() modified to cope with the fact that it has already chroot()ed, and thus can't look at the host /sys. So, the loops and such look similar. However, to cope with the fact that it can't look at /sys, it deals with hierarchies in the outermost loop, rather than controllers. Yet, it kept the list variable named "controllers". That's confusing.
2018-08-16nspawn: Merge chown_cgroup(), sync_cgroup(), & create_subcgroup() into one ↵Luke Shumaker
cgroup_setup()
2018-08-16nspawn: Detect the outer_cgver once, and pass that aroundLuke Shumaker
Yes, the relevant functions in cgroup-util actually do cache the values with static variables, so we aren't saving any time by avoiding lookups. But passing it around as a value makes the flow much nicer, I think; and it makes it clearer when we can expect it to fail. This moves the call to cg_*() out of parse_argv(), to main(), right after it calls parse_argv(); since main() is the function that ultimately needs the value from cg_version().
2018-08-16nspawn: Parse UNIFIED_CGROUP_HIERARCHY similarly to any other argLuke Shumaker
2018-08-16nspawn: Expand comments in detect_unified_cgroup_hierarchy()Luke Shumaker
2018-08-16nspawn: if !cg_ns_supported() then force arg_use_cgns = falseLuke Shumaker
It's silly that every time we check arg_use_cgns we also have to check cg_ns_supported(). So, simplify these checks and force arg_use_cgns = false if the kernel doesn't support cg_ns_supported.
2018-08-16cgroup-util: Split out cg_pid_get_path_internal()Luke Shumaker
The name is a bit if a misnomer, because it doesn't work with a PID, but with an already-opened /proc/${pid}/cgroup 'FILE*'. It lets us bypass the controller=NULL and controller=SYSTEMD_CGROUP_CONTROLLER special cases of cg_pid_get_path(), so that we can directly check the cgroup v2 hierarchy; even if systemd isn't using it (useful if we're doing work with a container that does use the v2 hierarchy). It also requires the caller to have already fopen()ed the cgroup file (useful so that we can read from an fd passed over a socket).
2018-08-16nspawn: Allow the container to inherit a 232-style hybrid (#6310)Luke Shumaker
2018-08-16cgroup-util,nspawn: Use switch cases around CGroupUnified when possibleLuke Shumaker
This replaces a bunch of confusing if/else trees with simple-to-reason-about switch blocks that make it absolutely clear what happens in each case of the enum. In nspawn, avoid all of the cg_* functions that query properties of the underlying CGroupUnified, and just check cg_version directly. Even in the cases where the checks on it can't be done in a switch statement, it's clearer, because it shows the symmetry between inner_cgver and outer_cgver. This should not change the behavior at all; except that there are a few more assert()s on things that shouldn't happen.
2018-08-16cgroup-util: Add cg_version() to get the raw CGroupUnified enumLuke Shumaker
2018-08-16cgroup-util: Merge the unified_cache and unified_systemd_v232 cachesLuke Shumaker
Conceptually, the addition of bool unified_systemd_v232 split CGROUP_UNIFIED_SYSTEMD in to two separate values. So, split it. The "tricky" part is when to switch the old CGROUP_UNIFIED_SYSTEMD to CGROUP_UNIFIED_SYSTEMD232 and when to switch it to CGROUP_UNIFIED_SYSTEMD233. All ">= CGROUP_UNIFIED_SYSTEMD" checks go to 232, since that preserves the existing behavior.
2018-08-16nspawn: nspawn.c: s/unified_cgroup_hierarchy/inner_cgver/Luke Shumaker
2018-08-16nspawn: nspawn-cgroup.{c,h}: s/unified_requested/inner_cgver/Luke Shumaker
2018-08-16nspawn: mount_sysfs(): Unconditionally mkdir /sys/fs/cgroupLuke Shumaker
Currently, mount_sysfs() only creates /sys/fs/cgroup if cg_ns_supported(). The comment explains that we need to "Create mountpoint for cgroups. Otherwise we are not allowed since we remount /sys read-only."; that is: that we need to do it now, rather than later. However, the comment doesn't do anything to explain why we only need to do this if cg_ns_supported(); shouldn't we _always_ need to do it? The answer is that if !use_cgns, then this was already done by the outer child, so mount_sysfs() only needs to do it if use_cgns. Now, mount_sysfs() doesn't know whether use_cgns, but !cg_ns_supported() implies !use_cgns, so we can optimize" the case where we _know_ !use_cgns, and deal with a no-op mkdir_p() in the false-positive where cgns_supported() but !use_cgns. But is it really much of an optimization? We're potentially spending an access(2) (cg_ns_supported() could be cached from a previous call) to potentially save an lstat(2) and mkdir(2); and all of them are on virtual fileystems, so they should all be pretty cheap. So, simplify and drop the conditional. It's a dubious optimization that requires more text to explain than it's worth. (cherry picked from commit 677a72cd3efdfde9d544b2d1fe62f352d6d8472c)
2018-08-16cgroup-util: cg_kernel_controllers(): Fix comment about including "name="Luke Shumaker
Remove "arbitrary named hierarchies" from the list of things that cg_kernel_controllers() might return, and clarify that "name=" pseudo-controllers are not included in the returned list. /proc/cgroups does not contain "name=" pseudo-controllers, and cg_kernel_controllers() makes no effort to enumerate them via a different mechanism. (cherry picked from commit f09e86bcaa012d64addd2314fa6054657a02f64c)
2018-08-16nspawn: sync_cgroup(): Rename arg_uid_shift -> uid_shiftLuke Shumaker
Naming it arg_uid_shift is confusing because of the global arg_uid_shift in nspawn.c (cherry picked from commit 93dbdf6cb1466133def725986a4605f8594959ae)
2018-08-16nspawn: Move cgroup mount stuff from nspawn-mount.c to nspawn-cgroup.cLuke Shumaker
(cherry picked from commit 0402948206203ccbd6b81b10d4bf8973b87b2c60)
2018-08-16nspawn: Simplify tmpfs_patch_options() usage, and trickle that upLuke Shumaker
One of the things that tmpfs_patch_options does is take an (optional) UID, and insert "uid=${UID},gid=${UID}" into the options string. So we need a uid_t argument, and a way of telling if we should use it. Fortunately, that is built in to the uid_t type by having UID_INVALID as a possible value. So this is really a feature that requires one argument. Yet, it is somehow taking 4! That is absurd. Simplify it to only take one argument, and have that trickle all the way up to mount_all()'s usage. Now, in may of the uses, the argument becomes uid_shift == 0 ? UID_INVALID : uid_shift because it used to treat uid_shift=0 as invalid unless the patch_ids flag was also set. This keeps the behavior the same. Note that in all cases where it is invoked, if !use_userns (sometimes called !userns), then uid_shift is 0; we don't have to add any checks for that. That said, I'm pretty sure that "uid=0" and not setting "uid=" are the same, but Christian Brauner seemed to not think so when implementing the cgns support. https://github.com/systemd/systemd/pull/3589 (cherry picked from commit 2fa017f16922776ff9751dc22031c7ee49920729)
2018-08-16nspawn: Simplify mkdir_userns() usage, and trickle that upLuke Shumaker
One of the things that mkdir_userns{,_p}() does is take an (optional) UID, and chown the directory to that. So we need a uid_t argument, and a way of telling if we should use that uid_t argument. Fortunately, that is built in to the uid_t type by having UID_INVALID as a possible value. However, currently mkdir_userns() also takes a MountSettingsMask and checks a couple of bits in it to decide if it should perform the chown. Drop the mask argument, and instead have the caller pass UID_INVALID if it shouldn't chown. (cherry picked from commit 9c0fad5fb5f47da125bb768dbb4cd0e824cccc7c)
2018-08-16Merge tag 'systemd/v239.0-2.parabola7' into systemd/parabolaHEADsystemd/parabolaLuke Shumaker
2018-08-16FSDG: bootctl: Say "Systemd Boot Manager" instead of "Linux Boot Manager"systemd/v239.0-2.parabola7Luke Shumaker
2018-08-16FSDG: man/: Use FSDG operating systems as examplesLuke Shumaker
2018-08-16FSDG: systemd-resolved: Fallback hostname to "gnu-linux" instead of "linux"Luke Shumaker
2018-08-16FSDG: os-release: Default ID to "gnu-linux" instead of "linux"Luke Shumaker
As far as I can tell, no code in this repository actually uses the ID field, so this is just a man page change.
2018-08-16FSDG: os-release: Default NAME to "GNU/Linux" instead of "Linux"Luke Shumaker
2018-08-16FSDG: os-release: Default PRETTY_NAME to "GNU/Linux" instead of "Linux"Luke Shumaker