[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with systemd-nspawn in containers with apparmor enabled #1321

Open
6 tasks
mkg20001 opened this issue Oct 19, 2024 · 6 comments
Open
6 tasks

Problems with systemd-nspawn in containers with apparmor enabled #1321

mkg20001 opened this issue Oct 19, 2024 · 6 comments
Labels
Incomplete Waiting on more information from reporter

Comments

@mkg20001
Copy link
Contributor

Required information

  • Distribution: ubuntu
  • Distribution version: noble / 24.04
  • The output of "incus info" or if that fails:
config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_dev_incus
- migration_pre_copy
- infiniband
- dev_incus_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- dev_incus_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- images_all_projects
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- zfs_delegate
- storage_api_remote_volume_snapshot_copy
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- image_restriction_privileged
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- certificate_description
- disk_io_bus_virtio_blk
- loki_config_instance
- instance_create_start
- clustering_evacuation_stop_options
- boot_host_shutdown_action
- agent_config_drive
- network_state_ovn_lr
- image_template_permissions
- storage_bucket_backup
- storage_lvm_cluster
- shared_custom_block_volumes
- auth_tls_jwt
- oidc_claim
- device_usb_serial
- numa_cpu_balanced
- image_restriction_nesting
- network_integrations
- instance_memory_swap_bytes
- network_bridge_external_create
- network_zones_all_projects
- storage_zfs_vdev
- container_migration_stateful
- profiles_all_projects
- instances_scriptlet_get_instances
- instances_scriptlet_get_cluster_members
- instances_scriptlet_get_project
- network_acl_stateless
- instance_state_started_at
- networks_all_projects
- network_acls_all_projects
- storage_buckets_all_projects
- resources_load
- instance_access
- project_access
- projects_force_delete
- resources_cpu_flags
- disk_io_bus_cache_filesystem
- instances_lxcfs_per_instance
- disk_volume_subpath
- projects_limits_disk_pool
- network_ovn_isolated
- qemu_raw_qmp
- network_load_balancer_health_check
- oidc_scopes
- network_integrations_peer_name
- qemu_scriptlet
- instance_auto_restart
- storage_lvm_metadatasize
- ovn_nic_promiscuous
- ovn_nic_ip_address_none
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: root
auth_user_method: unix
environment:
 addresses: []
 architectures:
 - x86_64
 - i686
 certificate: |
   -----BEGIN CERTIFICATE-----
   MIICBDCCAYqgAwIBAgIRAOTBGTHOAQsWASH92Q6LV6kwCgYIKoZIzj0EAwMwMzEZ
   MBcGA1UEChMQTGludXggQ29udGFpbmVyczEWMBQGA1UEAwwNcm9vdEByZXByb3hy
   bzAeFw0yNDEwMTgxODQ3MzBaFw0zNDEwMTYxODQ3MzBaMDMxGTAXBgNVBAoTEExp
   bnV4IENvbnRhaW5lcnMxFjAUBgNVBAMMDXJvb3RAcmVwcm94cm8wdjAQBgcqhkjO
   PQIBBgUrgQQAIgNiAARTavXyKixomMCPREGf/BerhxZeW1EFD/oNSpjXp+HmdGNq
   6CRx1+b3XPxcEHIXbo+xHFz9XIYKU6N+psvLHHKMVRf1pdbKKBpIUrMJ8E/bmqeC
   JGE0eFEZL9dT+70Rd1ejYjBgMA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAKBggr
   BgEFBQcDATAMBgNVHRMBAf8EAjAAMCsGA1UdEQQkMCKCCHJlcHJveHJvhwR/AAAB
   hxAAAAAAAAAAAAAAAAAAAAABMAoGCCqGSM49BAMDA2gAMGUCMQD5xFpYiE4I5TsT
   Oc0CVjenxUdoqd52m/2i+9SrIaAy4r7WMdO7LWTy5yfHGfxd0xkCMAaoFRb7/UBo
   40x20pwYSV9FMXIAooAcglRKRWedW4QcYD7gHZs+zvaeqZ0e8oiIjQ==
   -----END CERTIFICATE-----
 certificate_fingerprint: 2b7b7aa4a444821d1d4f151524acc41184b70bd966c07872781a23d16f407109
 driver: lxc | qemu
 driver_version: 6.0.2 | 9.0.2
 firewall: nftables
 kernel: Linux
 kernel_architecture: x86_64
 kernel_features:
   idmapped_mounts: "true"
   netnsid_getifaddrs: "true"
   seccomp_listener: "true"
   seccomp_listener_continue: "true"
   uevent_injection: "true"
   unpriv_binfmt: "true"
   unpriv_fscaps: "true"
 kernel_version: 6.8.0-47-generic
 lxc_features:
   cgroup2: "true"
   core_scheduling: "true"
   devpts_fd: "true"
   idmapped_mounts_v2: "true"
   mount_injection_file: "true"
   network_gateway_device_route: "true"
   network_ipvlan: "true"
   network_l2proxy: "true"
   network_phys_macvlan_mtu: "true"
   network_veth_router: "true"
   pidfd: "true"
   seccomp_allow_deny_syntax: "true"
   seccomp_notify: "true"
   seccomp_proxy_send_notify_fd: "true"
 os_name: Ubuntu
 os_version: "24.04"
 project: default
 server: incus
 server_clustered: false
 server_event_mode: full-mesh
 server_name: reproxro
 server_pid: 2003
 server_version: 6.0.2
 storage: dir
 storage_version: "1"
 storage_supported_drivers:
 - name: dir
   version: "1"
   remote: false

Issue description

When using systemd-nspawn in a container that has security.privileged and security.nesting set to true, nspawn still fails with a mysterious failure:

Failed to mount sysfs (type sysfs) on /sys/full (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC ""): No such file or directory

This is caused by systemd-nspawn trying to mkdir a directory and failing:

https://github.com/systemd/systemd/blob/main/src/nspawn/nspawn-mount.c#L474-L478

This is caused by one of the rules here blocking it, as they match /sys/full: https://github.com/lxc/incus/blob/main/internal/server/apparmor/instance_lxc.go#L332-L426

note that at that point /sys is a tmpfs, /sys/full is a directory created there and later (i think) /sys/full is mounted into /sys

I'm not sure if this is a thing incus should fix or systemd.

Steps to reproduce

(for full reproduction use noble ubuntu desktop vm and zabbly incus, but it should work anywhere with incus + working apparmor)

  1. incus launch images:nixos/unstable repronspawn -c security.privileged=true -c security.nesting=true
  2. incus exec repronspawn bash
  3. systemd-nspawn --keep-unit -M dokuwiki -D /tmp --private-network --network-veth --notify-ready=yes --kill-signal=SIGRTMIN+3 --bind-ro=/nix/store --bind-ro=/nix/var/nix/db --bind-ro=/nix/var/nix/daemon-socket --link-journal=try-guest

will produce error:


░ Spawning container dokuwiki on /tmp.
░ Press Ctrl-] three times within 1s to kill container.
Failed to mount sysfs (type sysfs) on /sys/full (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC ""): No such file or directory
Failed to register machine: Failed to determine unit of process 346 : No such device or address

this is just enough from the nspawn command to cause the error: Failed to mount sysfs (type sysfs) on /sys/full (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC ""): No such file or directory

Information to attach

  • Any relevant kernel output (dmesg)
  • Container log (incus info NAME --show-log)
  • Container configuration (incus config show NAME --expanded)
  • Main daemon log (at /var/log/incus/incusd.log)
  • Output of the client with --debug
  • Output of the daemon with --debug (alternatively output of incus monitor --pretty while reproducing the issue)
@mkg20001
Copy link
Contributor Author

happens on both lts 6.0.2 and non-lts 6.6

@mkg20001
Copy link
Contributor Author

Systemd issue: systemd/systemd#34836

@stgraber
Copy link
Member

This is probably an issue due to the use of security.privileged=true, privileged containers are extremely dangerous and have to come with much more restrictive apparmor rules which then cause a whole lot of problems.

In general you should avoid those containers like the plague, especially in a world where we also support just running a virtual machine for such cases.

In an unprivileged container, this still confusingly fails with:

[root@nixos:~]# systemd-nspawn --keep-unit -M dokuwiki -D /tmp --private-network --network-veth --notify-ready=yes --kill-signal=SIGRTMIN+3 --bind-ro=/nix/store --bind-ro=/nix/var/nix/db --bind-ro=/nix/var/nix/daemon-socket --link-journal=try-guest
░ Spawning container dokuwiki on /tmp.
░ Press Ctrl-] three times within 1s to kill container.
Failed to mount proc (type proc) on /proc (MS_NOSUID|MS_NODEV|MS_NOEXEC ""): Operation not permitted
Child died too early

This is pretty confusing as as far as I can tell, it's just the equivalent of:

[root@nixos:~]# unshare -m
[root@nixos:~]# mount -t proc proc /proc -o nosuid,nodev,noexec
[root@nixos:~]# 

There is no Apparmor mount restrictions on unprivileged containers, so it's not apparmor preventing this action when run by nspawn, but the kernel itself.

Actually looking at a full strace of nspawn, it looks like it's running under the equivalent of:

[root@nixos:~]# unshare -muip -f
[root@nixos:~]# mount -t proc proc /proc -o nosuid,nodev,noexec
[root@nixos:~]# 

Yet this still works when done manually, so it's something else being done as part of nspawn container initialization which is causing a situation where the kernel refuses to mount /proc.
It could be something to do with the rootfs setup (mounts + pivot_root) or with mount propagation.

What's for sure is that I wouldn't expect the systemd folks to care at all about the privileged container use case, it's running under all kind of weird apparmor policies and tricks.

The unprivileged container case is a bit cleaner, so if it's possible to track down the exact condition that's causing nspawn to fail to mount /proc, it may be possible to have it tweak that behavior slightly in such an environment.

@stgraber stgraber added the Incomplete Waiting on more information from reporter label Oct 19, 2024
@stgraber
Copy link
Member

Marking as incomplete on our side given that under an unprivileged environment, we don't do any mount mediation or similar overrides, so systemd-nspawn failing in that environment likely means it's just generally failing when run from within an unprivileged container, whether that's Incus or something else having created it.

Now whether this is something that upstream systemd-nspawn is interested in supporting will be up to them, if not, we conveniently also support running NixOS as a VM on Incus which should then work just fine.

@cyphar
Copy link
Member
cyphar commented Oct 22, 2024

@stgraber The procfs mount failing like that is usually due to mnt_too_revealing checks. I would guess it's because of lxcfs, though since nested containers usually work in Incus I'm a little surprised...

@stgraber
Copy link
Member

Yeah, I did try uncovering /proc, that wasn't it.

And indeed when nesting is enabled we make sure that an uncovered version of /proc and /sys is present in the mount table (hidden in /dev) so that the overmounting check doesn't block us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Incomplete Waiting on more information from reporter
Development

No branches or pull requests

3 participants