feat(cgroup): 实现cgroup mvp版本#1826
feat(cgroup): 实现cgroup mvp版本#1826Vitus213 wants to merge 35 commits intoDragonOS-Community:masterfrom
Conversation
Vitus213
commented
Mar 23, 2026
- 实现cgroup v2文件系统
- 实现cgroup 树
- 实现cgroup在不同pid 下的相对视图
- 实现进程从出生开始就被cgroup pid 作用域限制
- derive Rust components from kernel/rust-toolchain.toml
- auto-add required components (cargo/rustc/rust-std) when missing
- emit trace warning when required components are absent in toml
- prioritize ${rust-toolchain}/bin in nix run yolo and devShell shellHook
- document fixed-output sha256 update workflow
- keep fixed-output hash strategy for reproducibility
This avoids cargo/rustc drift caused by PATH/toolchain mixing across Nix entrypoints.
# Conflicts: # .agents/skills/bug-hunter/SKILL.md # .agents/skills/bug-hunter/scripts/debate_picker.py # .agents/skills/bug-hunter/scripts/redact_sensitive.py # .agents/skills/bug-hunter/scripts/render_report.py # .agents/skills/bug-hunter/scripts/run_pipeline.py # .agents/skills/bug-hunter/scripts/semantic_bucket.py # .agents/skills/bug-hunter/scripts/shuffle_diff.py # .agents/skills/bug-hunter/scripts/update_resolution_history.py # .agents/skills/bug-hunter/scripts/weighted_vote.py # .agents/skills/bug-hunter/subskills/bug-hunter-stage1-input-randomization/SKILL.md # .agents/skills/bug-hunter/subskills/bug-hunter-stage2-parallel-review/SKILL.md # .agents/skills/bug-hunter/subskills/bug-hunter-stage3-evidence-fusion/SKILL.md # .agents/skills/bug-hunter/subskills/bug-hunter-stage4-consensus-judge/SKILL.md # .gitignore # kernel/src/filesystem/vfs/mount.rs # user/apps/default.nix
- 添加agents.md,强制说中文
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b54f738833
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let fs_dir = ensure_dir(&sys, "fs", InodeMode::from_bits_truncate(0o755))?; | ||
| let cgroup_dir = ensure_dir(&fs_dir, "cgroup", InodeMode::from_bits_truncate(0o755))?; | ||
|
|
||
| let cgroup_fs = Cgroup2Fs::new(cgroup_root().root(), false); |
There was a problem hiding this comment.
让默认 cgroup2 挂载遵循当前 cgroup namespace 根
这里把 /sys/fs/cgroup 的初始挂载固定到全局根 cgroup_root().root(),该挂载随后会被所有进程复用;当进程执行 unshare(CLONE_NEWCGROUP) 但未重新挂载 cgroup2 时,它仍可通过这棵全局树访问/操作 namespace 根之外的 cgroup(而 cgroup.procs 写路径仅在 nsdelegate 时才额外做 namespace 边界检查)。这会破坏 cgroup namespace 的隔离语义,并允许越过 namespace 根进行迁移或管理。
Useful? React with 👍 / 👎.
| pub fn subtree_task_count(self: &Arc<Self>) -> usize { | ||
| let mut total = self.tasks.read().len(); | ||
| for child in self.children() { | ||
| total = total.saturating_add(child.subtree_task_count()); | ||
| } |
Design document for cgroup v2 performance benchmark tool that: - Measures filesystem operations (mkdir/rmdir/read/write) - Measures process migration overhead - Measures pids controller fork overhead - Outputs JSON format for DragonOS/Linux comparison Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
问题: 1. 进程退出时缺少 cgroup_accounting_lock 保护,与 cgroup.procs 写入路径锁顺序不一致 2. set_task_cgroup_node() 在持有 task_cgroup.write() 时获取 tasks.write(),违反锁层次结构 修复: - 在 exit_thread() 和 release() 中的 cgroup 操作添加 cgroup_accounting_lock 保护 - 重构 set_task_cgroup_node() 避免锁嵌套:先用读锁获取 old 节点,释放后再执行迁移 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
这个分支专注于 cgroup 问题,poll 相关修改应单独处理
… into feat/cgroup
| self.pids_events_max.fetch_add(1, Ordering::Relaxed); | ||
| } | ||
|
|
||
| pub fn subtree_task_count(self: &Arc<Self>) -> usize { |
There was a problem hiding this comment.
[Critical] pids counting via full subtree traversal — O(n) performance issue
subtree_task_count() walks the entire cgroup subtree and holds each node's tasks read lock on every invocation. This is called from:
pids.currentreadspids.maxchecks incgroup_can_fork_in()(fork hot path)cgroup_migrate_vet_dst_with_src()
In Linux kernel/cgroup/pids.c, an atomic64_t counter per cgroup node provides hierarchical accounting:
pids_try_charge(): walks ancestor chain withatomic64_add_return, O(depth)pids_current_read(): directatomic64_read(&pids->counter), O(1)pids_uncharge(): atomic decrement along ancestor chain on exit
Suggestion: Add counter: AtomicI64 to CgroupNode, charge/uncharge along the ancestor chain on task add/remove.
| self.pids_events_max.fetch_add(1, Ordering::Relaxed); | ||
| } | ||
|
|
||
| pub fn subtree_task_count(self: &Arc<Self>) -> usize { |
There was a problem hiding this comment.
[Critical] pids counting via full subtree traversal — O(n) performance issue
subtree_task_count() walks the entire cgroup subtree and holds each node's tasks read lock on every invocation. This is called from:
pids.currentreadspids.maxchecks incgroup_can_fork_in()(fork hot path)cgroup_migrate_vet_dst_with_src()
In Linux kernel/cgroup/pids.c, an atomic64_t counter per cgroup node provides hierarchical accounting:
pids_try_charge(): walks ancestor chain withatomic64_add_return, O(depth)pids_current_read(): directatomic64_read(&pids->counter), O(1)pids_uncharge(): atomic decrement along ancestor chain on exit
Suggestion: Add counter: AtomicI64 to CgroupNode, charge/uncharge along the ancestor chain on task add/remove.
| pub fn cgroup_migrate_vet_dst(dst: &Arc<CgroupNode>) -> Result<(), SystemError> { | ||
| // v2 no-internal-process 最小约束: | ||
| // 目标 cgroup 如果已经有进程并且启用了 subtree_control,则拒绝迁移。 | ||
| if dst.has_tasks() && !dst.subtree_control().is_empty() { |
There was a problem hiding this comment.
[Critical] no-internal-process constraint check deviates from Linux semantics
Current check: has_tasks() && !subtree_control().is_empty() — requires BOTH conditions.
Linux kernel/cgroup/cgroup.c:2632-2634:
/* apply no-internal-process constraint */
if (dst_cgrp->subtree_control)
return -EBUSY;Linux only checks whether subtree_control is non-zero, without checking has_tasks(). The meaning is: once a cgroup has subtree_control enabled, no process migration into it is allowed (regardless of whether it currently has tasks).
The current implementation allows migrating processes into an empty cgroup that has controllers enabled, violating the cgroup v2 no-internal-process constraint.
Suggested fix:
if !dst.subtree_control().is_empty() {
return Err(SystemError::EBUSY);
}| pub fn cgroup_migrate_vet_dst(dst: &Arc<CgroupNode>) -> Result<(), SystemError> { | ||
| // v2 no-internal-process 最小约束: | ||
| // 目标 cgroup 如果已经有进程并且启用了 subtree_control,则拒绝迁移。 | ||
| if dst.has_tasks() && !dst.subtree_control().is_empty() { |
There was a problem hiding this comment.
[Critical] no-internal-process constraint check deviates from Linux semantics
Current check: has_tasks() && !subtree_control().is_empty() — requires BOTH conditions.
Linux kernel/cgroup/cgroup.c:2632-2634:
/* apply no-internal-process constraint */
if (dst_cgrp->subtree_control)
return -EBUSY;Linux only checks whether subtree_control is non-zero, without checking has_tasks(). The meaning is: once a cgroup has subtree_control enabled, no process migration into it is allowed (regardless of whether it currently has tasks).
The current implementation allows migrating processes into an empty cgroup that has controllers enabled, violating the cgroup v2 no-internal-process constraint.
Suggested fix:
if !dst.subtree_control().is_empty() {
return Err(SystemError::EBUSY);
}| // 必须持有 cgroup_accounting_lock 以避免与 cgroup.procs 写入死锁 | ||
| { | ||
| let _cgroup_guard = crate::cgroup::cgroup_accounting_lock().lock(); | ||
| pcb.task_cgroup_node().remove_task(raw_pid); |
There was a problem hiding this comment.
[Critical] Double remove_task — task removed from cgroup in both do_exit and release
This line removes the task from its cgroup on exit. However, release() (around line 872) also calls remove_task(pid) for the same task. The same PID gets removed twice.
While HashSet::remove on a missing element is a no-op today, this causes problems:
- If refactored to atomic counters (recommended), double-decrement causes underflow
- Indicates unclear cgroup lifecycle ownership
Linux calls pids_uncharge() only once in the pids_release() callback (via css_release).
Suggestion: Keep remove_task only here in do_exit (matching Linux semantics — zombies don't appear in cgroup.procs), and remove the duplicate call in release().
| } | ||
| let moved_tasks = to_move.len(); | ||
|
|
||
| let _cgroup_guard = cgroup_accounting_lock().lock(); |
There was a problem hiding this comment.
[Critical] SpinLock held while taking RwLock write during task migration loop
cgroup_accounting_lock() is a SpinLock. Inside this critical section, the loop calls set_task_cgroup_node() which acquires a RwLock<TaskCgroupRef> write lock.
If RwLock can sleep (e.g., blocking rwlock implementation), this would deadlock under preemption-disabled / IRQ-disabled context imposed by SpinLock.
Additionally, set_task_cgroup_node() internally:
- Takes read lock on
task_cgroup→ gets old node → releases read lock - Calls
old.remove_task()andnew.add_task()(no lock on task_cgroup) - Takes write lock on
task_cgroup→ updates reference
Between steps 1 and 3, another CPU could read a stale cgroup reference via task_cgroup_node(). The SpinLock is supposed to prevent this, but need to verify RwLock is spin-based and safe to nest under SpinLock.
Suggestion: Verify RwLock implementation is non-sleeping, or switch cgroup_accounting_lock to a Mutex.
| } | ||
| let moved_tasks = to_move.len(); | ||
|
|
||
| let _cgroup_guard = cgroup_accounting_lock().lock(); |
There was a problem hiding this comment.
[Critical] SpinLock held while taking RwLock write during task migration loop
cgroup_accounting_lock() is a SpinLock. Inside this critical section, the loop calls set_task_cgroup_node() which acquires a RwLock<TaskCgroupRef> write lock.
If RwLock can sleep (e.g., blocking rwlock implementation), this would deadlock under preemption-disabled / IRQ-disabled context imposed by SpinLock.
Additionally, set_task_cgroup_node() internally:
- Takes read lock on
task_cgroup→ gets old node → releases read lock - Calls
old.remove_task()andnew.add_task()(no lock on task_cgroup) - Takes write lock on
task_cgroup→ updates reference
Between steps 1 and 3, another CPU could read a stale cgroup reference via task_cgroup_node(). The SpinLock is supposed to prevent this, but need to verify RwLock is spin-based and safe to nest under SpinLock.
Suggestion: Verify RwLock implementation is non-sleeping, or switch cgroup_accounting_lock to a Mutex.
| .clone(); | ||
| let src_node = pcb.task_cgroup_node(); | ||
| let _cgroup_guard = cgroup_accounting_lock().lock(); | ||
| cgroup_can_fork_in(&charge_node, 1)?; |
There was a problem hiding this comment.
[Medium] TOCTOU gap between cgroup_can_fork_in check and actual add_task
The cgroup_accounting_lock guard is dropped at the end of this block, but add_pcb() (which calls add_task()) runs after the lock is released. Between the check passing and the task actually being added, another fork could also pass the same check, exceeding pids.max.
Linux avoids this by using pids_try_charge() which atomically increments the counter during the check — the counter reflects the charge immediately. If the fork later fails, pids_cancel_fork() does an uncharge to revert.
Suggestion: Either extend the lock scope to cover add_pcb(), or (better) adopt the atomic counter charge/uncharge pattern.
| .clone(); | ||
| let src_node = pcb.task_cgroup_node(); | ||
| let _cgroup_guard = cgroup_accounting_lock().lock(); | ||
| cgroup_can_fork_in(&charge_node, 1)?; |
There was a problem hiding this comment.
[Medium] TOCTOU gap between cgroup_can_fork_in check and actual add_task
The cgroup_accounting_lock guard is dropped at the end of this block, but add_pcb() (which calls add_task()) runs after the lock is released. Between the check passing and the task actually being added, another fork could also pass the same check, exceeding pids.max.
Linux avoids this by using pids_try_charge() which atomically increments the counter during the check — the counter reflects the charge immediately. If the fork later fails, pids_cancel_fork() does an uncharge to revert.
Suggestion: Either extend the lock scope to cover add_pcb(), or (better) adopt the atomic counter charge/uncharge pattern.
| let child = Cgroup2Inode::lookup_child(&this, name)?; | ||
|
|
||
| { | ||
| let inner = child.inner.lock(); |
There was a problem hiding this comment.
[Medium] Potential ABBA deadlock in rmdir — child lock held while acquiring parent lock
Here child.inner.lock() is acquired first, then this.inner.lock() (parent) is acquired while still holding the child lock. Lock order: child → parent.
However, in lookup_child(), create(), list() and other operations, parent.inner.lock() is acquired first, then child operations are performed. Lock order: parent → child.
This is a classic ABBA deadlock pattern. While rmdir checks that the child has no tasks/children (reducing concurrent access likelihood), it's still possible with concurrent operations on different children of the same parent.
Suggestion: Restructure rmdir to extract needed info from child.inner first, release child lock, then acquire parent lock.
|
|
||
| // 2. 解析 fd,当前仅支持 pidfd | ||
| let current = ProcessManager::current_pcb(); | ||
| if !current.cred().has_cap_sys_admin() { |
There was a problem hiding this comment.
[Medium] Blanket CAP_SYS_ADMIN check in setns is overly restrictive
This unconditional CAP_SYS_ADMIN check at the ksys_setns entry point blocks all namespace types. However, in Linux, setns(CLONE_NEWUSER) does not require CAP_SYS_ADMIN — unprivileged users can join a user namespace.
Linux checks permissions per-namespace type in each namespace's install callback (e.g., cgroupns_install checks ns_capable(CAP_SYS_ADMIN), but userns_install has different rules).
Suggestion: Remove this top-level blanket check. Instead, rely on the per-namespace permission checks that already exist in the individual branches below (e.g., can_setns_cgroup, can_setns_target_userns).
|
|
||
| // 2. 解析 fd,当前仅支持 pidfd | ||
| let current = ProcessManager::current_pcb(); | ||
| if !current.cred().has_cap_sys_admin() { |
There was a problem hiding this comment.
[Medium] Blanket CAP_SYS_ADMIN check in setns is overly restrictive
This unconditional CAP_SYS_ADMIN check at the ksys_setns entry point blocks all namespace types. However, in Linux, setns(CLONE_NEWUSER) does not require CAP_SYS_ADMIN — unprivileged users can join a user namespace.
Linux checks permissions per-namespace type in each namespace's install callback (e.g., cgroupns_install checks ns_capable(CAP_SYS_ADMIN), but userns_install has different rules).
Suggestion: Remove this top-level blanket check. Instead, rely on the per-namespace permission checks that already exist in the individual branches below (e.g., can_setns_cgroup, can_setns_target_userns).
| /// Uses depth-limited recursion to prevent stack overflow. | ||
| /// Maximum recursion depth is set to 128 levels, which is far beyond any | ||
| /// reasonable cgroup hierarchy depth. | ||
| fn is_populated(cgroup: &Arc<CgroupNode>) -> bool { |
There was a problem hiding this comment.
[Medium] is_populated performs recursive subtree traversal on every cgroup.events read
Every read of cgroup.events triggers a full recursive walk of the cgroup subtree to determine the populated flag. This is O(n) where n is the number of descendant cgroups.
Linux maintains nr_populated_csets and nr_populated_children counters on each cgroup, propagating populated state changes up the ancestor chain when tasks are added/removed or child cgroups are created/destroyed. Reading populated is O(1).
Suggestion: Add a populated_count: AtomicUsize to CgroupNode, increment when a task joins an empty cgroup or a new non-empty child is created, decrement on the reverse. Read it directly in cgroup.events.
| /// Uses depth-limited recursion to prevent stack overflow. | ||
| /// Maximum recursion depth is set to 128 levels, which is far beyond any | ||
| /// reasonable cgroup hierarchy depth. | ||
| fn is_populated(cgroup: &Arc<CgroupNode>) -> bool { |
There was a problem hiding this comment.
[Medium] is_populated performs recursive subtree traversal on every cgroup.events read
Every read of cgroup.events triggers a full recursive walk of the cgroup subtree to determine the populated flag. This is O(n) where n is the number of descendant cgroups.
Linux maintains nr_populated_csets and nr_populated_children counters on each cgroup, propagating populated state changes up the ancestor chain when tasks are added/removed or child cgroups are created/destroyed. Reading populated is O(1).
Suggestion: Add a populated_count: AtomicUsize to CgroupNode, increment when a task joins an empty cgroup or a new non-empty child is created, decrement on the reverse. Read it directly in cgroup.events.
| // 必须持有 cgroup_accounting_lock 以避免与 cgroup.procs 写入死锁 | ||
| { | ||
| let _cgroup_guard = crate::cgroup::cgroup_accounting_lock().lock(); | ||
| pcb.task_cgroup_node().remove_task(pid); |
There was a problem hiding this comment.
[Critical] Duplicate remove_task — see comment on line 734
This is the second remove_task call for the same PID. The first one already happens in do_exit (line 734). This should be removed to avoid double-removal issues, especially when migrating to atomic counters.