feat(cgroup): 实现cgroup mvp版本 by Vitus213 · Pull Request #1826 · DragonOS-Community/DragonOS

Vitus213 · 2026-03-23T13:57:16Z

实现cgroup v2文件系统
实现cgroup 树
实现cgroup在不同pid 下的相对视图
实现进程从出生开始就被cgroup pid 作用域限制

- derive Rust components from kernel/rust-toolchain.toml - auto-add required components (cargo/rustc/rust-std) when missing - emit trace warning when required components are absent in toml - prioritize ${rust-toolchain}/bin in nix run yolo and devShell shellHook - document fixed-output sha256 update workflow - keep fixed-output hash strategy for reproducibility This avoids cargo/rustc drift caused by PATH/toolchain mixing across Nix entrypoints.

…ecks

# Conflicts: # .agents/skills/bug-hunter/SKILL.md # .agents/skills/bug-hunter/scripts/debate_picker.py # .agents/skills/bug-hunter/scripts/redact_sensitive.py # .agents/skills/bug-hunter/scripts/render_report.py # .agents/skills/bug-hunter/scripts/run_pipeline.py # .agents/skills/bug-hunter/scripts/semantic_bucket.py # .agents/skills/bug-hunter/scripts/shuffle_diff.py # .agents/skills/bug-hunter/scripts/update_resolution_history.py # .agents/skills/bug-hunter/scripts/weighted_vote.py # .agents/skills/bug-hunter/subskills/bug-hunter-stage1-input-randomization/SKILL.md # .agents/skills/bug-hunter/subskills/bug-hunter-stage2-parallel-review/SKILL.md # .agents/skills/bug-hunter/subskills/bug-hunter-stage3-evidence-fusion/SKILL.md # .agents/skills/bug-hunter/subskills/bug-hunter-stage4-consensus-judge/SKILL.md # .gitignore # kernel/src/filesystem/vfs/mount.rs # user/apps/default.nix

- 添加agents.md,强制说中文

fslongjin · 2026-03-25T11:13:30Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b54f738833

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-25T11:20:59Z

kernel/src/filesystem/cgroup2/mod.rs

+            let fs_dir = ensure_dir(&sys, "fs", InodeMode::from_bits_truncate(0o755))?;
+            let cgroup_dir = ensure_dir(&fs_dir, "cgroup", InodeMode::from_bits_truncate(0o755))?;
+
+            let cgroup_fs = Cgroup2Fs::new(cgroup_root().root(), false);


让默认 cgroup2 挂载遵循当前 cgroup namespace 根

这里把 /sys/fs/cgroup 的初始挂载固定到全局根 cgroup_root().root()，该挂载随后会被所有进程复用；当进程执行 unshare(CLONE_NEWCGROUP) 但未重新挂载 cgroup2 时，它仍可通过这棵全局树访问/操作 namespace 根之外的 cgroup（而 cgroup.procs 写路径仅在 nsdelegate 时才额外做 namespace 边界检查）。这会破坏 cgroup namespace 的隔离语义，并允许越过 namespace 根进行迁移或管理。

Useful? React with 👍 / 👎.

符合linux 前向兼容

chatgpt-codex-connector · 2026-03-25T11:20:59Z

kernel/src/cgroup/core.rs

+    pub fn subtree_task_count(self: &Arc<Self>) -> usize {
+        let mut total = self.tasks.read().len();
+        for child in self.children() {
+            total = total.saturating_add(child.subtree_task_count());
+        }


避免对子树任务计数使用无界递归

subtree_task_count() 对 cgroup 子树做无界递归遍历，但本提交没有对层级深度做限制；用户可通过连续创建深层 cgroup 目录触发该路径（例如 pids.max 检查、迁移校验或 pids.current 读取），在深树下造成内核栈过深甚至栈溢出。这个问题会直接影响稳定性，建议改为显式栈的迭代遍历或增加统一深度约束与防护。

Useful? React with 👍 / 👎.

Design document for cgroup v2 performance benchmark tool that: - Measures filesystem operations (mkdir/rmdir/read/write) - Measures process migration overhead - Measures pids controller fork overhead - Outputs JSON format for DragonOS/Linux comparison Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

问题： 1. 进程退出时缺少 cgroup_accounting_lock 保护，与 cgroup.procs 写入路径锁顺序不一致 2. set_task_cgroup_node() 在持有 task_cgroup.write() 时获取 tasks.write()，违反锁层次结构修复： - 在 exit_thread() 和 release() 中的 cgroup 操作添加 cgroup_accounting_lock 保护 - 重构 set_task_cgroup_node() 避免锁嵌套：先用读锁获取 old 节点，释放后再执行迁移 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

这个分支专注于 cgroup 问题，poll 相关修改应单独处理

… into feat/cgroup

fslongjin · 2026-04-12T12:58:39Z

kernel/src/cgroup/core.rs

+        self.pids_events_max.fetch_add(1, Ordering::Relaxed);
+    }
+
+    pub fn subtree_task_count(self: &Arc<Self>) -> usize {


[Critical] pids counting via full subtree traversal — O(n) performance issue

subtree_task_count() walks the entire cgroup subtree and holds each node's tasks read lock on every invocation. This is called from:

pids.current reads

pids.max checks in cgroup_can_fork_in() (fork hot path)

cgroup_migrate_vet_dst_with_src()

In Linux kernel/cgroup/pids.c, an atomic64_t counter per cgroup node provides hierarchical accounting:

pids_try_charge(): walks ancestor chain with atomic64_add_return, O(depth)

pids_current_read(): direct atomic64_read(&pids->counter), O(1)

pids_uncharge(): atomic decrement along ancestor chain on exit

Suggestion: Add counter: AtomicI64 to CgroupNode, charge/uncharge along the ancestor chain on task add/remove.

fslongjin · 2026-04-12T12:58:41Z

kernel/src/cgroup/core.rs

+        self.pids_events_max.fetch_add(1, Ordering::Relaxed);
+    }
+
+    pub fn subtree_task_count(self: &Arc<Self>) -> usize {


[Critical] pids counting via full subtree traversal — O(n) performance issue

subtree_task_count() walks the entire cgroup subtree and holds each node's tasks read lock on every invocation. This is called from:

pids.current reads

pids.max checks in cgroup_can_fork_in() (fork hot path)

cgroup_migrate_vet_dst_with_src()

In Linux kernel/cgroup/pids.c, an atomic64_t counter per cgroup node provides hierarchical accounting:

pids_try_charge(): walks ancestor chain with atomic64_add_return, O(depth)

pids_current_read(): direct atomic64_read(&pids->counter), O(1)

pids_uncharge(): atomic decrement along ancestor chain on exit

Suggestion: Add counter: AtomicI64 to CgroupNode, charge/uncharge along the ancestor chain on task add/remove.

fslongjin · 2026-04-12T12:58:46Z

kernel/src/cgroup/core.rs

+pub fn cgroup_migrate_vet_dst(dst: &Arc<CgroupNode>) -> Result<(), SystemError> {
+    // v2 no-internal-process 最小约束：
+    // 目标 cgroup 如果已经有进程并且启用了 subtree_control，则拒绝迁移。
+    if dst.has_tasks() && !dst.subtree_control().is_empty() {


[Critical] no-internal-process constraint check deviates from Linux semantics

Current check: has_tasks() && !subtree_control().is_empty() — requires BOTH conditions.

Linux kernel/cgroup/cgroup.c:2632-2634:

/* apply no-internal-process constraint */ if (dst_cgrp->subtree_control) return -EBUSY;

Linux only checks whether subtree_control is non-zero, without checking has_tasks(). The meaning is: once a cgroup has subtree_control enabled, no process migration into it is allowed (regardless of whether it currently has tasks).

The current implementation allows migrating processes into an empty cgroup that has controllers enabled, violating the cgroup v2 no-internal-process constraint.

Suggested fix:

if !dst.subtree_control().is_empty() { return Err(SystemError::EBUSY); }

fslongjin · 2026-04-12T12:59:05Z

kernel/src/cgroup/core.rs

+pub fn cgroup_migrate_vet_dst(dst: &Arc<CgroupNode>) -> Result<(), SystemError> {
+    // v2 no-internal-process 最小约束：
+    // 目标 cgroup 如果已经有进程并且启用了 subtree_control，则拒绝迁移。
+    if dst.has_tasks() && !dst.subtree_control().is_empty() {


[Critical] no-internal-process constraint check deviates from Linux semantics

Current check: has_tasks() && !subtree_control().is_empty() — requires BOTH conditions.

Linux kernel/cgroup/cgroup.c:2632-2634:

/* apply no-internal-process constraint */ if (dst_cgrp->subtree_control) return -EBUSY;

Linux only checks whether subtree_control is non-zero, without checking has_tasks(). The meaning is: once a cgroup has subtree_control enabled, no process migration into it is allowed (regardless of whether it currently has tasks).

The current implementation allows migrating processes into an empty cgroup that has controllers enabled, violating the cgroup v2 no-internal-process constraint.

Suggested fix:

if !dst.subtree_control().is_empty() { return Err(SystemError::EBUSY); }

fslongjin · 2026-04-12T12:59:08Z

kernel/src/process/mod.rs

+            // 必须持有 cgroup_accounting_lock 以避免与 cgroup.procs 写入死锁
+            {
+                let _cgroup_guard = crate::cgroup::cgroup_accounting_lock().lock();
+                pcb.task_cgroup_node().remove_task(raw_pid);


[Critical] Double remove_task — task removed from cgroup in both do_exit and release

This line removes the task from its cgroup on exit. However, release() (around line 872) also calls remove_task(pid) for the same task. The same PID gets removed twice.

While HashSet::remove on a missing element is a no-op today, this causes problems:

If refactored to atomic counters (recommended), double-decrement causes underflow

Indicates unclear cgroup lifecycle ownership

Linux calls pids_uncharge() only once in the pids_release() callback (via css_release).

Suggestion: Keep remove_task only here in do_exit (matching Linux semantics — zombies don't appear in cgroup.procs), and remove the duplicate call in release().

fslongjin · 2026-04-12T12:59:37Z

kernel/src/filesystem/cgroup2/mod.rs

+                }
+                let moved_tasks = to_move.len();
+
+                let _cgroup_guard = cgroup_accounting_lock().lock();


[Critical] SpinLock held while taking RwLock write during task migration loop

cgroup_accounting_lock() is a SpinLock. Inside this critical section, the loop calls set_task_cgroup_node() which acquires a RwLock<TaskCgroupRef> write lock.

If RwLock can sleep (e.g., blocking rwlock implementation), this would deadlock under preemption-disabled / IRQ-disabled context imposed by SpinLock.

Additionally, set_task_cgroup_node() internally:

Takes read lock on task_cgroup → gets old node → releases read lock

Calls old.remove_task() and new.add_task() (no lock on task_cgroup)

Takes write lock on task_cgroup → updates reference

Between steps 1 and 3, another CPU could read a stale cgroup reference via task_cgroup_node(). The SpinLock is supposed to prevent this, but need to verify RwLock is spin-based and safe to nest under SpinLock.

Suggestion: Verify RwLock implementation is non-sleeping, or switch cgroup_accounting_lock to a Mutex.

fslongjin · 2026-04-12T12:59:38Z

kernel/src/filesystem/cgroup2/mod.rs

+                }
+                let moved_tasks = to_move.len();
+
+                let _cgroup_guard = cgroup_accounting_lock().lock();


[Critical] SpinLock held while taking RwLock write during task migration loop

cgroup_accounting_lock() is a SpinLock. Inside this critical section, the loop calls set_task_cgroup_node() which acquires a RwLock<TaskCgroupRef> write lock.

If RwLock can sleep (e.g., blocking rwlock implementation), this would deadlock under preemption-disabled / IRQ-disabled context imposed by SpinLock.

Additionally, set_task_cgroup_node() internally:

Takes read lock on task_cgroup → gets old node → releases read lock

Calls old.remove_task() and new.add_task() (no lock on task_cgroup)

Takes write lock on task_cgroup → updates reference

Between steps 1 and 3, another CPU could read a stale cgroup reference via task_cgroup_node(). The SpinLock is supposed to prevent this, but need to verify RwLock is spin-based and safe to nest under SpinLock.

Suggestion: Verify RwLock implementation is non-sleeping, or switch cgroup_accounting_lock to a Mutex.

fslongjin · 2026-04-12T12:59:39Z

kernel/src/process/fork.rs

+                .clone();
+            let src_node = pcb.task_cgroup_node();
+            let _cgroup_guard = cgroup_accounting_lock().lock();
+            cgroup_can_fork_in(&charge_node, 1)?;


[Medium] TOCTOU gap between cgroup_can_fork_in check and actual add_task

The cgroup_accounting_lock guard is dropped at the end of this block, but add_pcb() (which calls add_task()) runs after the lock is released. Between the check passing and the task actually being added, another fork could also pass the same check, exceeding pids.max.

Linux avoids this by using pids_try_charge() which atomically increments the counter during the check — the counter reflects the charge immediately. If the fork later fails, pids_cancel_fork() does an uncharge to revert.

Suggestion: Either extend the lock scope to cover add_pcb(), or (better) adopt the atomic counter charge/uncharge pattern.

fslongjin · 2026-04-12T12:59:41Z

kernel/src/process/fork.rs

+                .clone();
+            let src_node = pcb.task_cgroup_node();
+            let _cgroup_guard = cgroup_accounting_lock().lock();
+            cgroup_can_fork_in(&charge_node, 1)?;


[Medium] TOCTOU gap between cgroup_can_fork_in check and actual add_task

The cgroup_accounting_lock guard is dropped at the end of this block, but add_pcb() (which calls add_task()) runs after the lock is released. Between the check passing and the task actually being added, another fork could also pass the same check, exceeding pids.max.

Linux avoids this by using pids_try_charge() which atomically increments the counter during the check — the counter reflects the charge immediately. If the fork later fails, pids_cancel_fork() does an uncharge to revert.

Suggestion: Either extend the lock scope to cover add_pcb(), or (better) adopt the atomic counter charge/uncharge pattern.

fslongjin · 2026-04-12T12:59:42Z

kernel/src/filesystem/cgroup2/mod.rs

+        let child = Cgroup2Inode::lookup_child(&this, name)?;
+
+        {
+            let inner = child.inner.lock();


[Medium] Potential ABBA deadlock in rmdir — child lock held while acquiring parent lock

Here child.inner.lock() is acquired first, then this.inner.lock() (parent) is acquired while still holding the child lock. Lock order: child → parent.

However, in lookup_child(), create(), list() and other operations, parent.inner.lock() is acquired first, then child operations are performed. Lock order: parent → child.

This is a classic ABBA deadlock pattern. While rmdir checks that the child has no tasks/children (reducing concurrent access likelihood), it's still possible with concurrent operations on different children of the same parent.

Suggestion: Restructure rmdir to extract needed info from child.inner first, release child lock, then acquire parent lock.

fslongjin · 2026-04-12T13:00:06Z

kernel/src/process/namespace/setns.rs


    // 2. 解析 fd，当前仅支持 pidfd
    let current = ProcessManager::current_pcb();
+    if !current.cred().has_cap_sys_admin() {


[Medium] Blanket CAP_SYS_ADMIN check in setns is overly restrictive

This unconditional CAP_SYS_ADMIN check at the ksys_setns entry point blocks all namespace types. However, in Linux, setns(CLONE_NEWUSER) does not require CAP_SYS_ADMIN — unprivileged users can join a user namespace.

Linux checks permissions per-namespace type in each namespace's install callback (e.g., cgroupns_install checks ns_capable(CAP_SYS_ADMIN), but userns_install has different rules).

Suggestion: Remove this top-level blanket check. Instead, rely on the per-namespace permission checks that already exist in the individual branches below (e.g., can_setns_cgroup, can_setns_target_userns).

fslongjin · 2026-04-12T13:00:08Z

kernel/src/process/namespace/setns.rs


    // 2. 解析 fd，当前仅支持 pidfd
    let current = ProcessManager::current_pcb();
+    if !current.cred().has_cap_sys_admin() {


[Medium] Blanket CAP_SYS_ADMIN check in setns is overly restrictive

This unconditional CAP_SYS_ADMIN check at the ksys_setns entry point blocks all namespace types. However, in Linux, setns(CLONE_NEWUSER) does not require CAP_SYS_ADMIN — unprivileged users can join a user namespace.

Linux checks permissions per-namespace type in each namespace's install callback (e.g., cgroupns_install checks ns_capable(CAP_SYS_ADMIN), but userns_install has different rules).

Suggestion: Remove this top-level blanket check. Instead, rely on the per-namespace permission checks that already exist in the individual branches below (e.g., can_setns_cgroup, can_setns_target_userns).

fslongjin · 2026-04-12T13:00:09Z

kernel/src/filesystem/cgroup2/mod.rs

+    /// Uses depth-limited recursion to prevent stack overflow.
+    /// Maximum recursion depth is set to 128 levels, which is far beyond any
+    /// reasonable cgroup hierarchy depth.
+    fn is_populated(cgroup: &Arc<CgroupNode>) -> bool {


[Medium] is_populated performs recursive subtree traversal on every cgroup.events read

Every read of cgroup.events triggers a full recursive walk of the cgroup subtree to determine the populated flag. This is O(n) where n is the number of descendant cgroups.

Linux maintains nr_populated_csets and nr_populated_children counters on each cgroup, propagating populated state changes up the ancestor chain when tasks are added/removed or child cgroups are created/destroyed. Reading populated is O(1).

Suggestion: Add a populated_count: AtomicUsize to CgroupNode, increment when a task joins an empty cgroup or a new non-empty child is created, decrement on the reverse. Read it directly in cgroup.events.

fslongjin · 2026-04-12T13:00:10Z

kernel/src/filesystem/cgroup2/mod.rs

+    /// Uses depth-limited recursion to prevent stack overflow.
+    /// Maximum recursion depth is set to 128 levels, which is far beyond any
+    /// reasonable cgroup hierarchy depth.
+    fn is_populated(cgroup: &Arc<CgroupNode>) -> bool {


[Medium] is_populated performs recursive subtree traversal on every cgroup.events read

Every read of cgroup.events triggers a full recursive walk of the cgroup subtree to determine the populated flag. This is O(n) where n is the number of descendant cgroups.

Linux maintains nr_populated_csets and nr_populated_children counters on each cgroup, propagating populated state changes up the ancestor chain when tasks are added/removed or child cgroups are created/destroyed. Reading populated is O(1).

Suggestion: Add a populated_count: AtomicUsize to CgroupNode, increment when a task joins an empty cgroup or a new non-empty child is created, decrement on the reverse. Read it directly in cgroup.events.

fslongjin · 2026-04-12T13:00:12Z

kernel/src/process/mod.rs

+            // 必须持有 cgroup_accounting_lock 以避免与 cgroup.procs 写入死锁
+            {
+                let _cgroup_guard = crate::cgroup::cgroup_accounting_lock().lock();
+                pcb.task_cgroup_node().remove_task(pid);


[Critical] Duplicate remove_task — see comment on line 734

This is the second remove_task call for the same PID. The first one already happens in do_exit (line 734). This should be removed to avoid double-removal issues, especially when migrating to atomic counters.

Vitus213 added 25 commits February 26, 2026 15:42

feat(cgroup): introduce cgroup core tree model

72e578e

feat(process): track task cgroup membership lifecycle

9ac1e71

feat(cgroup2): implement cgroup2 fs and boot-time mount

b3954f8

feat(skills): add bug-hunter and DragonOS research skills

4cb2c37

feat(procfs): add /proc/[pid]/cgroup view for cgroup v2

75507e3

add .gitignore

27a6e4f

feat(ns): add cgroup namespace root semantics and setns permission ch…

8bfe1fc

…ecks

feat(fork): support clone3 CLONE_INTO_CGROUP with pre-fork cgroup checks

5d64a31

fix(kernel): harden panic path and legacy clone/fork error handling

e6ac807

feat(nix): add compile c_unitest

f2d4c16

feat(cgroup): add cgroup test

d8a4112

fix

e1a4fe9

cgroup: fix namespace view path projection

d6fa6ad

cgroup2: fix mount setup and control file writes

91ed01a

kernfs: provide dname and clean up inode initialization

a4d9e39

poll: preserve restart semantics on EINTR

1358c56

vfs: compute mount path before registering mounts

2049c4b

update skill

1afecf8

fix

ae2a913

fix

e4e67eb

fix:删除掉没有调好的skills

a135d72

rename

8bda87e

update 长流程下的yolo skill

0a4b859

github-actions bot added the enhancement New feature or request label Mar 23, 2026

Vitus213 added 2 commits March 23, 2026 22:09

- cargo fmt

f940bab

- 添加agents.md,强制说中文

fix

b54f738

chatgpt-codex-connector bot reviewed Mar 25, 2026

View reviewed changes

Vitus213 and others added 8 commits March 28, 2026 01:56

fix

69f17d2

revert: poll: preserve restart semantics on EINTR

b5a6187

这个分支专注于 cgroup 问题，poll 相关修改应单独处理

fix

12248f6

fmt

9867823

fmt

20a0f09

Merge branch 'master' of https://github.com/DragonOS-Community/DragonOS…

15b911d

… into feat/cgroup

fslongjin reviewed Apr 12, 2026

View reviewed changes

Uh oh!

Conversation

Vitus213 commented Mar 23, 2026

Uh oh!

fslongjin commented Mar 25, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants