Skip to content

feat(cgroup): 实现cgroup mvp版本#1826

Open
Vitus213 wants to merge 35 commits intoDragonOS-Community:masterfrom
Vitus213:feat/cgroup
Open

feat(cgroup): 实现cgroup mvp版本#1826
Vitus213 wants to merge 35 commits intoDragonOS-Community:masterfrom
Vitus213:feat/cgroup

Conversation

@Vitus213
Copy link
Copy Markdown
Contributor

  • 实现cgroup v2文件系统
  • 实现cgroup 树
  • 实现cgroup在不同pid 下的相对视图
  • 实现进程从出生开始就被cgroup pid 作用域限制

- derive Rust components from kernel/rust-toolchain.toml
- auto-add required components (cargo/rustc/rust-std) when missing
- emit trace warning when required components are absent in toml
- prioritize ${rust-toolchain}/bin in nix run yolo and devShell shellHook
- document fixed-output sha256 update workflow
- keep fixed-output hash strategy for reproducibility

This avoids cargo/rustc drift caused by PATH/toolchain mixing across Nix entrypoints.
# Conflicts:
#	.agents/skills/bug-hunter/SKILL.md
#	.agents/skills/bug-hunter/scripts/debate_picker.py
#	.agents/skills/bug-hunter/scripts/redact_sensitive.py
#	.agents/skills/bug-hunter/scripts/render_report.py
#	.agents/skills/bug-hunter/scripts/run_pipeline.py
#	.agents/skills/bug-hunter/scripts/semantic_bucket.py
#	.agents/skills/bug-hunter/scripts/shuffle_diff.py
#	.agents/skills/bug-hunter/scripts/update_resolution_history.py
#	.agents/skills/bug-hunter/scripts/weighted_vote.py
#	.agents/skills/bug-hunter/subskills/bug-hunter-stage1-input-randomization/SKILL.md
#	.agents/skills/bug-hunter/subskills/bug-hunter-stage2-parallel-review/SKILL.md
#	.agents/skills/bug-hunter/subskills/bug-hunter-stage3-evidence-fusion/SKILL.md
#	.agents/skills/bug-hunter/subskills/bug-hunter-stage4-consensus-judge/SKILL.md
#	.gitignore
#	kernel/src/filesystem/vfs/mount.rs
#	user/apps/default.nix
@github-actions github-actions bot added the enhancement New feature or request label Mar 23, 2026
- 添加agents.md,强制说中文
@fslongjin
Copy link
Copy Markdown
Member

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b54f738833

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

let fs_dir = ensure_dir(&sys, "fs", InodeMode::from_bits_truncate(0o755))?;
let cgroup_dir = ensure_dir(&fs_dir, "cgroup", InodeMode::from_bits_truncate(0o755))?;

let cgroup_fs = Cgroup2Fs::new(cgroup_root().root(), false);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge 让默认 cgroup2 挂载遵循当前 cgroup namespace 根

这里把 /sys/fs/cgroup 的初始挂载固定到全局根 cgroup_root().root(),该挂载随后会被所有进程复用;当进程执行 unshare(CLONE_NEWCGROUP) 但未重新挂载 cgroup2 时,它仍可通过这棵全局树访问/操作 namespace 根之外的 cgroup(而 cgroup.procs 写路径仅在 nsdelegate 时才额外做 namespace 边界检查)。这会破坏 cgroup namespace 的隔离语义,并允许越过 namespace 根进行迁移或管理。

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

符合linux 前向兼容

Comment on lines +123 to +127
pub fn subtree_task_count(self: &Arc<Self>) -> usize {
let mut total = self.tasks.read().len();
for child in self.children() {
total = total.saturating_add(child.subtree_task_count());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge 避免对子树任务计数使用无界递归

subtree_task_count() 对 cgroup 子树做无界递归遍历,但本提交没有对层级深度做限制;用户可通过连续创建深层 cgroup 目录触发该路径(例如 pids.max 检查、迁移校验或 pids.current 读取),在深树下造成内核栈过深甚至栈溢出。这个问题会直接影响稳定性,建议改为显式栈的迭代遍历或增加统一深度约束与防护。

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修复

Vitus213 and others added 8 commits March 28, 2026 01:56
Design document for cgroup v2 performance benchmark tool that:
- Measures filesystem operations (mkdir/rmdir/read/write)
- Measures process migration overhead
- Measures pids controller fork overhead
- Outputs JSON format for DragonOS/Linux comparison

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
问题:
1. 进程退出时缺少 cgroup_accounting_lock 保护,与 cgroup.procs 写入路径锁顺序不一致
2. set_task_cgroup_node() 在持有 task_cgroup.write() 时获取 tasks.write(),违反锁层次结构

修复:
- 在 exit_thread() 和 release() 中的 cgroup 操作添加 cgroup_accounting_lock 保护
- 重构 set_task_cgroup_node() 避免锁嵌套:先用读锁获取 old 节点,释放后再执行迁移

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
这个分支专注于 cgroup 问题,poll 相关修改应单独处理
self.pids_events_max.fetch_add(1, Ordering::Relaxed);
}

pub fn subtree_task_count(self: &Arc<Self>) -> usize {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] pids counting via full subtree traversal — O(n) performance issue

subtree_task_count() walks the entire cgroup subtree and holds each node's tasks read lock on every invocation. This is called from:

  • pids.current reads
  • pids.max checks in cgroup_can_fork_in() (fork hot path)
  • cgroup_migrate_vet_dst_with_src()

In Linux kernel/cgroup/pids.c, an atomic64_t counter per cgroup node provides hierarchical accounting:

  • pids_try_charge(): walks ancestor chain with atomic64_add_return, O(depth)
  • pids_current_read(): direct atomic64_read(&pids->counter), O(1)
  • pids_uncharge(): atomic decrement along ancestor chain on exit

Suggestion: Add counter: AtomicI64 to CgroupNode, charge/uncharge along the ancestor chain on task add/remove.

self.pids_events_max.fetch_add(1, Ordering::Relaxed);
}

pub fn subtree_task_count(self: &Arc<Self>) -> usize {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] pids counting via full subtree traversal — O(n) performance issue

subtree_task_count() walks the entire cgroup subtree and holds each node's tasks read lock on every invocation. This is called from:

  • pids.current reads
  • pids.max checks in cgroup_can_fork_in() (fork hot path)
  • cgroup_migrate_vet_dst_with_src()

In Linux kernel/cgroup/pids.c, an atomic64_t counter per cgroup node provides hierarchical accounting:

  • pids_try_charge(): walks ancestor chain with atomic64_add_return, O(depth)
  • pids_current_read(): direct atomic64_read(&pids->counter), O(1)
  • pids_uncharge(): atomic decrement along ancestor chain on exit

Suggestion: Add counter: AtomicI64 to CgroupNode, charge/uncharge along the ancestor chain on task add/remove.

pub fn cgroup_migrate_vet_dst(dst: &Arc<CgroupNode>) -> Result<(), SystemError> {
// v2 no-internal-process 最小约束:
// 目标 cgroup 如果已经有进程并且启用了 subtree_control,则拒绝迁移。
if dst.has_tasks() && !dst.subtree_control().is_empty() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] no-internal-process constraint check deviates from Linux semantics

Current check: has_tasks() && !subtree_control().is_empty() — requires BOTH conditions.

Linux kernel/cgroup/cgroup.c:2632-2634:

/* apply no-internal-process constraint */
if (dst_cgrp->subtree_control)
    return -EBUSY;

Linux only checks whether subtree_control is non-zero, without checking has_tasks(). The meaning is: once a cgroup has subtree_control enabled, no process migration into it is allowed (regardless of whether it currently has tasks).

The current implementation allows migrating processes into an empty cgroup that has controllers enabled, violating the cgroup v2 no-internal-process constraint.

Suggested fix:

if !dst.subtree_control().is_empty() {
    return Err(SystemError::EBUSY);
}

pub fn cgroup_migrate_vet_dst(dst: &Arc<CgroupNode>) -> Result<(), SystemError> {
// v2 no-internal-process 最小约束:
// 目标 cgroup 如果已经有进程并且启用了 subtree_control,则拒绝迁移。
if dst.has_tasks() && !dst.subtree_control().is_empty() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] no-internal-process constraint check deviates from Linux semantics

Current check: has_tasks() && !subtree_control().is_empty() — requires BOTH conditions.

Linux kernel/cgroup/cgroup.c:2632-2634:

/* apply no-internal-process constraint */
if (dst_cgrp->subtree_control)
    return -EBUSY;

Linux only checks whether subtree_control is non-zero, without checking has_tasks(). The meaning is: once a cgroup has subtree_control enabled, no process migration into it is allowed (regardless of whether it currently has tasks).

The current implementation allows migrating processes into an empty cgroup that has controllers enabled, violating the cgroup v2 no-internal-process constraint.

Suggested fix:

if !dst.subtree_control().is_empty() {
    return Err(SystemError::EBUSY);
}

// 必须持有 cgroup_accounting_lock 以避免与 cgroup.procs 写入死锁
{
let _cgroup_guard = crate::cgroup::cgroup_accounting_lock().lock();
pcb.task_cgroup_node().remove_task(raw_pid);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Double remove_task — task removed from cgroup in both do_exit and release

This line removes the task from its cgroup on exit. However, release() (around line 872) also calls remove_task(pid) for the same task. The same PID gets removed twice.

While HashSet::remove on a missing element is a no-op today, this causes problems:

  1. If refactored to atomic counters (recommended), double-decrement causes underflow
  2. Indicates unclear cgroup lifecycle ownership

Linux calls pids_uncharge() only once in the pids_release() callback (via css_release).

Suggestion: Keep remove_task only here in do_exit (matching Linux semantics — zombies don't appear in cgroup.procs), and remove the duplicate call in release().

}
let moved_tasks = to_move.len();

let _cgroup_guard = cgroup_accounting_lock().lock();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] SpinLock held while taking RwLock write during task migration loop

cgroup_accounting_lock() is a SpinLock. Inside this critical section, the loop calls set_task_cgroup_node() which acquires a RwLock<TaskCgroupRef> write lock.

If RwLock can sleep (e.g., blocking rwlock implementation), this would deadlock under preemption-disabled / IRQ-disabled context imposed by SpinLock.

Additionally, set_task_cgroup_node() internally:

  1. Takes read lock on task_cgroup → gets old node → releases read lock
  2. Calls old.remove_task() and new.add_task() (no lock on task_cgroup)
  3. Takes write lock on task_cgroup → updates reference

Between steps 1 and 3, another CPU could read a stale cgroup reference via task_cgroup_node(). The SpinLock is supposed to prevent this, but need to verify RwLock is spin-based and safe to nest under SpinLock.

Suggestion: Verify RwLock implementation is non-sleeping, or switch cgroup_accounting_lock to a Mutex.

}
let moved_tasks = to_move.len();

let _cgroup_guard = cgroup_accounting_lock().lock();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] SpinLock held while taking RwLock write during task migration loop

cgroup_accounting_lock() is a SpinLock. Inside this critical section, the loop calls set_task_cgroup_node() which acquires a RwLock<TaskCgroupRef> write lock.

If RwLock can sleep (e.g., blocking rwlock implementation), this would deadlock under preemption-disabled / IRQ-disabled context imposed by SpinLock.

Additionally, set_task_cgroup_node() internally:

  1. Takes read lock on task_cgroup → gets old node → releases read lock
  2. Calls old.remove_task() and new.add_task() (no lock on task_cgroup)
  3. Takes write lock on task_cgroup → updates reference

Between steps 1 and 3, another CPU could read a stale cgroup reference via task_cgroup_node(). The SpinLock is supposed to prevent this, but need to verify RwLock is spin-based and safe to nest under SpinLock.

Suggestion: Verify RwLock implementation is non-sleeping, or switch cgroup_accounting_lock to a Mutex.

.clone();
let src_node = pcb.task_cgroup_node();
let _cgroup_guard = cgroup_accounting_lock().lock();
cgroup_can_fork_in(&charge_node, 1)?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] TOCTOU gap between cgroup_can_fork_in check and actual add_task

The cgroup_accounting_lock guard is dropped at the end of this block, but add_pcb() (which calls add_task()) runs after the lock is released. Between the check passing and the task actually being added, another fork could also pass the same check, exceeding pids.max.

Linux avoids this by using pids_try_charge() which atomically increments the counter during the check — the counter reflects the charge immediately. If the fork later fails, pids_cancel_fork() does an uncharge to revert.

Suggestion: Either extend the lock scope to cover add_pcb(), or (better) adopt the atomic counter charge/uncharge pattern.

.clone();
let src_node = pcb.task_cgroup_node();
let _cgroup_guard = cgroup_accounting_lock().lock();
cgroup_can_fork_in(&charge_node, 1)?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] TOCTOU gap between cgroup_can_fork_in check and actual add_task

The cgroup_accounting_lock guard is dropped at the end of this block, but add_pcb() (which calls add_task()) runs after the lock is released. Between the check passing and the task actually being added, another fork could also pass the same check, exceeding pids.max.

Linux avoids this by using pids_try_charge() which atomically increments the counter during the check — the counter reflects the charge immediately. If the fork later fails, pids_cancel_fork() does an uncharge to revert.

Suggestion: Either extend the lock scope to cover add_pcb(), or (better) adopt the atomic counter charge/uncharge pattern.

let child = Cgroup2Inode::lookup_child(&this, name)?;

{
let inner = child.inner.lock();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] Potential ABBA deadlock in rmdir — child lock held while acquiring parent lock

Here child.inner.lock() is acquired first, then this.inner.lock() (parent) is acquired while still holding the child lock. Lock order: child → parent.

However, in lookup_child(), create(), list() and other operations, parent.inner.lock() is acquired first, then child operations are performed. Lock order: parent → child.

This is a classic ABBA deadlock pattern. While rmdir checks that the child has no tasks/children (reducing concurrent access likelihood), it's still possible with concurrent operations on different children of the same parent.

Suggestion: Restructure rmdir to extract needed info from child.inner first, release child lock, then acquire parent lock.


// 2. 解析 fd,当前仅支持 pidfd
let current = ProcessManager::current_pcb();
if !current.cred().has_cap_sys_admin() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] Blanket CAP_SYS_ADMIN check in setns is overly restrictive

This unconditional CAP_SYS_ADMIN check at the ksys_setns entry point blocks all namespace types. However, in Linux, setns(CLONE_NEWUSER) does not require CAP_SYS_ADMIN — unprivileged users can join a user namespace.

Linux checks permissions per-namespace type in each namespace's install callback (e.g., cgroupns_install checks ns_capable(CAP_SYS_ADMIN), but userns_install has different rules).

Suggestion: Remove this top-level blanket check. Instead, rely on the per-namespace permission checks that already exist in the individual branches below (e.g., can_setns_cgroup, can_setns_target_userns).


// 2. 解析 fd,当前仅支持 pidfd
let current = ProcessManager::current_pcb();
if !current.cred().has_cap_sys_admin() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] Blanket CAP_SYS_ADMIN check in setns is overly restrictive

This unconditional CAP_SYS_ADMIN check at the ksys_setns entry point blocks all namespace types. However, in Linux, setns(CLONE_NEWUSER) does not require CAP_SYS_ADMIN — unprivileged users can join a user namespace.

Linux checks permissions per-namespace type in each namespace's install callback (e.g., cgroupns_install checks ns_capable(CAP_SYS_ADMIN), but userns_install has different rules).

Suggestion: Remove this top-level blanket check. Instead, rely on the per-namespace permission checks that already exist in the individual branches below (e.g., can_setns_cgroup, can_setns_target_userns).

/// Uses depth-limited recursion to prevent stack overflow.
/// Maximum recursion depth is set to 128 levels, which is far beyond any
/// reasonable cgroup hierarchy depth.
fn is_populated(cgroup: &Arc<CgroupNode>) -> bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] is_populated performs recursive subtree traversal on every cgroup.events read

Every read of cgroup.events triggers a full recursive walk of the cgroup subtree to determine the populated flag. This is O(n) where n is the number of descendant cgroups.

Linux maintains nr_populated_csets and nr_populated_children counters on each cgroup, propagating populated state changes up the ancestor chain when tasks are added/removed or child cgroups are created/destroyed. Reading populated is O(1).

Suggestion: Add a populated_count: AtomicUsize to CgroupNode, increment when a task joins an empty cgroup or a new non-empty child is created, decrement on the reverse. Read it directly in cgroup.events.

/// Uses depth-limited recursion to prevent stack overflow.
/// Maximum recursion depth is set to 128 levels, which is far beyond any
/// reasonable cgroup hierarchy depth.
fn is_populated(cgroup: &Arc<CgroupNode>) -> bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] is_populated performs recursive subtree traversal on every cgroup.events read

Every read of cgroup.events triggers a full recursive walk of the cgroup subtree to determine the populated flag. This is O(n) where n is the number of descendant cgroups.

Linux maintains nr_populated_csets and nr_populated_children counters on each cgroup, propagating populated state changes up the ancestor chain when tasks are added/removed or child cgroups are created/destroyed. Reading populated is O(1).

Suggestion: Add a populated_count: AtomicUsize to CgroupNode, increment when a task joins an empty cgroup or a new non-empty child is created, decrement on the reverse. Read it directly in cgroup.events.

// 必须持有 cgroup_accounting_lock 以避免与 cgroup.procs 写入死锁
{
let _cgroup_guard = crate::cgroup::cgroup_accounting_lock().lock();
pcb.task_cgroup_node().remove_task(pid);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Duplicate remove_task — see comment on line 734

This is the second remove_task call for the same PID. The first one already happens in do_exit (line 734). This should be removed to avoid double-removal issues, especially when migrating to atomic counters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants