Agent Skills Guard

This security guide explains how we review AI agent skills before trust recommendations. The goal is practical: help teams move faster without accepting hidden operational risk.

How the grading model works

Grade labels summarize risk density, not feature richness. We score implementation behavior, permission requirements, and operational blast radius. A skill with fewer features can rank safer than a feature-rich one if it keeps strict boundaries and predictable failure modes.

Grade A

Safe and Mature

Score 90-100

Grade B

Low Risk

Score 70-89

Grade C

Needs Guardrails

Score 50-69

Grade F

Unsafe by Default

Fail / Critical Risk

Network diagnostics

How to handle proxy and unknown-domain signals

Search-signal triage

Some security searches come from log fragments rather than normal product intent. When a query contains a proxy hostname, a local inspection domain, or an unfamiliar outbound destination, the right response is triage inside the security guide, not a standalone keyword page that could distract from the directory topic.

Unknown proxy host

Signal: A log, trace, or package output shows a MITM-looking hostname such as default.mitmproxy.hub.ace-research.openai.org.

Review response: Treat it as an environment-specific routing signal, not as a public integration target. Verify the runtime owner, proxy policy, and whether the request carried sensitive headers.

Unexpected outbound request

Signal: A skill contacts a domain that is not documented in its README, install command, or permission notes.

Review response: Block promotion until the destination is explained, allowlisted, and tested with redacted inputs. Silent telemetry should lower the trust score.

Credential-bearing network path

Signal: A request includes API keys, cookies, bearer tokens, repository contents, or local file snippets.

Review response: Fail the review unless the transfer is essential, explicit, and user-approved. Logs should mask tokens before they leave the local machine.

MCP package risk update

Treat install-command searches as security events

Package-intent signal

Users who search exact MCP package names are often trying to paste commands directly into Claude Code, Cursor, VS Code, or another client. A security guide should intercept that moment and convert it into a source, permission, and action-boundary review.

Similar package name

Signal: The install command looks like a known MCP package but the namespace, maintainer, or registry source is different.

Review response: Pause installation and verify the package from official docs, registry metadata, and repository ownership before it reaches a client config.

Filesystem boundary

Signal: A server requests local read/write access, broad path scopes, or the ability to search outside the working project.

Review response: Limit allowed directories, start read-only where possible, and require review before the server mutates files or touches private folders.

Network-capable tool

Signal: A server can fetch URLs, call APIs, upload data, or combine browser/session state with private prompts.

Review response: Document outbound destinations and block use with secrets, customer data, or unpublished content until the path is explicitly approved.

Destructive action surface

Signal: A server can delete branches, run SQL writes, change access controls, publish deploys, or trigger external workflows.

Review response: Keep those actions human-approved one at a time, and require preview evidence before production expansion.

🛑

1. Remote Code Execution (RCE)

Skills that execute arbitrary code from untrusted or weakly validated input.

Critical Risk

•No eval-like execution paths from user controlled content
•No dynamic module loading from untrusted parameters
•No unsafe deserialization of untrusted payloads
•No runtime code compilation from raw strings

💥

2. Command Injection

Shell execution patterns where user input can alter command intent.

Critical Risk

•Avoid shell=true patterns for untrusted arguments
•Pass command arguments as arrays instead of interpolated strings
•Reject metacharacters in risky command contexts
•Do not build command pipelines from untrusted variables

🧨

3. Destructive Operations

Actions that can irreversibly delete data or damage runtime state.

High Risk

•No recursive destructive deletes without hard guardrails
•No writes to critical system startup/config locations
•No raw disk-level operations in general-purpose skills
•Require explicit confirmation for broad file mutations

📤

4. Network Exfiltration

Unapproved data transfer from local context to external endpoints.

High Risk

•No outbound file uploads to arbitrary hosts
•No silent telemetry or hidden tracking paths
•No transfer of environment or credential files
•Prefer explicit allowlists for network destinations

🔐

5. Sensitive File Access

Reads from locations that commonly contain credentials or private data.

Medium Risk

•No broad reads of SSH, cloud, or auth config directories
•No access to browser storage/history databases by default
•No reads of system account or protected host files
•Use narrow path scopes and document required file access

🧯

6. Secrets Leakage

Hardcoded secrets or logging/output paths that expose sensitive tokens.

Medium Risk

•No hardcoded API tokens in source or config examples
•No raw env dumps in logs or diagnostic output
•Mask credentials in error messages and traces
•No committed secret-bearing local env files

🕳️

7. Persistence and Backdoors

Unauthorized mechanisms that survive session boundaries.

High Risk

•No hidden startup task registration by default
•No stealth background services without explicit user action
•No alias/function hijacking in shell profiles
•Document all persistence behavior and disable paths

⛔

8. Privilege Escalation

Attempts to obtain elevated privileges or bypass permission boundaries.

Critical Risk

•No privilege escalation patterns in standard skill flows
•No permission broadening commands as fallback behavior
•No bypass attempts for host security controls
•No unsafe permission presets for convenience

Operational review workflow

Strong security review is process-driven. We use a repeatable flow: static pattern checks, behavior simulation, permission scope validation, and rollout-readiness review. This ensures teams can compare skills consistently, not rely on subjective impressions.

Identify dangerous execution primitives and boundary violations.
Test failure behavior under controlled invalid inputs.
Validate documented permissions against observed behavior.
Assign rollout constraints and rollback ownership before promotion.

Teams that skip one of these steps usually pay later with brittle automation, unclear ownership, or emergency rollbacks during peak workload windows.

Worked example: reducing rollout risk in one week

Imagine your team wants to adopt a skill that automates data pull plus content update. Day one, define one measurable outcome and one explicit stop condition. Day two, run preview-only against a narrow dataset. Day three, classify errors by root cause rather than generic logs. Day four, patch guardrails for repeated failure classes. Day five, run one controlled replay and compare baseline metrics.

If throughput improves but error severity rises, promotion should be blocked. If throughput improves and error severity drops or remains stable, promote gradually with bounded scope. This is how teams avoid false confidence from raw speed gains.

Do not expand scope until failure classes are understood.
Require evidence links for every go/no-go decision.
Keep rollback commands documented before first production run.

Frequently Asked Questions

Can a skill still be risky if its score looks high?

Yes. Scoring is a risk signal, not an absolute safety guarantee. Always review permission scope and deployment context.

What should teams do before promoting a newly installed skill?

Run preview-first validation with fixed pass/fail checks, log failure evidence, and confirm rollback ownership.

How often should security review results be refreshed?

Refresh after upstream updates and run a scheduled monthly drift review for production-critical skills.

Why are command and network controls emphasized so strongly?

They create the largest blast radius when misused, especially in automation chains with access to local files and external APIs.

Why do logs sometimes show default.mitmproxy.hub.ace-research.openai.org?

A MITM-looking hostname in logs usually points to environment-specific proxy routing or network inspection. Do not create an integration around it. Instead, verify the runtime owner, confirm whether the skill intentionally made the request, and ensure no secrets or private files were sent through that path.

What is the fastest way to reduce real-world rollout risk?

Adopt fewer skills at once, enforce explicit ownership, and require evidence-backed acceptance before production expansion.

Use security scoring as a decision aid, not a shortcut

The safest adoption pattern is still preview-first execution with clear ownership. Use this guide to structure decisions, reduce ambiguity, and keep production rollouts predictable.

Back to Directory Open Security Audit