Grading explained
Findings collapse into a single letter (A → F) and a 0–100 score. The formula is deterministic, transparent, and tuned to be high-precision: a single CRIT is enough to drop you out of safe-to-install territory.
The formula
Counts of CRIT, HIGH, and MED findings are evaluated in order; the first matching row wins:
crit ≥ 2 → F · 0 crit = 1 → D · 30 high ≥ 3 or med ≥ 6 → C · 55 high ≥ 1 or med ≥ 3 → B · 75 else → A · 95
Verdict mapping
- A · 95 — Safe to install. No findings of concern.
- B · 75 — Generally safe. Minor hygiene issues. Worth a glance.
- C · 55 — Proceed with caution. Pile-up of high-severity hygiene + possibly some declaration issues.
- D · 30 — Do not install. At least one critical finding (credential leak, prompt injection, URL exfiltration).
- F · 0 — Do not install. Multiple criticals. Likely malicious.
Worked examples
- 0 findings → A · 95
- 1 MED (no manifest only) → A · 95
- 3 MED (e.g. no-manifest + 2 undeclared-egress) → B · 75
- 1 HIGH (e.g. chmod 777) → B · 75
- 3 HIGH (chmod 777 + filesystem-overreach + obfuscation) → C · 55
- 1 CRIT (e.g. $GITHUB_TOKEN harvest) → D · 30
- 2+ CRIT (env-var-harvest + instruction-injection) → F · 0
Why these numbers
The thresholds are picked so that the typical "well-meaning but careless" skill (no manifest, a couple of undeclared egresses) lands at B, and the "obvious malice" skill (env-var harvest + instruction injection) lands at F. The middle bands (C and D) are narrow on purpose — most scans should be A, B, or F. Anything in between deserves human review, which is why the Expert Review Network is coming soon.
Will this change?
Likely yes. LLM-based semantic probes are coming soon and will produce a richer signal than regex; the sandbox will then add behavioral findings; reviewer-curated weights come later. Whenever the formula changes, the result page banner will note the grading-rules version.