Where catalog data comes from

SkillOx's public catalog blends four sources: GitHub Code Search, the ClawHub public REST API, the Skills.sh sitemap, and creator submissions. Each source is documented here with its mechanism, its sourcing, and the relevant rate limits — so you can audit how we got every row.

GitHub — Code Search

Mechanism

We query the GitHub Code Search API with name filename:SKILL.md, partitioned by file size (six bins: <500, 500..1000, 1000..2000, 2000..5000, 5000..10000, >10000). The size partition is necessary because the API caps results at 1,000 per query — partitioning bypasses the cap by issuing six independent queries that don't overlap.

For each hit we fetch the SKILL.md content (public raw URL) + the repo's metadata (stars, license, archived flag, default branch, topics) via the /repos/{owner}/{repo} endpoint. Repo metadata is cached per-repo so multiple SKILL.md files in the same repo share one fetch.

Sourcing

Everything we index is public — Code Search only returns public repos, and the raw SKILL.md fetches use the same anonymous endpoint your browser would. We attach an authenticated GitHub token to lift the 60-req/hr unauthenticated ceiling to 5,000/hr; the token carries no repo scopes. Every Report Card links back to the original repo, and maintainers can ask us to remove their content.

ClawHub — public REST API

Mechanism

ClawHub exposes a public REST API at https://clawhub.ai/api/v1. We paginate GET /api/v1/skills?limit=200&sort=updated&cursor=… until nextCursor is null. For each skill the canonical content URL is constructed as /api/v1/skills/{slug}/file?path=SKILL.md&version={version} — exactly the URL the public ClawHub web app uses to render the same file.

Rate-limited at 3,000 requests per minute per their docs; we self-cap at 1,200 to leave headroom. No auth required.

Sourcing

We index the publicly available skills exposed by ClawHub's public /api/v1 REST API, using the documented cursor pagination. Every request sends a User-Agent: skillox-crawler header so ClawHub can identify our traffic in their logs. Maintainers can ask us to remove their content.

If a ClawHub-listed skill is taken down (404 on the file endpoint), the next nightly re-crawl marks it as removed in our catalog. We don't mirror file content — every Report Card links back to the original ClawHub URL.

Skills.sh — sitemap discovery

Mechanism

Skills.sh doesn't expose a REST API but does publish a standard sitemap at https://www.skills.sh/sitemap.xml. The index points at two sub-sitemaps that together list ~20,000 skill URLs in the canonical /{owner}/{repo}/{skill} shape.

For each URL we resolve the actual SKILL.md location: one cached GitHub /repos call per repo for the default branch + metadata, then one cached /git/trees call to find the exact in-repo path (which varies — some use skills/{name}/, some agent-skills/{name}/, some deep plugin-style paths). Both calls are cached per-repo so the API budget scales with unique repos, not URL count.

Sourcing

We discover URLs from the public sitemap Skills.sh publishes. The underlying SKILL.md content lives on GitHub (public repos) — same sourcing as the GitHub section for the actual content fetches. We link back to the original source, and maintainers can ask us to remove their content.

On rename / transfer (GitHub returns 301 on those repos), we follow the redirect and re-attribute the entry to the canonical owner/repo name so star counts + metadata match the real repo.

Creator submissions

Mechanism

Signed-in creators can submit a SKILL.md URL via /creators/{slug}/submit. The form runs the same scanner the public endpoint runs, persists the result to the catalog under the creator's slug, and offers a publish / withdraw toggle.

Sourcing

Submitted directly by the creator, who is signed in via GitHub OAuth and chose to list this skill publicly. They control removal and grade overrides via their dashboard.

Current totals

Live counts visible on the home page's catalog stat strip and on /catalog (filterable by source via the sidebar). For a programmatic view, the public GET /catalog endpoint returns totals and accepts a source= filter.

Removal + re-attribution

Three ways a skill leaves the catalog:

Creator removes their own listing — soft-delete (removed = true) so the crawler skips it on re-discovery. Reversible.
Upstream takedown — when the source URL starts 404-ing for >48h across re-crawl attempts, we mark the row as removed and stop scoring it. The Report Card stays archive-viewable but no longer counts toward catalog totals.
Repo rename — handled automatically. The fetchRepoMeta path detects when GitHub returns a different full_name than we asked for (the 301-then-redirect case) and re-attributes the row to the new owner/repo without losing the scan history.

If you don't want your skill listed

Two options, both fast:

Claim + remove via the creator portal — gives you a permanent owner relationship so the catalog shows your removal request even if the skill is re-discovered later.
Email takedowns@skillox.io if claim verification isn't possible (skill is on a different GitHub username than yours, etc.). 24-hour turnaround.

Adding a new source we should support? Open an issue at git.skillox.io. PRs welcome.