Indexing

Indexing is the process of crawling your sources, extracting content, and organizing it for fast search.

What happens during indexing

Each source URL is crawled (web pages are fetched, GitHub repos are accessed via API)
Content is extracted and cleaned (HTML stripped, code blocks preserved)
Text is split into searchable chunks
Each chunk is indexed for fast retrieval

Indexing status

Each source shows one of these statuses:

Status	Meaning
Pending	Source has been added but crawling hasn't started yet
Indexing	Content is being crawled and processed
Complete	All content has been indexed and is searchable
Error	Something went wrong — see the inline error reason below the source

Error reasons

When a source ends up in the Error state, the dashboard surfaces a machine-readable reason:

Reason	Cause	Action
`auth_required`	Private GitHub repo, but we don't have a valid token for your account	Click Link GitHub next to the source
`not_found`	Repo wasn't found with the linked GitHub account (renamed, deleted, or revoked access)	Verify the URL / re-grant org access
`rate_limited`	GitHub returned 403 because the API rate limit was hit	Click Refresh in a minute
`network`	Crawl timed out before completing	Click Refresh
`unknown`	Generic crawler failure	Click Refresh

Typical indexing times

1-3 sources: Under 1 minute
Large docs sites (50+ pages): 2-5 minutes
GitHub repos: Under 30 seconds

Using your server during indexing

Your MCP server is usable immediately after deployment, even while sources are still indexing. During this period, it uses live fetching as a fallback — queries still work, they're just slightly slower.

Once indexing completes, searches use the pre-built index for much faster results.

Refreshing sources

You can refresh sources in two ways:

Manual refresh: Click the refresh button next to any source on the dashboard
Auto-refresh: All sources are automatically re-indexed daily at 7:00 AM Central US time

Refreshing re-crawls the source from scratch to pick up any content changes.

Page limits

Website sources are limited to 1,000 pages per source
If a site exceeds this limit, consider connecting the GitHub repo instead for full coverage
GitHub repos have no page limit

How GitHub authentication flows into the crawler

For GitHub sources, the crawler looks up the server owner's linked GitHub Account.access_token (NextAuth). It uses that token to call:

GET /repos/:owner/:name — used as a permissions probe. A 401/403/404 here marks the source as crawlStatus = error with crawlError = auth_required (no token) or not_found (token present but no access). This is what lets the dashboard render an actionable Link GitHub button instead of silently producing an empty source.
GET /repos/:owner/:name/readme — fetched for the indexed README chunk.
GET /repos/:owner/:name/git/trees/HEAD?recursive=1 — used to enumerate Markdown files (.md, .mdx, .txt, .rst, .adoc) under 100 KB.
GET /repos/:owner/:name/contents/{path} — fetched per file in batches of 5.

The OAuth scope requested is read:user user:email repo. The repo scope grants read access to private repositories owned by, or shared with, the user.

What happens during indexing​

Indexing status​

Error reasons​

Typical indexing times​

Using your server during indexing​

Refreshing sources​

Page limits​

How GitHub authentication flows into the crawler​