Indexing
Indexing is the process of crawling your sources, extracting content, and organizing it for fast search.
What happens during indexing
- Each source URL is crawled (web pages are fetched, GitHub repos are accessed via API)
- Content is extracted and cleaned (HTML stripped, code blocks preserved)
- Text is split into searchable chunks
- Each chunk is indexed for fast retrieval
Indexing status
Each source shows one of these statuses:
| Status | Meaning |
|---|---|
| Pending | Source has been added but crawling hasn't started yet |
| Indexing | Content is being crawled and processed |
| Complete | All content has been indexed and is searchable |
| Error | Something went wrong — see the inline error reason below the source |
Error reasons
When a source ends up in the Error state, the dashboard surfaces a machine-readable reason:
| Reason | Cause | Action |
|---|---|---|
auth_required | Private GitHub repo, but we don't have a valid token for your account | Click Link GitHub next to the source |
not_found | Repo wasn't found with the linked GitHub account (renamed, deleted, or revoked access) | Verify the URL / re-grant org access |
rate_limited | GitHub returned 403 because the API rate limit was hit | Click Refresh in a minute |
network | Crawl timed out before completing | Click Refresh |
unknown | Generic crawler failure | Click Refresh |
Typical indexing times
- 1-3 sources: Under 1 minute
- Large docs sites (50+ pages): 2-5 minutes
- GitHub repos: Under 30 seconds
Using your server during indexing
Your MCP server is usable immediately after deployment, even while sources are still indexing. During this period, it uses live fetching as a fallback — queries still work, they're just slightly slower.
Once indexing completes, searches use the pre-built index for much faster results.
Refreshing sources
You can refresh sources in two ways:
- Manual refresh: Click the refresh button next to any source on the dashboard
- Auto-refresh: All sources are automatically re-indexed daily at 7:00 AM Central US time
Refreshing re-crawls the source from scratch to pick up any content changes.
Page limits
- Website sources are limited to 1,000 pages per source
- If a site exceeds this limit, consider connecting the GitHub repo instead for full coverage
- GitHub repos have no page limit
How GitHub authentication flows into the crawler
For GitHub sources, the crawler looks up the server owner's linked GitHub Account.access_token (NextAuth). It uses that token to call:
GET /repos/:owner/:name— used as a permissions probe. A 401/403/404 here marks the source ascrawlStatus = errorwithcrawlError = auth_required(no token) ornot_found(token present but no access). This is what lets the dashboard render an actionable Link GitHub button instead of silently producing an empty source.GET /repos/:owner/:name/readme— fetched for the indexed README chunk.GET /repos/:owner/:name/git/trees/HEAD?recursive=1— used to enumerate Markdown files (.md,.mdx,.txt,.rst,.adoc) under 100 KB.GET /repos/:owner/:name/contents/{path}— fetched per file in batches of 5.
The OAuth scope requested is read:user user:email repo. The repo scope grants read access to private repositories owned by, or shared with, the user.