Appa Tools documentation for MCP Studio, including setup, guides, concepts, and API-related reference content.

Skip to main content

Indexing

Indexing is the process of crawling your sources, extracting content, and organizing it for fast search.

What happens during indexing

  1. Each source URL is crawled (web pages are fetched, GitHub repos are accessed via API)
  2. Content is extracted and cleaned (HTML stripped, code blocks preserved)
  3. Text is split into searchable chunks
  4. Each chunk is indexed for fast retrieval

Indexing status

Each source shows one of these statuses:

StatusMeaning
PendingSource has been added but crawling hasn't started yet
IndexingContent is being crawled and processed
CompleteAll content has been indexed and is searchable
ErrorSomething went wrong — see the inline error reason below the source

Error reasons

When a source ends up in the Error state, the dashboard surfaces a machine-readable reason:

ReasonCauseAction
auth_requiredPrivate GitHub repo, but we don't have a valid token for your accountClick Link GitHub next to the source
not_foundRepo wasn't found with the linked GitHub account (renamed, deleted, or revoked access)Verify the URL / re-grant org access
rate_limitedGitHub returned 403 because the API rate limit was hitClick Refresh in a minute
networkCrawl timed out before completingClick Refresh
unknownGeneric crawler failureClick Refresh

Typical indexing times

  • 1-3 sources: Under 1 minute
  • Large docs sites (50+ pages): 2-5 minutes
  • GitHub repos: Under 30 seconds

Using your server during indexing

Your MCP server is usable immediately after deployment, even while sources are still indexing. During this period, it uses live fetching as a fallback — queries still work, they're just slightly slower.

Once indexing completes, searches use the pre-built index for much faster results.

Refreshing sources

You can refresh sources in two ways:

  • Manual refresh: Click the refresh button next to any source on the dashboard
  • Auto-refresh: All sources are automatically re-indexed daily at 7:00 AM Central US time

Refreshing re-crawls the source from scratch to pick up any content changes.

Page limits

  • Website sources are limited to 1,000 pages per source
  • If a site exceeds this limit, consider connecting the GitHub repo instead for full coverage
  • GitHub repos have no page limit

How GitHub authentication flows into the crawler

For GitHub sources, the crawler looks up the server owner's linked GitHub Account.access_token (NextAuth). It uses that token to call:

  1. GET /repos/:owner/:name — used as a permissions probe. A 401/403/404 here marks the source as crawlStatus = error with crawlError = auth_required (no token) or not_found (token present but no access). This is what lets the dashboard render an actionable Link GitHub button instead of silently producing an empty source.
  2. GET /repos/:owner/:name/readme — fetched for the indexed README chunk.
  3. GET /repos/:owner/:name/git/trees/HEAD?recursive=1 — used to enumerate Markdown files (.md, .mdx, .txt, .rst, .adoc) under 100 KB.
  4. GET /repos/:owner/:name/contents/{path} — fetched per file in batches of 5.

The OAuth scope requested is read:user user:email repo. The repo scope grants read access to private repositories owned by, or shared with, the user.