Resources/Knowledge Base File Types
Guide6 min read · Mar 1, 2026

Knowledge Base File Types — Everything Appalix Can Ingest

Appalix can train your AI bot on a wide variety of documents, spreadsheets, presentations, and cloud sources. This guide covers every supported format, how each one is processed, and tips for getting the best results.

Supported file types & sources

The following formats can be added directly in the Sources section of your Appalix dashboard — either as a file upload, a URL, or pasted text.

🌐Website URLAny public URL

Appalix fetches the page, strips scripts and navigation, and indexes the readable body text. JavaScript-rendered pages are handled via a secondary reader.

Tip: For multi-page sites, use a Sitemap source instead so every page is indexed in a single step.

🗺️Sitemapsitemap.xml

All <loc> URLs in the sitemap are fetched and indexed together (up to 50 pages). Ideal for documentation sites, blogs, and product knowledge bases.

Tip: Most CMS platforms generate a sitemap at /sitemap.xml automatically.

📄PDF.pdf

Text is extracted directly from the PDF using Claude's document API, which understands columns, tables, and multi-page layouts without losing structure.

Tip: Scanned PDFs (image-only) are also supported — Claude reads them visually.

📝Word Document.doc, .docx

Raw text is extracted from the Word XML structure using the mammoth library, preserving headings and paragraphs.

Tip: Tables in Word docs are extracted as plain text rows.

📊Excel Spreadsheet.xls, .xlsx

Every sheet in the workbook is converted to CSV text and labeled with its sheet name, so your bot can reference data by column header and sheet.

Tip: Keep column headers in row 1 — they become part of each cell's context when answering questions.

📑PowerPoint Presentation.ppt, .pptx

All text nodes are extracted from each slide's XML, prefixed with a slide number heading. Speaker notes are not currently indexed.

Tip: Presentation files with heavy imagery and minimal text may produce sparse results — add a text source with speaker notes or a summary for best coverage.

📋CSV File.csv

The file is read as plain UTF-8 text and chunked. Headers in row 1 are preserved, so the bot can correlate column values when answering questions.

Tip: Pair a CSV knowledge source with a system prompt instruction like "When asked about pricing, consult the pricing table." for more precise answers.

🖼️Image.jpg, .jpeg, .png, .webp, .gif

Images are passed to Claude's vision API, which transcribes all visible text and briefly describes diagrams, charts, and non-text visuals.

Tip: Great for scanned handwritten notes, whiteboards, and infographics.

🗜️ZIP Archive.zip

Appalix extracts the archive and indexes all readable text files inside: .txt, .md, .csv, .json, .xml, .html, and .htm. Binary files within the ZIP are skipped.

Tip: Use ZIP to bulk-upload multiple plain-text files or Markdown documentation sets in one go.

✏️Plain TextPaste directly

Type or paste any raw text directly into Appalix — no file needed. Ideal for FAQs, product descriptions, policies, or anything you want to write inline.

Tip: Plain text sources are the fastest to create and re-sync.

Cloud source connectors

On the Pro plan and above, you can connect external services directly. Appalix fetches content from these sources using an API token you provide — no files need to be downloaded manually.

Google Drive
Google DrivePro+

Google Docs, Sheets, and Slides are exported as plain text. Binary files in the drive are downloaded directly.

Dropbox
DropboxPro+

Files and shared links are downloaded and processed using the same type handlers as uploaded files.

OneDrive
OneDrivePro+

Microsoft Graph API downloads files directly from your OneDrive. Supports shared links and item IDs.

SharePoint
SharePointPro+

Index intranet content, policy documents, and internal wikis from your SharePoint site using a Microsoft Graph token.

Notion
NotionPro+

Page blocks are fetched via the Notion API using your Internal Integration Token. Nested blocks are flattened to plain text.

GitBook
GitBookPro+

All pages in your GitBook space are fetched and indexed using the GitBook Content API and a personal token.

How ingestion works

When you add or re-sync a source, Appalix runs it through the following pipeline:

  1. Content extraction — the file or URL is processed using the appropriate handler for its type (see table above).
  2. Chunking — extracted text is split into overlapping 1,500-character segments with a 200-character overlap to preserve context at chunk boundaries.
  3. Embedding — each chunk is converted to a 1,536-dimension vector using OpenAI's text-embedding-3-small model.
  4. Storage — vectors are stored in a pgvector table scoped to your workspace, enabling millisecond-fast similarity search.
  5. Retrieval — at chat time, the user's question is embedded and the top matching chunks are injected into the AI's context window before it replies.

Limits & best practices

  • Maximum file size: 50 MB per upload.
  • Sitemap pages: up to 50 URLs per sitemap source.
  • Excel workbooks: all sheets are indexed. Very large workbooks (> 10,000 rows) may produce a high chunk count — consider filtering to relevant sheets first.
  • ZIP files: only text-based files inside the archive are extracted (.txt, .md, .csv, .json, .xml, .html). Nested Word/Excel/PDF inside a ZIP are not currently processed — upload those separately.
  • Re-sync anytime: clicking Re-sync on a source deletes all existing chunks and re-ingests from scratch. Do this after updating a document.
  • Source scope: all sources in a workspace are shared across all bots in that workspace. A bot only queries its knowledge base when Knowledge Base is enabled on the bot's settings page.

Frequently asked questions

Can I upload multiple files at once?

Not yet — each source is added individually. To bulk-upload, put all your text files into a single ZIP archive and upload that.

Are scanned PDFs supported?

Yes. Claude's vision capabilities read scanned and image-only PDFs. Quality depends on scan resolution — 300 DPI or above gives the best results.

How long does ingestion take?

Most sources process in under 30 seconds. Large sitemaps or multi-sheet Excel files may take 1–2 minutes. The source status changes from PendingProcessingReady in real time.

My source shows "failed" — what should I check?

Open the source row in your dashboard to see the error message. Common causes: the URL is behind a login wall, the file is corrupt or password-protected, or an API token has expired. Fix the issue and click Re-sync.

Does Appalix store my documents?

Uploaded files are stored in Supabase Storage within your workspace bucket and are never shared with other customers. Only the extracted text chunks (not the original file) are used for AI retrieval.

Ready to train your bot?

Add your first knowledge source and watch your AI answer from your own content.

Get started free
📬

Stay ahead of the curve

Get new guides, case studies, and product updates delivered to your inbox every two weeks.

No spam. Unsubscribe any time.