Skip to main content
The Data Ingestion stage is where raw content enters your RAG pipeline. This process ensures your data is clean, formatted, and prepared for chunking and embedding.

Data Ingestion Methods

LarkupRAG supports a variety of ingestion methods:
Paste text snippets or write content directly into the UI.
Upload PDFs, markdown files, and standard text documents directly from your computer.
Extract content directly from URLs using Firecrawl, or perform programmatic Google searches via Serper.dev.
Data Ingestion Methods

Local Web Scraping

You can perform advanced web scraping locally, directly extracting clean data from target websites.

Setting up Firecrawl Locally

You can run Firecrawl locally to avoid external API limits. To do this, simply point your configuration to your local Firecrawl instance instead of the hosted API. To unlock premium web scraping features, or if using a hosted instance, configure the following in your .env file or directly in the UI:
FIRECRAWL_API_KEY=your-firecrawl-api-key
SERPER_API_KEY=your-serper-api-key

Custom Proxy Configuration

If your organization requires routing traffic through a proxy (e.g., for secure scraping or bypassing strict rate limits), you can easily set proxy configurations directly in your .env file.
# Add your custom proxy configuration here
HTTP_PROXY=http://proxy.example.com:8080
HTTPS_PROXY=https://proxy.example.com:8080
LarkupRAG will automatically pick up these environment variables and route all outgoing scraping requests through your configured proxy.

Data Ingestion Demo

Watch the video below to see a complete walkthrough of scraping web pages locally and configuring proxies for your ETL jobs.

Tracking ETL Jobs

All ingestion requests (like scraping a large website or uploading multiple files) are managed as background Jobs. You can track the status of these jobs directly from the Data tab, ensuring complete visibility over your corpus before moving to the Indexing stage.