Data Sources

This guide covers all supported data sources in Hyperparam, how to connect them, and best practices for each. Hyperparam streams data directly from where your agent and chat logs already live (object storage, lakehouse tables, or local files), with no warehouse round-trip required.

Overview

Supported Sources

Source	Access Method	Authentication	Streaming
Local Files	Drag & Drop	None	No
Direct URLs	Paste/Click	None/Token	Yes
Hugging Face	Search/Browse	None/Token	Yes
AWS S3	S3 URLs	Public/Signed	Yes
Google Cloud	Google Cloud Storage	Public bucket / Electron	Yes
Azure Blob	Azure	Public container / Electron	Yes

Local Files

Drag and Drop

The simplest method for small files:

Supported formats:
- .parquet (recommended)
- .txt (limited support)
- .csv (coming soon)
- .json (coming soon)

Behavior Differences

User State	Action	Result
Signed Out	Drop file	Process locally only
Signed In	Drop file	Upload to storage

Direct URLs

Public URLs

Any publicly accessible URL:

https://example.com/data.parquet
https://cdn.example.com/dataset.parquet
http://public-bucket.s3.amazonaws.com/file.parquet

URL Requirements

Must be direct file link
No authentication required
Supports range requests
CORS headers properly set

Testing URL Accessibility

# Test if URL supports range requests
curl -I -H "Range: bytes=0-1000" https://example.com/data.parquet

# Look for:
# Accept-Ranges: bytes
# Content-Range: bytes 0-1000/...

Hugging Face Datasets

Discovery via Chat

Most powerful method:

"Find conversation datasets with quality metrics"
"Show me code generation datasets over 1M examples"
"Search for multilingual instruction datasets"

Direct URLs

Format:

https://huggingface.co/datasets/{org}/{dataset}/resolve/main/{file}

Example:
https://huggingface.co/datasets/wikipedia/resolve/main/data/train-00000-of-00001.parquet

Hugging Face Features

Automatic dataset discovery
Preview before loading
Dataset cards and metadata
Version control
Community ratings

Authentication (Optional)

For private datasets:

// Coming soon: HF token support
const url = "https://huggingface.co/datasets/private/dataset";
const token = "hf_xxxxxxxxxxxx";

AWS S3

Public S3 Buckets

Direct access to public data:

https://s3.amazonaws.com/bucket-name/path/to/file.parquet
https://bucket-name.s3.amazonaws.com/path/to/file.parquet
https://s3.region.amazonaws.com/bucket-name/file.parquet

S3 Signed URLs

For private buckets:

import boto3
from botocore.exceptions import NoCredentialsError

def create_presigned_url(bucket, key, expiration=3600):
    s3_client = boto3.client('s3')
    try:
        response = s3_client.generate_presigned_url(
            'get_object',
            Params={'Bucket': bucket, 'Key': key},
            ExpiresIn=expiration
        )
        return response
    except NoCredentialsError:
        return None

# Generate URL valid for 1 hour
url = create_presigned_url('my-bucket', 'data/file.parquet')

S3 Best Practices

Region Selection: Use closest region
Bucket Settings: Enable transfer acceleration
CORS Configuration:

{
    "CORSRules": [{
        "AllowedOrigins": ["https://hyperparam.app"],
        "AllowedMethods": ["GET", "HEAD"],
        "AllowedHeaders": ["*"],
        "ExposeHeaders": ["Content-Range", "Accept-Ranges"]
    }]
}

Google Cloud Storage

Adding a GCS bucket

Open Add Source → Google Cloud Storage and fill in:

Bucket name — the GCS bucket name (e.g. my-data-bucket).
Path prefix (optional) — a virtual folder to scope browsing (e.g. data/parquet/). Leading slashes are stripped and a trailing slash is added automatically.
Display name (optional) — label shown in the sidebar; defaults to the bucket name.

Hyperparam browses the bucket by calling the Cloud Storage JSON API (storage.googleapis.com/storage/v1/b/<bucket>/o) directly from the browser (or via the Electron shell). There is no server-side proxy.

Supported URL formats

Pasted GCS object URLs are routed through the GCS listing path in all three forms:

Virtual-hosted: https://{bucket}.storage.googleapis.com/{object}
Path-style: https://storage.googleapis.com/{bucket}/{object}
Authenticated UI domain: https://storage.cloud.google.com/{bucket}/{object} (reachable when the object is public)

Authentication

The in-browser GCS source only works with buckets that allow public read access: either uniform bucket-level access with allUsers granted Storage Object Viewer, or per-object public ACLs. Private buckets require the Electron desktop app, which can access authenticated URLs through its networking shell. Signed URLs are not yet wired up in the Add Source flow.

CORS requirements

Because the browser talks to storage.googleapis.com directly, the bucket needs CORS rules that allow GET and HEAD from the Hyperparam origin and expose Content-Range. See CORS Configuration for the exact gsutil cors set command and an example cors.json.

Azure Blob Storage

Adding an Azure container

Open Add Source → Azure and fill in:

Storage account — the account name (e.g. mystorageaccount); the container is assumed to live at https://<account>.blob.core.windows.net.
Container — the blob container name.
Path prefix (optional) — a virtual folder to scope browsing (e.g. data/parquet/). Leading slashes are stripped and a trailing slash is added automatically.
Display name (optional) — label shown in the sidebar; defaults to the container name.

Hyperparam browses the container by calling the Azure Blob REST List Blobs endpoint directly from the browser (or via the Electron shell). There is no server-side proxy.

Authentication

The in-browser Azure source only works when Hyperparam can list the container contents. In practice that means container-level anonymous access or a URL/auth flow that grants the list permission. Blob-only anonymous access is not enough for browsing: a direct blob URL can download successfully while the List Blobs container request still fails. Private containers require the Electron desktop app, which can access Azure-authenticated URLs through its networking shell. Signed URLs (SAS) are not yet wired up in the Add Source flow.

CORS requirements

Because the browser talks to *.blob.core.windows.net directly, the storage account needs CORS rules that allow GET and HEAD from the Hyperparam origin and expose Content-Range. See CORS Configuration for the exact az storage cors add command and Portal steps.

Private Data Sources

Upload Strategy

For sensitive data:

Never use public URLs
Generate time-limited signed URLs
Restrict CORS to Hyperparam origin
Monitor access logs
Rotate credentials regularly

Troubleshooting

Common Issues

Issue	Cause	Solution
"Access Denied"	Private bucket	Use signed URL
"CORS Error"	Missing headers	Configure CORS
"Slow Loading"	Far region	Use closer source
"Range Not Supported"	Old server	Download fully

Testing Connectivity

// Browser console test
fetch('https://your-data-url.com/file.parquet', {
    method: 'HEAD',
    headers: {
        'Range': 'bytes=0-1000'
    }
}).then(response => {
    console.log('Status:', response.status);
    console.log('Headers:', response.headers);
});

Summary

Key points:

Multiple sources supported with streaming
URLs preferred over local files for large data
Cloud storage works seamlessly
Security via signed URLs
Performance varies by source location
CORS configuration required for some sources

Choose the right source for your use case and optimize for streaming performance!