Data Sources

This guide covers all supported data sources in Hyperparam, how to connect them, and best practices for each. Hyperparam streams data directly from where your agent and chat logs already live (object storage, lakehouse tables, or local files), with no warehouse round-trip required.

Overview

Supported Sources

SourceAccess MethodAuthenticationStreaming
Local FilesDrag & DropNoneNo
Direct URLsPaste/ClickNone/TokenYes
Hugging FaceSearch/BrowseNone/TokenYes
AWS S3S3 URLsPublic/SignedYes
Google CloudGoogle Cloud StoragePublic bucket / ElectronYes
Azure BlobAzurePublic container / ElectronYes

Local Files

Drag and Drop

The simplest method for small files:

Supported formats:
- .parquet (recommended)
- .txt (limited support)
- .csv (coming soon)
- .json (coming soon)

Behavior Differences

User StateActionResult
Signed OutDrop fileProcess locally only
Signed InDrop fileUpload to storage

Direct URLs

Public URLs

Any publicly accessible URL:

https://example.com/data.parquet
https://cdn.example.com/dataset.parquet
http://public-bucket.s3.amazonaws.com/file.parquet

URL Requirements

  • Must be direct file link
  • No authentication required
  • Supports range requests
  • CORS headers properly set

Testing URL Accessibility

# Test if URL supports range requests
curl -I -H "Range: bytes=0-1000" https://example.com/data.parquet

# Look for:
# Accept-Ranges: bytes
# Content-Range: bytes 0-1000/...

Hugging Face Datasets

Discovery via Chat

Most powerful method:

"Find conversation datasets with quality metrics"
"Show me code generation datasets over 1M examples"
"Search for multilingual instruction datasets"

Direct URLs

Format:

https://huggingface.co/datasets/{org}/{dataset}/resolve/main/{file}

Example:
https://huggingface.co/datasets/wikipedia/resolve/main/data/train-00000-of-00001.parquet

Hugging Face Features

  • Automatic dataset discovery
  • Preview before loading
  • Dataset cards and metadata
  • Version control
  • Community ratings

Authentication (Optional)

For private datasets:

// Coming soon: HF token support
const url = "https://huggingface.co/datasets/private/dataset";
const token = "hf_xxxxxxxxxxxx";

AWS S3

Public S3 Buckets

Direct access to public data:

https://s3.amazonaws.com/bucket-name/path/to/file.parquet
https://bucket-name.s3.amazonaws.com/path/to/file.parquet
https://s3.region.amazonaws.com/bucket-name/file.parquet

S3 Signed URLs

For private buckets:

import boto3
from botocore.exceptions import NoCredentialsError

def create_presigned_url(bucket, key, expiration=3600):
    s3_client = boto3.client('s3')
    try:
        response = s3_client.generate_presigned_url(
            'get_object',
            Params={'Bucket': bucket, 'Key': key},
            ExpiresIn=expiration
        )
        return response
    except NoCredentialsError:
        return None

# Generate URL valid for 1 hour
url = create_presigned_url('my-bucket', 'data/file.parquet')

S3 Best Practices

  1. Region Selection: Use closest region
  2. Bucket Settings: Enable transfer acceleration
  3. CORS Configuration:
{
    "CORSRules": [{
        "AllowedOrigins": ["https://hyperparam.app"],
        "AllowedMethods": ["GET", "HEAD"],
        "AllowedHeaders": ["*"],
        "ExposeHeaders": ["Content-Range", "Accept-Ranges"]
    }]
}

Google Cloud Storage

Adding a GCS bucket

Open Add Source → Google Cloud Storage and fill in:

  • Bucket name — the GCS bucket name (e.g. my-data-bucket).
  • Path prefix (optional) — a virtual folder to scope browsing (e.g. data/parquet/). Leading slashes are stripped and a trailing slash is added automatically.
  • Display name (optional) — label shown in the sidebar; defaults to the bucket name.

Hyperparam browses the bucket by calling the Cloud Storage JSON API (storage.googleapis.com/storage/v1/b/<bucket>/o) directly from the browser (or via the Electron shell). There is no server-side proxy.

Supported URL formats

Pasted GCS object URLs are routed through the GCS listing path in all three forms:

  • Virtual-hosted: https://{bucket}.storage.googleapis.com/{object}
  • Path-style: https://storage.googleapis.com/{bucket}/{object}
  • Authenticated UI domain: https://storage.cloud.google.com/{bucket}/{object} (reachable when the object is public)

Authentication

The in-browser GCS source only works with buckets that allow public read access: either uniform bucket-level access with allUsers granted Storage Object Viewer, or per-object public ACLs. Private buckets require the Electron desktop app, which can access authenticated URLs through its networking shell. Signed URLs are not yet wired up in the Add Source flow.

CORS requirements

Because the browser talks to storage.googleapis.com directly, the bucket needs CORS rules that allow GET and HEAD from the Hyperparam origin and expose Content-Range. See CORS Configuration for the exact gsutil cors set command and an example cors.json.

Azure Blob Storage

Adding an Azure container

Open Add Source → Azure and fill in:

  • Storage account — the account name (e.g. mystorageaccount); the container is assumed to live at https://<account>.blob.core.windows.net.
  • Container — the blob container name.
  • Path prefix (optional) — a virtual folder to scope browsing (e.g. data/parquet/). Leading slashes are stripped and a trailing slash is added automatically.
  • Display name (optional) — label shown in the sidebar; defaults to the container name.

Hyperparam browses the container by calling the Azure Blob REST List Blobs endpoint directly from the browser (or via the Electron shell). There is no server-side proxy.

Authentication

The in-browser Azure source only works when Hyperparam can list the container contents. In practice that means container-level anonymous access or a URL/auth flow that grants the list permission. Blob-only anonymous access is not enough for browsing: a direct blob URL can download successfully while the List Blobs container request still fails. Private containers require the Electron desktop app, which can access Azure-authenticated URLs through its networking shell. Signed URLs (SAS) are not yet wired up in the Add Source flow.

CORS requirements

Because the browser talks to *.blob.core.windows.net directly, the storage account needs CORS rules that allow GET and HEAD from the Hyperparam origin and expose Content-Range. See CORS Configuration for the exact az storage cors add command and Portal steps.

Private Data Sources

Upload Strategy

For sensitive data:

  1. Never use public URLs
  2. Generate time-limited signed URLs
  3. Restrict CORS to Hyperparam origin
  4. Monitor access logs
  5. Rotate credentials regularly

Troubleshooting

Common Issues

IssueCauseSolution
"Access Denied"Private bucketUse signed URL
"CORS Error"Missing headersConfigure CORS
"Slow Loading"Far regionUse closer source
"Range Not Supported"Old serverDownload fully

Testing Connectivity

// Browser console test
fetch('https://your-data-url.com/file.parquet', {
    method: 'HEAD',
    headers: {
        'Range': 'bytes=0-1000'
    }
}).then(response => {
    console.log('Status:', response.status);
    console.log('Headers:', response.headers);
});

Summary

Key points:

  • Multiple sources supported with streaming
  • URLs preferred over local files for large data
  • Cloud storage works seamlessly
  • Security via signed URLs
  • Performance varies by source location
  • CORS configuration required for some sources

Choose the right source for your use case and optimize for streaming performance!

Data Sources - Hyperparam