Data Sources
This guide covers all supported data sources in Hyperparam, how to connect them, and best practices for each. Hyperparam streams data directly from where your agent and chat logs already live (object storage, lakehouse tables, or local files), with no warehouse round-trip required.
Overview
Supported Sources
| Source | Access Method | Authentication | Streaming |
|---|---|---|---|
| Local Files | Drag & Drop | None | No |
| Direct URLs | Paste/Click | None/Token | Yes |
| Hugging Face | Search/Browse | None/Token | Yes |
| AWS S3 | S3 URLs | Public/Signed | Yes |
| Google Cloud | Google Cloud Storage | Public bucket / Electron | Yes |
| Azure Blob | Azure | Public container / Electron | Yes |
Local Files
Drag and Drop
The simplest method for small files:
Supported formats:
- .parquet (recommended)
- .txt (limited support)
- .csv (coming soon)
- .json (coming soon)Behavior Differences
| User State | Action | Result |
|---|---|---|
| Signed Out | Drop file | Process locally only |
| Signed In | Drop file | Upload to storage |
Direct URLs
Public URLs
Any publicly accessible URL:
https://example.com/data.parquet
https://cdn.example.com/dataset.parquet
http://public-bucket.s3.amazonaws.com/file.parquetURL Requirements
- Must be direct file link
- No authentication required
- Supports range requests
- CORS headers properly set
Testing URL Accessibility
# Test if URL supports range requests
curl -I -H "Range: bytes=0-1000" https://example.com/data.parquet
# Look for:
# Accept-Ranges: bytes
# Content-Range: bytes 0-1000/...Hugging Face Datasets
Discovery via Chat
Most powerful method:
"Find conversation datasets with quality metrics"
"Show me code generation datasets over 1M examples"
"Search for multilingual instruction datasets"Direct URLs
Format:
https://huggingface.co/datasets/{org}/{dataset}/resolve/main/{file}
Example:
https://huggingface.co/datasets/wikipedia/resolve/main/data/train-00000-of-00001.parquetHugging Face Features
- Automatic dataset discovery
- Preview before loading
- Dataset cards and metadata
- Version control
- Community ratings
Authentication (Optional)
For private datasets:
// Coming soon: HF token support
const url = "https://huggingface.co/datasets/private/dataset";
const token = "hf_xxxxxxxxxxxx";AWS S3
Public S3 Buckets
Direct access to public data:
https://s3.amazonaws.com/bucket-name/path/to/file.parquet
https://bucket-name.s3.amazonaws.com/path/to/file.parquet
https://s3.region.amazonaws.com/bucket-name/file.parquetS3 Signed URLs
For private buckets:
import boto3
from botocore.exceptions import NoCredentialsError
def create_presigned_url(bucket, key, expiration=3600):
s3_client = boto3.client('s3')
try:
response = s3_client.generate_presigned_url(
'get_object',
Params={'Bucket': bucket, 'Key': key},
ExpiresIn=expiration
)
return response
except NoCredentialsError:
return None
# Generate URL valid for 1 hour
url = create_presigned_url('my-bucket', 'data/file.parquet')S3 Best Practices
- Region Selection: Use closest region
- Bucket Settings: Enable transfer acceleration
- CORS Configuration:
{
"CORSRules": [{
"AllowedOrigins": ["https://hyperparam.app"],
"AllowedMethods": ["GET", "HEAD"],
"AllowedHeaders": ["*"],
"ExposeHeaders": ["Content-Range", "Accept-Ranges"]
}]
}Google Cloud Storage
Adding a GCS bucket
Open Add Source → Google Cloud Storage and fill in:
- Bucket name — the GCS bucket name (e.g.
my-data-bucket). - Path prefix (optional) — a virtual folder to scope browsing (e.g.
data/parquet/). Leading slashes are stripped and a trailing slash is added automatically. - Display name (optional) — label shown in the sidebar; defaults to the bucket name.
Hyperparam browses the bucket by calling the Cloud Storage JSON API (storage.googleapis.com/storage/v1/b/<bucket>/o) directly from the browser (or via the Electron shell). There is no server-side proxy.
Supported URL formats
Pasted GCS object URLs are routed through the GCS listing path in all three forms:
- Virtual-hosted:
https://{bucket}.storage.googleapis.com/{object} - Path-style:
https://storage.googleapis.com/{bucket}/{object} - Authenticated UI domain:
https://storage.cloud.google.com/{bucket}/{object}(reachable when the object is public)
Authentication
The in-browser GCS source only works with buckets that allow public read access: either uniform bucket-level access with allUsers granted Storage Object Viewer, or per-object public ACLs. Private buckets require the Electron desktop app, which can access authenticated URLs through its networking shell. Signed URLs are not yet wired up in the Add Source flow.
CORS requirements
Because the browser talks to storage.googleapis.com directly, the bucket needs CORS rules that allow GET and HEAD from the Hyperparam origin and expose Content-Range. See CORS Configuration for the exact gsutil cors set command and an example cors.json.
Azure Blob Storage
Adding an Azure container
Open Add Source → Azure and fill in:
- Storage account — the account name (e.g.
mystorageaccount); the container is assumed to live athttps://<account>.blob.core.windows.net. - Container — the blob container name.
- Path prefix (optional) — a virtual folder to scope browsing (e.g.
data/parquet/). Leading slashes are stripped and a trailing slash is added automatically. - Display name (optional) — label shown in the sidebar; defaults to the container name.
Hyperparam browses the container by calling the Azure Blob REST List Blobs endpoint directly from the browser (or via the Electron shell). There is no server-side proxy.
Authentication
The in-browser Azure source only works when Hyperparam can list the container contents. In practice that means container-level anonymous access or a URL/auth flow that grants the list permission. Blob-only anonymous access is not enough for browsing: a direct blob URL can download successfully while the List Blobs container request still fails. Private containers require the Electron desktop app, which can access Azure-authenticated URLs through its networking shell. Signed URLs (SAS) are not yet wired up in the Add Source flow.
CORS requirements
Because the browser talks to *.blob.core.windows.net directly, the storage account needs CORS rules that allow GET and HEAD from the Hyperparam origin and expose Content-Range. See CORS Configuration for the exact az storage cors add command and Portal steps.
Private Data Sources
Upload Strategy
For sensitive data:
- Never use public URLs
- Generate time-limited signed URLs
- Restrict CORS to Hyperparam origin
- Monitor access logs
- Rotate credentials regularly
Troubleshooting
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| "Access Denied" | Private bucket | Use signed URL |
| "CORS Error" | Missing headers | Configure CORS |
| "Slow Loading" | Far region | Use closer source |
| "Range Not Supported" | Old server | Download fully |
Testing Connectivity
// Browser console test
fetch('https://your-data-url.com/file.parquet', {
method: 'HEAD',
headers: {
'Range': 'bytes=0-1000'
}
}).then(response => {
console.log('Status:', response.status);
console.log('Headers:', response.headers);
});Summary
Key points:
- Multiple sources supported with streaming
- URLs preferred over local files for large data
- Cloud storage works seamlessly
- Security via signed URLs
- Performance varies by source location
- CORS configuration required for some sources
Choose the right source for your use case and optimize for streaming performance!