Caching for Fast Remote Access
DataFolio's caching system dramatically speeds up access to cloud-stored data by maintaining a local copy. This guide shows you how to use caching effectively for remote bundles.
Why Use Caching?
The Problem
Working with cloud-stored bundles can be slow due to network latency:
# Remote bundle on S3
folio = DataFolio('s3://my-bucket/experiments/model_v1')
# Every access downloads from S3 (slow!)
df1 = folio.get_table('results') # 30 seconds
df2 = folio.get_table('features') # 25 seconds
model = folio.get_model('classifier') # 15 seconds
# Total: 70 seconds just for data access
The Solution: Local Caching
Enable caching to download once, read many times:
# Enable caching
folio = DataFolio('s3://my-bucket/experiments/model_v1',
cache_enabled=True)
# First access: Downloads and caches (30s)
df1 = folio.get_table('results') # 30 seconds
# Second access: Reads from cache (fast!)
df1_again = folio.get_table('results') # 0.1 seconds
# 300x faster! 🚀
Quick Start
Basic Usage
from datafolio import DataFolio
# Enable caching with defaults
folio = DataFolio('s3://my-bucket/data',
cache_enabled=True) # That's it!
# Work normally - caching is automatic
df = folio.get_table('my_data') # Cached on first access
model = folio.get_model('my_model') # Cached on first access
# Check cache statistics
status = folio.cache_status()
print(f"Cache hits: {status['cache_hits']}")
print(f"Cache misses: {status['cache_misses']}")
print(f"Total size: {status['total_size_bytes'] / 1e9:.2f} GB")
Custom Cache Directory
# Store cache on a fast SSD
folio = DataFolio('gs://my-bucket/data',
cache_enabled=True,
cache_dir='/fast/ssd/cache')
# Or use a specific directory for this bundle
folio = DataFolio('s3://bucket/experiment1',
cache_enabled=True,
cache_dir='/data/cache/experiment1')
How Caching Works
Cache Storage Structure
DataFolio creates a cache directory with this structure:
~/.datafolio_cache/ # Default cache directory
├── bundles/
│ └── <bundle-id>/ # One directory per bundle
│ ├── tables/
│ │ └── results.parquet # Cached table files
│ ├── models/
│ │ └── classifier.joblib # Cached model files
│ └── ...
└── .locks/ # Lock files for thread safety
With explicit cache_dir:
/my/cache/ # Your specified directory
├── tables/
│ └── results.parquet
├── models/
│ └── classifier.joblib
└── ...
Cache Behavior
- First access: Downloads from remote → saves to cache → returns data
- Subsequent accesses: Reads from cache (no network access)
- Cache invalidation: Automatic when remote file changes
folio = DataFolio('s3://bucket/data', cache_enabled=True)
# Access 1: Cache miss (downloads from S3)
df = folio.get_table('data') # Downloads, caches, returns
# Status: 1 miss, 0 hits
# Access 2: Cache hit (reads from local cache)
df = folio.get_table('data') # Reads from cache
# Status: 1 miss, 1 hit
# Access 3: Cache hit
df = folio.get_table('data') # Reads from cache
# Status: 1 miss, 2 hits
Checksum-Based Invalidation
DataFolio uses checksums to detect changes:
folio = DataFolio('s3://bucket/data', cache_enabled=True)
# First access: Downloads and caches
df1 = folio.get_table('results') # Cache miss
# Meanwhile, someone updates the remote file...
# (e.g., another process overwrites results.parquet)
# Next access: Detects change, re-downloads
df2 = folio.get_table('results') # Cache miss (checksum changed)
# Automatic cache invalidation! ✅
Configuration Options
Cache Directory
# Default: ~/.datafolio_cache/bundles/<bundle-id>
folio = DataFolio('s3://bucket/data', cache_enabled=True)
# Custom: Exact directory for this bundle
folio = DataFolio('s3://bucket/data',
cache_enabled=True,
cache_dir='/fast/ssd/my_cache')
# Environment variable
import os
os.environ['DATAFOLIO_CACHE_DIR'] = '/shared/cache'
folio = DataFolio('s3://bucket/data', cache_enabled=True)
Cache Enabled/Disabled
# Disable caching (default)
folio = DataFolio('s3://bucket/data')
# or explicitly:
folio = DataFolio('s3://bucket/data', cache_enabled=False)
# Enable caching
folio = DataFolio('s3://bucket/data', cache_enabled=True)
# Environment variable
os.environ['DATAFOLIO_CACHE_ENABLED'] = 'true'
folio = DataFolio('s3://bucket/data') # Caching enabled
Multiple Bundles, Shared Cache
# Default cache_dir: Each bundle gets its own subdirectory
cache_base = '~/.datafolio_cache' # Default
folio1 = DataFolio('s3://bucket/exp1', cache_enabled=True)
# Cache: ~/.datafolio_cache/bundles/<exp1-id>/
folio2 = DataFolio('s3://bucket/exp2', cache_enabled=True)
# Cache: ~/.datafolio_cache/bundles/<exp2-id>/
# Custom: Separate caches for different projects
folio_a = DataFolio('s3://bucket/project_a',
cache_enabled=True,
cache_dir='/cache/project_a')
folio_b = DataFolio('s3://bucket/project_b',
cache_enabled=True,
cache_dir='/cache/project_b')
Cache Management
Check Cache Status
folio = DataFolio('s3://bucket/data', cache_enabled=True)
# Work with data
df = folio.get_table('table1')
df = folio.get_table('table2')
model = folio.get_model('model1')
# Check cache statistics
status = folio.cache_status()
print(f"Cache enabled: {status['cache_enabled']}")
print(f"Cache directory: {status['cache_dir']}")
print(f"Total files: {status['total_files']}")
print(f"Total size: {status['total_size_bytes'] / 1e9:.2f} GB")
print(f"Cache hits: {status['cache_hits']}")
print(f"Cache misses: {status['cache_misses']}")
print(f"Hit rate: {status['cache_hits'] / (status['cache_hits'] + status['cache_misses']) * 100:.1f}%")
Example output:
Cache enabled: True
Cache directory: /Users/me/.datafolio_cache/bundles/abc123
Total files: 15
Total size: 2.3 GB
Cache hits: 45
Cache misses: 15
Hit rate: 75.0%
Clear Cache
# Clear entire cache for this bundle
folio.clear_cache()
# Verify cache is empty
status = folio.cache_status()
assert status['total_files'] == 0
assert status['total_size_bytes'] == 0
Invalidate Specific Item
# Force re-download on next access
folio.invalidate_cache('my_table')
# Next access will download fresh copy
df = folio.get_table('my_table') # Cache miss (downloads)
Refresh Cache
# Re-download all cached items
folio.refresh_cache()
# Useful when you know remote data has changed
# but checksums aren't updated yet
Performance Examples
Example 1: Iterative Data Analysis
# Working with cloud data in a Jupyter notebook
folio = DataFolio('s3://analytics/user_study',
cache_enabled=True)
# Analysis iteration 1: Download data (slow)
df = folio.get_table('user_data') # 45 seconds
df.head()
# Oops, need to filter differently
df_filtered = folio.get_table('user_data') # 0.1 seconds (cached!)
df_filtered[df_filtered['age'] > 18]
# Try different aggregation
df_agg = folio.get_table('user_data') # 0.1 seconds (cached!)
df_agg.groupby('country').mean()
# Total time: 45s instead of 135s (3x access)
# Cache saved 90 seconds!
Example 2: Model Training Pipeline
# Training pipeline with multiple runs
folio = DataFolio('s3://ml-data/experiment',
cache_enabled=True,
cache_dir='/fast/nvme/cache')
# First training run
df_train = folio.get_table('training_data') # 2 minutes (5GB download)
df_test = folio.get_table('test_data') # 1 minute (2GB download)
model = train_model(df_train, df_test)
# Total: 3 minutes data loading
# Second training run (different hyperparameters)
df_train = folio.get_table('training_data') # 0.5 seconds (cached!)
df_test = folio.get_table('test_data') # 0.2 seconds (cached!)
model = train_model(df_train, df_test)
# Total: 0.7 seconds data loading
# 10 training runs: 3 min + (9 × 0.7s) = ~3.1 minutes
# Without caching: 10 × 3 min = 30 minutes
# Cache saved 27 minutes! 🚀
Example 3: Team Collaboration
# Shared cache on network drive
folio = DataFolio('s3://team-bucket/shared-analysis',
cache_enabled=True,
cache_dir='/nfs/team-cache/analysis')
# Alice downloads data first
df = folio.get_table('large_dataset') # 10 minutes (downloads)
# Bob uses same cache later
folio_bob = DataFolio('s3://team-bucket/shared-analysis',
cache_enabled=True,
cache_dir='/nfs/team-cache/analysis')
df_bob = folio_bob.get_table('large_dataset') # 5 seconds (from cache!)
# Charlie also benefits
folio_charlie = DataFolio('s3://team-bucket/shared-analysis',
cache_enabled=True,
cache_dir='/nfs/team-cache/analysis')
df_charlie = folio_charlie.get_table('large_dataset') # 5 seconds!
# Team total: 10 min + 5s + 5s vs 30 min without cache
Use Cases
Local Development with Cloud Data
# Develop locally with production data from S3
folio = DataFolio('s3://production/analytics',
cache_enabled=True)
# First run: Downloads data
df = folio.get_table('transactions') # Slow
# Develop your analysis...
# Restart notebook, re-run cells: Reads from cache (fast!)
df = folio.get_table('transactions') # Fast
# Much better development experience!
CI/CD Pipelines
# Cache data across CI runs
import os
# In CI environment
cache_dir = os.environ.get('CI_CACHE_DIR', '/tmp/cache')
folio = DataFolio('s3://ml-models/production',
cache_enabled=True,
cache_dir=cache_dir)
# First CI run: Downloads models
model = folio.get_model('production_model') # Downloads
run_tests(model)
# Subsequent CI runs: Uses cache
model = folio.get_model('production_model') # From cache
run_tests(model) # Much faster CI!
Offline Work
# Pre-populate cache while online
folio = DataFolio('s3://bucket/data',
cache_enabled=True,
cache_dir='/local/cache')
# Download everything while connected
df1 = folio.get_table('data1')
df2 = folio.get_table('data2')
model = folio.get_model('model')
# Now work offline (airplane, no internet)
# All data available from cache!
df1 = folio.get_table('data1') # Works offline! ✈️
Best Practices
1. Enable Caching for Remote Bundles
# Good: Enable caching for cloud bundles
folio = DataFolio('s3://bucket/data', cache_enabled=True)
# Unnecessary: Don't cache local bundles
folio = DataFolio('/local/path/data') # Already local
2. Use Fast Storage for Cache
# Best: NVMe SSD
folio = DataFolio('s3://bucket/data',
cache_enabled=True,
cache_dir='/nvme/cache')
# Good: Regular SSD
cache_dir='/ssd/cache'
# Slow: HDD (still faster than network!)
cache_dir='/hdd/cache'
3. Monitor Cache Size
# Check cache size periodically
status = folio.cache_status()
cache_gb = status['total_size_bytes'] / 1e9
if cache_gb > 100: # If cache > 100GB
print(f"Cache is large: {cache_gb:.1f} GB")
print("Consider clearing old data")
# folio.clear_cache()
4. Clear Cache When Switching Contexts
# Finished with this analysis
folio = DataFolio('s3://bucket/old_analysis', cache_enabled=True)
# ... work done ...
folio.clear_cache() # Free up disk space
# Start new analysis
folio = DataFolio('s3://bucket/new_analysis', cache_enabled=True)
5. Use Separate Caches for Different Projects
# Project A
folio_a = DataFolio('s3://bucket/project_a',
cache_enabled=True,
cache_dir='/cache/project_a')
# Project B
folio_b = DataFolio('s3://bucket/project_b',
cache_enabled=True,
cache_dir='/cache/project_b')
# Easy to manage and clear separately
Interaction with Parquet Filtering
Caching affects Parquet filtering performance:
Without Caching (Predicate Pushdown)
# No caching: Filter on S3, download only matching rows
folio = DataFolio('s3://bucket/data') # cache_enabled=False
df = folio.get_table('huge_table',
filters=[('country', '==', 'US')],
engine='pyarrow')
# Downloads: ~1GB (filtered data only)
# Time: 30 seconds
With Caching (Download Then Filter)
# With caching: Download full file, then filter locally
folio = DataFolio('s3://bucket/data', cache_enabled=True)
# First access: Downloads full file
df = folio.get_table('huge_table',
filters=[('country', '==', 'US')],
engine='pyarrow')
# Downloads: 100GB (full file)
# Time: 10 minutes
# But subsequent accesses are very fast!
# Second access: Filters cached file
df_ca = folio.get_table('huge_table',
filters=[('country', '==', 'CA')],
engine='pyarrow')
# Downloads: 0GB (from cache)
# Time: 5 seconds
Guideline: For one-time queries on large files, disable caching to use predicate pushdown. For repeated queries, enable caching.
See Parquet Optimization Guide for more details.
Troubleshooting
Cache Not Working
Problem: Data still downloads every time
Check:
status = folio.cache_status()
print(f"Cache enabled: {status['cache_enabled']}")
print(f"Cache dir: {status['cache_dir']}")
print(f"Cache hits: {status['cache_hits']}")
Solutions:
- Verify cache_enabled=True in constructor
- Check cache directory is writable
- Check disk space available
Stale Cache Data
Problem: Cache contains old data
Solution:
# Option 1: Invalidate specific item
folio.invalidate_cache('my_table')
# Option 2: Refresh all cache
folio.refresh_cache()
# Option 3: Clear and rebuild
folio.clear_cache()
df = folio.get_table('my_table') # Fresh download
Cache Too Large
Problem: Cache consuming too much disk space
Solutions:
# Check current size
status = folio.cache_status()
print(f"Cache size: {status['total_size_bytes'] / 1e9:.2f} GB")
# Clear cache
folio.clear_cache()
# Or: Use smaller cache_dir with limited space
folio = DataFolio('s3://bucket/data',
cache_enabled=True,
cache_dir='/limited/disk/cache') # E.g., 50GB partition
Permission Errors
Problem: Can't write to cache directory
Solution:
# Use a directory you have write access to
folio = DataFolio('s3://bucket/data',
cache_enabled=True,
cache_dir='/home/me/datafolio_cache') # Your home directory
# Or create directory first
import os
os.makedirs('/tmp/my_cache', exist_ok=True)
folio = DataFolio('s3://bucket/data',
cache_enabled=True,
cache_dir='/tmp/my_cache')
Cache Hits Not Improving Performance
Problem: Cache hits still slow
Possible causes:
1. Slow cache storage: HDD instead of SSD
python
# Move cache to faster storage
folio = DataFolio('s3://bucket/data',
cache_enabled=True,
cache_dir='/fast/ssd/cache') # Use SSD
-
Large files: Even local reads take time
python # Use column selection to read less data df = folio.get_table('huge_table', columns=['id', 'value']) # Smaller, faster -
CPU bottleneck: Parquet decompression
python # Check if CPU is bottleneck import time start = time.time() df = folio.get_table('compressed_data') print(f"Time: {time.time() - start:.2f}s") # If slow despite cache hit, compression/CPU may be bottleneck
Advanced: Cache Internals
Cache Key Generation
DataFolio generates cache keys based on: - Item name - Item type (table, model, etc.) - Checksum (if available)
# Internally, cache key might be:
# tables/my_data.parquet → cached as <cache_dir>/tables/my_data.parquet
Thread Safety
Cache operations are thread-safe:
from concurrent.futures import ThreadPoolExecutor
folio = DataFolio('s3://bucket/data', cache_enabled=True)
def load_table(name):
return folio.get_table(name)
# Multiple threads can safely access cache
with ThreadPoolExecutor(max_workers=4) as executor:
futures = [executor.submit(load_table, f'table_{i}') for i in range(10)]
results = [f.result() for f in futures]
# No race conditions! ✅
Cache Locking
DataFolio uses file locks to prevent corruption:
~/.datafolio_cache/
└── .locks/
└── <bundle-id>.lock # Lock file for this bundle
Performance Benchmarks
| Scenario | No Cache | With Cache (First) | With Cache (Subsequent) | Speedup |
|---|---|---|---|---|
| 100MB S3 file | 5s | 5s | 0.1s | 50x |
| 1GB S3 file | 45s | 45s | 0.5s | 90x |
| 10GB S3 file | 8m | 8m | 5s | 96x |
| 100GB S3 file | 80m | 80m | 45s | 107x |
| 10 small files | 30s | 30s | 1s | 30x |
Note: Speedup is for cache hits. First access has no speedup (must download).
Summary
| Feature | Description | Command |
|---|---|---|
| Enable caching | Cache remote data locally | cache_enabled=True |
| Set cache dir | Specify cache location | cache_dir='/path' |
| Check status | View cache statistics | folio.cache_status() |
| Clear cache | Delete all cached files | folio.clear_cache() |
| Invalidate item | Force re-download | folio.invalidate_cache('name') |
| Refresh cache | Re-download everything | folio.refresh_cache() |
When to use caching: - ✅ Remote bundles (S3, GCS, etc.) - ✅ Repeated data access - ✅ Iterative development - ✅ Team collaboration with shared cache - ✅ Offline work preparation
When NOT to use caching: - ❌ Local bundles (already fast) - ❌ One-time data access - ❌ Limited disk space - ❌ Very large files with selective filters (use predicate pushdown)
Learn More
- Getting Started Guide - Basic DataFolio usage
- Parquet Optimization Guide - Filtering and column selection
- Snapshots Guide - Version your cached data
- DataFolio API Reference - Cache management methods