DataFolio
A lightweight, filesystem-based data versioning and experiment tracking library for Python.
DataFolio helps you organize, version, and track your data science experiments by storing datasets, models, and artifacts in a simple, transparent directory structure. Everything is saved as plain files (Parquet, JSON, pickle) that you can inspect, version with git, or backup to any storage system.
Why DataFolio?
Ever trained a model with great results, then lost it while experimenting? Or struggled to remember which dataset produced which model? Or needed to reproduce results from months ago?
DataFolio solves these problems with a simple, filesystem-based approachβno servers, no databases, just files you can inspect and version control.
Quick Example: The Story of a Good Model
from datafolio import DataFolio
from sklearn.ensemble import RandomForestClassifier
# You've been working on a classification problem
folio = DataFolio('experiments/fraud_detection')
# Process your data
folio.add_table('training_data', processed_df,
description='Cleaned transaction data with engineered features')
# Train a model - it gets 89% accuracy! π
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
# Save it with metadata
folio.add_model('classifier', model,
description='Random forest classifier',
inputs=['training_data'])
folio.metadata['accuracy'] = 0.89
folio.metadata['status'] = 'promising'
# View everything
folio.describe()
# Create a snapshot before experimenting - it's free insurance!
folio.create_snapshot('v1-baseline',
description='89% accuracy baseline model',
tags=['baseline', 'validated'])
# Now experiment freely - try a neural network
new_model = train_experimental_model()
folio.add_model('classifier', new_model, overwrite=True)
folio.metadata['accuracy'] = 0.85 # Worse! π
# No problem - load the good version back
baseline = DataFolio.load_snapshot('experiments/fraud_detection', 'v1-baseline')
good_model = baseline.get_model('classifier') # Your 89% model is safe!
# Deploy the good one to production
deploy_to_production(good_model)
This is the core DataFolio workflow: track your data, models, and results; snapshot before experimenting; never lose good work.
Key Features
- Universal Data Management - Single
add_data()method handles DataFrames, numpy arrays, dicts, lists, and scalars - Model Support - Save and load scikit-learn models with full metadata
- Snapshots - Create immutable checkpoints of experiments with copy-on-write versioning (no data duplication!)
- Data Lineage - Track inputs and dependencies between datasets and models
- Autocomplete Access - IDE-friendly
folio.data.item_name.contentsyntax with full autocomplete - Multi-Instance Sync - Multiple notebooks/processes can safely access the same bundle
- Cloud Storage - Works with local paths, S3, GCS, Azure, and more
- Caching - Smart caching for remote data reduces download times
- Git-Friendly - All data stored as standard file formats in a simple directory structure
- CLI Tools - Command-line interface for snapshot management and bundle operations
Installation
pip install datafolio
Learn More
New to DataFolio? Start with the Getting Started Guide for a comprehensive tutorial.
Specific Topics: - Snapshots Guide - Version control for experiments - DataFolio API Reference - All methods and properties - CLI Reference - Command-line tools - Complete API - Full API documentation
Common Use Cases
Experiment Tracking
# Track everything about your experiment
folio = DataFolio('experiments/model_v2')
folio.metadata['experiment'] = 'hyperparameter_tuning'
folio.metadata['date'] = '2025-01-20'
# Save data, models, and results
folio.add_table('features', feature_df)
folio.add_model('model', trained_model)
folio.add_json('metrics', {'accuracy': 0.92, 'f1': 0.89})
# Create snapshot at milestones
folio.create_snapshot('v2.0-production', tags=['production'])
Reproducible Research
# Paper submission: snapshot your exact results
folio.create_snapshot('neurips-2025-submission',
description='Results in paper Table 3',
tags=['paper', 'published'])
# Six months later: reviewers ask for clarification
paper_version = DataFolio.load_snapshot('research/exp', 'neurips-2025-submission')
exact_model = paper_version.get_model('classifier')
exact_data = paper_version.get_table('test_data')
Team Collaboration
# Use cloud storage for team access
folio = DataFolio('s3://team-bucket/shared-experiment',
cache_enabled=True) # Cache for faster local access
# Everyone sees the same data
df = folio.get_table('results')
model = folio.get_model('classifier')
# Compare different team members' approaches
baseline = DataFolio.load_snapshot('s3://team-bucket/shared', 'alice-baseline')
variant = DataFolio.load_snapshot('s3://team-bucket/shared', 'bob-neural-net')
What Makes DataFolio Different?
| DataFolio | MLflow | Weights & Biases | |
|---|---|---|---|
| Setup | Zero - just a directory | Requires server | Requires account |
| Storage | Files on disk/cloud | Database + artifacts | Cloud service |
| Inspection | Direct file access | Via API | Via web UI |
| Versioning | Snapshots (copy-on-write) | Runs (separate copies) | Versions (cloud) |
| Sharing | Copy directory/git | Share server access | Share workspace |
| Cost | Free | Free (self-hosted) | Free tier + paid |
DataFolio is perfect when you want: - Full control over your data - Simple filesystem-based storage - Git-friendly versioning - No external dependencies - Cloud storage without cloud services
Directory Structure
DataFolio creates an intuitive, inspectable directory structure:
experiments/my_experiment/
βββ items.json # Manifest of all items
βββ metadata.json # Bundle metadata
βββ snapshots.json # Snapshot registry
β
βββ tables/
β βββ features.parquet # DataFrames as Parquet
β
βββ models/
β βββ classifier.joblib # Scikit-learn models
β
βββ numpy/
β βββ embeddings.npy # Numpy arrays
β
βββ artifacts/
βββ config.json # JSON data
βββ plot.png # Images
βββ report.pdf # Any file type
All files use standard formats you can open with any tool!
Quick CLI Reference
# Initialize a new bundle
datafolio init my_experiment
# Describe bundle contents
datafolio describe
# Create a snapshot
datafolio snapshot create v1.0 -d "Baseline model" --tags baseline,production
# List snapshots
datafolio snapshot list
# Compare two snapshots
datafolio snapshot compare v1.0 v2.0
# Show current status vs last snapshot
datafolio snapshot status
See the CLI Reference for complete documentation.
Best Practices
- Use descriptive names -
'training_features'not'data1' - Track lineage - Always specify
inputsparameter - Add descriptions - Help future you understand your work
- Snapshot before major changes - It's free insurance
- Use tags - Organize snapshots with
baseline,production,paper - Leverage autocomplete - Use
folio.data.item_name.content - Clean up regularly - Delete temporary items with
folio.delete() - Version control - Commit bundles to git for team collaboration
Get Started
Ready to organize your experiments? Check out the Getting Started Guide for a step-by-step tutorial.
Development
See CLAUDE.md for development guidelines.
License
MIT License - see LICENSE file for details.