Getting Started with DataFolio

DataFolio is a lightweight, filesystem-based experiment tracking library that helps you organize data science experiments by storing datasets, models, and artifacts in a simple, transparent directory structure.

Why DataFolio?

Traditional experiment tracking solutions can be heavyweight, require servers, or lock you into specific platforms. DataFolio takes a different approach:

Filesystem-based - Everything is just files on disk
Transparent - Standard formats (Parquet, JSON, pickle) you can inspect
Portable - Works locally, in notebooks, on clusters, or in the cloud
Git-friendly - Version control your entire experiment
Simple - Intuitive Python API with minimal boilerplate

Installation

pip install datafolio

This installs both the Python library and the datafolio CLI tool.

Quick Start

Your First Bundle

A "bundle" is DataFolio's way of organizing an experiment. Think of it as a project folder:

from datafolio import DataFolio
import pandas as pd

# Create a new bundle
folio = DataFolio('experiments/my_first_experiment')

# Add some data
df = pd.DataFrame({
    'feature_1': [1, 2, 3],
    'feature_2': [4, 5, 6],
    'target': [0, 1, 0]
})

folio.add_data('training_data', df)

# View what's in the bundle
folio.describe()

Output:

DataFolio: experiments/my_first_experiment
==========================================

Tables (1):
  • training_data
    ↳ shape: [3, 3]

What Just Happened?

DataFolio created a directory structure:

experiments/my_first_experiment/
├── metadata.json          # Bundle metadata
├── items.json            # Manifest of all items
└── tables/
    └── training_data.parquet  # Your DataFrame

Everything is saved automatically. No need to call save() or commit().

Core Concepts

Bundles

A bundle is a self-contained experiment with:

Data items (tables, arrays, JSON, models, files)
Metadata (custom key-value pairs about the experiment)
Lineage (relationships between data items)

# Create or open a bundle
folio = DataFolio('path/to/bundle')

# Add custom metadata
folio.metadata['experiment_name'] = 'baseline_v1'
folio.metadata['date'] = '2025-01-20'
folio.metadata['tags'] = ['classification', 'baseline']

# Metadata is automatically saved

Data Items

DataFolio supports multiple data types, each optimized for its use case:

Type	Examples	Storage Format
Tables	pandas / Polars DataFrames	Parquet
Numpy Arrays	Embeddings, tensors	`.npy`
JSON	Configs, metrics, lists	`.json`
Models	sklearn	`.joblib`, `.skops`
Artifacts	Images, PDFs, any file	Original format
References	External data (S3, etc.)	Metadata only

The Universal `add_data()` Method

For simplicity, use add_data() which automatically detects the type:

# Automatically handles different types
folio.add_data('df', dataframe)           # Table
folio.add_data('embeddings', np_array)    # Numpy
folio.add_data('config', {'lr': 0.01})    # JSON
folio.add_data('model', sklearn_model)    # Model
folio.add_data('score', 0.95)             # JSON (scalar)

Or use type-specific methods for more control:

folio.add_table('df', dataframe, description='Training data')
folio.add_numpy('embeddings', array, description='Word embeddings')
folio.add_json('config', config_dict, description='Model config')
folio.add_model('clf', model, description='Random forest')

Working with Data

Adding Data

import pandas as pd
import numpy as np

# Tables (pandas or Polars DataFrames)
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
folio.add_table('data', df,
    description='Experimental data',
    inputs=['raw_data'])  # Optional: track lineage

# Numpy arrays
embeddings = np.random.randn(100, 128)
folio.add_numpy('embeddings', embeddings,
    description='Model embeddings')

# JSON data (configs, metrics, lists)
config = {'learning_rate': 0.01, 'batch_size': 32}
folio.add_json('config', config,
    description='Training configuration')

# Scalars are stored as JSON
folio.add_json('accuracy', 0.95)

# Files/artifacts
folio.add_artifact('plot.png', 'path/to/plot.png',
    description='Training curve')

Retrieving Data

# Get by type-specific method
df = folio.get_table('data')
arr = folio.get_numpy('embeddings')
config = folio.get_json('config')

# Or use universal get_data()
df = folio.get_data('data')        # Returns DataFrame
arr = folio.get_data('embeddings') # Returns numpy array
config = folio.get_data('config')  # Returns dict

Autocomplete-Friendly Access

For a better developer experience, use the folio.data accessor:

# Attribute-style access (great for autocomplete!)
df = folio.data.training_data.content
config = folio.data.config.content
model = folio.data.classifier.content

# Access metadata
desc = folio.data.training_data.description
inputs = folio.data.training_data.inputs
item_type = folio.data.training_data.type

# In Jupyter/IPython, use TAB completion
folio.data.<TAB>  # Shows all available items

Overwriting Data

# Add initial data
folio.add_data('model', model_v1)

# Overwrite with new version
folio.add_data('model', model_v2, overwrite=True)

# Without overwrite=True, you'll get an error
folio.add_data('model', model_v3)  # Error: item exists!

Deleting Data

# Delete single item
folio.delete('old_model')

# Delete multiple items
folio.delete(['temp1', 'temp2', 'debug_data'])

# DataFolio warns if deleted items have dependents
folio.delete('train_data')  # Warns if other items depend on it
folio.delete('train_data', warn_dependents=False)  # Skip warning

Working with Models

Scikit-learn Models

from sklearn.ensemble import RandomForestClassifier

# Train model
clf = RandomForestClassifier(n_estimators=100, max_depth=10)
clf.fit(X_train, y_train)

# Save model
folio.add_model('classifier', clf,
    description='Random forest classifier',
    hyperparameters={'n_estimators': 100, 'max_depth': 10},
    inputs=['training_data'])

# Load model
loaded_clf = folio.get_model('classifier')
predictions = loaded_clf.predict(X_test)

Custom Models with Skops

DataFolio supports custom sklearn-compatible models using skops. This is particularly useful for pipelines with custom transformers that need to be portable across environments.

When to use skops format (custom=True): - Pipelines with custom transformers that need to work across different machines - Models that need to be deployed without access to the original class definitions - Better security for model deployment (skops provides secure serialization)

Key requirement: Custom transformers must inherit from sklearn base classes:

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

# ✅ CORRECT: Inherits from sklearn mixins
class PercentileClipper(BaseEstimator, TransformerMixin):
    """Custom transformer that clips values to percentile bounds."""

    def __init__(self, lower=1, upper=99):
        self.lower = lower
        self.upper = upper

    def fit(self, X, y=None):
        self.lower_bound_ = np.percentile(X, self.lower, axis=0)
        self.upper_bound_ = np.percentile(X, self.upper, axis=0)
        return self

    def transform(self, X):
        return np.clip(X, self.lower_bound_, self.upper_bound_)

# ❌ WRONG: Plain class without sklearn mixins
class BadTransformer:  # Won't work with skops!
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X

Why inherit from BaseEstimator and TransformerMixin? - BaseEstimator: Provides get_params() and set_params() methods required by sklearn - TransformerMixin: Provides fit_transform() method automatically - Ensures compatibility with sklearn's Pipeline and other utilities - Required for skops to properly serialize and deserialize your custom class

Using custom transformers in pipelines:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Create pipeline with custom transformer
pipeline = Pipeline([
    ('clipper', PercentileClipper(lower=5, upper=95)),
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

# Fit pipeline
X_train = np.random.randn(100, 5)
y_train = np.random.randint(0, 2, 100)
pipeline.fit(X_train, y_train)

# Save with skops format (custom=True)
folio.add_sklearn('custom_pipeline', pipeline,
    custom=True,  # Uses skops for portability
    description='Pipeline with custom percentile clipper')

# Load in a different environment (doesn't need PercentileClipper class!)
folio2 = DataFolio('path/to/bundle')
loaded_pipeline = folio2.get_sklearn('custom_pipeline')
predictions = loaded_pipeline.predict(X_test)

Comparison of serialization formats:

Format	When to Use	Pros	Cons
joblib (default)	Standard sklearn models, XGBoost, LightGBM	Fast, widely supported	Requires class definitions on load
skops (`custom=True`)	Custom transformers, deployment	Portable, more secure	Slightly slower

# Joblib format (default)
folio.add_sklearn('model', pipeline)  # Uses joblib

# Skops format (portable)
folio.add_sklearn('model', pipeline, custom=True)  # Uses skops

# Both work through generic add_model() too
folio.add_model('model', pipeline, custom=True)

Best practices for custom transformers:

Always inherit from sklearn base classes: ```python from sklearn.base import BaseEstimator, TransformerMixin

class MyTransformer(BaseEstimator, TransformerMixin): ... ```

Store fitted parameters with trailing underscore: python def fit(self, X, y=None): self.mean_ = np.mean(X) # Fitted params end with _ return self
Initialize all parameters in __init__: python def __init__(self, threshold=0.5): self.threshold = threshold # Store all params
Always return self from fit(): python def fit(self, X, y=None): # ... fitting logic ... return self # Required for sklearn API

Data Lineage

Track dependencies between your data items to understand your workflow:

# Reference external data
folio.reference_table('raw_data',
    reference='s3://bucket/raw_data.parquet',
    description='Original raw data from database')

# Add processed data with lineage
folio.add_table('cleaned_data', cleaned_df,
    description='Cleaned and preprocessed',
    inputs=['raw_data'])  # Depends on raw_data

# Add features
folio.add_table('features', feature_df,
    description='Engineered features',
    inputs=['cleaned_data'])  # Depends on cleaned_data

# Add model
folio.add_model('classifier', model,
    description='Trained classifier',
    inputs=['features'])  # Depends on features

# View the lineage chain
folio.describe()

Output shows the dependency chain:

Tables (2):
  • raw_data (reference): Original raw data from database
    ↳ path: s3://bucket/raw_data.parquet
  • cleaned_data: Cleaned and preprocessed
    ↳ inputs: raw_data
    ↳ shape: [10000, 25]
  • features: Engineered features
    ↳ inputs: cleaned_data
    ↳ shape: [10000, 50]

Models (1):
  • classifier: Trained classifier
    ↳ inputs: features

Why Track Lineage?

Understand workflows - See how data flows through your pipeline
Debug issues - Trace problems back to their source
Reproduce results - Know exactly which data created which results
Cleanup safely - DataFolio warns when deleting items with dependents

External References

For large datasets stored elsewhere (S3, network drives, etc.), use references instead of copying:

# Reference data without copying
folio.reference_table('huge_dataset',
    reference='s3://my-bucket/data/train.parquet',
    description='10GB training dataset')

# Reference with additional metadata
folio.reference_table('cloud_data',
    reference='gs://bucket/data.csv',
    description='Data in Google Cloud Storage',
    num_rows=1_000_000,
    num_cols=500)

# Later, access the path
path = folio.data.huge_dataset.path  # 's3://my-bucket/data/train.parquet'

# Load with pandas/pyarrow
import pandas as pd
df = pd.read_parquet(path)  # Reads directly from S3

Bundle Metadata

Store experiment-level information in the bundle metadata:

# Add custom metadata
folio.metadata['experiment_name'] = 'baseline_v1'
folio.metadata['researcher'] = 'Alice'
folio.metadata['date_started'] = '2025-01-20'
folio.metadata['hypothesis'] = 'Random forest will outperform logistic regression'
folio.metadata['tags'] = ['classification', 'baseline', 'production']
folio.metadata['notes'] = 'First experiment with cleaned dataset'

# Metadata is automatically saved

# Access metadata
print(folio.metadata['experiment_name'])

# View all metadata
folio.describe()  # Shows metadata section

The describe() method automatically formats and displays your custom metadata.

Describing Your Bundle

Get a comprehensive overview of your bundle:

# Print to console (default)
folio.describe()

# Get as string
summary = folio.describe(return_string=True)
print(summary)

# Show empty sections
folio.describe(show_empty=True)

# Limit metadata fields shown
folio.describe(max_metadata_fields=5)

Example output:

DataFolio: experiments/classifier_v1
====================================

Tables (2):
  • raw_data (reference): Original raw data
    ↳ path: s3://bucket/raw.parquet
  • features: Engineered features
    ↳ inputs: cleaned_data
    ↳ shape: [10000, 50]

Numpy Arrays (1):
  • embeddings: Model embeddings
    ↳ shape: [100, 128], dtype: float64
    ↳ inputs: features

Models (1):
  • classifier: Random forest classifier
    ↳ inputs: features
    ↳ hyperparameters: {'n_estimators': 100, 'max_depth': 10}

Metadata (5):
  • experiment_name: baseline_v1
  • researcher: Alice
  • tags: ['classification', 'baseline'] (list, 2 items)
  • hypothesis: Random forest will outperform logistic... (truncated)
  ... and 2 more fields

Multi-Instance Access

Multiple notebooks or processes can safely access the same bundle:

# Notebook 1: Create bundle
folio1 = DataFolio('experiments/shared')
folio1.add_data('results', df1)

# Notebook 2: Open same bundle
folio2 = DataFolio('experiments/shared')
print(folio2.describe())  # Shows 'results'

# Notebook 1: Add more data
folio1.add_data('analysis', df2)

# Notebook 2: Automatically sees new data!
folio2.describe()  # Now shows both 'results' and 'analysis'
data = folio2.get_data('analysis')  # Works immediately ✅

All read operations automatically refresh from disk, so you always see the latest state.

Complete Workflow Example

Here's a complete example from data loading to model deployment:

from datafolio import DataFolio
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

# 1. Initialize bundle
folio = DataFolio('experiments/fraud_detection_v1')
folio.metadata['experiment_name'] = 'fraud_detection_baseline'
folio.metadata['date'] = '2025-01-20'
folio.metadata['tags'] = ['classification', 'fraud', 'baseline']

# 2. Reference external raw data
folio.reference_table('raw_data',
    reference='s3://data-lake/fraud/raw_2024.parquet',
    description='Raw transaction data from 2024',
    num_rows=1_000_000,
    num_cols=25)

# 3. Load and clean data
raw_df = pd.read_parquet('s3://data-lake/fraud/raw_2024.parquet')
cleaned_df = clean_data(raw_df)  # Your cleaning function

folio.add_table('cleaned_data', cleaned_df,
    description='Cleaned transaction data',
    inputs=['raw_data'])

# 4. Engineer features
features_df = engineer_features(cleaned_df)  # Your feature engineering

folio.add_table('features', features_df,
    description='Engineered features for classification',
    inputs=['cleaned_data'])

# 5. Train/test split
X = features_df.drop('is_fraud', axis=1)
y = features_df['is_fraud']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 6. Train model
clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
clf.fit(X_train, y_train)

# 7. Evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# 8. Save model and results
folio.add_model('classifier', clf,
    description='Random forest fraud classifier',
    hyperparameters={
        'n_estimators': 100,
        'max_depth': 10,
        'random_state': 42
    },
    inputs=['features'])

folio.add_json('metrics', {
    'accuracy': float(accuracy),
    'f1_score': float(f1),
    'train_samples': len(X_train),
    'test_samples': len(X_test)
})

folio.add_json('feature_importance', {
    feature: float(importance)
    for feature, importance in zip(X.columns, clf.feature_importances_)
})

# 9. Update metadata with results
folio.metadata['accuracy'] = float(accuracy)
folio.metadata['f1_score'] = float(f1)
folio.metadata['status'] = 'completed'

# 10. View summary
folio.describe()

# 11. Later: Load and use in production
production_folio = DataFolio('experiments/fraud_detection_v1')
model = production_folio.get_model('classifier')
metrics = production_folio.get_json('metrics')

print(f"Deploying model with accuracy: {metrics['accuracy']}")
predictions = model.predict(new_transactions)

Directory Structure

DataFolio creates an intuitive directory structure:

experiments/my_experiment/
├── metadata.json              # Bundle metadata
├── items.json                # Manifest of all items
├── snapshots.json            # Snapshot registry (if using snapshots)
│
├── tables/
│   └── features.parquet      # DataFrames
│
├── models/
│   └── classifier.joblib     # Scikit-learn models
│
├── numpy/
│   └── embeddings.npy       # Numpy arrays
│
└── artifacts/
    ├── config.json          # JSON data
    ├── plot.png            # Images
    └── report.pdf          # Any file type

All files use standard formats: - Parquet for DataFrames (efficient, columnar) - JSON for configs and metrics (human-readable) - Joblib/Skops for scikit-learn models - Numpy .npy for arrays

You can inspect any file directly without DataFolio!

Tips and Tricks

1. Use Descriptive Names

# Good
folio.add_data('training_features_v2', df)
folio.add_model('random_forest_baseline', model)

# Bad
folio.add_data('data1', df)
folio.add_model('model', model)

2. Add Descriptions

# Always add descriptions
folio.add_table('features', df,
    description='Engineered features with PCA and polynomial terms')

# Future you will thank present you

3. Track Lineage

# Always specify inputs
folio.add_table('features', feature_df,
    inputs=['cleaned_data'])

# This helps you understand the data flow

4. Use Custom Metadata

# Store experiment context
folio.metadata['experiment_type'] = 'hyperparameter_tuning'
folio.metadata['best_params'] = {'n_estimators': 100, 'max_depth': 10}
folio.metadata['notes'] = 'Best results from grid search over 50 configs'

5. Clean Up Regularly

# Delete temporary data
folio.delete(['debug_data', 'temp_results', 'old_model_v1'])

# Check before deleting
folio.describe()  # Review what you have

6. Use References for Large Data

# Don't copy huge datasets
folio.reference_table('training_data',
    reference='s3://bucket/huge_data.parquet')

# Load directly from source when needed
df = pd.read_parquet(folio.data.training_data.path)

7. Leverage Autocomplete

# This is more discoverable
config = folio.data.config.content
model = folio.data.classifier.content

# Than this
config = folio.get_data('config')
model = folio.get_data('classifier')

8. Commit to Git

# Your bundle is git-friendly
cd experiments/my_experiment
git add .
git commit -m "Baseline model - 89% accuracy"
git push

9. Use Snapshots for Versions

# Create snapshots at milestones
folio.create_snapshot('v1.0-baseline',
    description='Initial baseline model')

# Experiment freely
folio.add_model('classifier', new_model, overwrite=True)

# Return to baseline anytime
baseline = DataFolio.load_snapshot('experiments/exp', 'v1.0-baseline')

See the Snapshots Guide for more details.

10. Use the CLI

# Describe bundle from terminal
datafolio describe

# List snapshots
datafolio snapshot list

# Compare versions
datafolio snapshot compare v1.0 v2.0

Common Patterns

Experiment Template

def run_experiment(name, config):
    # Initialize
    folio = DataFolio(f'experiments/{name}')
    folio.metadata.update(config)
    folio.metadata['status'] = 'running'

    # Load data
    data = load_data(config['data_source'])
    folio.add_data('data', data)

    # Train
    model = train_model(data, config)
    folio.add_model('model', model)

    # Evaluate
    metrics = evaluate_model(model, data)
    folio.add_json('metrics', metrics)
    folio.metadata.update(metrics)
    folio.metadata['status'] = 'completed'

    return folio

# Run experiments
exp1 = run_experiment('baseline', {'lr': 0.01, 'data_source': 'train.csv'})
exp2 = run_experiment('tuned', {'lr': 0.001, 'data_source': 'train.csv'})

A/B Test Comparison

# Load two experiments
baseline = DataFolio('experiments/baseline')
variant = DataFolio('experiments/variant_a')

# Compare
print(f"Baseline accuracy: {baseline.metadata['accuracy']}")
print(f"Variant accuracy: {variant.metadata['accuracy']}")

# Deploy winner
if variant.metadata['accuracy'] > baseline.metadata['accuracy']:
    model = variant.get_model('classifier')
else:
    model = baseline.get_model('classifier')

Hyperparameter Grid Search

from itertools import product

# Grid
params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20]
}

results = []

# Try each combination
for n_est, depth in product(params['n_estimators'], params['max_depth']):
    # Create bundle
    folio = DataFolio(f'experiments/grid_search/n{n_est}_d{depth}')

    # Train
    model = RandomForestClassifier(n_estimators=n_est, max_depth=depth)
    model.fit(X_train, y_train)

    # Evaluate
    acc = model.score(X_test, y_test)

    # Save
    folio.add_model('model', model)
    folio.metadata['n_estimators'] = n_est
    folio.metadata['max_depth'] = depth
    folio.metadata['accuracy'] = acc

    results.append((n_est, depth, acc))

# Find best
best = max(results, key=lambda x: x[2])
print(f"Best: n_estimators={best[0]}, max_depth={best[1]}, acc={best[2]}")

Next Steps

Learn about snapshots - See the Snapshots Guide for versioning experiments
API Reference - Check the API docs for all methods
Examples - Browse the main documentation for more examples
CLI Tools - Use datafolio --help to explore the command-line interface

Common Questions

Q: How is this different from MLflow/Weights & Biases?

A: DataFolio is filesystem-based and self-contained. No servers, no databases, no accounts. Everything is just files you can inspect, version with git, and move around.

Q: Can I use this in production?

A: Yes! DataFolio bundles are self-contained and can be deployed anywhere. Load a bundle, get your model, and run inference.

Q: Does it work with cloud storage?

A: Yes! DataFolio supports any storage backend via cloud-files (S3, GCS, Azure, etc.). Just use cloud paths:

folio = DataFolio('s3://my-bucket/experiments/exp1')

Q: How do I share bundles with colleagues?

A: Just share the directory! Everything is self-contained. You can: - Commit to git - Copy to shared storage - Zip and email - Mount network drives

Q: What about versioning?

A: Use Snapshots! They let you create immutable checkpoints without duplicating data.

Q: Can I use this with Jupyter notebooks?

A: Absolutely! DataFolio works great in notebooks. Multiple notebooks can even access the same bundle simultaneously.

Need Help?

Documentation: Check the full docs
Issues: Report bugs on GitHub
Examples: See the repository for example notebooks