Working with Models
This guide covers everything you need to know about storing and loading machine learning models in DataFolio, including scikit-learn models and custom transformers.
Quick Start
from datafolio import DataFolio
from sklearn.ensemble import RandomForestClassifier
# Train a model
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
# Save it
folio = DataFolio('experiments/my_experiment')
folio.add_sklearn('classifier', clf,
description='Random forest baseline',
inputs=['training_data'])
# Load it later
loaded_clf = folio.get_sklearn('classifier')
predictions = loaded_clf.predict(X_test)
Scikit-learn Models
Standard Models
DataFolio automatically handles all standard scikit-learn models and many popular ML libraries:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import xgboost as xgb
import lightgbm as lgb
# All of these work automatically
folio.add_sklearn('rf', RandomForestClassifier())
folio.add_sklearn('gbr', GradientBoostingRegressor())
folio.add_sklearn('lr', LogisticRegression())
folio.add_sklearn('svm', SVC())
folio.add_sklearn('xgb', xgb.XGBClassifier())
folio.add_sklearn('lgb', lgb.LGBMRegressor())
Custom Transformers with Skops
When you create custom transformers for sklearn pipelines, you need to use the custom=True flag to enable skops serialization. This makes your pipelines portable across different environments.
Why Use Skops?
Use skops (custom=True) when:
- Your pipeline contains custom transformers (not from sklearn/standard libraries)
- You need to deploy models to environments without access to your class definitions
- You want more secure model serialization for production
- You're sharing models with collaborators who may not have your codebase
Use joblib (default) when: - All components are from standard libraries (sklearn, XGBoost, LightGBM, etc.) - You're working within a single environment/codebase - You prioritize speed over portability
Creating Custom Transformers
The key requirement: Custom transformers MUST inherit from sklearn's base classes.
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
# ✅ CORRECT - Inherits from sklearn base classes
class PercentileClipper(BaseEstimator, TransformerMixin):
"""Custom transformer that clips values to percentile bounds."""
def __init__(self, lower=1, upper=99):
"""Initialize with percentile bounds.
Args:
lower: Lower percentile bound (default: 1)
upper: Upper percentile bound (default: 99)
"""
self.lower = lower
self.upper = upper
def fit(self, X, y=None):
"""Fit by computing percentile bounds.
Args:
X: Training data
y: Target values (ignored)
Returns:
self
"""
self.lower_bound_ = np.percentile(X, self.lower, axis=0)
self.upper_bound_ = np.percentile(X, self.upper, axis=0)
return self
def transform(self, X):
"""Transform by clipping to percentile bounds.
Args:
X: Data to transform
Returns:
Clipped data
"""
return np.clip(X, self.lower_bound_, self.upper_bound_)
# ❌ WRONG - Plain class without sklearn mixins
class BadTransformer:
"""This won't work with skops!"""
def fit(self, X, y=None):
return self
def transform(self, X):
return X
Why Inherit from BaseEstimator and TransformerMixin?
These mixins provide essential sklearn functionality:
BaseEstimator provides:
- get_params() - Required by sklearn for introspection
- set_params() - Required by sklearn for hyperparameter tuning
- Ensures your transformer works with GridSearchCV, RandomizedSearchCV, etc.
TransformerMixin provides:
- fit_transform() - Convenience method that calls fit() then transform()
- Ensures your transformer works seamlessly in pipelines
Required for skops: - Skops needs these methods to properly serialize and deserialize your custom classes - Without them, skops cannot reconstruct your transformer when loading
Using Custom Transformers in Pipelines
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Create pipeline with custom transformer
pipeline = Pipeline([
('clipper', PercentileClipper(lower=5, upper=95)),
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
# Fit pipeline
X_train = np.random.randn(100, 5)
y_train = np.random.randint(0, 2, 100)
pipeline.fit(X_train, y_train)
# Save with skops format
folio.add_sklearn('custom_pipeline', pipeline,
custom=True, # ← Important! Enables skops
description='Pipeline with custom percentile clipper',
inputs=['training_data'])
# Load in a different environment
# No need for PercentileClipper class definition!
folio2 = DataFolio('experiments/my_experiment')
loaded_pipeline = folio2.get_sklearn('custom_pipeline')
predictions = loaded_pipeline.predict(X_test)
Best Practices for Custom Transformers
1. Always inherit from sklearn base classes
from sklearn.base import BaseEstimator, TransformerMixin
class MyTransformer(BaseEstimator, TransformerMixin):
# Your implementation
pass
2. Store all parameters in __init__
def __init__(self, threshold=0.5, method='mean'):
# Store ALL parameters - required for get_params()
self.threshold = threshold
self.method = method
3. Store fitted parameters with trailing underscore
def fit(self, X, y=None):
# Fitted parameters end with underscore (sklearn convention)
self.mean_ = np.mean(X)
self.std_ = np.std(X)
return self
4. Always return self from fit()
def fit(self, X, y=None):
# Do fitting...
return self # ← Required for sklearn API
5. Make transform() stateless
def transform(self, X):
# Only use fitted parameters (those ending with _)
# Don't modify instance state here
return (X - self.mean_) / self.std_
Serialization Format Comparison
| Feature | Joblib (default) | Skops (custom=True) |
|---|---|---|
| Speed | Faster | Slightly slower |
| Portability | Requires class definitions | Self-contained |
| Use case | Standard libraries only | Custom transformers |
| Security | Less secure | More secure |
| Deployment | Need codebase | Standalone |
# Joblib format (default)
folio.add_sklearn('model', pipeline)
# → Saves as .joblib
# → Fast but requires class definitions
# Skops format (portable)
folio.add_sklearn('model', pipeline, custom=True)
# → Saves as .skops
# → Self-contained, works without class definitions
Complete Example: Custom Transformer Pipeline
from datafolio import DataFolio
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np
# Define custom transformer
class OutlierClipper(BaseEstimator, TransformerMixin):
"""Clips outliers based on IQR method."""
def __init__(self, iqr_multiplier=1.5):
self.iqr_multiplier = iqr_multiplier
def fit(self, X, y=None):
q1 = np.percentile(X, 25, axis=0)
q3 = np.percentile(X, 75, axis=0)
iqr = q3 - q1
self.lower_bound_ = q1 - self.iqr_multiplier * iqr
self.upper_bound_ = q3 + self.iqr_multiplier * iqr
return self
def transform(self, X):
return np.clip(X, self.lower_bound_, self.upper_bound_)
# Create and train pipeline
pipeline = Pipeline([
('outlier_clipper', OutlierClipper(iqr_multiplier=1.5)),
('scaler', StandardScaler()),
('classifier', LogisticRegression(random_state=42))
])
# Generate training data
X_train = np.random.randn(200, 10)
y_train = np.random.randint(0, 2, 200)
# Fit pipeline
pipeline.fit(X_train, y_train)
# Save with skops
folio = DataFolio('experiments/outlier_detection')
folio.add_sklearn('pipeline', pipeline,
custom=True, # Enable skops for custom transformer
description='Logistic regression with IQR-based outlier clipping',
inputs=['training_data'],
hyperparameters={
'iqr_multiplier': 1.5,
'random_state': 42
})
# Later: Load and use (even without OutlierClipper class!)
folio2 = DataFolio('experiments/outlier_detection')
loaded_pipeline = folio2.get_sklearn('pipeline')
# Make predictions
X_test = np.random.randn(50, 10)
predictions = loaded_pipeline.predict(X_test)
probabilities = loaded_pipeline.predict_proba(X_test)
print(f"Predictions: {predictions[:5]}")
print(f"Probabilities: {probabilities[:5]}")
Model Metadata
Add rich metadata to track model provenance:
folio.add_sklearn('classifier', model,
description='Random forest with balanced class weights',
inputs=['processed_features', 'labels'],
hyperparameters={
'n_estimators': 100,
'max_depth': 10,
'class_weight': 'balanced',
'random_state': 42
},
code='''
clf = RandomForestClassifier(
n_estimators=100,
max_depth=10,
class_weight='balanced',
random_state=42
)
clf.fit(X_train, y_train)
''')
Loading Models
All models can be loaded with type-specific or generic methods:
# Type-specific method
clf = folio.get_sklearn('classifier')
# Generic method (delegates to get_sklearn)
clf = folio.get_model('classifier')
# Data accessor (autocomplete-friendly)
clf = folio.data.classifier.content
Common Patterns
A/B Testing Models
# Train baseline
baseline = RandomForestClassifier(n_estimators=50)
baseline.fit(X_train, y_train)
# Train variant
variant = RandomForestClassifier(n_estimators=200)
variant.fit(X_train, y_train)
# Save both
folio.add_sklearn('baseline', baseline)
folio.add_sklearn('variant', variant)
# Compare
baseline_score = folio.data.baseline.content.score(X_test, y_test)
variant_score = folio.data.variant.content.score(X_test, y_test)
# Deploy winner
if variant_score > baseline_score:
production_model = folio.get_sklearn('variant')
else:
production_model = folio.get_sklearn('baseline')
Pipeline Versioning
# Version 1: Simple pipeline
v1_pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
v1_pipeline.fit(X_train, y_train)
folio.add_sklearn('pipeline_v1', v1_pipeline)
# Version 2: Added custom preprocessing
v2_pipeline = Pipeline([
('clipper', PercentileClipper()), # Custom transformer
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
v2_pipeline.fit(X_train, y_train)
folio.add_sklearn('pipeline_v2', v2_pipeline, custom=True) # Need skops!
# Compare versions
v1_score = folio.data.pipeline_v1.content.score(X_test, y_test)
v2_score = folio.data.pipeline_v2.content.score(X_test, y_test)
Hyperparameter Tuning Archive
from sklearn.model_selection import ParameterGrid
# Define grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 20]
}
# Try all combinations
for params in ParameterGrid(param_grid):
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
# Save each model
name = f"rf_n{params['n_estimators']}_d{params['max_depth']}"
folio.add_sklearn(name, model,
hyperparameters=params,
description=f"RF with {params['n_estimators']} trees, depth {params['max_depth']}")
# Track score in metadata
folio._items[name]['test_score'] = score
# Find best model
best_name = max(folio.models,
key=lambda name: folio._items[name].get('test_score', 0))
best_model = folio.get_sklearn(best_name)
print(f"Best model: {best_name}")
print(f"Score: {folio._items[best_name]['test_score']}")
FAQ
Q: When should I use custom=True?
A: Use it when your pipeline contains custom transformers (classes you wrote). Standard sklearn/XGBoost/LightGBM models don't need it.
Q: Can I mix joblib and skops models in the same bundle?
A: Yes! DataFolio automatically detects the format when loading. You can have some models saved with joblib and others with skops.
Q: Do I need to install skops?
A: Only if you use custom=True. For standard models (joblib format), skops is not required.
Q: Can I convert a joblib model to skops?
A: Yes, just load and re-save with custom=True:
model = folio.get_sklearn('old_model')
folio.add_sklearn('new_model', model, custom=True)
Q: What if I don't inherit from BaseEstimator/TransformerMixin?
A: Skops serialization will fail. Always inherit from these classes for custom transformers.
Q: How do I know which format a model uses?
A: Check the metadata:
print(folio._items['model_name']['serialization_format']) # 'joblib' or 'skops'
print(folio._items['model_name']['filename']) # ends in .joblib or .skops
Next Steps
- Getting Started Guide - Complete tutorial
- API Reference - Method documentation
- Snapshots - Version control for models
- GitHub Examples - More examples