Skip to content

Advanced Guides

Welcome to the DataFolio advanced guides! These tutorials cover specialized topics and advanced features.

New to DataFolio? Start with the Getting Started tutorial first, then come back here for advanced topics.

Available Guides

Working with Models

Save and load ML models with custom transformers

Complete guide to working with machine learning models in DataFolio: - Scikit-learn models (standard and custom) - Custom transformers with sklearn mixins - Joblib vs. skops serialization formats - When and how to use custom=True for portability - PyTorch models overview - Model metadata and lineage tracking - Common patterns (A/B testing, hyperparameter tuning) - Best practices and FAQ

Who should read this: Anyone working with sklearn pipelines, custom transformers, or deploying models across environments.

Time to complete: 20-25 minutes


Snapshots

Version control for your experiments

Deep dive into DataFolio's snapshot system: - Why use snapshots (with real-world scenarios) - Creating and loading snapshots - Copy-on-write versioning (efficient storage) - Comparing and managing snapshots - Snapshot workflows (paper submissions, A/B testing, hyperparameter tuning) - Git integration and credential protection - CLI tools for snapshot management - Best practices and troubleshooting

Who should read this: Anyone who wants to version experiments, maintain reproducibility, or experiment safely without losing good results.

Time to complete: 15-20 minutes


Caching

Speed up remote data access

Local caching for cloud-stored bundles: - Why and when to use caching - Enabling and configuring caching - Cache management (status, clearing, invalidation) - Performance examples and benchmarks - Team collaboration with shared caches - Offline work preparation - Best practices and troubleshooting - Interaction with Parquet filtering

Who should read this: Anyone working with remote bundles (S3, GCS, etc.) or wanting faster repeated data access.

Time to complete: 15-20 minutes


Parquet Optimization

Work efficiently with large datasets

Advanced techniques for working with large Parquet files: - Column selection (column pruning) - Read only what you need - Row filtering (predicate pushdown) - Filter before loading - Memory optimization strategies - Working with datasets larger than memory - Cloud storage optimization and caching - Integration with PyArrow and DuckDB - Real-world performance examples - Best practices and troubleshooting

Who should read this: Anyone working with large datasets (>1GB), cloud storage, or wanting to optimize performance.

Time to complete: 20-25 minutes


Learning Path

For Beginners: 1. Start with Getting Started 2. Then read Snapshots to learn about versioning

For Specific Use Cases: - Experiment trackingGetting Started + CLI Reference - Reproducible researchSnapshots - Curating results for publicationarchive() / copy(follow_lineage=True) in the API Reference - Team collaborationGetting Started (Multi-Instance Access section) - Sharing files with non-datafolio usersget_item_path() / describe(show_paths=True) in the API Reference - Model deploymentWorking with Models - Custom sklearn pipelinesWorking with Models - Cloud storageCaching + Parquet Optimization - Large datasetsParquet Optimization - Fast remote accessCaching


Additional Resources

Need Help?