Advanced Guides

Welcome to the DataFolio advanced guides! These tutorials cover specialized topics and advanced features.

New to DataFolio? Start with the Getting Started tutorial first, then come back here for advanced topics.

Available Guides

Working with Models

Save and load ML models with custom transformers

Complete guide to working with machine learning models in DataFolio: - Scikit-learn models (standard and custom) - Custom transformers with sklearn mixins - Joblib vs. skops serialization formats - When and how to use custom=True for portability - PyTorch models overview - Model metadata and lineage tracking - Common patterns (A/B testing, hyperparameter tuning) - Best practices and FAQ

Who should read this: Anyone working with sklearn pipelines, custom transformers, or deploying models across environments.

Time to complete: 20-25 minutes

Snapshots

Version control for your experiments

Deep dive into DataFolio's snapshot system: - Why use snapshots (with real-world scenarios) - Creating and loading snapshots - Copy-on-write versioning (efficient storage) - Comparing and managing snapshots - Snapshot workflows (paper submissions, A/B testing, hyperparameter tuning) - Git integration and credential protection - CLI tools for snapshot management - Best practices and troubleshooting

Who should read this: Anyone who wants to version experiments, maintain reproducibility, or experiment safely without losing good results.

Time to complete: 15-20 minutes

Caching

Speed up remote data access

Local caching for cloud-stored bundles: - Why and when to use caching - Enabling and configuring caching - Cache management (status, clearing, invalidation) - Performance examples and benchmarks - Team collaboration with shared caches - Offline work preparation - Best practices and troubleshooting - Interaction with Parquet filtering

Who should read this: Anyone working with remote bundles (S3, GCS, etc.) or wanting faster repeated data access.

Time to complete: 15-20 minutes

Parquet Optimization

Work efficiently with large datasets

Advanced techniques for working with large Parquet files: - Column selection (column pruning) - Read only what you need - Row filtering (predicate pushdown) - Filter before loading - Memory optimization strategies - Working with datasets larger than memory - Cloud storage optimization and caching - Integration with PyArrow and DuckDB - Real-world performance examples - Best practices and troubleshooting

Who should read this: Anyone working with large datasets (>1GB), cloud storage, or wanting to optimize performance.

Time to complete: 20-25 minutes

Learning Path

For Beginners: 1. Start with Getting Started 2. Then read Snapshots to learn about versioning

For Specific Use Cases: - Experiment tracking → Getting Started + CLI Reference - Reproducible research → Snapshots - Curating results for publication → archive() / copy(follow_lineage=True) in the API Reference - Team collaboration → Getting Started (Multi-Instance Access section) - Sharing files with non-datafolio users → get_item_path() / describe(show_paths=True) in the API Reference - Model deployment → Working with Models - Custom sklearn pipelines → Working with Models - Cloud storage → Caching + Parquet Optimization - Large datasets → Parquet Optimization - Fast remote access → Caching

Additional Resources

DataFolio API Reference - Complete method documentation
CLI Reference - Command-line tools
Complete API - Full API documentation
About - Overview and quick examples

Need Help?

Check the Getting Started FAQ
See the Snapshots FAQ
Report issues on GitHub