DataFolio Class - Complete API Reference
This page provides a comprehensive reference of all methods available on the DataFolio class, organized by functionality.
Creating a DataFolio
datafolio.DataFolio.__init__(path, metadata=None, random_suffix=False, read_only=False, cache_enabled=False, cache_dir=None, cache_ttl=None, use_https=False)
Initialize a new or open an existing DataFolio.
If the directory doesn't exist, creates a new bundle. If it exists, opens the existing bundle and reads manifests.
Parameters:
-
path(Union[str, Path]) –Full path to bundle directory (local or cloud)
-
metadata(Optional[Dict[str, Any]], default:None) –Optional dictionary of analysis metadata (for new bundles)
-
random_suffix(bool, default:False) –If True, append random suffix to bundle name (default: False)
-
read_only(bool, default:False) –If True, prevent all write operations (default: False)
-
cache_enabled(bool, default:False) –If True, enable local caching for remote data (default: False)
-
cache_dir(Optional[Union[str, Path]], default:None) –Optional cache directory (default: ~/.datafolio_cache)
-
cache_ttl(Optional[int], default:None) –Optional TTL override in seconds (default: 1800 = 30 minutes)
-
use_https(bool, default:False) –If True, use HTTPS URLs for CloudFiles (for read-only access to public buckets) (default: False)
Examples:
Create new bundle with exact name:
>>> folio = DataFolio('experiments/protein-analysis')
# Creates: experiments/protein-analysis/
Create new bundle with random suffix:
>>> folio = DataFolio(
... 'experiments/protein-analysis',
... random_suffix=True
... )
# Creates: experiments/protein-analysis-blue-happy-falcon/
Open existing bundle:
>>> folio = DataFolio('experiments/protein-analysis')
With metadata:
>>> folio = DataFolio(
... 'experiments/my-exp',
... metadata={'date': '2024-01-15', 'scientist': 'Dr. Smith'}
... )
Open existing bundle as read-only (for safe inspection):
>>> folio = DataFolio('experiments/production-model', read_only=True)
>>> model = folio.get_model('classifier') # OK
>>> folio.add_table('new', df) # Error: read-only
Enable caching for cloud bundles (faster repeated access):
>>> folio = DataFolio('gs://bucket/experiment', cache_enabled=True)
>>> df = folio.get_table('data') # Downloads and caches
>>> df = folio.get_table('data') # Loads from cache (instant)
Custom cache configuration:
>>> folio = DataFolio(
... 'gs://bucket/experiment',
... cache_enabled=True,
... cache_dir='/mnt/shared/cache',
... cache_ttl=3600 # 1 hour
... )
Adding Data
Methods for adding different types of data to a DataFolio.
Tables (DataFrames)
datafolio.DataFolio.add_table(name, data, description=None, overwrite=False, inputs=None, models=None, code=None)
Add a table to be included in the bundle.
Writes immediately to tables/ directory and updates items.json.
Parameters:
-
name(str) –Unique name for this table
-
data(Any) –pandas or Polars DataFrame to include
-
description(Optional[str], default:None) –Optional description
-
overwrite(bool, default:False) –If True, allow overwriting existing table (default: False)
-
inputs(Optional[list[str]], default:None) –Optional list of table names used to create this table
-
models(Optional[list[str]], default:None) –Optional list of model names used to create this table
-
code(Optional[str], default:None) –Optional code snippet that created this table
Returns:
-
Self–Self for method chaining
Raises:
-
ValueError–If name already exists and overwrite=False
-
TypeError–If data is not a DataFrame
Examples:
>>> import pandas as pd
>>> folio = DataFolio('experiments', prefix='test')
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> folio.add_table('summary', df)
>>> # With lineage
>>> pred_df = pd.DataFrame({'pred': [0, 1, 0]})
>>> folio.add_table('predictions', pred_df,
... inputs=['test_data'],
... models=['classifier'],
... code='pred = model.predict(X_test)')
datafolio.DataFolio.reference_table(name, path, table_format='parquet', num_rows=None, version=None, description=None, inputs=None, code=None)
Add a reference to an external table (not copied to bundle).
Writes immediately to items.json.
Parameters:
-
name(str) –Unique name for this table
-
path(Union[str, Path]) –Path to the table (local or cloud)
-
table_format(str, default:'parquet') –Format of the table ('parquet', 'delta', 'csv')
-
num_rows(Optional[int], default:None) –Optional number of rows
-
version(Optional[int], default:None) –Optional version number (for Delta tables)
-
description(Optional[str], default:None) –Optional description
-
inputs(Optional[list[str]], default:None) –Optional list of items this was derived from
-
code(Optional[str], default:None) –Optional code snippet that created this
Returns:
-
Self–Self for method chaining
Raises:
-
ValueError–If name already exists or format is invalid
Examples:
>>> folio = DataFolio('experiments', prefix='test')
>>> folio.reference_table(
... 'raw_data',
... path='s3://bucket/data.parquet',
... table_format='parquet',
... num_rows=1_000_000
... )
Arrays
datafolio.DataFolio.add_numpy(name, array, description=None, overwrite=False, inputs=None, code=None)
Add a numpy array to the bundle.
Saves array to artifacts/ directory as .npy file and updates items.json.
Parameters:
-
name(str) –Unique name for this array
-
array(Any) –numpy array to save
-
description(Optional[str], default:None) –Optional description
-
overwrite(bool, default:False) –If True, allow overwriting existing array (default: False)
-
inputs(Optional[list[str]], default:None) –Optional list of items this was derived from
-
code(Optional[str], default:None) –Optional code snippet that created this array
Returns:
-
Self–Self for method chaining
Raises:
-
ValueError–If name already exists and overwrite=False
-
ImportError–If numpy is not installed
-
TypeError–If data is not a numpy array
Examples:
>>> import numpy as np
>>> folio = DataFolio('experiments/test')
>>> embeddings = np.random.randn(100, 128)
>>> folio.add_numpy('embeddings', embeddings, description='Model embeddings')
>>> # With lineage
>>> predictions = np.array([0, 1, 0, 1])
>>> folio.add_numpy('predictions', predictions,
... inputs=['test_data'],
... code='predictions = model.predict(X)')
JSON Data
datafolio.DataFolio.add_json(name, data, description=None, overwrite=False, inputs=None, code=None)
Add JSON-serializable data to the bundle.
Saves data to artifacts/ directory as .json file and updates items.json. Supports dicts, lists, scalars, and other JSON-serializable types.
Parameters:
-
name(str) –Unique name for this data
-
data(Union[dict, list, int, float, str, bool, None]) –JSON-serializable data (dict, list, scalar, etc.)
-
description(Optional[str], default:None) –Optional description
-
overwrite(bool, default:False) –If True, allow overwriting existing data (default: False)
-
inputs(Optional[list[str]], default:None) –Optional list of items this was derived from
-
code(Optional[str], default:None) –Optional code snippet that created this data
Returns:
-
Self–Self for method chaining
Raises:
-
ValueError–If name already exists and overwrite=False, or data not JSON-serializable
-
TypeError–If data cannot be serialized to JSON
Examples:
>>> folio = DataFolio('experiments/test')
>>> config = {'learning_rate': 0.01, 'batch_size': 32}
>>> folio.add_json('config', config, description='Model config')
>>> # With list data
>>> class_names = ['cat', 'dog', 'bird']
>>> folio.add_json('classes', class_names)
>>> # With scalar
>>> folio.add_json('best_accuracy', 0.95)
Timestamps
datafolio.DataFolio.add_timestamp(name, timestamp, description=None, overwrite=False, inputs=None, code=None)
Add a timestamp to the bundle.
Saves timestamp to artifacts/ directory as .json file and updates items.json. Accepts timezone-aware datetime objects or Unix timestamps (int/float). All timestamps are stored in UTC as ISO 8601 strings.
Parameters:
-
name(str) –Unique name for this timestamp
-
timestamp(Union[datetime, int, float]) –Timezone-aware datetime object or Unix timestamp (int/float). Naive datetimes will raise ValueError.
-
description(Optional[str], default:None) –Optional description
-
overwrite(bool, default:False) –If True, allow overwriting existing timestamp (default: False)
-
inputs(Optional[list[str]], default:None) –Optional list of items this was derived from
-
code(Optional[str], default:None) –Optional code snippet that created this timestamp
Returns:
-
Self–Self for method chaining
Raises:
-
ValueError–If name already exists and overwrite=False, or if datetime is naive
-
TypeError–If timestamp is not a datetime or numeric type
Examples:
>>> from datetime import datetime, timezone
>>> folio = DataFolio('experiments/test')
>>>
>>> # Add timezone-aware datetime
>>> event_time = datetime(2024, 1, 15, 10, 30, 0, tzinfo=timezone.utc)
>>> folio.add_timestamp('event_time', event_time, description='Event occurred')
>>>
>>> # Add Unix timestamp
>>> folio.add_timestamp('start_time', 1705318200, description='Start time')
>>>
>>> # With lineage
>>> from datetime import datetime, timezone
>>> import pytz
>>> eastern = pytz.timezone('US/Eastern')
>>> local_time = eastern.localize(datetime(2024, 1, 15, 10, 30, 0))
>>> folio.add_timestamp('local_event', local_time,
... inputs=['event_log'],
... code='timestamp = event_log.iloc[0]["timestamp"]')
Generic Data
datafolio.DataFolio.add_data(name, data=None, reference=None, description=None, **kwargs)
Generic data addition with automatic type detection.
Convenience method that dispatches to the appropriate specific method based on data type. For fine-grained control, use the specific methods: add_table(), add_numpy(), add_json(), or reference_table().
Parameters:
-
name(str) –Unique name for this data
-
data(Any, default:None) –Data to save (DataFrame, numpy array, dict, list, scalar)
-
reference(Optional[Union[str, Path]], default:None) –If provided, creates a reference to external data instead
-
description(Optional[str], default:None) –Optional description
-
**kwargs–Additional arguments passed to the specific method
Returns:
-
Self–Self for method chaining
Raises:
-
ValueError–If neither data nor reference is provided, or both are provided
-
TypeError–If data type is not supported
Examples:
DataFrame (saves as parquet):
>>> folio.add_data('results', df)
Numpy array (saves as .npy):
>>> folio.add_data('embeddings', np.array([1, 2, 3]))
JSON data (saves as .json):
>>> folio.add_data('config', {'lr': 0.01})
>>> folio.add_data('classes', ['cat', 'dog'])
>>> folio.add_data('accuracy', 0.95)
External reference:
>>> folio.add_data('raw', reference='s3://bucket/data.parquet')
Adding Models
Methods for saving machine learning models.
Scikit-learn Models
datafolio.DataFolio.add_sklearn(name, model, description=None, overwrite=False, inputs=None, hyperparameters=None, code=None, custom=False)
Add a scikit-learn style model to the bundle.
Writes immediately to models/ directory and updates items.json.
Parameters:
-
name(str) –Unique name for this model
-
model(Any) –Trained model to include
-
description(Optional[str], default:None) –Optional description
-
overwrite(bool, default:False) –If True, allow overwriting existing model (default: False)
-
inputs(Optional[list[str]], default:None) –Optional list of table names used for training
-
hyperparameters(Optional[Dict[str, Any]], default:None) –Optional dict of hyperparameters
-
code(Optional[str], default:None) –Optional code snippet that trained this model
-
custom(bool, default:False) –If True, use skops format for portable pipelines with custom transformers. If False (default), use joblib format.
Returns:
-
Self–Self for method chaining
Raises:
-
ValueError–If name already exists and overwrite=False
Examples:
>>> from sklearn.ensemble import RandomForestClassifier
>>> folio = DataFolio('experiments', prefix='test')
>>> model = RandomForestClassifier(n_estimators=100, max_depth=10)
>>> # ... train model ...
>>> folio.add_sklearn('classifier', model,
... description='Random forest classifier',
... inputs=['training_data', 'validation_data'],
... hyperparameters={'n_estimators': 100, 'max_depth': 10},
... code='model.fit(X_train, y_train)')
>>>
>>> # Portable pipeline with custom transformer (skops)
>>> folio.add_sklearn('pipeline', custom_pipeline, custom=True)
datafolio.DataFolio.add_model(name, model, description=None, overwrite=False, custom=False, **kwargs)
Add a scikit-learn style model to the bundle.
This is a convenience method that delegates to add_sklearn().
Parameters:
-
name(str) –Unique name for this model
-
model(Any) –Trained sklearn-style model
-
description(Optional[str], default:None) –Optional description
-
overwrite(bool, default:False) –If True, allow overwriting existing model (default: False)
-
custom(bool, default:False) –If True use skops format for portability (required for custom transformers)
-
**kwargs–Additional arguments passed to add_sklearn() (e.g., hyperparameters, inputs, code)
Returns:
-
Self–Self for method chaining
Raises:
-
ValueError–If name already exists and overwrite=False
Examples:
>>> from sklearn.ensemble import RandomForestClassifier
>>> model = RandomForestClassifier()
>>> folio.add_model('clf', model, hyperparameters={'n_estimators': 100})
With custom transformer (portable):
>>> folio.add_model('pipeline', custom_pipeline, custom=True)
Adding Artifacts
Methods for adding arbitrary files and artifacts.
datafolio.DataFolio.add_artifact(name, path, category=None, description=None, overwrite=False)
Add an artifact file to the bundle.
Copies file immediately to artifacts/ directory and updates included_items.json.
Parameters:
-
name(str) –Unique name for this artifact
-
path(Union[str, Path]) –Path to the file to include
-
category(Optional[str], default:None) –Optional category ('plots', 'configs', etc.)
-
description(Optional[str], default:None) –Optional description
-
overwrite(bool, default:False) –If True, allow overwriting existing artifact (default: False)
Returns:
-
Self–Self for method chaining
Raises:
-
ValueError–If name already exists and overwrite=False
-
FileNotFoundError–If file doesn't exist
Examples:
>>> folio = DataFolio('experiments', prefix='test')
>>> folio.add_artifact('loss_curve', 'plots/training_loss.png', category='plots')
>>> # Update with overwrite
>>> folio.add_artifact('loss_curve', 'plots/updated_loss.png', category='plots', overwrite=True)
Retrieving Data
Methods for loading data from a DataFolio.
Tables (DataFrames)
datafolio.DataFolio.get_table(name, **kwargs)
Get a table by name (works for both included and referenced).
For included tables, reads from bundle directory. For referenced tables, reads from the specified external path. If caching is enabled (cache_enabled=True), cloud-based tables are cached locally for faster repeated access.
Supports all pandas.read_parquet() arguments for filtering and optimization:
- columns: List of column names to read (column pruning)
- filters: Row filtering predicates (row filtering)
- engine: Parquet engine ('pyarrow' or 'fastparquet')
Parameters:
-
name(str) –Name of the table
-
**kwargs–Additional arguments passed to pd.read_parquet() (e.g., columns, filters, engine)
Returns:
-
Any–pandas DataFrame
Raises:
-
KeyError–If table name doesn't exist
-
ImportError–If reading from cloud requires missing dependencies
-
FileNotFoundError–If referenced file doesn't exist
Examples:
Basic usage:
>>> folio = DataFolio('experiments', prefix='test')
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> folio.add_table('test', df)
>>> retrieved = folio.get_table('test')
>>> assert len(retrieved) == 3
Column selection (read only specific columns):
>>> df_subset = folio.get_table('test', columns=['a'])
>>> assert list(df_subset.columns) == ['a']
Row filtering (requires pyarrow engine):
>>> df_filtered = folio.get_table('test',
... filters=[('a', '>', 1)],
... engine='pyarrow')
>>> assert len(df_filtered) == 2
datafolio.DataFolio.get_table_path(name)
Get the path to a table file, whether included in the bundle or referenced externally.
For included tables, returns the full path to the parquet file inside the bundle. For referenced tables, returns the external path recorded at reference time.
Parameters:
-
name(str) –Name of the table
Returns:
-
str–Path to the table file
Raises:
-
KeyError–If table name doesn't exist
-
ValueError–If named item is not a table
Examples:
>>> folio = DataFolio('experiments/my-run')
>>> folio.add_table('results', df)
>>> path = folio.get_table_path('results')
>>> print(path)
'experiments/my-run/tables/results.parquet'
>>> folio.reference_table('raw', path='s3://data-lake/raw.parquet')
>>> folio.get_table_path('raw')
's3://data-lake/raw.parquet'
Arrays
datafolio.DataFolio.get_numpy(name)
Get a numpy array by name.
If caching is enabled, cloud-based arrays are cached locally for faster repeated access.
Parameters:
-
name(str) –Name of the array
Returns:
-
Any–numpy array
Raises:
-
KeyError–If array name doesn't exist
-
ValueError–If named item is not a numpy array
-
ImportError–If numpy is not installed
Examples:
>>> folio = DataFolio('experiments/test')
>>> embeddings = folio.get_numpy('embeddings')
>>> print(embeddings.shape)
datafolio.DataFolio.get_numpy_path(name)
Get the path to a numpy array file stored in the bundle.
Parameters:
-
name(str) –Name of the array
Returns:
-
str–Path to the .npy file
Raises:
-
KeyError–If array name doesn't exist
-
ValueError–If named item is not a numpy array
Examples:
>>> folio = DataFolio('experiments/my-run')
>>> folio.add_numpy('embeddings', arr)
>>> path = folio.get_numpy_path('embeddings')
>>> print(path)
'experiments/my-run/artifacts/embeddings.npy'
JSON Data
datafolio.DataFolio.get_json(name)
Get JSON data by name.
If caching is enabled, cloud-based JSON data is cached locally for faster repeated access.
Parameters:
-
name(str) –Name of the JSON data
Returns:
-
Any–Deserialized JSON data (dict, list, scalar, etc.)
Raises:
-
KeyError–If data name doesn't exist
-
ValueError–If named item is not JSON data
Examples:
>>> folio = DataFolio('experiments/test')
>>> config = folio.get_json('config')
>>> print(config['learning_rate'])
datafolio.DataFolio.get_json_path(name)
Get the path to a JSON data file stored in the bundle.
Parameters:
-
name(str) –Name of the JSON data
Returns:
-
str–Path to the .json file
Raises:
-
KeyError–If data name doesn't exist
-
ValueError–If named item is not JSON data
Examples:
>>> folio = DataFolio('experiments/my-run')
>>> folio.add_json('config', {'lr': 0.01})
>>> path = folio.get_json_path('config')
>>> print(path)
'experiments/my-run/artifacts/config.json'
Timestamps
datafolio.DataFolio.get_timestamp(name, as_unix=False)
Get a timestamp by name.
If caching is enabled, cloud-based timestamps are cached locally for faster repeated access.
Parameters:
-
name(str) –Name of the timestamp
-
as_unix(bool, default:False) –If True, return Unix timestamp (float); if False, return datetime (default)
Returns:
Raises:
-
KeyError–If timestamp name doesn't exist
-
ValueError–If named item is not a timestamp
Examples:
>>> folio = DataFolio('experiments/test')
>>>
>>> # Get as datetime (default)
>>> event_time = folio.get_timestamp('event_time')
>>> print(event_time.isoformat())
'2024-01-15T10:30:00+00:00'
>>>
>>> # Get as Unix timestamp
>>> unix_time = folio.get_timestamp('event_time', as_unix=True)
>>> print(unix_time)
1705318200.0
datafolio.DataFolio.get_timestamp_path(name)
Get the path to a timestamp file stored in the bundle.
Parameters:
-
name(str) –Name of the timestamp
Returns:
-
str–Path to the timestamp file
Raises:
-
KeyError–If timestamp name doesn't exist
-
ValueError–If named item is not a timestamp
Examples:
>>> folio = DataFolio('experiments/my-run')
>>> folio.add_timestamp('event_time', dt)
>>> path = folio.get_timestamp_path('event_time')
>>> print(path)
'experiments/my-run/artifacts/event_time.json'
Generic Data
datafolio.DataFolio.get_data(name)
Generic data getter that returns any data type.
Automatically detects the item type and calls the appropriate getter. For fine-grained control, use the specific methods: get_table(), get_numpy(), or get_json().
Parameters:
-
name(str) –Name of the data item
Returns:
-
Any–The data (DataFrame, numpy array, dict, list, or scalar)
Raises:
-
KeyError–If item name doesn't exist
-
ValueError–If item is not a data type (e.g., is a model or artifact)
Examples:
>>> folio.add_data('results', df)
>>> folio.add_data('embeddings', np_array)
>>> folio.add_data('config', {'lr': 0.01})
>>> # Later, retrieve without knowing the type
>>> results = folio.get_data('results') # Returns DataFrame
>>> embeddings = folio.get_data('embeddings') # Returns numpy array
>>> config = folio.get_data('config') # Returns dict
datafolio.DataFolio.get_data_path(name)
Get the path to any stored item, delegating to the appropriate type-specific method.
Automatically detects the item type and calls the appropriate path getter: - Tables (included or referenced): delegates to get_table_path() - Artifacts: delegates to get_artifact_path() - All other bundled items (numpy arrays, JSON, timestamps): returns the bundle file path
Parameters:
-
name(str) –Name of the item
Returns:
-
str–Path to the item file
Raises:
-
KeyError–If item name doesn't exist
Examples:
>>> folio = DataFolio('experiments', prefix='test')
>>> folio.add_table('results', df)
>>> folio.get_data_path('results') # returns path to parquet file
>>> folio.reference_table('data', path='s3://bucket/file.parquet')
>>> folio.get_data_path('data') # returns 's3://bucket/file.parquet'
datafolio.DataFolio.get_item_path(name)
Get the path to any item stored in the folio.
For items stored within the bundle (included tables, models, artifacts, arrays, JSON data, timestamps), returns the full path to the data file. For referenced tables, returns the external path recorded at reference time.
This is especially useful for cloud-hosted folios where collaborators can directly access or download underlying files without using datafolio.
Parameters:
-
name(str) –Name of the item
Returns:
-
str–Full path to the item's data file. For cloud folios this will be a
-
str–cloud URI (e.g.
s3://bucket/.../results.parquet). For local folios -
str–this will be an absolute file-system path.
Raises:
-
KeyError–If item name doesn't exist
-
ValueError–If item has no associated file path
Examples:
>>> folio = DataFolio('s3://bucket/experiments/my-run')
>>> path = folio.get_item_path('results')
>>> print(path)
's3://bucket/experiments/my-run/tables/results.parquet'
>>> path = folio.get_item_path('classifier')
>>> print(path)
's3://bucket/experiments/my-run/models/classifier.joblib'
>>> # For referenced tables the external path is returned
>>> folio.reference_table('raw', path='s3://data-lake/raw.parquet')
>>> folio.get_item_path('raw')
's3://data-lake/raw.parquet'
Retrieving Models
Methods for loading machine learning models.
Scikit-learn Models
datafolio.DataFolio.get_sklearn(name)
Get a scikit-learn style model by name.
If caching is enabled (cache_enabled=True), cloud-based models are cached locally for faster repeated access.
Parameters:
-
name(str) –Name of the model
Returns:
-
Any–The model object
Raises:
-
KeyError–If model name doesn't exist
-
ValueError–If named item is not a sklearn model
Examples:
>>> folio = DataFolio('experiments/test')
>>> model = folio.get_sklearn('classifier')
datafolio.DataFolio.get_model(name, **kwargs)
Get a scikit-learn style model by name.
This is a convenience method that delegates to get_sklearn().
If caching is enabled (cache_enabled=True), cloud-based models are cached locally for faster repeated access.
Parameters:
-
name(str) –Name of the model
-
**kwargs–Additional arguments (currently unused, kept for backward compatibility)
Returns:
-
Any–The model object
Raises:
-
KeyError–If model name doesn't exist
-
ValueError–If named item is not a model
Examples:
>>> folio = DataFolio('experiments/test')
>>> model = folio.get_model('classifier')
datafolio.DataFolio.get_model_path(name)
Get the path to a model file stored in the bundle.
Parameters:
-
name(str) –Name of the model
Returns:
-
str–Path to the model file
Raises:
-
KeyError–If model name doesn't exist
-
ValueError–If named item is not a model
Examples:
>>> folio = DataFolio('experiments/my-run')
>>> folio.add_sklearn('classifier', model)
>>> path = folio.get_model_path('classifier')
>>> print(path)
'experiments/my-run/models/classifier.joblib'
Retrieving Artifacts
datafolio.DataFolio.get_artifact_path(name)
Get the path to an artifact file.
Parameters:
-
name(str) –Name of the artifact
Returns:
-
str–Path to the artifact file
Raises:
-
KeyError–If artifact name doesn't exist
-
ValueError–If named item is not an artifact
Examples:
>>> folio = DataFolio('experiments/test-blue-happy-falcon')
>>> path = folio.get_artifact_path('plot')
Inspecting Items
Methods for getting information about items.
datafolio.DataFolio.list_contents(include_archived=False)
List all contents in the DataFolio.
Parameters:
-
include_archived(bool, default:False) –If True, include archived (hidden) items in the results. Defaults to False so archived items are hidden from normal views.
Returns:
-
Dict[str, list[str]]–Dictionary with keys 'referenced_tables', 'included_tables', 'numpy_arrays',
-
Dict[str, list[str]]–'json_data', 'timestamps', 'models', and 'artifacts',
-
Dict[str, list[str]]–each containing a list of names
Examples:
>>> folio = DataFolio('experiments', prefix='test')
>>> folio.reference_table('data1', path='s3://bucket/data.parquet')
>>> folio.add_numpy('embeddings', np.array([1, 2, 3]))
>>> folio.list_contents()
{'referenced_tables': ['data1'], 'included_tables': [], 'numpy_arrays': ['embeddings'],
'json_data': [], 'timestamps': [], 'models': [], 'artifacts': []}
datafolio.DataFolio.get_table_info(name)
Get metadata about a table (referenced or included).
Returns the manifest entry containing information like: - For referenced tables: path, table_format, is_directory, num_rows, version, description - For included tables: filename, table_format, is_directory, num_rows, num_cols, columns, dtypes, description
Parameters:
-
name(str) –Name of the table
Returns:
-
Union[TableReference, IncludedTable]–Dictionary with table metadata
Raises:
-
KeyError–If table name doesn't exist
Examples:
>>> folio = DataFolio('experiments', prefix='test')
>>> folio.reference_table('data', path='s3://bucket/data.parquet', num_rows=1000000)
>>> info = folio.get_table_info('data')
>>> info['num_rows']
1000000
>>> info['table_format']
'parquet'
datafolio.DataFolio.get_model_info(name)
Get metadata about a model.
Returns the manifest entry containing information like: - filename, item_type, description
Parameters:
-
name(str) –Name of the model
Returns:
-
IncludedItem–Dictionary with model metadata
Raises:
-
KeyError–If model name doesn't exist
-
ValueError–If named item is not a model
Examples:
>>> folio = DataFolio('experiments', prefix='test')
>>> folio.add_model('classifier', model, description='Random forest classifier')
>>> info = folio.get_model_info('classifier')
>>> info['description']
'Random forest classifier'
datafolio.DataFolio.get_artifact_info(name)
Get metadata about an artifact.
Returns the manifest entry containing information like: - filename, item_type, category, description
Parameters:
-
name(str) –Name of the artifact
Returns:
-
IncludedItem–Dictionary with artifact metadata
Raises:
-
KeyError–If artifact name doesn't exist
-
ValueError–If named item is not an artifact
Examples:
>>> folio = DataFolio('experiments', prefix='test')
>>> folio.add_artifact('plot', 'plot.png', category='plots', description='Loss curve')
>>> info = folio.get_artifact_info('plot')
>>> info['category']
'plots'
>>> info['description']
'Loss curve'
datafolio.DataFolio.describe(pattern=None, return_string=False, show_empty=False, max_metadata_fields=10, snapshot=None, include_archived=False, show_paths=False)
Generate a human-readable description of all items in the bundle.
Includes lineage information showing inputs and dependencies.
Parameters:
-
pattern(Optional[str], default:None) –Optional glob pattern to filter items by name (e.g. 'examples/', '/weights'). Uses fnmatch rules — '*' matches any characters including '/'.
-
return_string(bool, default:False) –If True, return as string instead of printing
-
show_empty(bool, default:False) –If True, show empty sections
-
max_metadata_fields(int, default:10) –Maximum metadata fields to show
-
snapshot(Optional[str], default:None) –Optional snapshot name to describe instead of the full bundle
-
include_archived(bool, default:False) –If True, show archived (hidden) items. Defaults to False.
-
show_paths(bool, default:False) –If True, show the file path for each item. Especially useful for cloud-hosted folios where paths can be shared with collaborators who don't use datafolio.
Returns:
Examples:
>>> folio.describe() # Show full bundle
>>> folio.describe('examples/*') # Show only items under 'examples/'
>>> folio.describe(snapshot='v1.0') # Show specific snapshot
>>> folio.describe(include_archived=True) # Show archived items too
>>> folio.describe(show_paths=True) # Show file paths for sharing
See DisplayFormatter.describe() for full documentation.
Managing Items
Deleting Items
datafolio.DataFolio.delete(name, warn_dependents=True)
Delete one or more items from the DataFolio.
Removes items from the manifest and deletes associated files. Does not enforce lineage - can delete items that other items depend on.
Parameters:
-
name(Union[str, list[str]]) –Name(s) of item(s) to delete (string or list of strings)
-
warn_dependents(bool, default:True) –If True, print warning if deleted items have dependents
Returns:
-
Self–Self for method chaining
Raises:
-
KeyError–If any item name doesn't exist
Examples:
Delete single item:
>>> folio = DataFolio('experiments/test')
>>> folio.delete('old_model')
Delete multiple items:
>>> folio.delete(['temp_data', 'debug_plot', 'old_model'])
Delete without warnings:
>>> folio.delete('item', warn_dependents=False)
Archiving Items
datafolio.DataFolio.archive(name)
Mark item(s) as archived (hidden from default views, not deleted).
Archived items remain on disk and are still accessible via get_data() / get_table() etc., but are excluded from list_contents(), describe(), and copy() by default. Pass include_archived=True to those methods to reveal them again, or call unarchive() to restore them permanently.
Accepts a single name, a list of names, or a glob pattern (fnmatch rules,
e.g. 'intermediate/*').
Parameters:
Returns:
-
Self–Self for method chaining
Raises:
-
KeyError–If a specific name (non-glob) is not found
Examples:
Archive a single item:
>>> folio.archive('debug_output')
Archive multiple items:
>>> folio.archive(['debug_output', 'temp_features'])
Archive by glob pattern:
>>> folio.archive('intermediate/*')
datafolio.DataFolio.unarchive(name)
Restore archived item(s) to active status.
Removes the archived flag so the items appear again in
list_contents(), describe(), and copy() by default.
Accepts a single name, a list of names, or a glob pattern (fnmatch rules).
Parameters:
Returns:
-
Self–Self for method chaining
Raises:
-
KeyError–If a specific name (non-glob) is not found
Examples:
Unarchive a single item:
>>> folio.unarchive('debug_output')
Unarchive multiple items:
>>> folio.unarchive(['debug_output', 'temp_features'])
Unarchive by glob pattern:
>>> folio.unarchive('intermediate/*')
Copying Bundles
datafolio.DataFolio.copy(path, name=None, metadata_updates=None, include_items=None, exclude_items=None, random_suffix=False, follow_lineage=False, include_archived=False)
Create a copy of this bundle at a new location.
Useful for creating derived experiments or checkpoints.
Parameters:
-
path(Union[str, Path]) –Destination path for the new bundle. Used as the exact bundle location (e.g., 'gs://bucket/experiments/my-copy').
-
name(Optional[str], default:None) –If provided, appended to path as a subdirectory (e.g., path='experiments', name='exp-v2' → 'experiments/exp-v2'). If None, path is used as-is.
-
metadata_updates(Optional[Dict[str, Any]], default:None) –Metadata fields to update/add in the copy
-
include_items(Optional[list[str]], default:None) –If specified, only copy these items (by name)
-
exclude_items(Optional[list[str]], default:None) –Items to exclude from copy (by name)
-
random_suffix(bool, default:False) –If True, append random suffix to new bundle name (default: False)
-
follow_lineage(bool, default:False) –If True and include_items is provided, automatically include all transitive upstream dependencies of the named items. Items referenced in lineage that are not present in this folio (e.g. external tables) are silently skipped.
-
include_archived(bool, default:False) –If True, archived items are included in the copy. Defaults to False so archived items are excluded.
Returns:
-
DataFolio–New DataFolio instance
Raises:
-
ValueError–If include_items and exclude_items are both specified
Examples:
>>> # Copy to exact destination path
>>> folio2 = folio.copy('gs://bucket/experiments/my-copy')
>>> # Copy to base directory with explicit name subdirectory
>>> folio2 = folio.copy('experiments', name='exp-v2')
>>> # Copy with random suffix
>>> folio2 = folio.copy('experiments/exp-v2', random_suffix=True)
>>> # Copy with metadata updates to track parent
>>> folio2 = folio.copy(
... 'experiments/exp-v2',
... metadata_updates={
... 'parent_bundle': folio._bundle_dir,
... 'changes': 'Increased max_depth to 15'
... }
... )
>>> # Copy only specific items (e.g., for derived experiment)
>>> folio2 = folio.copy(
... 'experiments/exp-v2-tuned',
... include_items=['training_data', 'validation_data'],
... metadata_updates={'status': 'in_progress'}
... )
>>> # Copy only final outputs, auto-resolving all upstream deps
>>> folio2 = folio.copy(
... 'results',
... include_items=['final_model', 'test_results'],
... follow_lineage=True,
... )
>>> # Include archived items in the copy
>>> folio2 = folio.copy('archive_backup', include_archived=True)
Validation
datafolio.DataFolio.validate()
Validate existence and integrity of all items.
Checks if: 1. Included items exist in the bundle 2. Referenced items exist at their external path 3. Checksums match (for included single files)
Returns:
Examples:
>>> status = folio.validate()
>>> if not all(status.values()):
... print("Bundle corrupted!")
datafolio.DataFolio.is_valid()
Check if the entire bundle is valid.
Convenience method that runs validate() and returns True only if all items pass validation.
Returns:
-
bool–True if all items are valid, False otherwise
Examples:
>>> if not folio.is_valid():
... print("Bundle corrupted!")
Lineage and Dependencies
Methods for working with lineage tracking.
datafolio.DataFolio.get_inputs(item_name)
Get list of items that were inputs to this item.
Parameters:
-
item_name(str) –Name of the item
Returns:
Raises:
-
KeyError–If item doesn't exist
Examples:
>>> folio = DataFolio('experiments', prefix='test')
>>> # After adding items with lineage...
>>> inputs = folio.get_inputs('predictions')
>>> # Returns: ['test_data', 'classifier']
datafolio.DataFolio.get_dependents(item_name)
Get list of items that depend on this item.
Parameters:
-
item_name(str) –Name of the item
Returns:
Raises:
-
KeyError–If item doesn't exist
Examples:
>>> folio = DataFolio('experiments', prefix='test')
>>> # After adding items with lineage...
>>> dependents = folio.get_dependents('classifier')
>>> # Returns items that used 'classifier' as input
datafolio.DataFolio.get_lineage_graph()
Get full dependency graph for all items in bundle.
Returns:
Examples:
>>> folio = DataFolio('experiments', prefix='test')
>>> graph = folio.get_lineage_graph()
>>> # Returns: {'predictions': ['test_data', 'classifier'], ...}
Snapshots
Methods for working with snapshots (read-only copies).
datafolio.DataFolio.create_snapshot(name, description=None, tags=None, capture_git=True, capture_environment=False, capture_execution=False)
Create a named snapshot of the current bundle state.
A snapshot captures: - Current versions of all items (via item_versions dict) - Current metadata state (via metadata_snapshot dict) - Git repository state (commit, branch, dirty status) [optional] - Python environment (version, packages) [optional, off by default] - Execution context (entry point, working directory) [optional, off by default]
After creating a snapshot, all current items are marked as being in that snapshot. Future overwrites will trigger copy-on-write to preserve the snapshot state.
SECURITY NOTE: Environment variables (API keys, tokens, etc.) are NEVER captured. The capture_environment flag only captures Python version, platform, and package versions from uv.lock or requirements.txt.
Parameters:
-
name(str) –Snapshot name (filesystem-safe, no @ symbol)
-
description(Optional[str], default:None) –Optional human-readable description
-
tags(Optional[list[str]], default:None) –Optional list of tags for organization
-
capture_git(bool, default:True) –Whether to capture git state (default: True)
-
capture_environment(bool, default:False) –Whether to capture Python environment info like version and packages (default: False for security)
-
capture_execution(bool, default:False) –Whether to capture execution context like entry point and working directory (default: False for security)
Returns:
-
Self–Self for method chaining
Raises:
-
ValueError–If snapshot name is invalid or already exists
Examples:
>>> folio = DataFolio('experiments/my-exp')
>>> folio.add_table('results', df)
>>> folio.create_snapshot('v1.0-baseline', description='Initial results')
>>>
>>> # Later, overwriting will preserve the snapshot
>>> folio.add_table('results', new_df, overwrite=True) # Creates v2
datafolio.DataFolio.list_snapshots()
datafolio.DataFolio.delete_snapshot(name, cleanup_orphans=False)
Delete a snapshot.
Removes the snapshot from the registry and updates items' in_snapshots lists. Optionally cleans up orphaned item versions that are no longer referenced.
Parameters:
-
name(str) –Snapshot name to delete
-
cleanup_orphans(bool, default:False) –If True, delete item versions no longer in any snapshot
Returns:
-
Self–Self for method chaining
Raises:
-
KeyError–If snapshot doesn't exist
Examples:
>>> folio.delete_snapshot('experimental-v5')
>>> folio.delete_snapshot('old-snapshot', cleanup_orphans=True)
datafolio.DataFolio.load_snapshot(bundle_dir, snapshot)
classmethod
Load a DataFolio in snapshot state.
Creates a DataFolio instance configured to access items and metadata as they existed at snapshot time. Snapshots are always read-only to preserve snapshot immutability.
Parameters:
Returns:
-
DataFolio–Read-only DataFolio instance in snapshot state
Raises:
-
KeyError–If snapshot doesn't exist
Examples:
Load snapshot for inspection:
>>> paper = DataFolio.load_snapshot('research/exp', 'paper-v1')
>>> model = paper.get_model('classifier')
>>> print(paper.metadata['accuracy'])
>>> paper.add_table('new', df) # Error: snapshots are always read-only
Compare multiple snapshots:
>>> v1 = DataFolio.load_snapshot('path', 'v1.0')
>>> v2 = DataFolio.load_snapshot('path', 'v2.0')
>>> print(f"v1: {v1.metadata['accuracy']}, v2: {v2.metadata['accuracy']}")
datafolio.DataFolio.get_snapshot(snapshot)
Get a snapshot from this folio as a new DataFolio instance.
Convenience method for loading a snapshot when you already have a folio. Equivalent to DataFolio.load_snapshot(self._bundle_dir, snapshot). Snapshots are always read-only to preserve immutability.
Parameters:
-
snapshot(str) –Snapshot name to load
Returns:
-
DataFolio–Read-only DataFolio instance in snapshot state
Raises:
-
KeyError–If snapshot doesn't exist
Examples:
>>> folio = DataFolio('experiments/classifier')
>>> baseline = folio.get_snapshot('v1.0-baseline')
>>> assert baseline.metadata['accuracy'] == 0.89
>>> assert baseline.read_only # Snapshots are always read-only
>>>
>>> # Compare current state to snapshot
>>> current_acc = folio.metadata['accuracy']
>>> baseline_acc = baseline.metadata['accuracy']
>>> print(f"Improvement: {current_acc - baseline_acc:.2%}")
datafolio.DataFolio.get_snapshot_info(snapshot)
Get detailed information about a snapshot.
Returns the full snapshot metadata including item versions, metadata state, git info, environment info, and execution context.
Parameters:
-
snapshot(str) –Snapshot name
Returns:
Raises:
-
KeyError–If snapshot doesn't exist
Examples:
>>> info = folio.get_snapshot_info('v1.0')
>>> print(info['description'])
'Baseline model'
>>> print(info['git']['commit'])
'a3f2b8c'
>>> print(info['metadata_snapshot']['accuracy'])
0.89
datafolio.DataFolio.compare_snapshots(snapshot1, snapshot2)
Compare two snapshots.
Returns a dictionary showing differences between the two snapshots including: - added_items: Items in snapshot2 but not snapshot1 - removed_items: Items in snapshot1 but not snapshot2 - modified_items: Items in both but with different versions - shared_items: Items in both with same version - metadata_changes: Metadata fields that changed (old_value, new_value)
Parameters:
Returns:
Raises:
-
KeyError–If either snapshot doesn't exist
Examples:
>>> diff = folio.compare_snapshots('v1.0', 'v2.0')
>>> print(diff['modified_items'])
['classifier', 'config']
>>> print(diff['metadata_changes']['accuracy'])
(0.89, 0.91)
datafolio.DataFolio.diff_from_snapshot(snapshot=None)
Compare current state to a snapshot.
This is useful for seeing what has changed since a snapshot was created, similar to 'git status' showing changes since last commit.
Parameters:
-
snapshot(Optional[str], default:None) –Snapshot name to compare to. If None, uses most recent snapshot.
Returns:
-
Dict[str, Any]–Dictionary with comparison results including:
-
Dict[str, Any]–- snapshot_name: The snapshot being compared to
-
Dict[str, Any]–- added_items: Items in current state but not in snapshot
-
Dict[str, Any]–- removed_items: Items in snapshot but not in current state
-
Dict[str, Any]–- modified_items: Items in both but with different checksums/versions
-
Dict[str, Any]–- unchanged_items: Items in both with same checksum/version
-
Dict[str, Any]–- metadata_changes: Metadata fields that changed
Raises:
-
KeyError–If snapshot doesn't exist
-
ValueError–If no snapshots exist and snapshot=None
Examples:
>>> # Compare to last snapshot
>>> diff = folio.diff_from_snapshot()
>>> print(f"Modified: {diff['modified_items']}")
['classifier', 'config']
>>> # Compare to specific snapshot
>>> diff = folio.diff_from_snapshot('v1.0')
>>> print(f"Added since v1.0: {diff['added_items']}")
['new_feature']
datafolio.DataFolio.restore_snapshot(snapshot, confirm=False)
Restore working state to snapshot (DESTRUCTIVE).
This operation: - Replaces current metadata with snapshot metadata - Sets current item versions to match snapshot - Removes items added after snapshot - Does NOT delete the snapshot itself
WARNING: This is a destructive operation that overwrites current state.
Parameters:
-
snapshot(str) –Snapshot name to restore
-
confirm(bool, default:False) –Must be True to proceed (safety check)
Returns:
-
Self–Self for method chaining
Raises:
-
ValueError–If confirm=False
-
KeyError–If snapshot doesn't exist
Examples:
>>> folio.restore_snapshot('v1.0', confirm=True)
>>> # Working state now matches v1.0 snapshot
datafolio.DataFolio.export_snapshot(snapshot, target_path, *, include_snapshot_metadata=True)
Export a snapshot to a clean, standalone bundle.
Creates a new DataFolio bundle containing only the items and metadata from the specified snapshot. This is useful for: - Sharing a specific snapshot with collaborators - Creating a clean bundle for deployment - Starting fresh without version history
Parameters:
-
snapshot(str) –Name of snapshot to export
-
target_path(Union[str, Path]) –Path for new bundle (must not exist)
-
include_snapshot_metadata(bool, default:True) –If True, adds snapshot info to new bundle's metadata under '_source_snapshot' key (default: True)
Returns:
-
DataFolio–New DataFolio instance at target_path
Raises:
-
KeyError–If snapshot doesn't exist
-
ValueError–If target_path already exists
Examples:
Export a baseline snapshot for sharing:
>>> folio = DataFolio('experiments/classifier')
>>> baseline = folio.export_snapshot('v1.0-baseline', 'shared/baseline')
>>> # New bundle contains only v1.0-baseline state, no history
Export for deployment:
>>> production = folio.export_snapshot('production-v2', 'deploy/v2')
>>> # Clean bundle ready for deployment
Export without metadata reference:
>>> clean = folio.export_snapshot(
... 'v1.0',
... 'clean-export',
... include_snapshot_metadata=False
... )
Caching
Methods for managing the local cache (for remote bundles).
datafolio.DataFolio.cache_status(item_name=None)
Get cache status for an item or entire bundle.
Parameters:
-
item_name(Optional[str], default:None) –Name of item to check. If None, returns overall cache stats.
Returns:
-
Optional[Dict[str, Any]]–Dict with cache status information, or None if caching not enabled or item not found.
-
Optional[Dict[str, Any]]–For specific items: - cached: Whether item is cached - cache_path: Path to cached file - size_bytes: Size of cached file - cached_at: Timestamp when cached - last_accessed: Last access timestamp - access_count: Number of times accessed - ttl_remaining: Seconds until cache expires (None if no TTL)
-
Optional[Dict[str, Any]]–For bundle-level (item_name=None): - bundle_path: Original bundle path - cache_dir: Cache directory path - ttl_seconds: TTL in seconds - cache_hits: Number of cache hits - cache_misses: Number of cache misses - cache_hit_rate: Hit rate (0.0-1.0)
Examples:
Check if a specific item is cached:
>>> status = folio.cache_status('my_table')
>>> if status and status['cached']:
... print(f"Cache expires in {status['ttl_remaining']} seconds")
Get overall cache statistics:
>>> stats = folio.cache_status()
>>> print(f"Cache hit rate: {stats['cache_hit_rate']:.1%}")
datafolio.DataFolio.clear_cache(item_name=None)
datafolio.DataFolio.invalidate_cache(item_name)
Invalidate cache for an item without deleting the file.
This marks the cached item as invalid, forcing a re-fetch on next access, but keeps the file on disk (useful for stale cache fallback).
Parameters:
-
item_name(str) –Name of item to invalidate
Examples:
Force re-download on next access:
>>> folio.invalidate_cache('my_table')
>>> table = folio.get_table('my_table') # Will re-download
datafolio.DataFolio.refresh_cache(item_name)
Refresh cache for an item by re-downloading from remote.
This is equivalent to invalidating and then fetching the item.
Parameters:
-
item_name(str) –Name of item to refresh
Raises:
-
ValueError–If item doesn't exist in bundle
-
RuntimeError–If caching is not enabled
Examples:
>>> folio.refresh_cache('my_table')
Bundle Management
Methods for managing the DataFolio bundle itself.
datafolio.DataFolio.refresh()
Explicitly refresh manifests from disk/cloud.
This reloads items.json and metadata.json from the bundle directory, syncing the in-memory state with any external updates.
Useful when working with multiple DataFolio instances pointing to the same bundle, or when the bundle is updated by another process.
Returns:
-
Self–Self for method chaining
Examples:
Explicit refresh after external update:
>>> folio1 = DataFolio('experiments/shared')
>>> folio2 = DataFolio('experiments/shared')
>>> folio1.add_table('results', df)
>>> folio2.refresh() # Manually sync
>>> assert 'results' in folio2.list_contents()['included_tables']
Auto-refresh (happens automatically):
>>> folio1.add_table('results', df)
>>> # folio2 auto-refreshes on next read operation
>>> assert 'results' in folio2.list_contents()['included_tables']
Properties
Useful properties for accessing bundle information and items.
Core Properties
path
The path to the DataFolio bundle.
print(folio.path) # e.g., 'gs://my-bucket/my-bundle' or '/local/path/bundle'
metadata
Bundle-level metadata dictionary.
print(folio.metadata) # e.g., {'project': 'analysis', 'version': '1.0'}
items
Dictionary of all items in the bundle with their metadata.
print(folio.items) # e.g., {'table1': {...}, 'model1': {...}}
Item Lists
tables
List of all table names in the bundle.
print(folio.tables) # e.g., ['results', 'metadata', 'analysis']
models
List of all model names in the bundle.
print(folio.models) # e.g., ['classifier', 'regressor']
artifacts
List of all artifact names in the bundle.
print(folio.artifacts) # e.g., ['config.yaml', 'results.png']
Data Accessor
data
Accessor for convenient data retrieval with autocomplete support.
df = folio.data.my_table # Equivalent to folio.get_table('my_table')
model = folio.data.my_model # Equivalent to folio.get_model('my_model')
Status Properties
read_only
Whether the bundle is in read-only mode.
print(folio.read_only) # True or False
in_snapshot_mode
Whether the bundle was loaded from a snapshot.
print(folio.in_snapshot_mode) # True or False
loaded_snapshot
Name of the snapshot this bundle was loaded from (if any).
print(folio.loaded_snapshot) # e.g., 'v1.0' or None
Method Categories Summary
| Category | Methods |
|---|---|
| Adding Data | add_table(), add_numpy(), add_json(), add_timestamp(), add_data(), reference_table() |
| Adding Models | add_sklearn(), add_model() |
| Adding Artifacts | add_artifact() |
| Retrieving Data | get_table(), get_table_path(), get_numpy(), get_numpy_path(), get_json(), get_json_path(), get_timestamp(), get_timestamp_path(), get_data(), get_data_path(), get_item_path() |
| Retrieving Models | get_sklearn(), get_model(), get_model_path() |
| Retrieving Artifacts | get_artifact_path() |
| Inspecting Items | list_contents(), get_table_info(), get_model_info(), get_artifact_info(), describe() |
| Managing Items | delete(), copy(), validate(), is_valid() |
| Lineage | get_inputs(), get_dependents(), get_lineage_graph() |
| Snapshots | create_snapshot(), list_snapshots(), delete_snapshot(), load_snapshot(), get_snapshot(), get_snapshot_info(), compare_snapshots(), diff_from_snapshot(), restore_snapshot(), export_snapshot() |
| Caching | cache_status(), clear_cache(), invalidate_cache(), refresh_cache() |
| Bundle Management | refresh() |
Quick Examples
Basic Usage
import datafolio
import pandas as pd
# Create a new DataFolio
folio = datafolio.DataFolio('my_analysis')
# Add data
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
folio.add_table('results', df, description='Experimental results')
# Retrieve data
df_loaded = folio.get_table('results')
# List contents
print(folio.list_contents())
print(folio.tables) # Property access
# Use data accessor with autocomplete
df_via_accessor = folio.data.results
With Caching
# Enable caching for remote bundles
folio = datafolio.DataFolio(
'gs://my-bucket/my-bundle',
cache_enabled=True,
cache_dir='/tmp/my-cache'
)
# First access downloads and caches
df = folio.get_table('large_table') # Downloads from cloud
# Second access uses cache (much faster!)
df = folio.get_table('large_table') # Reads from local cache
# Check cache statistics
status = folio.cache_status()
print(f"Cache hits: {status['cache_hits']}")
print(f"Cache misses: {status['cache_misses']}")
With Snapshots
# Create a read-only snapshot
folio.create_snapshot('v1.0', description='Release 1.0')
# Load a snapshot (read-only mode)
folio_snapshot = datafolio.DataFolio.load_snapshot('v1.0')
# List all snapshots
snapshots = folio.list_snapshots()
for snap in snapshots:
print(f"{snap['name']}: {snap['description']}")
Lineage Tracking
# Add data with lineage
folio.add_table('raw_data', raw_df)
folio.add_table('processed_data', processed_df, inputs=['raw_data'])
folio.add_model('trained_model', model, inputs=['processed_data'])
# Query lineage
inputs = folio.get_inputs('trained_model') # ['processed_data']
dependents = folio.get_dependents('raw_data') # ['processed_data']
# Get full lineage graph
graph = folio.get_lineage_graph()
print(graph) # Shows dependency relationships
Sharing Paths with Collaborators
# For a cloud-hosted folio, get the direct path to any item
folio = datafolio.DataFolio('s3://my-bucket/experiments/run-42')
# Type-specific path methods (recommended):
path = folio.get_table_path('results')
# → 's3://my-bucket/experiments/run-42/tables/results.parquet'
path = folio.get_model_path('classifier')
# → 's3://my-bucket/experiments/run-42/models/classifier.joblib'
# Generic path getter (dispatches to the appropriate method automatically):
path = folio.get_data_path('results') # same as get_table_path for tables
path = folio.get_item_path('results') # lower-level, skips type-specific logic
# Share with a colleague who doesn't use datafolio:
# import pandas as pd; pd.read_parquet('s3://my-bucket/.../results.parquet')
# Or browse all paths at once with describe()
folio.describe(show_paths=True)
# Tables (2):
# • raw_data (reference): Input dataset
# ↳ path: s3://data-lake/raw.parquet
# • results: Model results
# ↳ path: s3://my-bucket/experiments/run-42/tables/results.parquet
# Models (1):
# • classifier: Trained model
# ↳ path: s3://my-bucket/experiments/run-42/models/classifier.joblib