DataFolio Class - Complete API Reference

This page provides a comprehensive reference of all methods available on the DataFolio class, organized by functionality.

Creating a DataFolio

`datafolio.DataFolio.init(path, metadata=None, random_suffix=False, read_only=False, cache_enabled=False, cache_dir=None, cache_ttl=None, use_https=False)`

Initialize a new or open an existing DataFolio.

If the directory doesn't exist, creates a new bundle. If it exists, opens the existing bundle and reads manifests.

Parameters:

path (Union[str, Path]) –

Full path to bundle directory (local or cloud)
metadata (Optional[Dict[str, Any]], default: None ) –

Optional dictionary of analysis metadata (for new bundles)
random_suffix (bool, default: False ) –

If True, append random suffix to bundle name (default: False)
read_only (bool, default: False ) –

If True, prevent all write operations (default: False)
cache_enabled (bool, default: False ) –

If True, enable local caching for remote data (default: False)
cache_dir (Optional[Union[str, Path]], default: None ) –

Optional cache directory (default: ~/.datafolio_cache)
cache_ttl (Optional[int], default: None ) –

Optional TTL override in seconds (default: 1800 = 30 minutes)
use_https (bool, default: False ) –

If True, use HTTPS URLs for CloudFiles (for read-only access to public buckets) (default: False)

Examples:

Create new bundle with exact name:

>>> folio = DataFolio('experiments/protein-analysis')
# Creates: experiments/protein-analysis/

Create new bundle with random suffix:

>>> folio = DataFolio(
...     'experiments/protein-analysis',
...     random_suffix=True
... )
# Creates: experiments/protein-analysis-blue-happy-falcon/

Open existing bundle:

>>> folio = DataFolio('experiments/protein-analysis')

With metadata:

>>> folio = DataFolio(
...     'experiments/my-exp',
...     metadata={'date': '2024-01-15', 'scientist': 'Dr. Smith'}
... )

Open existing bundle as read-only (for safe inspection):

>>> folio = DataFolio('experiments/production-model', read_only=True)
>>> model = folio.get_model('classifier')  # OK
>>> folio.add_table('new', df)  # Error: read-only

Enable caching for cloud bundles (faster repeated access):

>>> folio = DataFolio('gs://bucket/experiment', cache_enabled=True)
>>> df = folio.get_table('data')  # Downloads and caches
>>> df = folio.get_table('data')  # Loads from cache (instant)

Custom cache configuration:

>>> folio = DataFolio(
...     'gs://bucket/experiment',
...     cache_enabled=True,
...     cache_dir='/mnt/shared/cache',
...     cache_ttl=3600  # 1 hour
... )

Adding Data

Methods for adding different types of data to a DataFolio.

Tables (DataFrames)

`datafolio.DataFolio.add_table(name, data, description=None, overwrite=False, inputs=None, models=None, code=None)`

Add a table to be included in the bundle.

Writes immediately to tables/ directory and updates items.json.

Parameters:

name (str) –

Unique name for this table
data (Any) –

pandas or Polars DataFrame to include
description (Optional[str], default: None ) –

Optional description
overwrite (bool, default: False ) –

If True, allow overwriting existing table (default: False)
inputs (Optional[list[str]], default: None ) –

Optional list of table names used to create this table
models (Optional[list[str]], default: None ) –

Optional list of model names used to create this table
code (Optional[str], default: None ) –

Optional code snippet that created this table

Returns:

Self –

Self for method chaining

Raises:

ValueError –

If name already exists and overwrite=False
TypeError –

If data is not a DataFrame

Examples:

>>> import pandas as pd
>>> folio = DataFolio('experiments', prefix='test')
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> folio.add_table('summary', df)
>>> # With lineage
>>> pred_df = pd.DataFrame({'pred': [0, 1, 0]})
>>> folio.add_table('predictions', pred_df,
...     inputs=['test_data'],
...     models=['classifier'],
...     code='pred = model.predict(X_test)')

`datafolio.DataFolio.reference_table(name, path, table_format='parquet', num_rows=None, version=None, description=None, inputs=None, code=None)`

Add a reference to an external table (not copied to bundle).

Writes immediately to items.json.

Parameters:

name (str) –

Unique name for this table
path (Union[str, Path]) –

Path to the table (local or cloud)
table_format (str, default: 'parquet' ) –

Format of the table ('parquet', 'delta', 'csv')
num_rows (Optional[int], default: None ) –

Optional number of rows
version (Optional[int], default: None ) –

Optional version number (for Delta tables)
description (Optional[str], default: None ) –

Optional description
inputs (Optional[list[str]], default: None ) –

Optional list of items this was derived from
code (Optional[str], default: None ) –

Optional code snippet that created this

Returns:

Self –

Self for method chaining

Raises:

ValueError –

If name already exists or format is invalid

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.reference_table(
...     'raw_data',
...     path='s3://bucket/data.parquet',
...     table_format='parquet',
...     num_rows=1_000_000
... )

Arrays

`datafolio.DataFolio.add_numpy(name, array, description=None, overwrite=False, inputs=None, code=None)`

Add a numpy array to the bundle.

Saves array to artifacts/ directory as .npy file and updates items.json.

Parameters:

name (str) –

Unique name for this array
array (Any) –

numpy array to save
description (Optional[str], default: None ) –

Optional description
overwrite (bool, default: False ) –

If True, allow overwriting existing array (default: False)
inputs (Optional[list[str]], default: None ) –

Optional list of items this was derived from
code (Optional[str], default: None ) –

Optional code snippet that created this array

Returns:

Self –

Self for method chaining

Raises:

ValueError –

If name already exists and overwrite=False
ImportError –

If numpy is not installed
TypeError –

If data is not a numpy array

Examples:

>>> import numpy as np
>>> folio = DataFolio('experiments/test')
>>> embeddings = np.random.randn(100, 128)
>>> folio.add_numpy('embeddings', embeddings, description='Model embeddings')
>>> # With lineage
>>> predictions = np.array([0, 1, 0, 1])
>>> folio.add_numpy('predictions', predictions,
...     inputs=['test_data'],
...     code='predictions = model.predict(X)')

JSON Data

`datafolio.DataFolio.add_json(name, data, description=None, overwrite=False, inputs=None, code=None)`

Add JSON-serializable data to the bundle.

Saves data to artifacts/ directory as .json file and updates items.json. Supports dicts, lists, scalars, and other JSON-serializable types.

Parameters:

name (str) –

Unique name for this data
data (Union[dict, list, int, float, str, bool, None]) –

JSON-serializable data (dict, list, scalar, etc.)
description (Optional[str], default: None ) –

Optional description
overwrite (bool, default: False ) –

If True, allow overwriting existing data (default: False)
inputs (Optional[list[str]], default: None ) –

Optional list of items this was derived from
code (Optional[str], default: None ) –

Optional code snippet that created this data

Returns:

Self –

Self for method chaining

Raises:

ValueError –

If name already exists and overwrite=False, or data not JSON-serializable
TypeError –

If data cannot be serialized to JSON

Examples:

>>> folio = DataFolio('experiments/test')
>>> config = {'learning_rate': 0.01, 'batch_size': 32}
>>> folio.add_json('config', config, description='Model config')
>>> # With list data
>>> class_names = ['cat', 'dog', 'bird']
>>> folio.add_json('classes', class_names)
>>> # With scalar
>>> folio.add_json('best_accuracy', 0.95)

Timestamps

`datafolio.DataFolio.add_timestamp(name, timestamp, description=None, overwrite=False, inputs=None, code=None)`

Add a timestamp to the bundle.

Saves timestamp to artifacts/ directory as .json file and updates items.json. Accepts timezone-aware datetime objects or Unix timestamps (int/float). All timestamps are stored in UTC as ISO 8601 strings.

Parameters:

name (str) –

Unique name for this timestamp
timestamp (Union[datetime, int, float]) –

Timezone-aware datetime object or Unix timestamp (int/float). Naive datetimes will raise ValueError.
description (Optional[str], default: None ) –

Optional description
overwrite (bool, default: False ) –

If True, allow overwriting existing timestamp (default: False)
inputs (Optional[list[str]], default: None ) –

Optional list of items this was derived from
code (Optional[str], default: None ) –

Optional code snippet that created this timestamp

Returns:

Self –

Self for method chaining

Raises:

ValueError –

If name already exists and overwrite=False, or if datetime is naive
TypeError –

If timestamp is not a datetime or numeric type

Examples:

>>> from datetime import datetime, timezone
>>> folio = DataFolio('experiments/test')
>>>
>>> # Add timezone-aware datetime
>>> event_time = datetime(2024, 1, 15, 10, 30, 0, tzinfo=timezone.utc)
>>> folio.add_timestamp('event_time', event_time, description='Event occurred')
>>>
>>> # Add Unix timestamp
>>> folio.add_timestamp('start_time', 1705318200, description='Start time')
>>>
>>> # With lineage
>>> from datetime import datetime, timezone
>>> import pytz
>>> eastern = pytz.timezone('US/Eastern')
>>> local_time = eastern.localize(datetime(2024, 1, 15, 10, 30, 0))
>>> folio.add_timestamp('local_event', local_time,
...     inputs=['event_log'],
...     code='timestamp = event_log.iloc[0]["timestamp"]')

Generic Data

`datafolio.DataFolio.add_data(name, data=None, reference=None, description=None, **kwargs)`

Generic data addition with automatic type detection.

Convenience method that dispatches to the appropriate specific method based on data type. For fine-grained control, use the specific methods: add_table(), add_numpy(), add_json(), or reference_table().

Parameters:

name (str) –

Unique name for this data
data (Any, default: None ) –

Data to save (DataFrame, numpy array, dict, list, scalar)
reference (Optional[Union[str, Path]], default: None ) –

If provided, creates a reference to external data instead
description (Optional[str], default: None ) –

Optional description
**kwargs –

Additional arguments passed to the specific method

Returns:

Self –

Self for method chaining

Raises:

ValueError –

If neither data nor reference is provided, or both are provided
TypeError –

If data type is not supported

Examples:

DataFrame (saves as parquet):

>>> folio.add_data('results', df)

Numpy array (saves as .npy):

>>> folio.add_data('embeddings', np.array([1, 2, 3]))

JSON data (saves as .json):

>>> folio.add_data('config', {'lr': 0.01})
>>> folio.add_data('classes', ['cat', 'dog'])
>>> folio.add_data('accuracy', 0.95)

External reference:

>>> folio.add_data('raw', reference='s3://bucket/data.parquet')

Adding Models

Methods for saving machine learning models.

Scikit-learn Models

`datafolio.DataFolio.add_sklearn(name, model, description=None, overwrite=False, inputs=None, hyperparameters=None, code=None, custom=False)`

Add a scikit-learn style model to the bundle.

Writes immediately to models/ directory and updates items.json.

Parameters:

name (str) –

Unique name for this model
model (Any) –

Trained model to include
description (Optional[str], default: None ) –

Optional description
overwrite (bool, default: False ) –

If True, allow overwriting existing model (default: False)
inputs (Optional[list[str]], default: None ) –

Optional list of table names used for training
hyperparameters (Optional[Dict[str, Any]], default: None ) –

Optional dict of hyperparameters
code (Optional[str], default: None ) –

Optional code snippet that trained this model
custom (bool, default: False ) –

If True, use skops format for portable pipelines with custom transformers. If False (default), use joblib format.

Returns:

Self –

Self for method chaining

Raises:

ValueError –

If name already exists and overwrite=False

Examples:

>>> from sklearn.ensemble import RandomForestClassifier
>>> folio = DataFolio('experiments', prefix='test')
>>> model = RandomForestClassifier(n_estimators=100, max_depth=10)
>>> # ... train model ...
>>> folio.add_sklearn('classifier', model,
...     description='Random forest classifier',
...     inputs=['training_data', 'validation_data'],
...     hyperparameters={'n_estimators': 100, 'max_depth': 10},
...     code='model.fit(X_train, y_train)')
>>>
>>> # Portable pipeline with custom transformer (skops)
>>> folio.add_sklearn('pipeline', custom_pipeline, custom=True)

`datafolio.DataFolio.add_model(name, model, description=None, overwrite=False, custom=False, **kwargs)`

Add a scikit-learn style model to the bundle.

This is a convenience method that delegates to add_sklearn().

Parameters:

name (str) –

Unique name for this model
model (Any) –

Trained sklearn-style model
description (Optional[str], default: None ) –

Optional description
overwrite (bool, default: False ) –

If True, allow overwriting existing model (default: False)
custom (bool, default: False ) –

If True use skops format for portability (required for custom transformers)
**kwargs –

Additional arguments passed to add_sklearn() (e.g., hyperparameters, inputs, code)

Returns:

Self –

Self for method chaining

Raises:

ValueError –

If name already exists and overwrite=False

Examples:

>>> from sklearn.ensemble import RandomForestClassifier
>>> model = RandomForestClassifier()
>>> folio.add_model('clf', model, hyperparameters={'n_estimators': 100})

With custom transformer (portable):

>>> folio.add_model('pipeline', custom_pipeline, custom=True)

Adding Artifacts

Methods for adding arbitrary files and artifacts.

`datafolio.DataFolio.add_artifact(name, path, category=None, description=None, overwrite=False)`

Add an artifact file to the bundle.

Copies file immediately to artifacts/ directory and updates included_items.json.

Parameters:

name (str) –

Unique name for this artifact
path (Union[str, Path]) –

Path to the file to include
category (Optional[str], default: None ) –

Optional category ('plots', 'configs', etc.)
description (Optional[str], default: None ) –

Optional description
overwrite (bool, default: False ) –

If True, allow overwriting existing artifact (default: False)

Returns:

Self –

Self for method chaining

Raises:

ValueError –

If name already exists and overwrite=False
FileNotFoundError –

If file doesn't exist

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.add_artifact('loss_curve', 'plots/training_loss.png', category='plots')
>>> # Update with overwrite
>>> folio.add_artifact('loss_curve', 'plots/updated_loss.png', category='plots', overwrite=True)

Retrieving Data

Methods for loading data from a DataFolio.

Tables (DataFrames)

`datafolio.DataFolio.get_table(name, **kwargs)`

Get a table by name (works for both included and referenced).

For included tables, reads from bundle directory. For referenced tables, reads from the specified external path. If caching is enabled (cache_enabled=True), cloud-based tables are cached locally for faster repeated access.

Supports all pandas.read_parquet() arguments for filtering and optimization: - columns: List of column names to read (column pruning) - filters: Row filtering predicates (row filtering) - engine: Parquet engine ('pyarrow' or 'fastparquet')

Parameters:

name (str) –

Name of the table
**kwargs –

Additional arguments passed to pd.read_parquet() (e.g., columns, filters, engine)

Returns:

Any –

pandas DataFrame

Raises:

KeyError –

If table name doesn't exist
ImportError –

If reading from cloud requires missing dependencies
FileNotFoundError –

If referenced file doesn't exist

Examples:

Basic usage:

>>> folio = DataFolio('experiments', prefix='test')
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> folio.add_table('test', df)
>>> retrieved = folio.get_table('test')
>>> assert len(retrieved) == 3

Column selection (read only specific columns):

>>> df_subset = folio.get_table('test', columns=['a'])
>>> assert list(df_subset.columns) == ['a']

Row filtering (requires pyarrow engine):

>>> df_filtered = folio.get_table('test',
...     filters=[('a', '>', 1)],
...     engine='pyarrow')
>>> assert len(df_filtered) == 2

`datafolio.DataFolio.get_table_path(name)`

Get the path to a table file, whether included in the bundle or referenced externally.

For included tables, returns the full path to the parquet file inside the bundle. For referenced tables, returns the external path recorded at reference time.

Parameters:

name (str) –

Name of the table

Returns:

str –

Path to the table file

Raises:

KeyError –

If table name doesn't exist
ValueError –

If named item is not a table

Examples:

>>> folio = DataFolio('experiments/my-run')
>>> folio.add_table('results', df)
>>> path = folio.get_table_path('results')
>>> print(path)
'experiments/my-run/tables/results.parquet'

>>> folio.reference_table('raw', path='s3://data-lake/raw.parquet')
>>> folio.get_table_path('raw')
's3://data-lake/raw.parquet'

Arrays

`datafolio.DataFolio.get_numpy(name)`

Get a numpy array by name.

If caching is enabled, cloud-based arrays are cached locally for faster repeated access.

Parameters:

name (str) –

Name of the array

Returns:

Any –

numpy array

Raises:

KeyError –

If array name doesn't exist
ValueError –

If named item is not a numpy array
ImportError –

If numpy is not installed

Examples:

>>> folio = DataFolio('experiments/test')
>>> embeddings = folio.get_numpy('embeddings')
>>> print(embeddings.shape)

`datafolio.DataFolio.get_numpy_path(name)`

Get the path to a numpy array file stored in the bundle.

Parameters:

name (str) –

Name of the array

Returns:

str –

Path to the .npy file

Raises:

KeyError –

If array name doesn't exist
ValueError –

If named item is not a numpy array

Examples:

>>> folio = DataFolio('experiments/my-run')
>>> folio.add_numpy('embeddings', arr)
>>> path = folio.get_numpy_path('embeddings')
>>> print(path)
'experiments/my-run/artifacts/embeddings.npy'

JSON Data

`datafolio.DataFolio.get_json(name)`

Get JSON data by name.

If caching is enabled, cloud-based JSON data is cached locally for faster repeated access.

Parameters:

name (str) –

Name of the JSON data

Returns:

Any –

Deserialized JSON data (dict, list, scalar, etc.)

Raises:

KeyError –

If data name doesn't exist
ValueError –

If named item is not JSON data

Examples:

>>> folio = DataFolio('experiments/test')
>>> config = folio.get_json('config')
>>> print(config['learning_rate'])

`datafolio.DataFolio.get_json_path(name)`

Get the path to a JSON data file stored in the bundle.

Parameters:

name (str) –

Name of the JSON data

Returns:

str –

Path to the .json file

Raises:

KeyError –

If data name doesn't exist
ValueError –

If named item is not JSON data

Examples:

>>> folio = DataFolio('experiments/my-run')
>>> folio.add_json('config', {'lr': 0.01})
>>> path = folio.get_json_path('config')
>>> print(path)
'experiments/my-run/artifacts/config.json'

Timestamps

`datafolio.DataFolio.get_timestamp(name, as_unix=False)`

Get a timestamp by name.

If caching is enabled, cloud-based timestamps are cached locally for faster repeated access.

Parameters:

name (str) –

Name of the timestamp
as_unix (bool, default: False ) –

If True, return Unix timestamp (float); if False, return datetime (default)

Returns:

Union[datetime, float] –

UTC-aware datetime object (default) or Unix timestamp (if as_unix=True)

Raises:

KeyError –

If timestamp name doesn't exist
ValueError –

If named item is not a timestamp

Examples:

>>> folio = DataFolio('experiments/test')
>>>
>>> # Get as datetime (default)
>>> event_time = folio.get_timestamp('event_time')
>>> print(event_time.isoformat())
'2024-01-15T10:30:00+00:00'
>>>
>>> # Get as Unix timestamp
>>> unix_time = folio.get_timestamp('event_time', as_unix=True)
>>> print(unix_time)
1705318200.0

`datafolio.DataFolio.get_timestamp_path(name)`

Get the path to a timestamp file stored in the bundle.

Parameters:

name (str) –

Name of the timestamp

Returns:

str –

Path to the timestamp file

Raises:

KeyError –

If timestamp name doesn't exist
ValueError –

If named item is not a timestamp

Examples:

>>> folio = DataFolio('experiments/my-run')
>>> folio.add_timestamp('event_time', dt)
>>> path = folio.get_timestamp_path('event_time')
>>> print(path)
'experiments/my-run/artifacts/event_time.json'

Generic Data

`datafolio.DataFolio.get_data(name)`

Generic data getter that returns any data type.

Automatically detects the item type and calls the appropriate getter. For fine-grained control, use the specific methods: get_table(), get_numpy(), or get_json().

Parameters:

name (str) –

Name of the data item

Returns:

Any –

The data (DataFrame, numpy array, dict, list, or scalar)

Raises:

KeyError –

If item name doesn't exist
ValueError –

If item is not a data type (e.g., is a model or artifact)

Examples:

>>> folio.add_data('results', df)
>>> folio.add_data('embeddings', np_array)
>>> folio.add_data('config', {'lr': 0.01})
>>> # Later, retrieve without knowing the type
>>> results = folio.get_data('results')  # Returns DataFrame
>>> embeddings = folio.get_data('embeddings')  # Returns numpy array
>>> config = folio.get_data('config')  # Returns dict

`datafolio.DataFolio.get_data_path(name)`

Get the path to any stored item, delegating to the appropriate type-specific method.

Automatically detects the item type and calls the appropriate path getter: - Tables (included or referenced): delegates to get_table_path() - Artifacts: delegates to get_artifact_path() - All other bundled items (numpy arrays, JSON, timestamps): returns the bundle file path

Parameters:

name (str) –

Name of the item

Returns:

str –

Path to the item file

Raises:

KeyError –

If item name doesn't exist

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.add_table('results', df)
>>> folio.get_data_path('results')  # returns path to parquet file
>>> folio.reference_table('data', path='s3://bucket/file.parquet')
>>> folio.get_data_path('data')  # returns 's3://bucket/file.parquet'

`datafolio.DataFolio.get_item_path(name)`

Get the path to any item stored in the folio.

For items stored within the bundle (included tables, models, artifacts, arrays, JSON data, timestamps), returns the full path to the data file. For referenced tables, returns the external path recorded at reference time.

This is especially useful for cloud-hosted folios where collaborators can directly access or download underlying files without using datafolio.

Parameters:

name (str) –

Name of the item

Returns:

str –

Full path to the item's data file. For cloud folios this will be a
str –

cloud URI (e.g. s3://bucket/.../results.parquet). For local folios
str –

this will be an absolute file-system path.

Raises:

KeyError –

If item name doesn't exist
ValueError –

If item has no associated file path

Examples:

>>> folio = DataFolio('s3://bucket/experiments/my-run')
>>> path = folio.get_item_path('results')
>>> print(path)
's3://bucket/experiments/my-run/tables/results.parquet'

>>> path = folio.get_item_path('classifier')
>>> print(path)
's3://bucket/experiments/my-run/models/classifier.joblib'

>>> # For referenced tables the external path is returned
>>> folio.reference_table('raw', path='s3://data-lake/raw.parquet')
>>> folio.get_item_path('raw')
's3://data-lake/raw.parquet'

Retrieving Models

Methods for loading machine learning models.

Scikit-learn Models

`datafolio.DataFolio.get_sklearn(name)`

Get a scikit-learn style model by name.

If caching is enabled (cache_enabled=True), cloud-based models are cached locally for faster repeated access.

Parameters:

name (str) –

Name of the model

Returns:

Any –

The model object

Raises:

KeyError –

If model name doesn't exist
ValueError –

If named item is not a sklearn model

Examples:

>>> folio = DataFolio('experiments/test')
>>> model = folio.get_sklearn('classifier')

`datafolio.DataFolio.get_model(name, **kwargs)`

Get a scikit-learn style model by name.

This is a convenience method that delegates to get_sklearn().

If caching is enabled (cache_enabled=True), cloud-based models are cached locally for faster repeated access.

Parameters:

name (str) –

Name of the model
**kwargs –

Additional arguments (currently unused, kept for backward compatibility)

Returns:

Any –

The model object

Raises:

KeyError –

If model name doesn't exist
ValueError –

If named item is not a model

Examples:

>>> folio = DataFolio('experiments/test')
>>> model = folio.get_model('classifier')

`datafolio.DataFolio.get_model_path(name)`

Get the path to a model file stored in the bundle.

Parameters:

name (str) –

Name of the model

Returns:

str –

Path to the model file

Raises:

KeyError –

If model name doesn't exist
ValueError –

If named item is not a model

Examples:

>>> folio = DataFolio('experiments/my-run')
>>> folio.add_sklearn('classifier', model)
>>> path = folio.get_model_path('classifier')
>>> print(path)
'experiments/my-run/models/classifier.joblib'

Retrieving Artifacts

`datafolio.DataFolio.get_artifact_path(name)`

Get the path to an artifact file.

Parameters:

name (str) –

Name of the artifact

Returns:

str –

Path to the artifact file

Raises:

KeyError –

If artifact name doesn't exist
ValueError –

If named item is not an artifact

Examples:

>>> folio = DataFolio('experiments/test-blue-happy-falcon')
>>> path = folio.get_artifact_path('plot')

Inspecting Items

Methods for getting information about items.

`datafolio.DataFolio.list_contents(include_archived=False)`

List all contents in the DataFolio.

Parameters:

include_archived (bool, default: False ) –

If True, include archived (hidden) items in the results. Defaults to False so archived items are hidden from normal views.

Returns:

Dict[str, list[str]] –

Dictionary with keys 'referenced_tables', 'included_tables', 'numpy_arrays',
Dict[str, list[str]] –

'json_data', 'timestamps', 'models', and 'artifacts',
Dict[str, list[str]] –

each containing a list of names

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.reference_table('data1', path='s3://bucket/data.parquet')
>>> folio.add_numpy('embeddings', np.array([1, 2, 3]))
>>> folio.list_contents()
{'referenced_tables': ['data1'], 'included_tables': [], 'numpy_arrays': ['embeddings'],
 'json_data': [], 'timestamps': [], 'models': [], 'artifacts': []}

`datafolio.DataFolio.get_table_info(name)`

Get metadata about a table (referenced or included).

Returns the manifest entry containing information like: - For referenced tables: path, table_format, is_directory, num_rows, version, description - For included tables: filename, table_format, is_directory, num_rows, num_cols, columns, dtypes, description

Parameters:

name (str) –

Name of the table

Returns:

Union[TableReference, IncludedTable] –

Dictionary with table metadata

Raises:

KeyError –

If table name doesn't exist

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.reference_table('data', path='s3://bucket/data.parquet', num_rows=1000000)
>>> info = folio.get_table_info('data')
>>> info['num_rows']
1000000
>>> info['table_format']
'parquet'

`datafolio.DataFolio.get_model_info(name)`

Get metadata about a model.

Returns the manifest entry containing information like: - filename, item_type, description

Parameters:

name (str) –

Name of the model

Returns:

IncludedItem –

Dictionary with model metadata

Raises:

KeyError –

If model name doesn't exist
ValueError –

If named item is not a model

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.add_model('classifier', model, description='Random forest classifier')
>>> info = folio.get_model_info('classifier')
>>> info['description']
'Random forest classifier'

`datafolio.DataFolio.get_artifact_info(name)`

Get metadata about an artifact.

Returns the manifest entry containing information like: - filename, item_type, category, description

Parameters:

name (str) –

Name of the artifact

Returns:

IncludedItem –

Dictionary with artifact metadata

Raises:

KeyError –

If artifact name doesn't exist
ValueError –

If named item is not an artifact

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.add_artifact('plot', 'plot.png', category='plots', description='Loss curve')
>>> info = folio.get_artifact_info('plot')
>>> info['category']
'plots'
>>> info['description']
'Loss curve'

`datafolio.DataFolio.describe(pattern=None, return_string=False, show_empty=False, max_metadata_fields=10, snapshot=None, include_archived=False, show_paths=False)`

Generate a human-readable description of all items in the bundle.

Includes lineage information showing inputs and dependencies.

Parameters:

pattern (Optional[str], default: None ) –

Optional glob pattern to filter items by name (e.g. 'examples/', '/weights'). Uses fnmatch rules — '*' matches any characters including '/'.
return_string (bool, default: False ) –

If True, return as string instead of printing
show_empty (bool, default: False ) –

If True, show empty sections
max_metadata_fields (int, default: 10 ) –

Maximum metadata fields to show
snapshot (Optional[str], default: None ) –

Optional snapshot name to describe instead of the full bundle
include_archived (bool, default: False ) –

If True, show archived (hidden) items. Defaults to False.
show_paths (bool, default: False ) –

If True, show the file path for each item. Especially useful for cloud-hosted folios where paths can be shared with collaborators who don't use datafolio.

Returns:

Optional[str] –

None if return_string=False, otherwise the description string

Examples:

>>> folio.describe()  # Show full bundle
>>> folio.describe('examples/*')  # Show only items under 'examples/'
>>> folio.describe(snapshot='v1.0')  # Show specific snapshot
>>> folio.describe(include_archived=True)  # Show archived items too
>>> folio.describe(show_paths=True)  # Show file paths for sharing

See DisplayFormatter.describe() for full documentation.

Managing Items

Deleting Items

`datafolio.DataFolio.delete(name, warn_dependents=True)`

Delete one or more items from the DataFolio.

Removes items from the manifest and deletes associated files. Does not enforce lineage - can delete items that other items depend on.

Parameters:

name (Union[str, list[str]]) –

Name(s) of item(s) to delete (string or list of strings)
warn_dependents (bool, default: True ) –

If True, print warning if deleted items have dependents

Returns:

Self –

Self for method chaining

Raises:

KeyError –

If any item name doesn't exist

Examples:

Delete single item:

>>> folio = DataFolio('experiments/test')
>>> folio.delete('old_model')

Delete multiple items:

>>> folio.delete(['temp_data', 'debug_plot', 'old_model'])

Delete without warnings:

>>> folio.delete('item', warn_dependents=False)

Archiving Items

`datafolio.DataFolio.archive(name)`

Mark item(s) as archived (hidden from default views, not deleted).

Archived items remain on disk and are still accessible via get_data() / get_table() etc., but are excluded from list_contents(), describe(), and copy() by default. Pass include_archived=True to those methods to reveal them again, or call unarchive() to restore them permanently.

Accepts a single name, a list of names, or a glob pattern (fnmatch rules, e.g. 'intermediate/*').

Parameters:

name (Union[str, list[str]]) –

Item name, list of names, or glob pattern to archive.

Returns:

Self –

Self for method chaining

Raises:

KeyError –

If a specific name (non-glob) is not found

Examples:

Archive a single item:

>>> folio.archive('debug_output')

Archive multiple items:

>>> folio.archive(['debug_output', 'temp_features'])

Archive by glob pattern:

>>> folio.archive('intermediate/*')

`datafolio.DataFolio.unarchive(name)`

Restore archived item(s) to active status.

Removes the archived flag so the items appear again in list_contents(), describe(), and copy() by default.

Accepts a single name, a list of names, or a glob pattern (fnmatch rules).

Parameters:

name (Union[str, list[str]]) –

Item name, list of names, or glob pattern to unarchive.

Returns:

Self –

Self for method chaining

Raises:

KeyError –

If a specific name (non-glob) is not found

Examples:

Unarchive a single item:

>>> folio.unarchive('debug_output')

Unarchive multiple items:

>>> folio.unarchive(['debug_output', 'temp_features'])

Unarchive by glob pattern:

>>> folio.unarchive('intermediate/*')

Copying Bundles

`datafolio.DataFolio.copy(path, name=None, metadata_updates=None, include_items=None, exclude_items=None, random_suffix=False, follow_lineage=False, include_archived=False)`

Create a copy of this bundle at a new location.

Useful for creating derived experiments or checkpoints.

Parameters:

path (Union[str, Path]) –

Destination path for the new bundle. Used as the exact bundle location (e.g., 'gs://bucket/experiments/my-copy').
name (Optional[str], default: None ) –

If provided, appended to path as a subdirectory (e.g., path='experiments', name='exp-v2' → 'experiments/exp-v2'). If None, path is used as-is.
metadata_updates (Optional[Dict[str, Any]], default: None ) –

Metadata fields to update/add in the copy
include_items (Optional[list[str]], default: None ) –

If specified, only copy these items (by name)
exclude_items (Optional[list[str]], default: None ) –

Items to exclude from copy (by name)
random_suffix (bool, default: False ) –

If True, append random suffix to new bundle name (default: False)
follow_lineage (bool, default: False ) –

If True and include_items is provided, automatically include all transitive upstream dependencies of the named items. Items referenced in lineage that are not present in this folio (e.g. external tables) are silently skipped.
include_archived (bool, default: False ) –

If True, archived items are included in the copy. Defaults to False so archived items are excluded.

Returns:

DataFolio –

New DataFolio instance

Raises:

ValueError –

If include_items and exclude_items are both specified

Examples:

>>> # Copy to exact destination path
>>> folio2 = folio.copy('gs://bucket/experiments/my-copy')

>>> # Copy to base directory with explicit name subdirectory
>>> folio2 = folio.copy('experiments', name='exp-v2')

>>> # Copy with random suffix
>>> folio2 = folio.copy('experiments/exp-v2', random_suffix=True)

>>> # Copy with metadata updates to track parent
>>> folio2 = folio.copy(
...     'experiments/exp-v2',
...     metadata_updates={
...         'parent_bundle': folio._bundle_dir,
...         'changes': 'Increased max_depth to 15'
...     }
... )

>>> # Copy only specific items (e.g., for derived experiment)
>>> folio2 = folio.copy(
...     'experiments/exp-v2-tuned',
...     include_items=['training_data', 'validation_data'],
...     metadata_updates={'status': 'in_progress'}
... )

>>> # Copy only final outputs, auto-resolving all upstream deps
>>> folio2 = folio.copy(
...     'results',
...     include_items=['final_model', 'test_results'],
...     follow_lineage=True,
... )

>>> # Include archived items in the copy
>>> folio2 = folio.copy('archive_backup', include_archived=True)

Validation

`datafolio.DataFolio.validate()`

Validate existence and integrity of all items.

Checks if: 1. Included items exist in the bundle 2. Referenced items exist at their external path 3. Checksums match (for included single files)

Returns:

Dict[str, bool] –

Dict mapping item names to validation status (True if valid)

Examples:

>>> status = folio.validate()
>>> if not all(status.values()):
...     print("Bundle corrupted!")

`datafolio.DataFolio.is_valid()`

Check if the entire bundle is valid.

Convenience method that runs validate() and returns True only if all items pass validation.

Returns:

bool –

True if all items are valid, False otherwise

Examples:

>>> if not folio.is_valid():
...     print("Bundle corrupted!")

Lineage and Dependencies

Methods for working with lineage tracking.

`datafolio.DataFolio.get_inputs(item_name)`

Get list of items that were inputs to this item.

Parameters:

item_name (str) –

Name of the item

Returns:

list[str] –

List of item names that were inputs

Raises:

KeyError –

If item doesn't exist

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> # After adding items with lineage...
>>> inputs = folio.get_inputs('predictions')
>>> # Returns: ['test_data', 'classifier']

`datafolio.DataFolio.get_dependents(item_name)`

Get list of items that depend on this item.

Parameters:

item_name (str) –

Name of the item

Returns:

list[str] –

List of item names that use this as input

Raises:

KeyError –

If item doesn't exist

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> # After adding items with lineage...
>>> dependents = folio.get_dependents('classifier')
>>> # Returns items that used 'classifier' as input

`datafolio.DataFolio.get_lineage_graph()`

Get full dependency graph for all items in bundle.

Returns:

Dict[str, list[str]] –

Dictionary mapping item names to their input item names

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> graph = folio.get_lineage_graph()
>>> # Returns: {'predictions': ['test_data', 'classifier'], ...}

Snapshots

Methods for working with snapshots (read-only copies).

`datafolio.DataFolio.create_snapshot(name, description=None, tags=None, capture_git=True, capture_environment=False, capture_execution=False)`

Create a named snapshot of the current bundle state.

A snapshot captures: - Current versions of all items (via item_versions dict) - Current metadata state (via metadata_snapshot dict) - Git repository state (commit, branch, dirty status) [optional] - Python environment (version, packages) [optional, off by default] - Execution context (entry point, working directory) [optional, off by default]

After creating a snapshot, all current items are marked as being in that snapshot. Future overwrites will trigger copy-on-write to preserve the snapshot state.

SECURITY NOTE: Environment variables (API keys, tokens, etc.) are NEVER captured. The capture_environment flag only captures Python version, platform, and package versions from uv.lock or requirements.txt.

Parameters:

name (str) –

Snapshot name (filesystem-safe, no @ symbol)
description (Optional[str], default: None ) –

Optional human-readable description
tags (Optional[list[str]], default: None ) –

Optional list of tags for organization
capture_git (bool, default: True ) –

Whether to capture git state (default: True)
capture_environment (bool, default: False ) –

Whether to capture Python environment info like version and packages (default: False for security)
capture_execution (bool, default: False ) –

Whether to capture execution context like entry point and working directory (default: False for security)

Returns:

Self –

Self for method chaining

Raises:

ValueError –

If snapshot name is invalid or already exists

Examples:

>>> folio = DataFolio('experiments/my-exp')
>>> folio.add_table('results', df)
>>> folio.create_snapshot('v1.0-baseline', description='Initial results')
>>>
>>> # Later, overwriting will preserve the snapshot
>>> folio.add_table('results', new_df, overwrite=True)  # Creates v2

`datafolio.DataFolio.list_snapshots()`

List all snapshots with their metadata.

Returns:

list[Dict[str, Any]] –

List of snapshot metadata dicts with name, timestamp, description, tags

Examples:

>>> snapshots = folio.list_snapshots()
>>> for snap in snapshots:
...     print(f"{snap['name']}: {snap['description']}")

`datafolio.DataFolio.delete_snapshot(name, cleanup_orphans=False)`

Delete a snapshot.

Removes the snapshot from the registry and updates items' in_snapshots lists. Optionally cleans up orphaned item versions that are no longer referenced.

Parameters:

name (str) –

Snapshot name to delete
cleanup_orphans (bool, default: False ) –

If True, delete item versions no longer in any snapshot

Returns:

Self –

Self for method chaining

Raises:

KeyError –

If snapshot doesn't exist

Examples:

>>> folio.delete_snapshot('experimental-v5')
>>> folio.delete_snapshot('old-snapshot', cleanup_orphans=True)

`datafolio.DataFolio.load_snapshot(bundle_dir, snapshot)` `classmethod`

Load a DataFolio in snapshot state.

Creates a DataFolio instance configured to access items and metadata as they existed at snapshot time. Snapshots are always read-only to preserve snapshot immutability.

Parameters:

bundle_dir (Union[str, Path]) –

Path to bundle directory
snapshot (str) –

Snapshot name to load

Returns:

DataFolio –

Read-only DataFolio instance in snapshot state

Raises:

KeyError –

If snapshot doesn't exist

Examples:

Load snapshot for inspection:

>>> paper = DataFolio.load_snapshot('research/exp', 'paper-v1')
>>> model = paper.get_model('classifier')
>>> print(paper.metadata['accuracy'])
>>> paper.add_table('new', df)  # Error: snapshots are always read-only

Compare multiple snapshots:

>>> v1 = DataFolio.load_snapshot('path', 'v1.0')
>>> v2 = DataFolio.load_snapshot('path', 'v2.0')
>>> print(f"v1: {v1.metadata['accuracy']}, v2: {v2.metadata['accuracy']}")

`datafolio.DataFolio.get_snapshot(snapshot)`

Get a snapshot from this folio as a new DataFolio instance.

Convenience method for loading a snapshot when you already have a folio. Equivalent to DataFolio.load_snapshot(self._bundle_dir, snapshot). Snapshots are always read-only to preserve immutability.

Parameters:

snapshot (str) –

Snapshot name to load

Returns:

DataFolio –

Read-only DataFolio instance in snapshot state

Raises:

KeyError –

If snapshot doesn't exist

Examples:

>>> folio = DataFolio('experiments/classifier')
>>> baseline = folio.get_snapshot('v1.0-baseline')
>>> assert baseline.metadata['accuracy'] == 0.89
>>> assert baseline.read_only  # Snapshots are always read-only
>>>
>>> # Compare current state to snapshot
>>> current_acc = folio.metadata['accuracy']
>>> baseline_acc = baseline.metadata['accuracy']
>>> print(f"Improvement: {current_acc - baseline_acc:.2%}")

`datafolio.DataFolio.get_snapshot_info(snapshot)`

Get detailed information about a snapshot.

Returns the full snapshot metadata including item versions, metadata state, git info, environment info, and execution context.

Parameters:

snapshot (str) –

Snapshot name

Returns:

Dict[str, Any] –

Dictionary containing all snapshot metadata

Raises:

KeyError –

If snapshot doesn't exist

Examples:

>>> info = folio.get_snapshot_info('v1.0')
>>> print(info['description'])
'Baseline model'
>>> print(info['git']['commit'])
'a3f2b8c'
>>> print(info['metadata_snapshot']['accuracy'])
0.89

`datafolio.DataFolio.compare_snapshots(snapshot1, snapshot2)`

Compare two snapshots.

Returns a dictionary showing differences between the two snapshots including: - added_items: Items in snapshot2 but not snapshot1 - removed_items: Items in snapshot1 but not snapshot2 - modified_items: Items in both but with different versions - shared_items: Items in both with same version - metadata_changes: Metadata fields that changed (old_value, new_value)

Parameters:

snapshot1 (str) –

First snapshot name
snapshot2 (str) –

Second snapshot name

Returns:

Dict[str, Any] –

Dictionary with comparison results

Raises:

KeyError –

If either snapshot doesn't exist

Examples:

>>> diff = folio.compare_snapshots('v1.0', 'v2.0')
>>> print(diff['modified_items'])
['classifier', 'config']
>>> print(diff['metadata_changes']['accuracy'])
(0.89, 0.91)

`datafolio.DataFolio.diff_from_snapshot(snapshot=None)`

Compare current state to a snapshot.

This is useful for seeing what has changed since a snapshot was created, similar to 'git status' showing changes since last commit.

Parameters:

snapshot (Optional[str], default: None ) –

Snapshot name to compare to. If None, uses most recent snapshot.

Returns:

Dict[str, Any] –

Dictionary with comparison results including:
Dict[str, Any] –
- snapshot_name: The snapshot being compared to
Dict[str, Any] –
- added_items: Items in current state but not in snapshot
Dict[str, Any] –
- removed_items: Items in snapshot but not in current state
Dict[str, Any] –
- modified_items: Items in both but with different checksums/versions
Dict[str, Any] –
- unchanged_items: Items in both with same checksum/version
Dict[str, Any] –
- metadata_changes: Metadata fields that changed

Raises:

KeyError –

If snapshot doesn't exist
ValueError –

If no snapshots exist and snapshot=None

Examples:

>>> # Compare to last snapshot
>>> diff = folio.diff_from_snapshot()
>>> print(f"Modified: {diff['modified_items']}")
['classifier', 'config']

>>> # Compare to specific snapshot
>>> diff = folio.diff_from_snapshot('v1.0')
>>> print(f"Added since v1.0: {diff['added_items']}")
['new_feature']

`datafolio.DataFolio.restore_snapshot(snapshot, confirm=False)`

Restore working state to snapshot (DESTRUCTIVE).

This operation: - Replaces current metadata with snapshot metadata - Sets current item versions to match snapshot - Removes items added after snapshot - Does NOT delete the snapshot itself

WARNING: This is a destructive operation that overwrites current state.

Parameters:

snapshot (str) –

Snapshot name to restore
confirm (bool, default: False ) –

Must be True to proceed (safety check)

Returns:

Self –

Self for method chaining

Raises:

ValueError –

If confirm=False
KeyError –

If snapshot doesn't exist

Examples:

>>> folio.restore_snapshot('v1.0', confirm=True)
>>> # Working state now matches v1.0 snapshot

`datafolio.DataFolio.export_snapshot(snapshot, target_path, *, include_snapshot_metadata=True)`

Export a snapshot to a clean, standalone bundle.

Creates a new DataFolio bundle containing only the items and metadata from the specified snapshot. This is useful for: - Sharing a specific snapshot with collaborators - Creating a clean bundle for deployment - Starting fresh without version history

Parameters:

snapshot (str) –

Name of snapshot to export
target_path (Union[str, Path]) –

Path for new bundle (must not exist)
include_snapshot_metadata (bool, default: True ) –

If True, adds snapshot info to new bundle's metadata under '_source_snapshot' key (default: True)

Returns:

DataFolio –

New DataFolio instance at target_path

Raises:

KeyError –

If snapshot doesn't exist
ValueError –

If target_path already exists

Examples:

Export a baseline snapshot for sharing:

>>> folio = DataFolio('experiments/classifier')
>>> baseline = folio.export_snapshot('v1.0-baseline', 'shared/baseline')
>>> # New bundle contains only v1.0-baseline state, no history

Export for deployment:

>>> production = folio.export_snapshot('production-v2', 'deploy/v2')
>>> # Clean bundle ready for deployment

Export without metadata reference:

>>> clean = folio.export_snapshot(
...     'v1.0',
...     'clean-export',
...     include_snapshot_metadata=False
... )

Caching

Methods for managing the local cache (for remote bundles).

`datafolio.DataFolio.cache_status(item_name=None)`

Get cache status for an item or entire bundle.

Parameters:

item_name (Optional[str], default: None ) –

Name of item to check. If None, returns overall cache stats.

Returns:

Optional[Dict[str, Any]] –

Dict with cache status information, or None if caching not enabled or item not found.
Optional[Dict[str, Any]] –

For specific items: - cached: Whether item is cached - cache_path: Path to cached file - size_bytes: Size of cached file - cached_at: Timestamp when cached - last_accessed: Last access timestamp - access_count: Number of times accessed - ttl_remaining: Seconds until cache expires (None if no TTL)
Optional[Dict[str, Any]] –

For bundle-level (item_name=None): - bundle_path: Original bundle path - cache_dir: Cache directory path - ttl_seconds: TTL in seconds - cache_hits: Number of cache hits - cache_misses: Number of cache misses - cache_hit_rate: Hit rate (0.0-1.0)

Examples:

Check if a specific item is cached:

>>> status = folio.cache_status('my_table')
>>> if status and status['cached']:
...     print(f"Cache expires in {status['ttl_remaining']} seconds")

Get overall cache statistics:

>>> stats = folio.cache_status()
>>> print(f"Cache hit rate: {stats['cache_hit_rate']:.1%}")

`datafolio.DataFolio.clear_cache(item_name=None)`

Clear cached items.

Parameters:

item_name (Optional[str], default: None ) –

Name of specific item to clear. If None, clears all cached items for this bundle.

Examples:

Clear a specific item:

>>> folio.clear_cache('my_table')

Clear entire cache:

>>> folio.clear_cache()

`datafolio.DataFolio.invalidate_cache(item_name)`

Invalidate cache for an item without deleting the file.

This marks the cached item as invalid, forcing a re-fetch on next access, but keeps the file on disk (useful for stale cache fallback).

Parameters:

item_name (str) –

Name of item to invalidate

Examples:

Force re-download on next access:

>>> folio.invalidate_cache('my_table')
>>> table = folio.get_table('my_table')  # Will re-download

`datafolio.DataFolio.refresh_cache(item_name)`

Refresh cache for an item by re-downloading from remote.

This is equivalent to invalidating and then fetching the item.

Parameters:

item_name (str) –

Name of item to refresh

Raises:

ValueError –

If item doesn't exist in bundle
RuntimeError –

If caching is not enabled

Examples:

>>> folio.refresh_cache('my_table')

Bundle Management

Methods for managing the DataFolio bundle itself.

`datafolio.DataFolio.refresh()`

Explicitly refresh manifests from disk/cloud.

This reloads items.json and metadata.json from the bundle directory, syncing the in-memory state with any external updates.

Useful when working with multiple DataFolio instances pointing to the same bundle, or when the bundle is updated by another process.

Returns:

Self –

Self for method chaining

Examples:

Explicit refresh after external update:

>>> folio1 = DataFolio('experiments/shared')
>>> folio2 = DataFolio('experiments/shared')
>>> folio1.add_table('results', df)
>>> folio2.refresh()  # Manually sync
>>> assert 'results' in folio2.list_contents()['included_tables']

Auto-refresh (happens automatically):

>>> folio1.add_table('results', df)
>>> # folio2 auto-refreshes on next read operation
>>> assert 'results' in folio2.list_contents()['included_tables']

Properties

Useful properties for accessing bundle information and items.

Core Properties

`path`

The path to the DataFolio bundle.

print(folio.path)  # e.g., 'gs://my-bucket/my-bundle' or '/local/path/bundle'

`metadata`

Bundle-level metadata dictionary.

print(folio.metadata)  # e.g., {'project': 'analysis', 'version': '1.0'}

`items`

Dictionary of all items in the bundle with their metadata.

print(folio.items)  # e.g., {'table1': {...}, 'model1': {...}}

Item Lists

`tables`

List of all table names in the bundle.

print(folio.tables)  # e.g., ['results', 'metadata', 'analysis']

`models`

List of all model names in the bundle.

print(folio.models)  # e.g., ['classifier', 'regressor']

`artifacts`

List of all artifact names in the bundle.

print(folio.artifacts)  # e.g., ['config.yaml', 'results.png']

Data Accessor

`data`

Accessor for convenient data retrieval with autocomplete support.

df = folio.data.my_table  # Equivalent to folio.get_table('my_table')
model = folio.data.my_model  # Equivalent to folio.get_model('my_model')

Status Properties

`read_only`

Whether the bundle is in read-only mode.

print(folio.read_only)  # True or False

`in_snapshot_mode`

Whether the bundle was loaded from a snapshot.

print(folio.in_snapshot_mode)  # True or False

`loaded_snapshot`

Name of the snapshot this bundle was loaded from (if any).

print(folio.loaded_snapshot)  # e.g., 'v1.0' or None

Method Categories Summary

Category	Methods
Adding Data	`add_table()`, `add_numpy()`, `add_json()`, `add_timestamp()`, `add_data()`, `reference_table()`
Adding Models	`add_sklearn()`, `add_model()`
Adding Artifacts	`add_artifact()`
Retrieving Data	`get_table()`, `get_table_path()`, `get_numpy()`, `get_numpy_path()`, `get_json()`, `get_json_path()`, `get_timestamp()`, `get_timestamp_path()`, `get_data()`, `get_data_path()`, `get_item_path()`
Retrieving Models	`get_sklearn()`, `get_model()`, `get_model_path()`
Retrieving Artifacts	`get_artifact_path()`
Inspecting Items	`list_contents()`, `get_table_info()`, `get_model_info()`, `get_artifact_info()`, `describe()`
Managing Items	`delete()`, `copy()`, `validate()`, `is_valid()`
Lineage	`get_inputs()`, `get_dependents()`, `get_lineage_graph()`
Snapshots	`create_snapshot()`, `list_snapshots()`, `delete_snapshot()`, `load_snapshot()`, `get_snapshot()`, `get_snapshot_info()`, `compare_snapshots()`, `diff_from_snapshot()`, `restore_snapshot()`, `export_snapshot()`
Caching	`cache_status()`, `clear_cache()`, `invalidate_cache()`, `refresh_cache()`
Bundle Management	`refresh()`

Quick Examples

Basic Usage

import datafolio
import pandas as pd

# Create a new DataFolio
folio = datafolio.DataFolio('my_analysis')

# Add data
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
folio.add_table('results', df, description='Experimental results')

# Retrieve data
df_loaded = folio.get_table('results')

# List contents
print(folio.list_contents())
print(folio.tables)  # Property access

# Use data accessor with autocomplete
df_via_accessor = folio.data.results

With Caching

# Enable caching for remote bundles
folio = datafolio.DataFolio(
    'gs://my-bucket/my-bundle',
    cache_enabled=True,
    cache_dir='/tmp/my-cache'
)

# First access downloads and caches
df = folio.get_table('large_table')  # Downloads from cloud

# Second access uses cache (much faster!)
df = folio.get_table('large_table')  # Reads from local cache

# Check cache statistics
status = folio.cache_status()
print(f"Cache hits: {status['cache_hits']}")
print(f"Cache misses: {status['cache_misses']}")

With Snapshots

# Create a read-only snapshot
folio.create_snapshot('v1.0', description='Release 1.0')

# Load a snapshot (read-only mode)
folio_snapshot = datafolio.DataFolio.load_snapshot('v1.0')

# List all snapshots
snapshots = folio.list_snapshots()
for snap in snapshots:
    print(f"{snap['name']}: {snap['description']}")

Lineage Tracking

# Add data with lineage
folio.add_table('raw_data', raw_df)
folio.add_table('processed_data', processed_df, inputs=['raw_data'])
folio.add_model('trained_model', model, inputs=['processed_data'])

# Query lineage
inputs = folio.get_inputs('trained_model')  # ['processed_data']
dependents = folio.get_dependents('raw_data')  # ['processed_data']

# Get full lineage graph
graph = folio.get_lineage_graph()
print(graph)  # Shows dependency relationships

# For a cloud-hosted folio, get the direct path to any item
folio = datafolio.DataFolio('s3://my-bucket/experiments/run-42')

# Type-specific path methods (recommended):
path = folio.get_table_path('results')
# → 's3://my-bucket/experiments/run-42/tables/results.parquet'

path = folio.get_model_path('classifier')
# → 's3://my-bucket/experiments/run-42/models/classifier.joblib'

# Generic path getter (dispatches to the appropriate method automatically):
path = folio.get_data_path('results')   # same as get_table_path for tables
path = folio.get_item_path('results')   # lower-level, skips type-specific logic

# Share with a colleague who doesn't use datafolio:
# import pandas as pd; pd.read_parquet('s3://my-bucket/.../results.parquet')

# Or browse all paths at once with describe()
folio.describe(show_paths=True)
# Tables (2):
#   • raw_data (reference): Input dataset
#     ↳ path: s3://data-lake/raw.parquet
#   • results: Model results
#     ↳ path: s3://my-bucket/experiments/run-42/tables/results.parquet
# Models (1):
#   • classifier: Trained model
#     ↳ path: s3://my-bucket/experiments/run-42/models/classifier.joblib

DataFolio Class - Complete API Reference

Creating a DataFolio

datafolio.DataFolio.__init__(path, metadata=None, random_suffix=False, read_only=False, cache_enabled=False, cache_dir=None, cache_ttl=None, use_https=False)

Adding Data

Tables (DataFrames)

datafolio.DataFolio.add_table(name, data, description=None, overwrite=False, inputs=None, models=None, code=None)

datafolio.DataFolio.reference_table(name, path, table_format='parquet', num_rows=None, version=None, description=None, inputs=None, code=None)

Arrays

datafolio.DataFolio.add_numpy(name, array, description=None, overwrite=False, inputs=None, code=None)

JSON Data

datafolio.DataFolio.add_json(name, data, description=None, overwrite=False, inputs=None, code=None)

Timestamps

datafolio.DataFolio.add_timestamp(name, timestamp, description=None, overwrite=False, inputs=None, code=None)

Generic Data

datafolio.DataFolio.add_data(name, data=None, reference=None, description=None, **kwargs)

Adding Models

Scikit-learn Models

datafolio.DataFolio.add_sklearn(name, model, description=None, overwrite=False, inputs=None, hyperparameters=None, code=None, custom=False)

datafolio.DataFolio.add_model(name, model, description=None, overwrite=False, custom=False, **kwargs)

Adding Artifacts

datafolio.DataFolio.add_artifact(name, path, category=None, description=None, overwrite=False)

Retrieving Data

Tables (DataFrames)

datafolio.DataFolio.get_table(name, **kwargs)

datafolio.DataFolio.get_table_path(name)

Arrays

datafolio.DataFolio.get_numpy(name)

datafolio.DataFolio.get_numpy_path(name)

JSON Data

datafolio.DataFolio.get_json(name)

datafolio.DataFolio.get_json_path(name)

Timestamps

datafolio.DataFolio.get_timestamp(name, as_unix=False)

datafolio.DataFolio.get_timestamp_path(name)

Generic Data

datafolio.DataFolio.get_data(name)

datafolio.DataFolio.get_data_path(name)

datafolio.DataFolio.get_item_path(name)

Retrieving Models

Scikit-learn Models

datafolio.DataFolio.get_sklearn(name)

datafolio.DataFolio.get_model(name, **kwargs)

datafolio.DataFolio.get_model_path(name)

Retrieving Artifacts

datafolio.DataFolio.get_artifact_path(name)

Inspecting Items

datafolio.DataFolio.list_contents(include_archived=False)

datafolio.DataFolio.get_table_info(name)

datafolio.DataFolio.get_model_info(name)

datafolio.DataFolio.get_artifact_info(name)

datafolio.DataFolio.describe(pattern=None, return_string=False, show_empty=False, max_metadata_fields=10, snapshot=None, include_archived=False, show_paths=False)

Managing Items

Deleting Items

datafolio.DataFolio.delete(name, warn_dependents=True)

Archiving Items

datafolio.DataFolio.archive(name)

datafolio.DataFolio.unarchive(name)

Copying Bundles

datafolio.DataFolio.copy(path, name=None, metadata_updates=None, include_items=None, exclude_items=None, random_suffix=False, follow_lineage=False, include_archived=False)

Validation

datafolio.DataFolio.validate()

datafolio.DataFolio.is_valid()

Lineage and Dependencies

datafolio.DataFolio.get_inputs(item_name)

datafolio.DataFolio.get_dependents(item_name)

datafolio.DataFolio.get_lineage_graph()

Snapshots

datafolio.DataFolio.create_snapshot(name, description=None, tags=None, capture_git=True, capture_environment=False, capture_execution=False)

datafolio.DataFolio.list_snapshots()

datafolio.DataFolio.delete_snapshot(name, cleanup_orphans=False)

datafolio.DataFolio.load_snapshot(bundle_dir, snapshot) classmethod

datafolio.DataFolio.get_snapshot(snapshot)

datafolio.DataFolio.get_snapshot_info(snapshot)

datafolio.DataFolio.compare_snapshots(snapshot1, snapshot2)

datafolio.DataFolio.diff_from_snapshot(snapshot=None)

datafolio.DataFolio.restore_snapshot(snapshot, confirm=False)

datafolio.DataFolio.export_snapshot(snapshot, target_path, *, include_snapshot_metadata=True)

Caching

datafolio.DataFolio.cache_status(item_name=None)

datafolio.DataFolio.clear_cache(item_name=None)

`datafolio.DataFolio.init(path, metadata=None, random_suffix=False, read_only=False, cache_enabled=False, cache_dir=None, cache_ttl=None, use_https=False)`

`datafolio.DataFolio.add_table(name, data, description=None, overwrite=False, inputs=None, models=None, code=None)`

`datafolio.DataFolio.reference_table(name, path, table_format='parquet', num_rows=None, version=None, description=None, inputs=None, code=None)`

`datafolio.DataFolio.add_numpy(name, array, description=None, overwrite=False, inputs=None, code=None)`

`datafolio.DataFolio.add_json(name, data, description=None, overwrite=False, inputs=None, code=None)`

`datafolio.DataFolio.add_timestamp(name, timestamp, description=None, overwrite=False, inputs=None, code=None)`

`datafolio.DataFolio.add_data(name, data=None, reference=None, description=None, **kwargs)`

`datafolio.DataFolio.add_sklearn(name, model, description=None, overwrite=False, inputs=None, hyperparameters=None, code=None, custom=False)`

`datafolio.DataFolio.add_model(name, model, description=None, overwrite=False, custom=False, **kwargs)`

`datafolio.DataFolio.add_artifact(name, path, category=None, description=None, overwrite=False)`

`datafolio.DataFolio.get_table(name, **kwargs)`

`datafolio.DataFolio.get_table_path(name)`

`datafolio.DataFolio.get_numpy(name)`

`datafolio.DataFolio.get_numpy_path(name)`

`datafolio.DataFolio.get_json(name)`

`datafolio.DataFolio.get_json_path(name)`

`datafolio.DataFolio.get_timestamp(name, as_unix=False)`

`datafolio.DataFolio.get_timestamp_path(name)`

`datafolio.DataFolio.get_data(name)`

`datafolio.DataFolio.get_data_path(name)`

`datafolio.DataFolio.get_item_path(name)`

`datafolio.DataFolio.get_sklearn(name)`

`datafolio.DataFolio.get_model(name, **kwargs)`

`datafolio.DataFolio.get_model_path(name)`

`datafolio.DataFolio.get_artifact_path(name)`

`datafolio.DataFolio.list_contents(include_archived=False)`

`datafolio.DataFolio.get_table_info(name)`

`datafolio.DataFolio.get_model_info(name)`

`datafolio.DataFolio.get_artifact_info(name)`

`datafolio.DataFolio.describe(pattern=None, return_string=False, show_empty=False, max_metadata_fields=10, snapshot=None, include_archived=False, show_paths=False)`

`datafolio.DataFolio.delete(name, warn_dependents=True)`

`datafolio.DataFolio.archive(name)`

`datafolio.DataFolio.unarchive(name)`

`datafolio.DataFolio.copy(path, name=None, metadata_updates=None, include_items=None, exclude_items=None, random_suffix=False, follow_lineage=False, include_archived=False)`

`datafolio.DataFolio.validate()`

`datafolio.DataFolio.is_valid()`

`datafolio.DataFolio.get_inputs(item_name)`

`datafolio.DataFolio.get_dependents(item_name)`

`datafolio.DataFolio.get_lineage_graph()`

`datafolio.DataFolio.create_snapshot(name, description=None, tags=None, capture_git=True, capture_environment=False, capture_execution=False)`

`datafolio.DataFolio.list_snapshots()`

`datafolio.DataFolio.delete_snapshot(name, cleanup_orphans=False)`

`datafolio.DataFolio.load_snapshot(bundle_dir, snapshot)` `classmethod`

`datafolio.DataFolio.get_snapshot(snapshot)`

`datafolio.DataFolio.get_snapshot_info(snapshot)`

`datafolio.DataFolio.compare_snapshots(snapshot1, snapshot2)`

`datafolio.DataFolio.diff_from_snapshot(snapshot=None)`

`datafolio.DataFolio.restore_snapshot(snapshot, confirm=False)`

`datafolio.DataFolio.export_snapshot(snapshot, target_path, *, include_snapshot_metadata=True)`

`datafolio.DataFolio.cache_status(item_name=None)`

`datafolio.DataFolio.clear_cache(item_name=None)`

`datafolio.DataFolio.invalidate_cache(item_name)`

`datafolio.DataFolio.refresh_cache(item_name)`

`datafolio.DataFolio.refresh()`

`path`

`metadata`

`items`

`tables`

`models`

`artifacts`

`data`

`read_only`

`in_snapshot_mode`

`loaded_snapshot`