Skip to content

DataFolio Class - Complete API Reference

This page provides a comprehensive reference of all methods available on the DataFolio class, organized by functionality.

Creating a DataFolio

datafolio.DataFolio.__init__(path, metadata=None, random_suffix=False, read_only=False, cache_enabled=False, cache_dir=None, cache_ttl=None, use_https=False)

Initialize a new or open an existing DataFolio.

If the directory doesn't exist, creates a new bundle. If it exists, opens the existing bundle and reads manifests.

Parameters:

  • path (Union[str, Path]) –

    Full path to bundle directory (local or cloud)

  • metadata (Optional[Dict[str, Any]], default: None ) –

    Optional dictionary of analysis metadata (for new bundles)

  • random_suffix (bool, default: False ) –

    If True, append random suffix to bundle name (default: False)

  • read_only (bool, default: False ) –

    If True, prevent all write operations (default: False)

  • cache_enabled (bool, default: False ) –

    If True, enable local caching for remote data (default: False)

  • cache_dir (Optional[Union[str, Path]], default: None ) –

    Optional cache directory (default: ~/.datafolio_cache)

  • cache_ttl (Optional[int], default: None ) –

    Optional TTL override in seconds (default: 1800 = 30 minutes)

  • use_https (bool, default: False ) –

    If True, use HTTPS URLs for CloudFiles (for read-only access to public buckets) (default: False)

Examples:

Create new bundle with exact name:

>>> folio = DataFolio('experiments/protein-analysis')
# Creates: experiments/protein-analysis/

Create new bundle with random suffix:

>>> folio = DataFolio(
...     'experiments/protein-analysis',
...     random_suffix=True
... )
# Creates: experiments/protein-analysis-blue-happy-falcon/

Open existing bundle:

>>> folio = DataFolio('experiments/protein-analysis')

With metadata:

>>> folio = DataFolio(
...     'experiments/my-exp',
...     metadata={'date': '2024-01-15', 'scientist': 'Dr. Smith'}
... )

Open existing bundle as read-only (for safe inspection):

>>> folio = DataFolio('experiments/production-model', read_only=True)
>>> model = folio.get_model('classifier')  # OK
>>> folio.add_table('new', df)  # Error: read-only

Enable caching for cloud bundles (faster repeated access):

>>> folio = DataFolio('gs://bucket/experiment', cache_enabled=True)
>>> df = folio.get_table('data')  # Downloads and caches
>>> df = folio.get_table('data')  # Loads from cache (instant)

Custom cache configuration:

>>> folio = DataFolio(
...     'gs://bucket/experiment',
...     cache_enabled=True,
...     cache_dir='/mnt/shared/cache',
...     cache_ttl=3600  # 1 hour
... )

Adding Data

Methods for adding different types of data to a DataFolio.

Tables (DataFrames)

datafolio.DataFolio.add_table(name, data, description=None, overwrite=False, inputs=None, models=None, code=None)

Add a table to be included in the bundle.

Writes immediately to tables/ directory and updates items.json.

Parameters:

  • name (str) –

    Unique name for this table

  • data (Any) –

    pandas or Polars DataFrame to include

  • description (Optional[str], default: None ) –

    Optional description

  • overwrite (bool, default: False ) –

    If True, allow overwriting existing table (default: False)

  • inputs (Optional[list[str]], default: None ) –

    Optional list of table names used to create this table

  • models (Optional[list[str]], default: None ) –

    Optional list of model names used to create this table

  • code (Optional[str], default: None ) –

    Optional code snippet that created this table

Returns:

  • Self

    Self for method chaining

Raises:

  • ValueError

    If name already exists and overwrite=False

  • TypeError

    If data is not a DataFrame

Examples:

>>> import pandas as pd
>>> folio = DataFolio('experiments', prefix='test')
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> folio.add_table('summary', df)
>>> # With lineage
>>> pred_df = pd.DataFrame({'pred': [0, 1, 0]})
>>> folio.add_table('predictions', pred_df,
...     inputs=['test_data'],
...     models=['classifier'],
...     code='pred = model.predict(X_test)')

datafolio.DataFolio.reference_table(name, path, table_format='parquet', num_rows=None, version=None, description=None, inputs=None, code=None)

Add a reference to an external table (not copied to bundle).

Writes immediately to items.json.

Parameters:

  • name (str) –

    Unique name for this table

  • path (Union[str, Path]) –

    Path to the table (local or cloud)

  • table_format (str, default: 'parquet' ) –

    Format of the table ('parquet', 'delta', 'csv')

  • num_rows (Optional[int], default: None ) –

    Optional number of rows

  • version (Optional[int], default: None ) –

    Optional version number (for Delta tables)

  • description (Optional[str], default: None ) –

    Optional description

  • inputs (Optional[list[str]], default: None ) –

    Optional list of items this was derived from

  • code (Optional[str], default: None ) –

    Optional code snippet that created this

Returns:

  • Self

    Self for method chaining

Raises:

  • ValueError

    If name already exists or format is invalid

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.reference_table(
...     'raw_data',
...     path='s3://bucket/data.parquet',
...     table_format='parquet',
...     num_rows=1_000_000
... )

Arrays

datafolio.DataFolio.add_numpy(name, array, description=None, overwrite=False, inputs=None, code=None)

Add a numpy array to the bundle.

Saves array to artifacts/ directory as .npy file and updates items.json.

Parameters:

  • name (str) –

    Unique name for this array

  • array (Any) –

    numpy array to save

  • description (Optional[str], default: None ) –

    Optional description

  • overwrite (bool, default: False ) –

    If True, allow overwriting existing array (default: False)

  • inputs (Optional[list[str]], default: None ) –

    Optional list of items this was derived from

  • code (Optional[str], default: None ) –

    Optional code snippet that created this array

Returns:

  • Self

    Self for method chaining

Raises:

Examples:

>>> import numpy as np
>>> folio = DataFolio('experiments/test')
>>> embeddings = np.random.randn(100, 128)
>>> folio.add_numpy('embeddings', embeddings, description='Model embeddings')
>>> # With lineage
>>> predictions = np.array([0, 1, 0, 1])
>>> folio.add_numpy('predictions', predictions,
...     inputs=['test_data'],
...     code='predictions = model.predict(X)')

JSON Data

datafolio.DataFolio.add_json(name, data, description=None, overwrite=False, inputs=None, code=None)

Add JSON-serializable data to the bundle.

Saves data to artifacts/ directory as .json file and updates items.json. Supports dicts, lists, scalars, and other JSON-serializable types.

Parameters:

  • name (str) –

    Unique name for this data

  • data (Union[dict, list, int, float, str, bool, None]) –

    JSON-serializable data (dict, list, scalar, etc.)

  • description (Optional[str], default: None ) –

    Optional description

  • overwrite (bool, default: False ) –

    If True, allow overwriting existing data (default: False)

  • inputs (Optional[list[str]], default: None ) –

    Optional list of items this was derived from

  • code (Optional[str], default: None ) –

    Optional code snippet that created this data

Returns:

  • Self

    Self for method chaining

Raises:

  • ValueError

    If name already exists and overwrite=False, or data not JSON-serializable

  • TypeError

    If data cannot be serialized to JSON

Examples:

>>> folio = DataFolio('experiments/test')
>>> config = {'learning_rate': 0.01, 'batch_size': 32}
>>> folio.add_json('config', config, description='Model config')
>>> # With list data
>>> class_names = ['cat', 'dog', 'bird']
>>> folio.add_json('classes', class_names)
>>> # With scalar
>>> folio.add_json('best_accuracy', 0.95)

Timestamps

datafolio.DataFolio.add_timestamp(name, timestamp, description=None, overwrite=False, inputs=None, code=None)

Add a timestamp to the bundle.

Saves timestamp to artifacts/ directory as .json file and updates items.json. Accepts timezone-aware datetime objects or Unix timestamps (int/float). All timestamps are stored in UTC as ISO 8601 strings.

Parameters:

  • name (str) –

    Unique name for this timestamp

  • timestamp (Union[datetime, int, float]) –

    Timezone-aware datetime object or Unix timestamp (int/float). Naive datetimes will raise ValueError.

  • description (Optional[str], default: None ) –

    Optional description

  • overwrite (bool, default: False ) –

    If True, allow overwriting existing timestamp (default: False)

  • inputs (Optional[list[str]], default: None ) –

    Optional list of items this was derived from

  • code (Optional[str], default: None ) –

    Optional code snippet that created this timestamp

Returns:

  • Self

    Self for method chaining

Raises:

  • ValueError

    If name already exists and overwrite=False, or if datetime is naive

  • TypeError

    If timestamp is not a datetime or numeric type

Examples:

>>> from datetime import datetime, timezone
>>> folio = DataFolio('experiments/test')
>>>
>>> # Add timezone-aware datetime
>>> event_time = datetime(2024, 1, 15, 10, 30, 0, tzinfo=timezone.utc)
>>> folio.add_timestamp('event_time', event_time, description='Event occurred')
>>>
>>> # Add Unix timestamp
>>> folio.add_timestamp('start_time', 1705318200, description='Start time')
>>>
>>> # With lineage
>>> from datetime import datetime, timezone
>>> import pytz
>>> eastern = pytz.timezone('US/Eastern')
>>> local_time = eastern.localize(datetime(2024, 1, 15, 10, 30, 0))
>>> folio.add_timestamp('local_event', local_time,
...     inputs=['event_log'],
...     code='timestamp = event_log.iloc[0]["timestamp"]')

Generic Data

datafolio.DataFolio.add_data(name, data=None, reference=None, description=None, **kwargs)

Generic data addition with automatic type detection.

Convenience method that dispatches to the appropriate specific method based on data type. For fine-grained control, use the specific methods: add_table(), add_numpy(), add_json(), or reference_table().

Parameters:

  • name (str) –

    Unique name for this data

  • data (Any, default: None ) –

    Data to save (DataFrame, numpy array, dict, list, scalar)

  • reference (Optional[Union[str, Path]], default: None ) –

    If provided, creates a reference to external data instead

  • description (Optional[str], default: None ) –

    Optional description

  • **kwargs

    Additional arguments passed to the specific method

Returns:

  • Self

    Self for method chaining

Raises:

  • ValueError

    If neither data nor reference is provided, or both are provided

  • TypeError

    If data type is not supported

Examples:

DataFrame (saves as parquet):

>>> folio.add_data('results', df)

Numpy array (saves as .npy):

>>> folio.add_data('embeddings', np.array([1, 2, 3]))

JSON data (saves as .json):

>>> folio.add_data('config', {'lr': 0.01})
>>> folio.add_data('classes', ['cat', 'dog'])
>>> folio.add_data('accuracy', 0.95)

External reference:

>>> folio.add_data('raw', reference='s3://bucket/data.parquet')

Adding Models

Methods for saving machine learning models.

Scikit-learn Models

datafolio.DataFolio.add_sklearn(name, model, description=None, overwrite=False, inputs=None, hyperparameters=None, code=None, custom=False)

Add a scikit-learn style model to the bundle.

Writes immediately to models/ directory and updates items.json.

Parameters:

  • name (str) –

    Unique name for this model

  • model (Any) –

    Trained model to include

  • description (Optional[str], default: None ) –

    Optional description

  • overwrite (bool, default: False ) –

    If True, allow overwriting existing model (default: False)

  • inputs (Optional[list[str]], default: None ) –

    Optional list of table names used for training

  • hyperparameters (Optional[Dict[str, Any]], default: None ) –

    Optional dict of hyperparameters

  • code (Optional[str], default: None ) –

    Optional code snippet that trained this model

  • custom (bool, default: False ) –

    If True, use skops format for portable pipelines with custom transformers. If False (default), use joblib format.

Returns:

  • Self

    Self for method chaining

Raises:

  • ValueError

    If name already exists and overwrite=False

Examples:

>>> from sklearn.ensemble import RandomForestClassifier
>>> folio = DataFolio('experiments', prefix='test')
>>> model = RandomForestClassifier(n_estimators=100, max_depth=10)
>>> # ... train model ...
>>> folio.add_sklearn('classifier', model,
...     description='Random forest classifier',
...     inputs=['training_data', 'validation_data'],
...     hyperparameters={'n_estimators': 100, 'max_depth': 10},
...     code='model.fit(X_train, y_train)')
>>>
>>> # Portable pipeline with custom transformer (skops)
>>> folio.add_sklearn('pipeline', custom_pipeline, custom=True)

datafolio.DataFolio.add_model(name, model, description=None, overwrite=False, custom=False, **kwargs)

Add a scikit-learn style model to the bundle.

This is a convenience method that delegates to add_sklearn().

Parameters:

  • name (str) –

    Unique name for this model

  • model (Any) –

    Trained sklearn-style model

  • description (Optional[str], default: None ) –

    Optional description

  • overwrite (bool, default: False ) –

    If True, allow overwriting existing model (default: False)

  • custom (bool, default: False ) –

    If True use skops format for portability (required for custom transformers)

  • **kwargs

    Additional arguments passed to add_sklearn() (e.g., hyperparameters, inputs, code)

Returns:

  • Self

    Self for method chaining

Raises:

  • ValueError

    If name already exists and overwrite=False

Examples:

>>> from sklearn.ensemble import RandomForestClassifier
>>> model = RandomForestClassifier()
>>> folio.add_model('clf', model, hyperparameters={'n_estimators': 100})

With custom transformer (portable):

>>> folio.add_model('pipeline', custom_pipeline, custom=True)

Adding Artifacts

Methods for adding arbitrary files and artifacts.

datafolio.DataFolio.add_artifact(name, path, category=None, description=None, overwrite=False)

Add an artifact file to the bundle.

Copies file immediately to artifacts/ directory and updates included_items.json.

Parameters:

  • name (str) –

    Unique name for this artifact

  • path (Union[str, Path]) –

    Path to the file to include

  • category (Optional[str], default: None ) –

    Optional category ('plots', 'configs', etc.)

  • description (Optional[str], default: None ) –

    Optional description

  • overwrite (bool, default: False ) –

    If True, allow overwriting existing artifact (default: False)

Returns:

  • Self

    Self for method chaining

Raises:

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.add_artifact('loss_curve', 'plots/training_loss.png', category='plots')
>>> # Update with overwrite
>>> folio.add_artifact('loss_curve', 'plots/updated_loss.png', category='plots', overwrite=True)

Retrieving Data

Methods for loading data from a DataFolio.

Tables (DataFrames)

datafolio.DataFolio.get_table(name, **kwargs)

Get a table by name (works for both included and referenced).

For included tables, reads from bundle directory. For referenced tables, reads from the specified external path. If caching is enabled (cache_enabled=True), cloud-based tables are cached locally for faster repeated access.

Supports all pandas.read_parquet() arguments for filtering and optimization: - columns: List of column names to read (column pruning) - filters: Row filtering predicates (row filtering) - engine: Parquet engine ('pyarrow' or 'fastparquet')

Parameters:

  • name (str) –

    Name of the table

  • **kwargs

    Additional arguments passed to pd.read_parquet() (e.g., columns, filters, engine)

Returns:

  • Any

    pandas DataFrame

Raises:

Examples:

Basic usage:

>>> folio = DataFolio('experiments', prefix='test')
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> folio.add_table('test', df)
>>> retrieved = folio.get_table('test')
>>> assert len(retrieved) == 3

Column selection (read only specific columns):

>>> df_subset = folio.get_table('test', columns=['a'])
>>> assert list(df_subset.columns) == ['a']

Row filtering (requires pyarrow engine):

>>> df_filtered = folio.get_table('test',
...     filters=[('a', '>', 1)],
...     engine='pyarrow')
>>> assert len(df_filtered) == 2

datafolio.DataFolio.get_table_path(name)

Get the path to a table file, whether included in the bundle or referenced externally.

For included tables, returns the full path to the parquet file inside the bundle. For referenced tables, returns the external path recorded at reference time.

Parameters:

  • name (str) –

    Name of the table

Returns:

  • str

    Path to the table file

Raises:

Examples:

>>> folio = DataFolio('experiments/my-run')
>>> folio.add_table('results', df)
>>> path = folio.get_table_path('results')
>>> print(path)
'experiments/my-run/tables/results.parquet'
>>> folio.reference_table('raw', path='s3://data-lake/raw.parquet')
>>> folio.get_table_path('raw')
's3://data-lake/raw.parquet'

Arrays

datafolio.DataFolio.get_numpy(name)

Get a numpy array by name.

If caching is enabled, cloud-based arrays are cached locally for faster repeated access.

Parameters:

  • name (str) –

    Name of the array

Returns:

  • Any

    numpy array

Raises:

Examples:

>>> folio = DataFolio('experiments/test')
>>> embeddings = folio.get_numpy('embeddings')
>>> print(embeddings.shape)

datafolio.DataFolio.get_numpy_path(name)

Get the path to a numpy array file stored in the bundle.

Parameters:

  • name (str) –

    Name of the array

Returns:

  • str

    Path to the .npy file

Raises:

  • KeyError

    If array name doesn't exist

  • ValueError

    If named item is not a numpy array

Examples:

>>> folio = DataFolio('experiments/my-run')
>>> folio.add_numpy('embeddings', arr)
>>> path = folio.get_numpy_path('embeddings')
>>> print(path)
'experiments/my-run/artifacts/embeddings.npy'

JSON Data

datafolio.DataFolio.get_json(name)

Get JSON data by name.

If caching is enabled, cloud-based JSON data is cached locally for faster repeated access.

Parameters:

  • name (str) –

    Name of the JSON data

Returns:

  • Any

    Deserialized JSON data (dict, list, scalar, etc.)

Raises:

Examples:

>>> folio = DataFolio('experiments/test')
>>> config = folio.get_json('config')
>>> print(config['learning_rate'])

datafolio.DataFolio.get_json_path(name)

Get the path to a JSON data file stored in the bundle.

Parameters:

  • name (str) –

    Name of the JSON data

Returns:

  • str

    Path to the .json file

Raises:

Examples:

>>> folio = DataFolio('experiments/my-run')
>>> folio.add_json('config', {'lr': 0.01})
>>> path = folio.get_json_path('config')
>>> print(path)
'experiments/my-run/artifacts/config.json'

Timestamps

datafolio.DataFolio.get_timestamp(name, as_unix=False)

Get a timestamp by name.

If caching is enabled, cloud-based timestamps are cached locally for faster repeated access.

Parameters:

  • name (str) –

    Name of the timestamp

  • as_unix (bool, default: False ) –

    If True, return Unix timestamp (float); if False, return datetime (default)

Returns:

  • Union[datetime, float]

    UTC-aware datetime object (default) or Unix timestamp (if as_unix=True)

Raises:

  • KeyError

    If timestamp name doesn't exist

  • ValueError

    If named item is not a timestamp

Examples:

>>> folio = DataFolio('experiments/test')
>>>
>>> # Get as datetime (default)
>>> event_time = folio.get_timestamp('event_time')
>>> print(event_time.isoformat())
'2024-01-15T10:30:00+00:00'
>>>
>>> # Get as Unix timestamp
>>> unix_time = folio.get_timestamp('event_time', as_unix=True)
>>> print(unix_time)
1705318200.0

datafolio.DataFolio.get_timestamp_path(name)

Get the path to a timestamp file stored in the bundle.

Parameters:

  • name (str) –

    Name of the timestamp

Returns:

  • str

    Path to the timestamp file

Raises:

  • KeyError

    If timestamp name doesn't exist

  • ValueError

    If named item is not a timestamp

Examples:

>>> folio = DataFolio('experiments/my-run')
>>> folio.add_timestamp('event_time', dt)
>>> path = folio.get_timestamp_path('event_time')
>>> print(path)
'experiments/my-run/artifacts/event_time.json'

Generic Data

datafolio.DataFolio.get_data(name)

Generic data getter that returns any data type.

Automatically detects the item type and calls the appropriate getter. For fine-grained control, use the specific methods: get_table(), get_numpy(), or get_json().

Parameters:

  • name (str) –

    Name of the data item

Returns:

  • Any

    The data (DataFrame, numpy array, dict, list, or scalar)

Raises:

  • KeyError

    If item name doesn't exist

  • ValueError

    If item is not a data type (e.g., is a model or artifact)

Examples:

>>> folio.add_data('results', df)
>>> folio.add_data('embeddings', np_array)
>>> folio.add_data('config', {'lr': 0.01})
>>> # Later, retrieve without knowing the type
>>> results = folio.get_data('results')  # Returns DataFrame
>>> embeddings = folio.get_data('embeddings')  # Returns numpy array
>>> config = folio.get_data('config')  # Returns dict

datafolio.DataFolio.get_data_path(name)

Get the path to any stored item, delegating to the appropriate type-specific method.

Automatically detects the item type and calls the appropriate path getter: - Tables (included or referenced): delegates to get_table_path() - Artifacts: delegates to get_artifact_path() - All other bundled items (numpy arrays, JSON, timestamps): returns the bundle file path

Parameters:

  • name (str) –

    Name of the item

Returns:

  • str

    Path to the item file

Raises:

  • KeyError

    If item name doesn't exist

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.add_table('results', df)
>>> folio.get_data_path('results')  # returns path to parquet file
>>> folio.reference_table('data', path='s3://bucket/file.parquet')
>>> folio.get_data_path('data')  # returns 's3://bucket/file.parquet'

datafolio.DataFolio.get_item_path(name)

Get the path to any item stored in the folio.

For items stored within the bundle (included tables, models, artifacts, arrays, JSON data, timestamps), returns the full path to the data file. For referenced tables, returns the external path recorded at reference time.

This is especially useful for cloud-hosted folios where collaborators can directly access or download underlying files without using datafolio.

Parameters:

  • name (str) –

    Name of the item

Returns:

  • str

    Full path to the item's data file. For cloud folios this will be a

  • str

    cloud URI (e.g. s3://bucket/.../results.parquet). For local folios

  • str

    this will be an absolute file-system path.

Raises:

  • KeyError

    If item name doesn't exist

  • ValueError

    If item has no associated file path

Examples:

>>> folio = DataFolio('s3://bucket/experiments/my-run')
>>> path = folio.get_item_path('results')
>>> print(path)
's3://bucket/experiments/my-run/tables/results.parquet'
>>> path = folio.get_item_path('classifier')
>>> print(path)
's3://bucket/experiments/my-run/models/classifier.joblib'
>>> # For referenced tables the external path is returned
>>> folio.reference_table('raw', path='s3://data-lake/raw.parquet')
>>> folio.get_item_path('raw')
's3://data-lake/raw.parquet'

Retrieving Models

Methods for loading machine learning models.

Scikit-learn Models

datafolio.DataFolio.get_sklearn(name)

Get a scikit-learn style model by name.

If caching is enabled (cache_enabled=True), cloud-based models are cached locally for faster repeated access.

Parameters:

  • name (str) –

    Name of the model

Returns:

  • Any

    The model object

Raises:

  • KeyError

    If model name doesn't exist

  • ValueError

    If named item is not a sklearn model

Examples:

>>> folio = DataFolio('experiments/test')
>>> model = folio.get_sklearn('classifier')

datafolio.DataFolio.get_model(name, **kwargs)

Get a scikit-learn style model by name.

This is a convenience method that delegates to get_sklearn().

If caching is enabled (cache_enabled=True), cloud-based models are cached locally for faster repeated access.

Parameters:

  • name (str) –

    Name of the model

  • **kwargs

    Additional arguments (currently unused, kept for backward compatibility)

Returns:

  • Any

    The model object

Raises:

Examples:

>>> folio = DataFolio('experiments/test')
>>> model = folio.get_model('classifier')

datafolio.DataFolio.get_model_path(name)

Get the path to a model file stored in the bundle.

Parameters:

  • name (str) –

    Name of the model

Returns:

  • str

    Path to the model file

Raises:

Examples:

>>> folio = DataFolio('experiments/my-run')
>>> folio.add_sklearn('classifier', model)
>>> path = folio.get_model_path('classifier')
>>> print(path)
'experiments/my-run/models/classifier.joblib'

Retrieving Artifacts

datafolio.DataFolio.get_artifact_path(name)

Get the path to an artifact file.

Parameters:

  • name (str) –

    Name of the artifact

Returns:

  • str

    Path to the artifact file

Raises:

  • KeyError

    If artifact name doesn't exist

  • ValueError

    If named item is not an artifact

Examples:

>>> folio = DataFolio('experiments/test-blue-happy-falcon')
>>> path = folio.get_artifact_path('plot')

Inspecting Items

Methods for getting information about items.

datafolio.DataFolio.list_contents(include_archived=False)

List all contents in the DataFolio.

Parameters:

  • include_archived (bool, default: False ) –

    If True, include archived (hidden) items in the results. Defaults to False so archived items are hidden from normal views.

Returns:

  • Dict[str, list[str]]

    Dictionary with keys 'referenced_tables', 'included_tables', 'numpy_arrays',

  • Dict[str, list[str]]

    'json_data', 'timestamps', 'models', and 'artifacts',

  • Dict[str, list[str]]

    each containing a list of names

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.reference_table('data1', path='s3://bucket/data.parquet')
>>> folio.add_numpy('embeddings', np.array([1, 2, 3]))
>>> folio.list_contents()
{'referenced_tables': ['data1'], 'included_tables': [], 'numpy_arrays': ['embeddings'],
 'json_data': [], 'timestamps': [], 'models': [], 'artifacts': []}

datafolio.DataFolio.get_table_info(name)

Get metadata about a table (referenced or included).

Returns the manifest entry containing information like: - For referenced tables: path, table_format, is_directory, num_rows, version, description - For included tables: filename, table_format, is_directory, num_rows, num_cols, columns, dtypes, description

Parameters:

  • name (str) –

    Name of the table

Returns:

  • Union[TableReference, IncludedTable]

    Dictionary with table metadata

Raises:

  • KeyError

    If table name doesn't exist

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.reference_table('data', path='s3://bucket/data.parquet', num_rows=1000000)
>>> info = folio.get_table_info('data')
>>> info['num_rows']
1000000
>>> info['table_format']
'parquet'

datafolio.DataFolio.get_model_info(name)

Get metadata about a model.

Returns the manifest entry containing information like: - filename, item_type, description

Parameters:

  • name (str) –

    Name of the model

Returns:

  • IncludedItem

    Dictionary with model metadata

Raises:

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.add_model('classifier', model, description='Random forest classifier')
>>> info = folio.get_model_info('classifier')
>>> info['description']
'Random forest classifier'

datafolio.DataFolio.get_artifact_info(name)

Get metadata about an artifact.

Returns the manifest entry containing information like: - filename, item_type, category, description

Parameters:

  • name (str) –

    Name of the artifact

Returns:

  • IncludedItem

    Dictionary with artifact metadata

Raises:

  • KeyError

    If artifact name doesn't exist

  • ValueError

    If named item is not an artifact

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> folio.add_artifact('plot', 'plot.png', category='plots', description='Loss curve')
>>> info = folio.get_artifact_info('plot')
>>> info['category']
'plots'
>>> info['description']
'Loss curve'

datafolio.DataFolio.describe(pattern=None, return_string=False, show_empty=False, max_metadata_fields=10, snapshot=None, include_archived=False, show_paths=False)

Generate a human-readable description of all items in the bundle.

Includes lineage information showing inputs and dependencies.

Parameters:

  • pattern (Optional[str], default: None ) –

    Optional glob pattern to filter items by name (e.g. 'examples/', '/weights'). Uses fnmatch rules — '*' matches any characters including '/'.

  • return_string (bool, default: False ) –

    If True, return as string instead of printing

  • show_empty (bool, default: False ) –

    If True, show empty sections

  • max_metadata_fields (int, default: 10 ) –

    Maximum metadata fields to show

  • snapshot (Optional[str], default: None ) –

    Optional snapshot name to describe instead of the full bundle

  • include_archived (bool, default: False ) –

    If True, show archived (hidden) items. Defaults to False.

  • show_paths (bool, default: False ) –

    If True, show the file path for each item. Especially useful for cloud-hosted folios where paths can be shared with collaborators who don't use datafolio.

Returns:

  • Optional[str]

    None if return_string=False, otherwise the description string

Examples:

>>> folio.describe()  # Show full bundle
>>> folio.describe('examples/*')  # Show only items under 'examples/'
>>> folio.describe(snapshot='v1.0')  # Show specific snapshot
>>> folio.describe(include_archived=True)  # Show archived items too
>>> folio.describe(show_paths=True)  # Show file paths for sharing

See DisplayFormatter.describe() for full documentation.


Managing Items

Deleting Items

datafolio.DataFolio.delete(name, warn_dependents=True)

Delete one or more items from the DataFolio.

Removes items from the manifest and deletes associated files. Does not enforce lineage - can delete items that other items depend on.

Parameters:

  • name (Union[str, list[str]]) –

    Name(s) of item(s) to delete (string or list of strings)

  • warn_dependents (bool, default: True ) –

    If True, print warning if deleted items have dependents

Returns:

  • Self

    Self for method chaining

Raises:

  • KeyError

    If any item name doesn't exist

Examples:

Delete single item:

>>> folio = DataFolio('experiments/test')
>>> folio.delete('old_model')

Delete multiple items:

>>> folio.delete(['temp_data', 'debug_plot', 'old_model'])

Delete without warnings:

>>> folio.delete('item', warn_dependents=False)

Archiving Items

datafolio.DataFolio.archive(name)

Mark item(s) as archived (hidden from default views, not deleted).

Archived items remain on disk and are still accessible via get_data() / get_table() etc., but are excluded from list_contents(), describe(), and copy() by default. Pass include_archived=True to those methods to reveal them again, or call unarchive() to restore them permanently.

Accepts a single name, a list of names, or a glob pattern (fnmatch rules, e.g. 'intermediate/*').

Parameters:

  • name (Union[str, list[str]]) –

    Item name, list of names, or glob pattern to archive.

Returns:

  • Self

    Self for method chaining

Raises:

  • KeyError

    If a specific name (non-glob) is not found

Examples:

Archive a single item:

>>> folio.archive('debug_output')

Archive multiple items:

>>> folio.archive(['debug_output', 'temp_features'])

Archive by glob pattern:

>>> folio.archive('intermediate/*')

datafolio.DataFolio.unarchive(name)

Restore archived item(s) to active status.

Removes the archived flag so the items appear again in list_contents(), describe(), and copy() by default.

Accepts a single name, a list of names, or a glob pattern (fnmatch rules).

Parameters:

  • name (Union[str, list[str]]) –

    Item name, list of names, or glob pattern to unarchive.

Returns:

  • Self

    Self for method chaining

Raises:

  • KeyError

    If a specific name (non-glob) is not found

Examples:

Unarchive a single item:

>>> folio.unarchive('debug_output')

Unarchive multiple items:

>>> folio.unarchive(['debug_output', 'temp_features'])

Unarchive by glob pattern:

>>> folio.unarchive('intermediate/*')

Copying Bundles

datafolio.DataFolio.copy(path, name=None, metadata_updates=None, include_items=None, exclude_items=None, random_suffix=False, follow_lineage=False, include_archived=False)

Create a copy of this bundle at a new location.

Useful for creating derived experiments or checkpoints.

Parameters:

  • path (Union[str, Path]) –

    Destination path for the new bundle. Used as the exact bundle location (e.g., 'gs://bucket/experiments/my-copy').

  • name (Optional[str], default: None ) –

    If provided, appended to path as a subdirectory (e.g., path='experiments', name='exp-v2' → 'experiments/exp-v2'). If None, path is used as-is.

  • metadata_updates (Optional[Dict[str, Any]], default: None ) –

    Metadata fields to update/add in the copy

  • include_items (Optional[list[str]], default: None ) –

    If specified, only copy these items (by name)

  • exclude_items (Optional[list[str]], default: None ) –

    Items to exclude from copy (by name)

  • random_suffix (bool, default: False ) –

    If True, append random suffix to new bundle name (default: False)

  • follow_lineage (bool, default: False ) –

    If True and include_items is provided, automatically include all transitive upstream dependencies of the named items. Items referenced in lineage that are not present in this folio (e.g. external tables) are silently skipped.

  • include_archived (bool, default: False ) –

    If True, archived items are included in the copy. Defaults to False so archived items are excluded.

Returns:

  • DataFolio

    New DataFolio instance

Raises:

  • ValueError

    If include_items and exclude_items are both specified

Examples:

>>> # Copy to exact destination path
>>> folio2 = folio.copy('gs://bucket/experiments/my-copy')
>>> # Copy to base directory with explicit name subdirectory
>>> folio2 = folio.copy('experiments', name='exp-v2')
>>> # Copy with random suffix
>>> folio2 = folio.copy('experiments/exp-v2', random_suffix=True)
>>> # Copy with metadata updates to track parent
>>> folio2 = folio.copy(
...     'experiments/exp-v2',
...     metadata_updates={
...         'parent_bundle': folio._bundle_dir,
...         'changes': 'Increased max_depth to 15'
...     }
... )
>>> # Copy only specific items (e.g., for derived experiment)
>>> folio2 = folio.copy(
...     'experiments/exp-v2-tuned',
...     include_items=['training_data', 'validation_data'],
...     metadata_updates={'status': 'in_progress'}
... )
>>> # Copy only final outputs, auto-resolving all upstream deps
>>> folio2 = folio.copy(
...     'results',
...     include_items=['final_model', 'test_results'],
...     follow_lineage=True,
... )
>>> # Include archived items in the copy
>>> folio2 = folio.copy('archive_backup', include_archived=True)

Validation

datafolio.DataFolio.validate()

Validate existence and integrity of all items.

Checks if: 1. Included items exist in the bundle 2. Referenced items exist at their external path 3. Checksums match (for included single files)

Returns:

  • Dict[str, bool]

    Dict mapping item names to validation status (True if valid)

Examples:

>>> status = folio.validate()
>>> if not all(status.values()):
...     print("Bundle corrupted!")

datafolio.DataFolio.is_valid()

Check if the entire bundle is valid.

Convenience method that runs validate() and returns True only if all items pass validation.

Returns:

  • bool

    True if all items are valid, False otherwise

Examples:

>>> if not folio.is_valid():
...     print("Bundle corrupted!")

Lineage and Dependencies

Methods for working with lineage tracking.

datafolio.DataFolio.get_inputs(item_name)

Get list of items that were inputs to this item.

Parameters:

  • item_name (str) –

    Name of the item

Returns:

  • list[str]

    List of item names that were inputs

Raises:

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> # After adding items with lineage...
>>> inputs = folio.get_inputs('predictions')
>>> # Returns: ['test_data', 'classifier']

datafolio.DataFolio.get_dependents(item_name)

Get list of items that depend on this item.

Parameters:

  • item_name (str) –

    Name of the item

Returns:

  • list[str]

    List of item names that use this as input

Raises:

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> # After adding items with lineage...
>>> dependents = folio.get_dependents('classifier')
>>> # Returns items that used 'classifier' as input

datafolio.DataFolio.get_lineage_graph()

Get full dependency graph for all items in bundle.

Returns:

  • Dict[str, list[str]]

    Dictionary mapping item names to their input item names

Examples:

>>> folio = DataFolio('experiments', prefix='test')
>>> graph = folio.get_lineage_graph()
>>> # Returns: {'predictions': ['test_data', 'classifier'], ...}

Snapshots

Methods for working with snapshots (read-only copies).

datafolio.DataFolio.create_snapshot(name, description=None, tags=None, capture_git=True, capture_environment=False, capture_execution=False)

Create a named snapshot of the current bundle state.

A snapshot captures: - Current versions of all items (via item_versions dict) - Current metadata state (via metadata_snapshot dict) - Git repository state (commit, branch, dirty status) [optional] - Python environment (version, packages) [optional, off by default] - Execution context (entry point, working directory) [optional, off by default]

After creating a snapshot, all current items are marked as being in that snapshot. Future overwrites will trigger copy-on-write to preserve the snapshot state.

SECURITY NOTE: Environment variables (API keys, tokens, etc.) are NEVER captured. The capture_environment flag only captures Python version, platform, and package versions from uv.lock or requirements.txt.

Parameters:

  • name (str) –

    Snapshot name (filesystem-safe, no @ symbol)

  • description (Optional[str], default: None ) –

    Optional human-readable description

  • tags (Optional[list[str]], default: None ) –

    Optional list of tags for organization

  • capture_git (bool, default: True ) –

    Whether to capture git state (default: True)

  • capture_environment (bool, default: False ) –

    Whether to capture Python environment info like version and packages (default: False for security)

  • capture_execution (bool, default: False ) –

    Whether to capture execution context like entry point and working directory (default: False for security)

Returns:

  • Self

    Self for method chaining

Raises:

  • ValueError

    If snapshot name is invalid or already exists

Examples:

>>> folio = DataFolio('experiments/my-exp')
>>> folio.add_table('results', df)
>>> folio.create_snapshot('v1.0-baseline', description='Initial results')
>>>
>>> # Later, overwriting will preserve the snapshot
>>> folio.add_table('results', new_df, overwrite=True)  # Creates v2

datafolio.DataFolio.list_snapshots()

List all snapshots with their metadata.

Returns:

  • list[Dict[str, Any]]

    List of snapshot metadata dicts with name, timestamp, description, tags

Examples:

>>> snapshots = folio.list_snapshots()
>>> for snap in snapshots:
...     print(f"{snap['name']}: {snap['description']}")

datafolio.DataFolio.delete_snapshot(name, cleanup_orphans=False)

Delete a snapshot.

Removes the snapshot from the registry and updates items' in_snapshots lists. Optionally cleans up orphaned item versions that are no longer referenced.

Parameters:

  • name (str) –

    Snapshot name to delete

  • cleanup_orphans (bool, default: False ) –

    If True, delete item versions no longer in any snapshot

Returns:

  • Self

    Self for method chaining

Raises:

  • KeyError

    If snapshot doesn't exist

Examples:

>>> folio.delete_snapshot('experimental-v5')
>>> folio.delete_snapshot('old-snapshot', cleanup_orphans=True)

datafolio.DataFolio.load_snapshot(bundle_dir, snapshot) classmethod

Load a DataFolio in snapshot state.

Creates a DataFolio instance configured to access items and metadata as they existed at snapshot time. Snapshots are always read-only to preserve snapshot immutability.

Parameters:

  • bundle_dir (Union[str, Path]) –

    Path to bundle directory

  • snapshot (str) –

    Snapshot name to load

Returns:

  • DataFolio

    Read-only DataFolio instance in snapshot state

Raises:

  • KeyError

    If snapshot doesn't exist

Examples:

Load snapshot for inspection:

>>> paper = DataFolio.load_snapshot('research/exp', 'paper-v1')
>>> model = paper.get_model('classifier')
>>> print(paper.metadata['accuracy'])
>>> paper.add_table('new', df)  # Error: snapshots are always read-only

Compare multiple snapshots:

>>> v1 = DataFolio.load_snapshot('path', 'v1.0')
>>> v2 = DataFolio.load_snapshot('path', 'v2.0')
>>> print(f"v1: {v1.metadata['accuracy']}, v2: {v2.metadata['accuracy']}")

datafolio.DataFolio.get_snapshot(snapshot)

Get a snapshot from this folio as a new DataFolio instance.

Convenience method for loading a snapshot when you already have a folio. Equivalent to DataFolio.load_snapshot(self._bundle_dir, snapshot). Snapshots are always read-only to preserve immutability.

Parameters:

  • snapshot (str) –

    Snapshot name to load

Returns:

  • DataFolio

    Read-only DataFolio instance in snapshot state

Raises:

  • KeyError

    If snapshot doesn't exist

Examples:

>>> folio = DataFolio('experiments/classifier')
>>> baseline = folio.get_snapshot('v1.0-baseline')
>>> assert baseline.metadata['accuracy'] == 0.89
>>> assert baseline.read_only  # Snapshots are always read-only
>>>
>>> # Compare current state to snapshot
>>> current_acc = folio.metadata['accuracy']
>>> baseline_acc = baseline.metadata['accuracy']
>>> print(f"Improvement: {current_acc - baseline_acc:.2%}")

datafolio.DataFolio.get_snapshot_info(snapshot)

Get detailed information about a snapshot.

Returns the full snapshot metadata including item versions, metadata state, git info, environment info, and execution context.

Parameters:

  • snapshot (str) –

    Snapshot name

Returns:

  • Dict[str, Any]

    Dictionary containing all snapshot metadata

Raises:

  • KeyError

    If snapshot doesn't exist

Examples:

>>> info = folio.get_snapshot_info('v1.0')
>>> print(info['description'])
'Baseline model'
>>> print(info['git']['commit'])
'a3f2b8c'
>>> print(info['metadata_snapshot']['accuracy'])
0.89

datafolio.DataFolio.compare_snapshots(snapshot1, snapshot2)

Compare two snapshots.

Returns a dictionary showing differences between the two snapshots including: - added_items: Items in snapshot2 but not snapshot1 - removed_items: Items in snapshot1 but not snapshot2 - modified_items: Items in both but with different versions - shared_items: Items in both with same version - metadata_changes: Metadata fields that changed (old_value, new_value)

Parameters:

  • snapshot1 (str) –

    First snapshot name

  • snapshot2 (str) –

    Second snapshot name

Returns:

  • Dict[str, Any]

    Dictionary with comparison results

Raises:

  • KeyError

    If either snapshot doesn't exist

Examples:

>>> diff = folio.compare_snapshots('v1.0', 'v2.0')
>>> print(diff['modified_items'])
['classifier', 'config']
>>> print(diff['metadata_changes']['accuracy'])
(0.89, 0.91)

datafolio.DataFolio.diff_from_snapshot(snapshot=None)

Compare current state to a snapshot.

This is useful for seeing what has changed since a snapshot was created, similar to 'git status' showing changes since last commit.

Parameters:

  • snapshot (Optional[str], default: None ) –

    Snapshot name to compare to. If None, uses most recent snapshot.

Returns:

  • Dict[str, Any]

    Dictionary with comparison results including:

  • Dict[str, Any]
    • snapshot_name: The snapshot being compared to
  • Dict[str, Any]
    • added_items: Items in current state but not in snapshot
  • Dict[str, Any]
    • removed_items: Items in snapshot but not in current state
  • Dict[str, Any]
    • modified_items: Items in both but with different checksums/versions
  • Dict[str, Any]
    • unchanged_items: Items in both with same checksum/version
  • Dict[str, Any]
    • metadata_changes: Metadata fields that changed

Raises:

  • KeyError

    If snapshot doesn't exist

  • ValueError

    If no snapshots exist and snapshot=None

Examples:

>>> # Compare to last snapshot
>>> diff = folio.diff_from_snapshot()
>>> print(f"Modified: {diff['modified_items']}")
['classifier', 'config']
>>> # Compare to specific snapshot
>>> diff = folio.diff_from_snapshot('v1.0')
>>> print(f"Added since v1.0: {diff['added_items']}")
['new_feature']

datafolio.DataFolio.restore_snapshot(snapshot, confirm=False)

Restore working state to snapshot (DESTRUCTIVE).

This operation: - Replaces current metadata with snapshot metadata - Sets current item versions to match snapshot - Removes items added after snapshot - Does NOT delete the snapshot itself

WARNING: This is a destructive operation that overwrites current state.

Parameters:

  • snapshot (str) –

    Snapshot name to restore

  • confirm (bool, default: False ) –

    Must be True to proceed (safety check)

Returns:

  • Self

    Self for method chaining

Raises:

Examples:

>>> folio.restore_snapshot('v1.0', confirm=True)
>>> # Working state now matches v1.0 snapshot

datafolio.DataFolio.export_snapshot(snapshot, target_path, *, include_snapshot_metadata=True)

Export a snapshot to a clean, standalone bundle.

Creates a new DataFolio bundle containing only the items and metadata from the specified snapshot. This is useful for: - Sharing a specific snapshot with collaborators - Creating a clean bundle for deployment - Starting fresh without version history

Parameters:

  • snapshot (str) –

    Name of snapshot to export

  • target_path (Union[str, Path]) –

    Path for new bundle (must not exist)

  • include_snapshot_metadata (bool, default: True ) –

    If True, adds snapshot info to new bundle's metadata under '_source_snapshot' key (default: True)

Returns:

  • DataFolio

    New DataFolio instance at target_path

Raises:

Examples:

Export a baseline snapshot for sharing:

>>> folio = DataFolio('experiments/classifier')
>>> baseline = folio.export_snapshot('v1.0-baseline', 'shared/baseline')
>>> # New bundle contains only v1.0-baseline state, no history

Export for deployment:

>>> production = folio.export_snapshot('production-v2', 'deploy/v2')
>>> # Clean bundle ready for deployment

Export without metadata reference:

>>> clean = folio.export_snapshot(
...     'v1.0',
...     'clean-export',
...     include_snapshot_metadata=False
... )

Caching

Methods for managing the local cache (for remote bundles).

datafolio.DataFolio.cache_status(item_name=None)

Get cache status for an item or entire bundle.

Parameters:

  • item_name (Optional[str], default: None ) –

    Name of item to check. If None, returns overall cache stats.

Returns:

  • Optional[Dict[str, Any]]

    Dict with cache status information, or None if caching not enabled or item not found.

  • Optional[Dict[str, Any]]

    For specific items: - cached: Whether item is cached - cache_path: Path to cached file - size_bytes: Size of cached file - cached_at: Timestamp when cached - last_accessed: Last access timestamp - access_count: Number of times accessed - ttl_remaining: Seconds until cache expires (None if no TTL)

  • Optional[Dict[str, Any]]

    For bundle-level (item_name=None): - bundle_path: Original bundle path - cache_dir: Cache directory path - ttl_seconds: TTL in seconds - cache_hits: Number of cache hits - cache_misses: Number of cache misses - cache_hit_rate: Hit rate (0.0-1.0)

Examples:

Check if a specific item is cached:

>>> status = folio.cache_status('my_table')
>>> if status and status['cached']:
...     print(f"Cache expires in {status['ttl_remaining']} seconds")

Get overall cache statistics:

>>> stats = folio.cache_status()
>>> print(f"Cache hit rate: {stats['cache_hit_rate']:.1%}")

datafolio.DataFolio.clear_cache(item_name=None)

Clear cached items.

Parameters:

  • item_name (Optional[str], default: None ) –

    Name of specific item to clear. If None, clears all cached items for this bundle.

Examples:

Clear a specific item:

>>> folio.clear_cache('my_table')

Clear entire cache:

>>> folio.clear_cache()

datafolio.DataFolio.invalidate_cache(item_name)

Invalidate cache for an item without deleting the file.

This marks the cached item as invalid, forcing a re-fetch on next access, but keeps the file on disk (useful for stale cache fallback).

Parameters:

  • item_name (str) –

    Name of item to invalidate

Examples:

Force re-download on next access:

>>> folio.invalidate_cache('my_table')
>>> table = folio.get_table('my_table')  # Will re-download

datafolio.DataFolio.refresh_cache(item_name)

Refresh cache for an item by re-downloading from remote.

This is equivalent to invalidating and then fetching the item.

Parameters:

  • item_name (str) –

    Name of item to refresh

Raises:

Examples:

>>> folio.refresh_cache('my_table')

Bundle Management

Methods for managing the DataFolio bundle itself.

datafolio.DataFolio.refresh()

Explicitly refresh manifests from disk/cloud.

This reloads items.json and metadata.json from the bundle directory, syncing the in-memory state with any external updates.

Useful when working with multiple DataFolio instances pointing to the same bundle, or when the bundle is updated by another process.

Returns:

  • Self

    Self for method chaining

Examples:

Explicit refresh after external update:

>>> folio1 = DataFolio('experiments/shared')
>>> folio2 = DataFolio('experiments/shared')
>>> folio1.add_table('results', df)
>>> folio2.refresh()  # Manually sync
>>> assert 'results' in folio2.list_contents()['included_tables']

Auto-refresh (happens automatically):

>>> folio1.add_table('results', df)
>>> # folio2 auto-refreshes on next read operation
>>> assert 'results' in folio2.list_contents()['included_tables']

Properties

Useful properties for accessing bundle information and items.

Core Properties

path

The path to the DataFolio bundle.

print(folio.path)  # e.g., 'gs://my-bucket/my-bundle' or '/local/path/bundle'

metadata

Bundle-level metadata dictionary.

print(folio.metadata)  # e.g., {'project': 'analysis', 'version': '1.0'}

items

Dictionary of all items in the bundle with their metadata.

print(folio.items)  # e.g., {'table1': {...}, 'model1': {...}}

Item Lists

tables

List of all table names in the bundle.

print(folio.tables)  # e.g., ['results', 'metadata', 'analysis']

models

List of all model names in the bundle.

print(folio.models)  # e.g., ['classifier', 'regressor']

artifacts

List of all artifact names in the bundle.

print(folio.artifacts)  # e.g., ['config.yaml', 'results.png']

Data Accessor

data

Accessor for convenient data retrieval with autocomplete support.

df = folio.data.my_table  # Equivalent to folio.get_table('my_table')
model = folio.data.my_model  # Equivalent to folio.get_model('my_model')

Status Properties

read_only

Whether the bundle is in read-only mode.

print(folio.read_only)  # True or False

in_snapshot_mode

Whether the bundle was loaded from a snapshot.

print(folio.in_snapshot_mode)  # True or False

loaded_snapshot

Name of the snapshot this bundle was loaded from (if any).

print(folio.loaded_snapshot)  # e.g., 'v1.0' or None

Method Categories Summary

Category Methods
Adding Data add_table(), add_numpy(), add_json(), add_timestamp(), add_data(), reference_table()
Adding Models add_sklearn(), add_model()
Adding Artifacts add_artifact()
Retrieving Data get_table(), get_table_path(), get_numpy(), get_numpy_path(), get_json(), get_json_path(), get_timestamp(), get_timestamp_path(), get_data(), get_data_path(), get_item_path()
Retrieving Models get_sklearn(), get_model(), get_model_path()
Retrieving Artifacts get_artifact_path()
Inspecting Items list_contents(), get_table_info(), get_model_info(), get_artifact_info(), describe()
Managing Items delete(), copy(), validate(), is_valid()
Lineage get_inputs(), get_dependents(), get_lineage_graph()
Snapshots create_snapshot(), list_snapshots(), delete_snapshot(), load_snapshot(), get_snapshot(), get_snapshot_info(), compare_snapshots(), diff_from_snapshot(), restore_snapshot(), export_snapshot()
Caching cache_status(), clear_cache(), invalidate_cache(), refresh_cache()
Bundle Management refresh()

Quick Examples

Basic Usage

import datafolio
import pandas as pd

# Create a new DataFolio
folio = datafolio.DataFolio('my_analysis')

# Add data
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
folio.add_table('results', df, description='Experimental results')

# Retrieve data
df_loaded = folio.get_table('results')

# List contents
print(folio.list_contents())
print(folio.tables)  # Property access

# Use data accessor with autocomplete
df_via_accessor = folio.data.results

With Caching

# Enable caching for remote bundles
folio = datafolio.DataFolio(
    'gs://my-bucket/my-bundle',
    cache_enabled=True,
    cache_dir='/tmp/my-cache'
)

# First access downloads and caches
df = folio.get_table('large_table')  # Downloads from cloud

# Second access uses cache (much faster!)
df = folio.get_table('large_table')  # Reads from local cache

# Check cache statistics
status = folio.cache_status()
print(f"Cache hits: {status['cache_hits']}")
print(f"Cache misses: {status['cache_misses']}")

With Snapshots

# Create a read-only snapshot
folio.create_snapshot('v1.0', description='Release 1.0')

# Load a snapshot (read-only mode)
folio_snapshot = datafolio.DataFolio.load_snapshot('v1.0')

# List all snapshots
snapshots = folio.list_snapshots()
for snap in snapshots:
    print(f"{snap['name']}: {snap['description']}")

Lineage Tracking

# Add data with lineage
folio.add_table('raw_data', raw_df)
folio.add_table('processed_data', processed_df, inputs=['raw_data'])
folio.add_model('trained_model', model, inputs=['processed_data'])

# Query lineage
inputs = folio.get_inputs('trained_model')  # ['processed_data']
dependents = folio.get_dependents('raw_data')  # ['processed_data']

# Get full lineage graph
graph = folio.get_lineage_graph()
print(graph)  # Shows dependency relationships

Sharing Paths with Collaborators

# For a cloud-hosted folio, get the direct path to any item
folio = datafolio.DataFolio('s3://my-bucket/experiments/run-42')

# Type-specific path methods (recommended):
path = folio.get_table_path('results')
# → 's3://my-bucket/experiments/run-42/tables/results.parquet'

path = folio.get_model_path('classifier')
# → 's3://my-bucket/experiments/run-42/models/classifier.joblib'

# Generic path getter (dispatches to the appropriate method automatically):
path = folio.get_data_path('results')   # same as get_table_path for tables
path = folio.get_item_path('results')   # lower-level, skips type-specific logic

# Share with a colleague who doesn't use datafolio:
# import pandas as pd; pd.read_parquet('s3://my-bucket/.../results.parquet')

# Or browse all paths at once with describe()
folio.describe(show_paths=True)
# Tables (2):
#   • raw_data (reference): Input dataset
#     ↳ path: s3://data-lake/raw.parquet
#   • results: Model results
#     ↳ path: s3://my-bucket/experiments/run-42/tables/results.parquet
# Models (1):
#   • classifier: Trained model
#     ↳ path: s3://my-bucket/experiments/run-42/models/classifier.joblib