Skip to content

DataFolio CLI Reference

DataFolio provides a command-line interface for managing bundles and snapshots without writing Python code.

Installation

The CLI is automatically available after installing datafolio:

pip install datafolio

Global Options

All commands support these global options:

datafolio [OPTIONS] COMMAND [ARGS]...
Option Description
-f, --folio PATH Path to DataFolio bundle (default: current directory or DATAFOLIO_PATH env var)
--version Show version and exit
--help Show help message

Setting Default Folio Path

You can set a default folio path using the environment variable:

export DATAFOLIO_PATH=/path/to/my/bundle
datafolio describe  # Uses DATAFOLIO_PATH

Or specify it explicitly:

datafolio -f /path/to/my/bundle describe

Commands

init - Initialize a Bundle

Create a new DataFolio bundle.

datafolio init [PATH]

Arguments: - PATH (optional): Directory to create bundle in (default: current directory)

Examples:

# Create bundle in current directory
datafolio init

# Create bundle in specific directory
datafolio init my_analysis

# Create with custom path
datafolio init /data/experiments/exp_001

Output:

✓ Initialized new DataFolio at: /data/experiments/exp_001

describe - Show Bundle Information

Display detailed information about a bundle including items, metadata, and lineage.

datafolio describe [OPTIONS]

Options: | Option | Description | |--------|-------------| | --json | Output in JSON format | | --verbose, -v | Show detailed item information |

Examples:

# Basic description
datafolio describe

# Detailed output
datafolio describe --verbose

# JSON output for scripting
datafolio describe --json > bundle_info.json

# Describe specific bundle
datafolio -f /path/to/bundle describe

Output:

DataFolio: my_analysis
Path: /data/experiments/exp_001

Bundle Metadata:
  project: analysis
  created: 2024-01-15

Items (5):
  Tables (3):
    - raw_data (100 rows, 5 cols)
    - processed_data (100 rows, 8 cols)
    - results (50 rows, 3 cols)

  Models (1):
    - classifier (sklearn_model)

  Artifacts (1):
    - config.yaml

validate - Validate Bundle

Check if a directory is a valid DataFolio bundle.

datafolio validate [PATH]

Arguments: - PATH (optional): Directory to validate (default: current directory)

Examples:

# Validate current directory
datafolio validate

# Validate specific path
datafolio validate /data/experiments/exp_001

Output:

✅ Valid bundle:

✓ Valid DataFolio bundle
  - items.json: valid
  - 5 items found
  - No issues detected

❌ Invalid bundle:

✗ Not a valid DataFolio bundle
  - Missing items.json
  - Directory structure incomplete

Exit Codes: - 0: Valid bundle - 1: Invalid bundle


Snapshot Commands

Manage snapshots (read-only copies) of your bundle state.

snapshot create - Create Snapshot

Create a new snapshot of the current bundle state.

datafolio snapshot create NAME [OPTIONS]

Arguments: - NAME: Unique name for the snapshot (e.g., 'v1.0', 'baseline', '2024-01-15')

Options: | Option | Description | |--------|-------------| | -d, --description TEXT | Description of this snapshot | | --tags TEXT | Comma-separated tags | | --metadata KEY=VALUE | Additional metadata (can be used multiple times) |

Examples:

# Simple snapshot
datafolio snapshot create v1.0

# With description
datafolio snapshot create baseline -d "Initial baseline results"

# With tags
datafolio snapshot create exp_001 --tags "experiment,baseline,validated"

# With custom metadata
datafolio snapshot create v2.0 \
  -d "Improved model" \
  --metadata accuracy=0.95 \
  --metadata model=transformer

Output:

✓ Created snapshot 'v1.0'
  Items: 5 tables, 1 model, 1 artifact
  Time: 2024-01-15 14:30:00 UTC

snapshot list - List Snapshots

List all snapshots in the bundle.

datafolio snapshot list [OPTIONS]

Options: | Option | Description | |--------|-------------| | --json | Output in JSON format | | --verbose, -v | Show detailed information |

Examples:

# List all snapshots
datafolio snapshot list

# Detailed listing
datafolio snapshot list --verbose

# JSON output
datafolio snapshot list --json

Output:

Snapshots (3):

  v1.0
    Created: 2024-01-15 14:30:00
    Items: 7
    Description: Initial baseline

  v1.1
    Created: 2024-01-16 10:15:00
    Items: 8
    Description: Added validation data

  v2.0
    Created: 2024-01-17 15:45:00
    Items: 9
    Description: Improved model (accuracy=0.95)

snapshot show - Show Snapshot Details

Display detailed information about a specific snapshot.

datafolio snapshot show NAME

Arguments: - NAME: Snapshot name

Examples:

datafolio snapshot show v1.0

Output:

Snapshot: v1.0
Created: 2024-01-15 14:30:00 UTC
Description: Initial baseline results

Items (7):
  Tables (5):
    - raw_data (v1)
    - processed_data (v1)
    - train_data (v1)
    - test_data (v1)
    - results (v1)

  Models (1):
    - classifier (v1)

  Artifacts (1):
    - config.yaml (v1)

Metadata:
  accuracy: 0.92
  model_type: random_forest

snapshot compare - Compare Snapshots

Compare two snapshots to see what changed.

datafolio snapshot compare SNAPSHOT1 SNAPSHOT2

Arguments: - SNAPSHOT1: First snapshot name - SNAPSHOT2: Second snapshot name

Examples:

datafolio snapshot compare v1.0 v2.0

Output:

Comparing v1.0 → v2.0

Added (2):
  + new_features (table)
  + updated_model (model)

Modified (1):
  ~ results (table): rows changed 50 → 75

Removed (0):

Summary:
  2 additions, 1 modification, 0 deletions

snapshot diff - Diff Against Snapshot

Show changes between current state and a snapshot.

datafolio snapshot diff [SNAPSHOT]

Arguments: - SNAPSHOT (optional): Snapshot name (default: latest snapshot)

Examples:

# Compare with latest snapshot
datafolio snapshot diff

# Compare with specific snapshot
datafolio snapshot diff v1.0

Output:

Changes since v1.0:

Modified (2):
  ~ results (table): updated
  ~ classifier (model): updated

Added (1):
  + validation_results (table)

Current state has 3 changes from snapshot v1.0

snapshot status - Show Bundle Status

Show current bundle state compared to the last snapshot.

datafolio snapshot status

Examples:

datafolio snapshot status

Output:

Current Status:

Last snapshot: v2.0 (2024-01-17 15:45:00)

Changes since v2.0:
  Modified: 1 item
  Added: 0 items
  Deleted: 0 items

Modified items:
  ~ results (table): 75 → 100 rows

💡 Tip: Create a new snapshot to save current state
      datafolio snapshot create v2.1

snapshot delete - Delete Snapshot

Delete a snapshot from the bundle.

datafolio snapshot delete NAME [OPTIONS]

Arguments: - NAME: Snapshot name to delete

Options: | Option | Description | |--------|-------------| | --force | Skip confirmation prompt | | --cleanup-orphans | Also remove orphaned item versions |

Examples:

# Delete with confirmation
datafolio snapshot delete old_experiment

# Force delete without confirmation
datafolio snapshot delete old_experiment --force

# Delete and cleanup orphaned versions
datafolio snapshot delete old_experiment --cleanup-orphans

Output:

⚠ Warning: This will permanently delete snapshot 'old_experiment'
Continue? [y/N]: y
✓ Deleted snapshot 'old_experiment'

snapshot gc - Garbage Collection

Clean up orphaned item versions that are no longer referenced by any snapshot.

datafolio snapshot gc [OPTIONS]

Options: | Option | Description | |--------|-------------| | --dry-run | Show what would be deleted without actually deleting | | --verbose, -v | Show detailed information |

Examples:

# Dry run to see what would be deleted
datafolio snapshot gc --dry-run

# Actually perform cleanup
datafolio snapshot gc

# Verbose output
datafolio snapshot gc --verbose

Output:

Scanning for orphaned versions...

Would delete (3):
  - results.v1.parquet (orphaned since v1.0 deleted)
  - old_model.v2.pkl (no longer referenced)
  - temp_data.v1.parquet (orphaned)

Total space to free: 45.2 MB

Run without --dry-run to perform cleanup

snapshot reproduce - Show Reproduction Instructions

Generate instructions for reproducing a snapshot.

datafolio snapshot reproduce NAME

Arguments: - NAME: Snapshot name

Examples:

datafolio snapshot reproduce v1.0

Output:

Reproduction Instructions for Snapshot: v1.0

To reproduce this exact state:

1. Load the snapshot:
   ```python
   import datafolio
   folio = datafolio.DataFolio.load_snapshot('v1.0')
   ```

2. Items in this snapshot:
   - raw_data (table)
   - processed_data (table)
   - classifier (model)
   - config.yaml (artifact)

3. Dependencies:
   raw_data → processed_data → classifier

4. Metadata:
   - Created: 2024-01-15 14:30:00 UTC
   - Python: 3.10.2
   - datafolio: 0.2.0

5. To export this snapshot:
   ```python
   folio.export_snapshot('v1.0', '/path/to/export')
   ```

Usage Examples

Common Workflows

1. Create and Manage a Bundle

# Initialize new bundle
datafolio init my_analysis

# Work with Python to add data...
# (see Python API documentation)

# Create snapshot when ready
datafolio -f my_analysis snapshot create baseline -d "Initial results"

# View bundle info
datafolio -f my_analysis describe

2. Track Progress with Snapshots

# After initial analysis
datafolio snapshot create v1.0 -d "Initial model"

# Continue working...

# Create another snapshot
datafolio snapshot create v1.1 -d "Improved preprocessing"

# Compare versions
datafolio snapshot compare v1.0 v1.1

# Check what changed since last snapshot
datafolio snapshot diff

3. Validate and Inspect Bundles

# Validate bundle structure
datafolio validate /path/to/bundle

# View detailed description
datafolio -f /path/to/bundle describe --verbose

# List all snapshots
datafolio -f /path/to/bundle snapshot list

4. Cleanup Old Snapshots

# List all snapshots
datafolio snapshot list

# Delete old experiments
datafolio snapshot delete old_experiment

# Clean up orphaned versions
datafolio snapshot gc

Integration with Python API

The CLI complements the Python API. A typical workflow:

# Python: Create and populate bundle
import datafolio
import pandas as pd

folio = datafolio.DataFolio('my_analysis')
folio.add_table('results', df)
folio.add_model('classifier', model)
# CLI: Create snapshot
datafolio -f my_analysis snapshot create v1.0 -d "Initial results"

# CLI: Validate
datafolio -f my_analysis validate

# CLI: View status
datafolio -f my_analysis describe
# Python: Load snapshot later
folio = datafolio.DataFolio.load_snapshot('v1.0')
results = folio.get_table('results')

Environment Variables

Variable Description
DATAFOLIO_PATH Default path for folio operations
DATAFOLIO_CACHE_ENABLED Enable caching ('true'/'false')
DATAFOLIO_CACHE_DIR Cache directory path
DATAFOLIO_CACHE_TTL Cache TTL in seconds

Example:

export DATAFOLIO_PATH=/data/experiments/current
export DATAFOLIO_CACHE_ENABLED=true
export DATAFOLIO_CACHE_DIR=/tmp/datafolio_cache

# Now CLI commands use these defaults
datafolio describe
datafolio snapshot list

Scripting with the CLI

The CLI is designed for use in scripts and automation:

Bash Script Example

#!/bin/bash

# Validate bundle
if ! datafolio validate /data/bundle; then
    echo "Invalid bundle!"
    exit 1
fi

# Create dated snapshot
DATE=$(date +%Y-%m-%d)
datafolio -f /data/bundle snapshot create "daily_$DATE" \
    -d "Daily backup" \
    --tags "automated,backup"

# Cleanup old snapshots (keep last 7 days)
# ... (custom logic to delete old snapshots)

echo "Backup complete: daily_$DATE"

JSON Output for Processing

# Get bundle info as JSON
INFO=$(datafolio describe --json)

# Extract item count using jq
TABLE_COUNT=$(echo "$INFO" | jq '.tables | length')
echo "Bundle has $TABLE_COUNT tables"

# List snapshots as JSON
SNAPSHOTS=$(datafolio snapshot list --json)
LATEST=$(echo "$SNAPSHOTS" | jq -r '.[0].name')
echo "Latest snapshot: $LATEST"

Exit Codes

Code Meaning
0 Success
1 Error (invalid arguments, bundle not found, operation failed)
2 Validation failed (for validate command)

Getting Help

For any command, use --help:

datafolio --help
datafolio snapshot --help
datafolio snapshot create --help

For more detailed documentation, see: - Python API Reference - Getting Started Guide - Snapshots Guide