Advanced Parquet Operations
DataFolio stores tables as Parquet files, leveraging Parquet's powerful filtering and optimization capabilities. This guide shows you how to work efficiently with large datasets using column selection, row filtering, and other advanced techniques.
Why Parquet?
Parquet is a columnar storage format designed for analytics workloads:
- Column pruning - Read only the columns you need
- Row filtering - Filter data at the file level before loading into memory
- Compression - Efficient storage with various compression algorithms
- Metadata - Store schema, statistics for query optimization
- Cloud-optimized - Works well with S3, GCS, and other cloud storage
Quick Example: The Problem
# Problem: 100GB table, but you only need 2 columns and 1% of rows
folio = DataFolio('s3://huge-bucket/analysis')
# Bad: Loads entire 100GB into memory! 💥
df = folio.get_table('transactions') # OOM error!
df_filtered = df[df['amount'] > 1000][['user_id', 'amount']]
# Good: Loads only ~1GB (filtered subset) ✅
df_filtered = folio.get_table('transactions',
columns=['user_id', 'amount'],
filters=[('amount', '>', 1000)],
engine='pyarrow')
This is the power of column selection and row filtering - work with massive datasets efficiently.
Column Selection (Column Pruning)
Read only the columns you need, reducing memory usage and load time.
Basic Column Selection
folio = DataFolio('experiments/analysis')
# Full table has 50 columns
df_full = folio.get_table('user_events') # Loads all 50 columns
# Read only specific columns
df_subset = folio.get_table('user_events',
columns=['user_id', 'timestamp', 'event_type']) # Loads only 3 columns
Performance Impact
import time
# Example: 10M rows × 50 columns = 4GB
start = time.time()
df_full = folio.get_table('large_table')
full_time = time.time() - start
print(f"Full table: {full_time:.2f}s, {df_full.memory_usage().sum() / 1e9:.2f} GB")
# Full table: 12.3s, 4.2 GB
# Read only 5 columns
start = time.time()
df_subset = folio.get_table('large_table',
columns=['col1', 'col2', 'col3', 'col4', 'col5'])
subset_time = time.time() - start
print(f"5 columns: {subset_time:.2f}s, {df_subset.memory_usage().sum() / 1e9:.2f} GB")
# 5 columns: 1.8s, 0.4 GB
# 6.8x faster, 10x less memory!
Use Cases
Data Exploration
# Quick peek at data without loading everything
summary = folio.get_table('user_data',
columns=['user_id', 'signup_date', 'country'])
print(summary.head())
print(summary['country'].value_counts())
Feature Engineering
# Load only features needed for this model
features = ['age', 'income', 'education', 'job_title']
target = 'churn'
df = folio.get_table('training_data',
columns=features + [target])
X = df[features]
y = df[target]
Row Filtering (Predicate Pushdown)
Filter rows at the Parquet file level before loading into memory. This is much more efficient than loading everything and filtering in pandas.
Basic Filtering
Row filtering requires the pyarrow engine:
# Filter rows where amount > 1000
df = folio.get_table('transactions',
filters=[('amount', '>', 1000)],
engine='pyarrow')
# Multiple conditions (AND)
df = folio.get_table('transactions',
filters=[
('amount', '>', 1000),
('country', '==', 'US')
],
engine='pyarrow')
Filter Syntax
PyArrow supports these filter operators:
# Comparisons
[('age', '>', 18)] # Greater than
[('age', '>=', 18)] # Greater than or equal
[('age', '<', 65)] # Less than
[('age', '<=', 65)] # Less than or equal
[('status', '==', 'active')] # Equal
[('status', '!=', 'banned')] # Not equal
# Set membership
[('country', 'in', ['US', 'CA', 'MX'])]
[('country', 'not in', ['XX', 'YY'])]
Complex Filters
Combine multiple conditions:
# AND conditions (list of tuples)
df = folio.get_table('users',
filters=[
('age', '>=', 18),
('age', '<', 65),
('country', 'in', ['US', 'CA', 'UK'])
],
engine='pyarrow')
# OR conditions (list of lists)
# Get users who are either (age < 18) OR (country == 'US')
df = folio.get_table('users',
filters=[
[('age', '<', 18)],
[('country', '==', 'US')]
],
engine='pyarrow')
# Complex: (age >= 18 AND country == 'US') OR (status == 'premium')
df = folio.get_table('users',
filters=[
[('age', '>=', 18), ('country', '==', 'US')],
[('status', '==', 'premium')]
],
engine='pyarrow')
Performance Impact
# Example: 100M rows, filter returns 1M rows
# Bad: Load all 100M rows, then filter (slow!) 💥
df = folio.get_table('events') # 30s, 8GB memory
df_filtered = df[df['event_type'] == 'purchase'] # Another 5s
# Good: Filter at file level (fast!) ✅
df_filtered = folio.get_table('events',
filters=[('event_type', '==', 'purchase')],
engine='pyarrow') # 2s, 80MB memory
# 17.5x faster, 100x less memory!
Combining Column Selection and Filtering
The real power comes from combining both techniques:
# Dataset: 1B rows × 100 columns = 800GB
# Goal: Get user_id and amount for large transactions in 2024
# Optimal approach: Filter first, select columns
df = folio.get_table('transactions',
columns=['user_id', 'amount', 'timestamp'], # Only 3 columns
filters=[
('timestamp', '>=', '2024-01-01'),
('timestamp', '<', '2025-01-01'),
('amount', '>', 10000)
],
engine='pyarrow')
# Loads only ~500MB instead of 800GB! 🎉
Real-World Example: Fraud Detection
from datafolio import DataFolio
folio = DataFolio('s3://fraud-detection/production',
cache_enabled=True) # Cache for repeated access
# Full dataset: 10TB of transaction data
# We need: Recent high-value transactions from flagged countries
df = folio.get_table('transactions',
# Only these columns
columns=[
'transaction_id',
'user_id',
'amount',
'country',
'timestamp',
'merchant_id'
],
# Filter to recent, high-value, suspicious countries
filters=[
('timestamp', '>=', '2024-12-01'), # Last month
('amount', '>', 5000), # Large amounts
('country', 'in', ['XX', 'YY', 'ZZ']) # Flagged countries
],
engine='pyarrow')
# Result: ~2GB loaded instead of 10TB
# Now train fraud detection model
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
# ... feature engineering and training ...
# Save everything
folio.add_table('flagged_transactions', df)
folio.add_model('fraud_detector', model,
inputs=['transactions'])
folio.metadata['filtered_rows'] = len(df)
folio.metadata['date_range'] = '2024-12'
Working with Large Datasets
Pattern 1: Iterative Processing
For datasets too large to fit in memory, process in chunks:
# Get list of unique dates
info = folio.get_table_info('huge_dataset')
# Assume we know the date range: 2020-01-01 to 2024-12-31
from datetime import datetime, timedelta
import pandas as pd
results = []
start_date = datetime(2020, 1, 1)
end_date = datetime(2024, 12, 31)
current_date = start_date
while current_date < end_date:
next_date = current_date + timedelta(days=30)
# Load one month at a time
df_month = folio.get_table('huge_dataset',
filters=[
('date', '>=', current_date.strftime('%Y-%m-%d')),
('date', '<', next_date.strftime('%Y-%m-%d'))
],
engine='pyarrow')
# Process this month
monthly_stats = df_month.groupby('category').agg({
'amount': ['sum', 'mean', 'count']
})
results.append(monthly_stats)
current_date = next_date
# Combine results
final_results = pd.concat(results)
folio.add_table('monthly_statistics', final_results)
Pattern 2: Sample First, Analyze Later
# 1. Get a sample to develop your analysis
df_sample = folio.get_table('massive_dataset',
filters=[('random_partition', '==', 0)], # 1% sample
engine='pyarrow')
# 2. Develop and test your analysis on the sample
def analyze_data(df):
# Your analysis logic
return df.groupby('category').agg({'value': 'mean'})
# 3. Test on sample
result_sample = analyze_data(df_sample)
print("Sample results:", result_sample)
# 4. Run on full dataset
df_full = folio.get_table('massive_dataset')
result_full = analyze_data(df_full)
folio.add_table('analysis_results', result_full)
Pattern 3: Progressive Loading
# Load increasingly detailed data as needed
# Step 1: Load summary columns for filtering
df_summary = folio.get_table('user_behavior',
columns=['user_id', 'total_spend', 'num_purchases'])
# Step 2: Identify users of interest
high_value_users = df_summary[df_summary['total_spend'] > 10000]['user_id']
# Step 3: Load detailed data only for those users
df_detailed = folio.get_table('user_behavior',
filters=[('user_id', 'in', high_value_users.tolist())],
engine='pyarrow')
# All columns, but only for high-value users
Memory Optimization Tips
1. Check Available Columns First
# See what columns are available without loading data
info = folio.get_table_info('my_table')
print("Available columns:", info.get('columns', []))
print("Total rows:", info.get('num_rows'))
print("Total columns:", info.get('num_cols'))
# Now load only what you need
df = folio.get_table('my_table',
columns=['col1', 'col2', 'col3'])
2. Use Appropriate Data Types
# Load with optimized dtypes
df = folio.get_table('user_data',
columns=['user_id', 'age', 'country'])
# Convert to more efficient types
df['user_id'] = df['user_id'].astype('uint32') # If IDs are small integers
df['age'] = df['age'].astype('uint8') # Age fits in uint8
df['country'] = df['country'].astype('category') # Categorical for repeated strings
# Memory savings can be 50-90%!
3. Read Only What You'll Use
# Bad: Load everything, then filter
df = folio.get_table('events') # 10GB
active_users = df[df['status'] == 'active']['user_id'].unique()
# Good: Filter first, then load
df_active = folio.get_table('events',
columns=['user_id', 'status'],
filters=[('status', '==', 'active')],
engine='pyarrow') # 100MB
active_users = df_active['user_id'].unique()
Advanced: PyArrow Tables
For maximum performance, work with PyArrow tables directly:
import pyarrow.parquet as pq
# Get the file path
folio = DataFolio('experiments/data')
# get_table_path() works for both included and referenced tables
file_path = folio.get_table_path('my_table')
# Read as PyArrow table (zero-copy, very fast)
table = pq.read_table(file_path,
columns=['col1', 'col2'],
filters=[('col1', '>', 100)])
# Work with PyArrow (often faster than pandas)
print(f"Rows: {table.num_rows}")
print(f"Schema: {table.schema}")
# Convert to pandas only when needed
df = table.to_pandas()
Integration with DuckDB
For complex queries on large datasets, use DuckDB:
import duckdb
folio = DataFolio('s3://my-bucket/analysis')
# Get path to table (works for included and referenced tables)
table_path = folio.get_table_path('large_table')
# Query with DuckDB (very fast, minimal memory)
result = duckdb.query(f"""
SELECT
category,
AVG(amount) as avg_amount,
COUNT(*) as count
FROM read_parquet('{table_path}')
WHERE amount > 100
AND date >= '2024-01-01'
GROUP BY category
ORDER BY avg_amount DESC
LIMIT 100
""").to_df()
# Save results
folio.add_table('category_summary', result,
description='Top 100 categories by average amount',
inputs=['large_table'])
Cloud Storage Optimization
Caching for Repeated Access
# Enable caching for cloud data
folio = DataFolio('s3://my-bucket/experiment',
cache_enabled=True,
cache_dir='/fast/local/disk')
# First access: Downloads from S3 (slow)
df1 = folio.get_table('data',
filters=[('country', '==', 'US')],
engine='pyarrow') # 30s
# Second access: Reads from local cache (fast!)
df2 = folio.get_table('data',
filters=[('country', '==', 'CA')],
engine='pyarrow') # 0.5s
# Note: Different filters still benefit from cached file
Minimize Data Transfer
# Bad: Transfer entire file from S3, then filter locally
folio = DataFolio('s3://huge-bucket/data')
df = folio.get_table('transactions') # Transfers 50GB!
df_us = df[df['country'] == 'US'] # Then filters locally
# Good: Filter on S3, transfer only results
df_us = folio.get_table('transactions',
columns=['id', 'amount', 'date'],
filters=[('country', '==', 'US')],
engine='pyarrow') # Transfers only 500MB
Best Practices
1. Always Specify Columns
# Default: Read all columns
df = folio.get_table('data') # All columns
# Better: Specify exactly what you need
df = folio.get_table('data',
columns=['essential_col1', 'essential_col2'])
2. Use Filters for Time-Series Data
# Analyze only recent data
df = folio.get_table('events',
filters=[('timestamp', '>=', '2024-12-01')],
engine='pyarrow')
3. Combine with Good Data Organization
# When saving data, include useful columns for filtering
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
folio.add_table('data_with_partitions', df)
# Later: Fast filtering
df_2024 = folio.get_table('data_with_partitions',
filters=[('year', '==', 2024)],
engine='pyarrow')
4. Check Your Filters Work
# Verify filter syntax before processing huge files
df_test = folio.get_table('data',
filters=[('category', 'in', ['A', 'B'])],
engine='pyarrow')
if len(df_test) > 0:
print("Filter works! Processing full dataset...")
else:
print("No data matched filter - check column names/values")
Common Patterns
Pattern: Top N Analysis
# Find top users by spend, without loading everyone
df_all = folio.get_table('user_spending',
columns=['user_id', 'total_spend'])
top_users = df_all.nlargest(1000, 'total_spend')['user_id']
# Load full details only for top users
df_top = folio.get_table('user_details',
filters=[('user_id', 'in', top_users.tolist())],
engine='pyarrow')
Pattern: Time-Window Analysis
from datetime import datetime, timedelta
# Analyze last 7 days
end_date = datetime.now()
start_date = end_date - timedelta(days=7)
df = folio.get_table('events',
columns=['user_id', 'event_type', 'timestamp', 'value'],
filters=[
('timestamp', '>=', start_date.strftime('%Y-%m-%d')),
('timestamp', '<', end_date.strftime('%Y-%m-%d'))
],
engine='pyarrow')
# Analyze this week's data
summary = df.groupby('event_type').agg({'value': 'sum'})
folio.add_table(f'weekly_summary_{start_date:%Y%m%d}', summary)
Pattern: A/B Test Analysis
# Load only treatment group data
df_treatment = folio.get_table('experiment_data',
columns=['user_id', 'conversion', 'revenue'],
filters=[('group', '==', 'treatment')],
engine='pyarrow')
df_control = folio.get_table('experiment_data',
columns=['user_id', 'conversion', 'revenue'],
filters=[('group', '==', 'control')],
engine='pyarrow')
# Compare
print(f"Treatment: {df_treatment['conversion'].mean():.2%}")
print(f"Control: {df_control['conversion'].mean():.2%}")
Troubleshooting
Error: "filter '...' not supported"
Use engine='pyarrow':
# Wrong
df = folio.get_table('data', filters=[('a', '>', 1)]) # Error!
# Right
df = folio.get_table('data',
filters=[('a', '>', 1)],
engine='pyarrow') # Works!
Error: "Column 'x' does not exist"
Check available columns:
info = folio.get_table_info('data')
print("Available columns:", info.get('columns', []))
Slow Performance Despite Filtering
- Ensure you're using
engine='pyarrow' - Check filter selectivity (how many rows match)
- Consider breaking into smaller chunks
- Enable caching for cloud data
Summary
| Technique | Use When | Performance Gain |
|---|---|---|
| Column selection | You need only some columns | 5-50x faster, 5-50x less memory |
| Row filtering | You need only some rows | 10-1000x faster, 10-1000x less memory |
| Both combined | Large datasets | 50-10000x faster! |
| Caching | Cloud storage, repeated access | 10-100x faster on subsequent reads |
| PyArrow engine | Any filtering | Required for filters |
| Chunked processing | Data too large for memory | Enables processing unlimited data |
Golden Rule: Always filter and select at the Parquet level, never load everything into memory first!
Learn More
- Getting Started Guide - Basic DataFolio usage
- Snapshots Guide - Version your filtered datasets
- DataFolio API Reference - All
get_table()options - Pandas read_parquet docs - Full parameter reference
- PyArrow filtering docs - Advanced filter syntax