Posted on:April 10, 2021 at 01:00 AM

Managing ML Pipelines with Datamon

Managing ML Pipelines with Datamon

Introduction

Datamon helps build ML pipelines by adding versioning, auditing and lineage tracking to cloud storage tools (e.g. Google GCS, AWS S3). This is not a replacement for these tools, but rather a way to manage their inputs and outputs.

Datamon works by providing a git-like interface to manage data efficiently: your data buckets are organized in repositories of versioned and tagged bundles of files.

Installation and Setup

Version Information

# Check Datamon version
d2 version
# Output:
# Version: 2.3.0
# BuildDate: 2020-11-10T11:29:17Z
# Commit: 17ce7d6
# Working tree: clean

Configuration

# Set config for different environments
d2 config set --config global-onec-co-datamon-config --context dev
d2 config set --config global-onec-co-datamon-config --context staging
d2 config set --config workshop-config

Context Management

# List all contexts
d2 context list
# Output:
# [dev prod staging]

# Get current context
d2 context
# or
d2 get context

# Get specific context
d2 context get --context dev
# Output:
# Model Version: 0
# Name: dev
# WAL: dev-onec-co-datamon-metadata-wal
# ReadLog: dev-onec-co-datamon-readlog
# Blob: global-onec-co-datamon-blob
# Metadata: dev-onec-co-datamon-metadata
# Version Metadata: dev-onec-co-datamon-vmetadata

# Create new context
d2 context create test

Repository Management

# Get repository details
d2 repo get zenrin-estat-residential
# or
d2 repo get --repo zenrin-estat-residential

# List repositories
d2 repo list | grep zenrin-estat-residential

# Create new repositories
d2 repo create --repo ntd-road-source-dev --description "raw download of ntd road data"
d2 repo list | grep ntd

# Create repository in specific context
d2 repo create --context staging --repo ntd-road-source-staging --description "the original ntd road data"
d2 repo list --context staging | grep ntd

Bundle Management

Uploading Bundles

# Upload a bundle
d2 bundle upload --path folder-to-upload --repo mkang-test-repo --message "my first upload"

# Upload to specific repository
d2 bundle upload --path ~/occ/prod/01-built-object-service/tmp/ntd/RI --repo ntd-road-source-dev
d2 bundle upload --path ~/occ/prod/01-built-object-service/tmp/ntd/RI --repo ntd-road-source-dev --message "upload RI"

Listing Bundles

# List all bundles
d2 bundle list

# List bundles in a repository
d2 bundle list --repo mkang-test-repo
d2 bundle list --repo ntd-road-source-dev
d2 bundle list --repo zenrin-estat-residential
d2 bundle list --repo resilience-japan-hazard-maps
d2 bundle list --repo Seattle-Sample-Corelogic-Run-Data

# List bundles in different contexts
d2 bundle list --repo ntd-road-source-staging --context staging
d2 bundle list --repo ntd-road-source-dev --context dev

Bundle Operations

# List files in a bundle
d2 bundle list files --repo mkang-test-repo --bundle 1fySBuavEhqWAXnYnZEiDCNm8TC
d2 bundle list files --repo mkang-test-repo --bundle 1fySBuavEhq 2>/dev/null | grep file

# Mount bundle
d2 bundle mount --repo mkang-test-repo --mount ~/mnt --daemonize
d2 bundle mount --repo mkang-test-repo --label 1fySBuavEhqWAXnYnZEiDCNm8TC --mount ~/mnt --daemonize

# Download bundle
d2 bundle download --repo mkang-test-repo --destination .