Getting Started with lakeFS: Docker Compose, CLI, and Python

What is lakeFS?

lakeFS is an open-source data version control system that sits on top of an object store (S3, GCS, Azure Blob) and gives you Git-like semantics — branches, commits, merges, diffs — over your data lake. You point your readers and writers at the lakeFS endpoint (S3-compatible API) instead of the raw bucket, and lakeFS keeps track of versioned references without copying the underlying objects.

This post walks through:

Running lakeFS locally with Docker Compose against a GCS bucket backend.
A simple first-run to set up the admin user and a repository.
Using the lakectl CLI to branch, commit, and merge.
Working with lakeFS files locally using lakectl local (Git-like clone / pull / commit).
Reading and writing from Python with the official lakefs SDK.
Running SQL queries on parquet files directly from the web UI’s embedded SQL console.

1. Install: Docker Compose with a GCS Bucket Backend

Prerequisites

Docker and Docker Compose installed
A GCP project with a GCS bucket you can use as the storage namespace (e.g., gs://my-lakefs-bucket)
A GCP service account with Storage Object Admin on that bucket, and a downloaded JSON key

Create the bucket

# Pick a unique bucket name
export BUCKET=my-lakefs-bucket
export GCP_PROJECT=my-project
export REGION=us-central1

gsutil mb -p $GCP_PROJECT -l $REGION gs://$BUCKET

Create a service account key

export SA=lakefs-local@$GCP_PROJECT.iam.gserviceaccount.com

gcloud iam service-accounts create lakefs-local \
  --project $GCP_PROJECT \
  --display-name "lakeFS local"

gcloud projects add-iam-policy-binding $GCP_PROJECT \
  --member "serviceAccount:$SA" \
  --role "roles/storage.objectAdmin"

# Required for presigned GCS URLs — lakectl and many Python clients rely
# on these so that uploads/downloads stream directly to/from GCS instead of
# through lakeFS. Grants the SA `signBlob` on itself.
gcloud iam service-accounts add-iam-policy-binding $SA \
  --project $GCP_PROJECT \
  --member "serviceAccount:$SA" \
  --role "roles/iam.serviceAccountTokenCreator"

gcloud iam service-accounts keys create ./gcs-key.json \
  --iam-account "$SA"

Keep gcs-key.json next to your docker-compose.yml (and add it to .gitignore).

Why the second role binding? Without iam.serviceAccountTokenCreator on the SA itself, lakeFS can’t sign GCS URLs and any lakectl fs upload (or a Python client with presign on) will fail with get physical address to upload object … giving up after 5 attempt(s). If you’d rather not grant signing permission, set LAKEFS_BLOCKSTORE_GS_DISABLE_PRE_SIGNED_URL: "true" in the compose env below — all bytes then flow through the lakeFS server instead of directly to GCS.

docker-compose.yml

lakeFS needs two things to run: a metadata database (Postgres is the easy default) and credentials for the object store. Here’s a minimal setup:

version: "3.8"

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: lakefs
      POSTGRES_PASSWORD: lakefs
      POSTGRES_DB: lakefs
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U lakefs"]
      interval: 5s
      retries: 10

  lakefs:
    image: treeverse/lakefs:latest
    depends_on:
      postgres:
        condition: service_healthy
    ports:
      - "8000:8000"
    environment:
      LAKEFS_DATABASE_TYPE: postgres
      LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING: "postgres://lakefs:lakefs@postgres:5432/lakefs?sslmode=disable"
      LAKEFS_AUTH_ENCRYPT_SECRET_KEY: "change-me-to-a-long-random-string"
      LAKEFS_BLOCKSTORE_TYPE: gs
      LAKEFS_BLOCKSTORE_GS_CREDENTIALS_FILE: /etc/lakefs/gcs-key.json
      LAKEFS_LOGGING_LEVEL: INFO
    volumes:
      - ./gcs-key.json:/etc/lakefs/gcs-key.json:ro

volumes:
  pgdata:

A few notes:

LAKEFS_BLOCKSTORE_TYPE: gs tells lakeFS to use Google Cloud Storage.
The service account key is mounted read-only into the container.
LAKEFS_AUTH_ENCRYPT_SECRET_KEY must be set to something long and random; it encrypts credentials in the metadata DB. If you rotate it you’ll invalidate existing credentials.
For production you’d put the DB on managed Postgres and put lakeFS behind TLS — this compose file is for local exploration.

Bring it up:

docker compose up -d
docker compose logs -f lakefs

When the logs show listen on [::]:8000, open http://localhost:8000 in your browser.

2. Simple Start

First-run setup

The first time you open the lakeFS UI, it prompts you to create an admin user. Pick a username (e.g., admin), then lakeFS prints an access key ID and secret access key. Copy them immediately — the secret is only shown once.

Save them to ~/.lakectl.yaml so the CLI and Python clients can pick them up:

# ~/.lakectl.yaml
credentials:
  access_key_id: AKIA...
  secret_access_key: ...
server:
  endpoint_url: http://localhost:8000

Also export them as environment variables for tools that expect AWS-style credentials:

export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=...
export LAKEFS_ENDPOINT=http://localhost:8000

Create your first repository

In the UI: Create Repository → name it quickstart, storage namespace gs://my-lakefs-bucket/quickstart, default branch main.

Equivalently from the CLI (see next section):

lakectl repo create lakefs://quickstart gs://my-lakefs-bucket/quickstart

The storage namespace is the prefix inside the bucket where lakeFS will store data for this repo. One bucket can host many repositories as long as each has its own prefix.

3. Using the CLI (`lakectl`)

Install lakectl

macOS:

brew tap treeverse/lakefs
brew install lakefs

Linux (download the binary from the releases page):

curl -L https://github.com/treeverse/lakeFS/releases/latest/download/lakeFS_Linux_x86_64.tar.gz \
  | tar xz
sudo mv lakectl /usr/local/bin/

Verify:

lakectl --version
lakectl repo list

Upload, branch, commit, merge

Create a local file and upload it to the main branch:

echo "id,name\n1,alice\n2,bob" > users.csv

lakectl fs upload \
  -s users.csv \
  lakefs://quickstart/main/data/users.csv

Commit it:

lakectl commit lakefs://quickstart/main \
  -m "Add initial users.csv"

Create a feature branch off main:

lakectl branch create \
  lakefs://quickstart/experiment \
  -s lakefs://quickstart/main

Modify the file on the new branch:

echo "id,name\n1,alice\n2,bob\n3,carol" > users.csv

lakectl fs upload \
  -s users.csv \
  lakefs://quickstart/experiment/data/users.csv

lakectl commit lakefs://quickstart/experiment \
  -m "Add carol"

Diff against main:

lakectl diff \
  lakefs://quickstart/main \
  lakefs://quickstart/experiment

Merge the feature branch back:

lakectl merge \
  lakefs://quickstart/experiment \
  lakefs://quickstart/main

Delete the feature branch once it’s merged:

lakectl branch delete lakefs://quickstart/experiment

Deleting a branch only removes the branch reference — the underlying commits, and any tags pointing to them, are preserved. Add -y to skip the confirmation prompt (useful in scripts).

Useful commands to know

# List repos, branches, objects
lakectl repo list
lakectl branch list lakefs://quickstart
lakectl fs ls lakefs://quickstart/main/

# See commit history
lakectl log lakefs://quickstart/main

# Download a file
lakectl fs download \
  lakefs://quickstart/main/data/users.csv \
  ./users.csv

# Roll a branch back to a previous commit
lakectl branch reset lakefs://quickstart/experiment \
  --commit <commit-id>

# Tag a commit (useful for reproducible dataset versions)
lakectl tag create \
  lakefs://quickstart/v1.0 \
  lakefs://quickstart/main

4. Work with lakeFS Files Locally (`lakectl local`)

To edit or explore lakeFS files with regular local tools — a text editor, grep, pandas, a Jupyter notebook — without writing upload/download code, use lakectl local. It links a local directory to a lakeFS path and gives you clone / pull / commit / status commands with Git-like ergonomics.

This downloads the files (it’s not a lazy mount), so it’s best for small-to-medium datasets you actually want to work with.

Clone a branch to a local directory

lakectl local clone \
  lakefs://quickstart/main/data/ \
  ./data

This creates ./data/ mirroring lakefs://quickstart/main/data/ and records the link in ./data/.lakefs_ref.yaml so subsequent commands know which repo, branch, and path the directory belongs to.

Pull, edit, status, commit

# Pull any new/changed files from lakeFS
lakectl local pull ./data

# Edit files however you like
echo "id,name\n1,alice\n2,bob\n3,carol" > ./data/users.csv

# See what's changed vs lakeFS
lakectl local status ./data

# Commit the local changes back to the branch
lakectl local commit ./data \
  -m "Update users.csv from local edits"

Note: Unlike Git, lakectl local commit is not local-only. It uploads the changed files to lakeFS and creates the commit on the tracked branch in a single step — there’s no separate push. As soon as the command returns, the changes are live on the server and another machine can lakectl local pull them.

Because the link is branch-scoped, you get a natural workflow: create a branch, clone it, make changes, commit, then lakectl merge back to main — all without touching upload/download APIs directly.

Switch to a different ref

If you want to work against a different branch, tag, or commit, re-clone into a separate directory or cd elsewhere and lakectl local clone again. One local directory maps to one lakeFS ref.

5. Using Python

The official, most popular lakeFS Python package is simply lakefs — maintained by Treeverse (the lakeFS team) and the one the docs recommend. It wraps the REST API with an ergonomic object model (repos, branches, objects) and is the only package you need for reads, writes, and the branch/commit/merge control flow.

pip install lakefs

Runnable example: the snippets below are also bundled as one end-to-end script — lakefs_quickstart.py. It reads credentials from env vars so you don’t have to paste them in:
export LAKEFS_HOST="http://localhost:8000"
export LAKEFS_ACCESS_KEY="AKIA..."
export LAKEFS_SECRET_KEY="..."
export LAKEFS_REPO="quickstart"
python lakefs_quickstart.py

Connect

import lakefs
from lakefs.client import Client

client = Client(
    host="http://localhost:8000",
    username="AKIA...",   # access key id
    password="...",        # secret access key
)

repo = lakefs.Repository("quickstart", client=client)

Read and write objects

main = repo.branch("main")

# Write a file, then commit it on the branch.
# The write alone only stages the object — you have to commit to persist it.
with main.object("data/from_sdk.txt").writer(mode="wb") as f:
    f.write(b"hello from python\n")
main.commit(message="Add data/from_sdk.txt")

# Read a file
with main.object("data/users.csv").reader(mode="r") as f:
    print(f.read())

# List objects under a prefix
for obj in main.objects(prefix="data/"):
    print(obj.path, obj.size_bytes)

Branch, commit, merge

# Create a branch off main
exp = repo.branch("python-experiment").create(source_reference="main")

# Write something on the branch
with exp.object("data/from_sdk.txt").writer(mode="wb") as f:
    f.write(b"edited on the experiment branch\n")

# Commit and merge back
exp.commit(message="Update from python SDK")
exp.merge_into("main")

You can swap any branch name for a tag or commit ID — repo.ref("v1.0") or repo.ref("<commit-sha>") — to read a frozen version of the data.

Download the full script

All of the above is bundled as a single runnable file — end-to-end (connect → write → read → list → branch → commit → merge → verify):

↓ Download lakefs_quickstart.py

+ python lakefs_quickstart.py
writing data/from_sdk.txt on main...
reading data/from_sdk.txt on main...
  contents: hello from python
listing data/ on main...
  data/from_sdk.txt  (18 bytes)
  data/users.csv  (37 bytes)
creating branch python-experiment-1776452154 off main...
editing data/from_sdk.txt on python-experiment-1776452154...
committing...
merging python-experiment-1776452154 -> main...
reading data/from_sdk.txt on main after merge...
  contents: edited on the experiment branch
done.

6. Querying Parquet Files with SQL (in the Web UI)

Since lakeFS v0.88.0 the web UI has an embedded SQL console — DuckDB compiled to WebAssembly, running right in your browser. Open any parquet or CSV object in the UI and you get a query pane where you can DESCRIBE the schema, SELECT samples, JOIN across files on the same branch, and even compare the same path across two branches.

No installation, no credentials setup, nothing to configure — it works out of the box. And because DuckDB is running in the browser (not on the lakeFS server), your queries stay client-side; the server only serves the object bytes.

Next Steps

Hooks: lakeFS supports pre-commit and pre-merge hooks (run a Lua script or an external webhook to enforce schema checks, null-rate thresholds, etc.).
Garbage collection: when you delete branches or overwrite objects, the underlying GCS data isn’t removed until you run lakectl gc. Schedule this on a cadence that matches your retention policy.
Auth: the open-source version only has basic access keys — the admin user you created is fine for local, and you can issue additional access keys per user from the UI. Fine-grained RBAC, SSO, and IAM integrations are not in open source — they were moved to lakeFS Enterprise / Cloud (delivered via a closed-source sidecar called “Fluffy”). Pricing for Enterprise isn’t published publicly; you have to contact Treeverse sales. For most solo/small-team self-hosted setups, basic auth behind a VPN or private network is the pragmatic path.
Spark / DuckDB / Trino: all work via the S3 gateway with the same branch-as-path trick shown above.

References

Getting Started with lakeFS: Docker Compose, CLI, and Python

Getting Started with lakeFS: Docker Compose, CLI, and Python

What is lakeFS?

1. Install: Docker Compose with a GCS Bucket Backend

Prerequisites

Create the bucket

Create a service account key

docker-compose.yml

2. Simple Start

First-run setup

Create your first repository

3. Using the CLI (`lakectl`)

Install lakectl

Upload, branch, commit, merge

Useful commands to know

4. Work with lakeFS Files Locally (`lakectl local`)

Clone a branch to a local directory

Pull, edit, status, commit

Switch to a different ref

5. Using Python

Connect

Read and write objects

Branch, commit, merge

Download the full script

6. Querying Parquet Files with SQL (in the Web UI)

Next Steps

References

Related Posts

Fixing Poetry Virtualenv Path After Moving a Project Directory

How to Transfer Your Claude Code Environment to a New Machine

Fixing PostgreSQL Stale postmaster.pid Lock File

Popular

Latest

Category

Tag Cloud

Getting Started with lakeFS: Docker Compose, CLI, and Python

Getting Started with lakeFS: Docker Compose, CLI, and Python

What is lakeFS?

1. Install: Docker Compose with a GCS Bucket Backend

Prerequisites

Create the bucket

Create a service account key

docker-compose.yml

2. Simple Start

First-run setup

Create your first repository

3. Using the CLI (lakectl)

Install lakectl

Upload, branch, commit, merge

Useful commands to know

4. Work with lakeFS Files Locally (lakectl local)

Clone a branch to a local directory

Pull, edit, status, commit

Switch to a different ref

5. Using Python

Connect

Read and write objects

Branch, commit, merge

Download the full script

6. Querying Parquet Files with SQL (in the Web UI)

Next Steps

References

Related Posts

Fixing Poetry Virtualenv Path After Moving a Project Directory

How to Transfer Your Claude Code Environment to a New Machine

Fixing PostgreSQL Stale postmaster.pid Lock File

Popular

Latest

Category

Tag Cloud

3. Using the CLI (`lakectl`)

4. Work with lakeFS Files Locally (`lakectl local`)