Getting Started with lakeFS: Docker Compose, CLI, and Python
What is lakeFS?
lakeFS is an open-source data version control system that sits on top of an object store (S3, GCS, Azure Blob) and gives you Git-like semantics — branches, commits, merges, diffs — over your data lake. You point your readers and writers at the lakeFS endpoint (S3-compatible API) instead of the raw bucket, and lakeFS keeps track of versioned references without copying the underlying objects.
This post walks through:
- Running lakeFS locally with Docker Compose against a GCS bucket backend.
- A simple first-run to set up the admin user and a repository.
- Using the
lakectlCLI to branch, commit, and merge. - Working with lakeFS files locally using
lakectl local(Git-likeclone/pull/commit). - Reading and writing from Python with the official
lakefsSDK. - Running SQL queries on parquet files directly from the web UI’s embedded SQL console.
1. Install: Docker Compose with a GCS Bucket Backend
Prerequisites
- Docker and Docker Compose installed
- A GCP project with a GCS bucket you can use as the storage namespace (e.g.,
gs://my-lakefs-bucket) - A GCP service account with Storage Object Admin on that bucket, and a downloaded JSON key
Create the bucket
# Pick a unique bucket name
export BUCKET=my-lakefs-bucket
export GCP_PROJECT=my-project
export REGION=us-central1
gsutil mb -p $GCP_PROJECT -l $REGION gs://$BUCKET
Create a service account key
export SA=lakefs-local@$GCP_PROJECT.iam.gserviceaccount.com
gcloud iam service-accounts create lakefs-local \
--project $GCP_PROJECT \
--display-name "lakeFS local"
gcloud projects add-iam-policy-binding $GCP_PROJECT \
--member "serviceAccount:$SA" \
--role "roles/storage.objectAdmin"
# Required for presigned GCS URLs — lakectl and many Python clients rely
# on these so that uploads/downloads stream directly to/from GCS instead of
# through lakeFS. Grants the SA `signBlob` on itself.
gcloud iam service-accounts add-iam-policy-binding $SA \
--project $GCP_PROJECT \
--member "serviceAccount:$SA" \
--role "roles/iam.serviceAccountTokenCreator"
gcloud iam service-accounts keys create ./gcs-key.json \
--iam-account "$SA"
Keep gcs-key.json next to your docker-compose.yml (and add it to .gitignore).
Why the second role binding? Without iam.serviceAccountTokenCreator on the SA itself, lakeFS can’t sign GCS URLs and any lakectl fs upload (or a Python client with presign on) will fail with get physical address to upload object … giving up after 5 attempt(s). If you’d rather not grant signing permission, set LAKEFS_BLOCKSTORE_GS_DISABLE_PRE_SIGNED_URL: "true" in the compose env below — all bytes then flow through the lakeFS server instead of directly to GCS.
docker-compose.yml
lakeFS needs two things to run: a metadata database (Postgres is the easy default) and credentials for the object store. Here’s a minimal setup:
version: "3.8"
services:
postgres:
image: postgres:16
environment:
POSTGRES_USER: lakefs
POSTGRES_PASSWORD: lakefs
POSTGRES_DB: lakefs
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U lakefs"]
interval: 5s
retries: 10
lakefs:
image: treeverse/lakefs:latest
depends_on:
postgres:
condition: service_healthy
ports:
- "8000:8000"
environment:
LAKEFS_DATABASE_TYPE: postgres
LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING: "postgres://lakefs:lakefs@postgres:5432/lakefs?sslmode=disable"
LAKEFS_AUTH_ENCRYPT_SECRET_KEY: "change-me-to-a-long-random-string"
LAKEFS_BLOCKSTORE_TYPE: gs
LAKEFS_BLOCKSTORE_GS_CREDENTIALS_FILE: /etc/lakefs/gcs-key.json
LAKEFS_LOGGING_LEVEL: INFO
volumes:
- ./gcs-key.json:/etc/lakefs/gcs-key.json:ro
volumes:
pgdata:
A few notes:
LAKEFS_BLOCKSTORE_TYPE: gstells lakeFS to use Google Cloud Storage.- The service account key is mounted read-only into the container.
LAKEFS_AUTH_ENCRYPT_SECRET_KEYmust be set to something long and random; it encrypts credentials in the metadata DB. If you rotate it you’ll invalidate existing credentials.- For production you’d put the DB on managed Postgres and put lakeFS behind TLS — this compose file is for local exploration.
Bring it up:
docker compose up -d
docker compose logs -f lakefs
When the logs show listen on [::]:8000, open http://localhost:8000 in your browser.
2. Simple Start
First-run setup
The first time you open the lakeFS UI, it prompts you to create an admin user. Pick a username (e.g., admin), then lakeFS prints an access key ID and secret access key. Copy them immediately — the secret is only shown once.
Save them to ~/.lakectl.yaml so the CLI and Python clients can pick them up:
# ~/.lakectl.yaml
credentials:
access_key_id: AKIA...
secret_access_key: ...
server:
endpoint_url: http://localhost:8000
Also export them as environment variables for tools that expect AWS-style credentials:
export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=...
export LAKEFS_ENDPOINT=http://localhost:8000
Create your first repository
In the UI: Create Repository → name it quickstart, storage namespace gs://my-lakefs-bucket/quickstart, default branch main.
Equivalently from the CLI (see next section):
lakectl repo create lakefs://quickstart gs://my-lakefs-bucket/quickstart
The storage namespace is the prefix inside the bucket where lakeFS will store data for this repo. One bucket can host many repositories as long as each has its own prefix.
3. Using the CLI (lakectl)
Install lakectl
macOS:
brew tap treeverse/lakefs
brew install lakefs
Linux (download the binary from the releases page):
curl -L https://github.com/treeverse/lakeFS/releases/latest/download/lakeFS_Linux_x86_64.tar.gz \
| tar xz
sudo mv lakectl /usr/local/bin/
Verify:
lakectl --version
lakectl repo list
Upload, branch, commit, merge
Create a local file and upload it to the main branch:
echo "id,name\n1,alice\n2,bob" > users.csv
lakectl fs upload \
-s users.csv \
lakefs://quickstart/main/data/users.csv
Commit it:
lakectl commit lakefs://quickstart/main \
-m "Add initial users.csv"
Create a feature branch off main:
lakectl branch create \
lakefs://quickstart/experiment \
-s lakefs://quickstart/main
Modify the file on the new branch:
echo "id,name\n1,alice\n2,bob\n3,carol" > users.csv
lakectl fs upload \
-s users.csv \
lakefs://quickstart/experiment/data/users.csv
lakectl commit lakefs://quickstart/experiment \
-m "Add carol"
Diff against main:
lakectl diff \
lakefs://quickstart/main \
lakefs://quickstart/experiment
Merge the feature branch back:
lakectl merge \
lakefs://quickstart/experiment \
lakefs://quickstart/main
Delete the feature branch once it’s merged:
lakectl branch delete lakefs://quickstart/experiment
Deleting a branch only removes the branch reference — the underlying commits, and any tags pointing to them, are preserved. Add -y to skip the confirmation prompt (useful in scripts).
Useful commands to know
# List repos, branches, objects
lakectl repo list
lakectl branch list lakefs://quickstart
lakectl fs ls lakefs://quickstart/main/
# See commit history
lakectl log lakefs://quickstart/main
# Download a file
lakectl fs download \
lakefs://quickstart/main/data/users.csv \
./users.csv
# Roll a branch back to a previous commit
lakectl branch reset lakefs://quickstart/experiment \
--commit <commit-id>
# Tag a commit (useful for reproducible dataset versions)
lakectl tag create \
lakefs://quickstart/v1.0 \
lakefs://quickstart/main
4. Work with lakeFS Files Locally (lakectl local)
To edit or explore lakeFS files with regular local tools — a text editor, grep, pandas, a Jupyter notebook — without writing upload/download code, use lakectl local. It links a local directory to a lakeFS path and gives you clone / pull / commit / status commands with Git-like ergonomics.
This downloads the files (it’s not a lazy mount), so it’s best for small-to-medium datasets you actually want to work with.
Clone a branch to a local directory
lakectl local clone \
lakefs://quickstart/main/data/ \
./data
This creates ./data/ mirroring lakefs://quickstart/main/data/ and records the link in ./data/.lakefs_ref.yaml so subsequent commands know which repo, branch, and path the directory belongs to.
Pull, edit, status, commit
# Pull any new/changed files from lakeFS
lakectl local pull ./data
# Edit files however you like
echo "id,name\n1,alice\n2,bob\n3,carol" > ./data/users.csv
# See what's changed vs lakeFS
lakectl local status ./data
# Commit the local changes back to the branch
lakectl local commit ./data \
-m "Update users.csv from local edits"
Note: Unlike Git,
lakectl local commitis not local-only. It uploads the changed files to lakeFS and creates the commit on the tracked branch in a single step — there’s no separatepush. As soon as the command returns, the changes are live on the server and another machine canlakectl local pullthem.
Because the link is branch-scoped, you get a natural workflow: create a branch, clone it, make changes, commit, then lakectl merge back to main — all without touching upload/download APIs directly.
Switch to a different ref
If you want to work against a different branch, tag, or commit, re-clone into a separate directory or cd elsewhere and lakectl local clone again. One local directory maps to one lakeFS ref.
5. Using Python
The official, most popular lakeFS Python package is simply lakefs — maintained by Treeverse (the lakeFS team) and the one the docs recommend. It wraps the REST API with an ergonomic object model (repos, branches, objects) and is the only package you need for reads, writes, and the branch/commit/merge control flow.
pip install lakefs
Runnable example: the snippets below are also bundled as one end-to-end script — lakefs_quickstart.py. It reads credentials from env vars so you don’t have to paste them in:
export LAKEFS_HOST="http://localhost:8000" export LAKEFS_ACCESS_KEY="AKIA..." export LAKEFS_SECRET_KEY="..." export LAKEFS_REPO="quickstart" python lakefs_quickstart.py
Connect
import lakefs
from lakefs.client import Client
client = Client(
host="http://localhost:8000",
username="AKIA...", # access key id
password="...", # secret access key
)
repo = lakefs.Repository("quickstart", client=client)
Read and write objects
main = repo.branch("main")
# Write a file, then commit it on the branch.
# The write alone only stages the object — you have to commit to persist it.
with main.object("data/from_sdk.txt").writer(mode="wb") as f:
f.write(b"hello from python\n")
main.commit(message="Add data/from_sdk.txt")
# Read a file
with main.object("data/users.csv").reader(mode="r") as f:
print(f.read())
# List objects under a prefix
for obj in main.objects(prefix="data/"):
print(obj.path, obj.size_bytes)
Branch, commit, merge
# Create a branch off main
exp = repo.branch("python-experiment").create(source_reference="main")
# Write something on the branch
with exp.object("data/from_sdk.txt").writer(mode="wb") as f:
f.write(b"edited on the experiment branch\n")
# Commit and merge back
exp.commit(message="Update from python SDK")
exp.merge_into("main")
You can swap any branch name for a tag or commit ID — repo.ref("v1.0") or repo.ref("<commit-sha>") — to read a frozen version of the data.
Download the full script
All of the above is bundled as a single runnable file — end-to-end (connect → write → read → list → branch → commit → merge → verify):
↓ Download lakefs_quickstart.py
+ python lakefs_quickstart.py
writing data/from_sdk.txt on main...
reading data/from_sdk.txt on main...
contents: hello from python
listing data/ on main...
data/from_sdk.txt (18 bytes)
data/users.csv (37 bytes)
creating branch python-experiment-1776452154 off main...
editing data/from_sdk.txt on python-experiment-1776452154...
committing...
merging python-experiment-1776452154 -> main...
reading data/from_sdk.txt on main after merge...
contents: edited on the experiment branch
done.
6. Querying Parquet Files with SQL (in the Web UI)
Since lakeFS v0.88.0 the web UI has an embedded SQL console — DuckDB compiled to WebAssembly, running right in your browser. Open any parquet or CSV object in the UI and you get a query pane where you can DESCRIBE the schema, SELECT samples, JOIN across files on the same branch, and even compare the same path across two branches.
No installation, no credentials setup, nothing to configure — it works out of the box. And because DuckDB is running in the browser (not on the lakeFS server), your queries stay client-side; the server only serves the object bytes.
Next Steps
- Hooks: lakeFS supports pre-commit and pre-merge hooks (run a Lua script or an external webhook to enforce schema checks, null-rate thresholds, etc.).
- Garbage collection: when you delete branches or overwrite objects, the underlying GCS data isn’t removed until you run
lakectl gc. Schedule this on a cadence that matches your retention policy. - Auth: the open-source version only has basic access keys — the admin user you created is fine for local, and you can issue additional access keys per user from the UI. Fine-grained RBAC, SSO, and IAM integrations are not in open source — they were moved to lakeFS Enterprise / Cloud (delivered via a closed-source sidecar called “Fluffy”). Pricing for Enterprise isn’t published publicly; you have to contact Treeverse sales. For most solo/small-team self-hosted setups, basic auth behind a VPN or private network is the pragmatic path.
- Spark / DuckDB / Trino: all work via the S3 gateway with the same branch-as-path trick shown above.