DVC vs LakeFS for Data Versioning: When You Need What
Both version data, in very different ways. DVC for ML datasets and experiments; LakeFS for production data lakes. Picking the wrong one means friction.
“Version control for data” sounds like one problem with one solution. It isn’t. DVC and LakeFS both fit that pitch but solve very different problems for very different audiences. DVC is for ML practitioners who want git-like workflows for their training datasets. LakeFS is for data engineering teams who want git-like semantics on top of production data lakes.
We’ve deployed both in client work — DVC for hospital imaging ML experiments, LakeFS for a logistics company’s data platform with strict change-management requirements. Here’s how the decision actually plays out.
The thirty-second framing#
- DVC (Data Version Control) extends git to handle large files and data pipelines. You add datasets to DVC, push to a remote storage backend (S3, GCS, Azure Blob), and commit the .dvc pointer file to git.
dvc pullretrieves the actual data. - LakeFS sits in front of your S3 (or other object store) and gives it git-like semantics: branches, commits, merges, time travel. You read/write to LakeFS as if it were S3; LakeFS manages the versioning underneath.
DVC is “git for ML data.” LakeFS is “git for the data lake.”
Different audiences, different workflows, mostly different problems.
What’s actually different#
| Dimension | DVC | LakeFS |
|---|---|---|
| Audience | ML practitioners | Data engineering teams |
| Scope | Per-project, per-repo | Whole data lake |
| Storage backend | S3, GCS, Azure, SSH | S3, GCS, Azure |
| Workflow | git + dvc commands | S3 API with versioning extensions |
| Branching | Yes (via git) | Yes (native) |
| Merging | Yes (via git) | Yes (with conflict resolution) |
| Pipeline definition | DVC pipelines (dvc.yaml) | None — data, not pipelines |
| Experiment tracking | Yes (dvc exp) | No |
| Production data infra | Lightweight | Heavier |
| Atomic dataset versions | Yes | Yes |
| Multi-table consistency | Per-project | Cross-table within a branch |
| Operates over | Files (any) | Objects in object store |
| Compute integration | Spark, Airflow, etc. via files | Spark, Trino, Athena, etc. via S3 protocol |
Where DVC wins#
ML experiment workflow. DVC pipelines + experiment tracking integrate cleanly with the ML iteration loop. You version data, pipeline steps, model artifacts, and metrics — all in one tool, all tied to git commits.
# dvc.yaml
stages:
preprocess:
cmd: python preprocess.py
deps: [data/raw, preprocess.py]
outs: [data/processed]
train:
cmd: python train.py
deps: [data/processed, train.py]
outs: [models/model.pt]
metrics: [metrics.json]
dvc repro runs only the stages whose inputs changed. dvc exp run runs an experiment, tracks hyperparameters and metrics. Compare experiments with dvc exp show.
No infrastructure to deploy. DVC is a CLI tool. No server, no service to operate. Your team installs it and uses it.
Git-native. Every data version is tied to a git commit. Code reviews see what data changed alongside what code changed.
Lightweight cost. Just the storage backend (S3, etc.). No per-user pricing, no managed service.
Where DVC hurts:
- Doesn’t scale well to multi-team data engineering (it’s per-repo).
- File-based mental model isn’t great for “tabular table in a warehouse.”
- Pipeline reproducibility relies on stages being deterministic, which they often aren’t.
- Large datasets (TB scale) require careful caching and storage configuration.
Where LakeFS wins#
Data lake versioning at scale. When your “data” is petabytes in S3 across many tables, LakeFS gives you git semantics over the whole thing.
Cross-table atomic changes. “I’m updating these 5 Iceberg tables together as part of one logical change” — commit on a branch, merge atomically. Without LakeFS, this is multiple S3 PUTs with no way to roll back if step 3 fails.
Branch-per-environment. Production reads from main branch. Pre-prod reads from staging branch. Promote with a merge. Real change management at the data layer.
Compute-engine agnostic. Spark, Athena, Trino, BigQuery (via external tables), DuckDB — all can read/write LakeFS via the S3-compatible API. No app changes.
Time travel for compliance. “What did this table look like on April 14?” Answer instantly. For audited industries (banking, healthcare), this matters.
Where LakeFS hurts:
- Real infrastructure: LakeFS server + Postgres for metadata + your object store.
- Adds a layer in front of your S3 — one more thing that can break.
- The mental model “S3 with branches” takes adjustment for teams new to it.
- Smaller community than DVC. Fewer integration tutorials.
A third option: just use Iceberg/Delta with snapshots#
If your data is already in Apache Iceberg or Delta Lake format, you have time travel and “atomic write” semantics built in. No additional tool needed.
- Iceberg snapshots: every commit creates a new snapshot. Time-travel queries via
AS OF timestamp 'X'. - Delta Lake time travel:
SELECT * FROM table TIMESTAMP AS OF '2026-01-01'.
For many production data lakes, the table format’s built-in versioning is enough. LakeFS adds value when you need git-like branching across MULTIPLE tables (not just within one table).
When you need DVC#
- You’re doing ML and want to version training datasets alongside training code
- Your training data fits comfortably on disk (or in S3 with reasonable retrieval cost)
- Your team is small enough that git-per-project workflows work
- You want experiment tracking integrated with data versioning (alternative: MLflow / W&B for experiments, raw S3 paths for data)
When you need LakeFS#
- Your data engineering team works on shared tables and needs to coordinate changes
- You want branch-based environments at the data layer
- You’re not yet on Iceberg/Delta and adopting LakeFS is easier than migrating table format
- You need cross-table atomic commits
When you need neither#
- One team, one model, small dataset → just version with timestamps in S3 paths
- All your data is in Iceberg/Delta → use the table format’s native versioning
- You don’t actually have train/test data version problems in production → don’t pre-optimize
For most projects we deploy, neither DVC nor LakeFS is needed. The mental discipline of “every dataset has a version + a date in its path” + table format snapshots covers 80% of the problem.
Patterns that work without either tool#
For teams that don’t want to deploy either:
- Date-partitioned S3 paths:
s3://bucket/dataset=customers/version=2026-05-26/.... Immutable, queryable, simple. - Iceberg or Delta tables for warehouses: built-in time travel, atomic writes, schema evolution.
- dbt snapshots for tracking dimension changes (see our dbt advanced patterns piece).
- MLflow for experiment + model artifact tracking (see our MLflow vs W&B piece).
Combined, these handle most “data versioning” needs without adding DVC or LakeFS.
What we deploy by default#
For client work:
- Most projects: no DVC, no LakeFS. Iceberg/Delta + dbt snapshots + MLflow + date-partitioned S3 paths handle 80% of needs.
- ML-heavy projects with messy datasets (image classification, NLP with custom corpora): DVC for dataset + experiment versioning. ✓ Used for hospital imaging ML where datasets are non-tabular.
- Large data engineering platforms with strict change management: LakeFS, especially for orgs that explicitly want git-like data branches.
We don’t usually run both. They overlap weirdly in the middle and the cognitive cost of “is this a DVC concern or a LakeFS concern?” outweighs the value.
The pattern of patterns#
Data versioning tools solve a real problem for a specific set of teams. They’re not “best practices for everyone” — they’re targeted solutions for targeted needs.
The teams that ship data platforms well aren’t the ones with the fanciest versioning stack. They’re the ones who picked the simplest pattern (date-partitioned paths, table format snapshots) that covers their actual workflow and added complexity only when the pain was real.
Data versioning is a real problem with multiple right answers. If you’re building a data platform and want a sanity check on the versioning approach, our data engineering team has shipped with and without these tools. Tell us about the workload.