Getting Started with dbt Seeds in Databricks: A Beginner-Friendly Guide

Posted on by Sumit Kumar

If you’ve started your dbt journey and successfully connected dbt Core with Databricks, congratulations! 🎉

The next feature you should learn is dbt Seeds. Seeds are one of the simplest yet most powerful features in dbt, especially when working with reference data, lookup tables, and demo datasets.

In this blog, we’ll explore what dbt Seeds are, why they are used, and how to implement them in Databricks with a practical example.


What Are dbt Seeds?

A dbt Seed is a CSV file that dbt can load directly into your data warehouse as a table.

Instead of manually creating and maintaining small tables in Databricks, you can store the data as CSV files within your dbt project and let dbt manage them.

Think of Seeds as:

“Version-controlled tables created from CSV files.”

Because the CSV files are stored inside your project, they can be tracked through Git, reviewed during code reviews, and deployed consistently across environments.


Why Use dbt Seeds?

dbt Seeds are ideal for small datasets that do not change frequently.

Some common use cases include:

Reference Data

Examples:

  • Country codes
  • Currency mappings
  • Department lists
  • Product categories

Example:

country_code country_name
IN India
US United States
UK United Kingdom

Lookup Tables

Many organizations maintain small mapping tables used across multiple transformations.

Example:

department_id department_name
10 HR
20 Finance
30 IT

Instead of creating this table manually in Databricks, you can simply maintain it as a CSV file.


Development and Testing

Seeds are extremely useful when:

  • Learning dbt
  • Demonstrating concepts
  • Building proof-of-concepts
  • Creating test datasets

This makes Seeds a perfect feature for beginners who want to understand the dbt workflow.


Static Business Rules

Example:

status_code description
A Active
I Inactive

Since such values rarely change, storing them as a Seed is often the easiest approach.


How dbt Seeds Work

The process is simple:

  1. Create a CSV file.
  2. Place it inside the seeds folder.
  3. Run dbt seed.
  4. dbt creates a table in Databricks.

The generated table can then be used inside your dbt models just like any other source table.


Project Structure

A typical dbt project may look like this:

my_dbt_project/
|
├── models/
├── seeds/
├── macros/
├── tests/
└── dbt_project.yml

Create a folder named:

seeds/

if it does not already exist.


Step 1: Create a Seed File

Create a file named:

seeds/employees.csv

Add the following content:

emp_id,emp_name,department,salary
1,Rahul,IT,50000
2,Priya,HR,40000
3,Amit,Finance,60000
4,Neha,IT,70000

This CSV file will become a table in Databricks.


Step 2: Load the Seed into Databricks

Run the following command:

dbt seed

Sample output:

Finished running 1 seed in 4.12 seconds

dbt will create a table called:

employees

inside your target schema.


Step 3: Verify the Data

Open Databricks and execute:

SELECT *
FROM employees;

Output:

emp_id emp_name department salary
1 Rahul IT 50000
2 Priya HR 40000
3 Amit Finance 60000
4 Neha IT 70000

Congratulations! Your first Seed has been successfully loaded.


Step 4: Use the Seed in a dbt Model

Now let’s create a simple transformation.

Create a file:

models/high_salary_employees.sql

Add the following SQL:

SELECT *
FROM {{ ref('employees') }}
WHERE salary > 50000

Run:

dbt run

dbt creates a new model containing employees whose salary exceeds ₹50,000.

Output:

emp_id emp_name department salary
3 Amit Finance 60000
4 Neha IT 70000

Why Use ref() Instead of Direct Table Names?

You may wonder why we use:

{{ ref('employees') }}

instead of:

SELECT * FROM employees

The answer is simple: dbt understands dependencies through ref().

Benefits include:

  • Automatic dependency tracking
  • Better lineage visualization
  • Easier environment management
  • Improved maintainability
  • Accurate build ordering

Using ref() is considered a dbt best practice.


Loading a Specific Seed

If your project contains multiple Seed files, you can load only one:

dbt seed --select employees

This is particularly useful in large projects.


Refreshing Existing Seed Data

When the CSV file changes, reload it using:

dbt seed --full-refresh

This recreates the table with the latest data.


Configuring Seed Schemas

You can control where Seed tables are created.

In dbt_project.yml:

seeds:
  my_dbt_project:
    +schema: seed_data

Now dbt will create the Seed table in:

seed_data.employees

This helps separate Seed tables from business models.


Real-World Example

Imagine a marketing analytics project where campaign data comes from multiple platforms.

You may maintain a channel mapping file like:

channel_id,channel_name
1,Facebook
2,Instagram
3,LinkedIn
4,Twitter

Instead of hardcoding these values in SQL, you can store them as a Seed and join them with campaign performance data.

Benefits include:

  • Easier maintenance
  • Better version control
  • Centralized mappings
  • Reduced SQL complexity

Best Practices for dbt Seeds

✅ Use Seeds for small datasets only.

✅ Store reference and lookup data as Seeds.

✅ Always use ref() when referencing Seeds.

✅ Keep Seed files under version control.

✅ Avoid loading large transactional datasets as Seeds.

❌ Do not use Seeds for millions of records.

❌ Do not use Seeds as a replacement for source systems.


Conclusion

dbt Seeds provide a simple and efficient way to manage small, static datasets directly within your dbt project. They are perfect for lookup tables, reference data, testing, and learning dbt concepts.

For beginners working with Databricks, learning Seeds is a great next step after creating your first dbt model. With just a CSV file and a single command, you can create reusable tables that integrate seamlessly into your dbt transformation workflow.

By mastering Seeds early, you’ll build cleaner projects, improve maintainability, and follow dbt best practices from day one.


Key Takeaways

  • dbt Seeds convert CSV files into database tables.
  • Ideal for lookup and reference data.
  • Loaded using the dbt seed command.
  • Can be referenced in models using ref().
  • Fully version-controlled and easy to maintain.
  • Commonly used in production analytics projects.

Happy Learning and Happy Data Transforming with dbt and Databricks! 🚀

Leave a Reply

Your email address will not be published. Required fields are marked *

*

*