Getting Started with dbt Seeds in Databricks: A Beginner-Friendly Guide
Posted on by Sumit KumarIf you’ve started your dbt journey and successfully connected dbt Core with Databricks, congratulations! 🎉
The next feature you should learn is dbt Seeds. Seeds are one of the simplest yet most powerful features in dbt, especially when working with reference data, lookup tables, and demo datasets.
In this blog, we’ll explore what dbt Seeds are, why they are used, and how to implement them in Databricks with a practical example.
What Are dbt Seeds?
A dbt Seed is a CSV file that dbt can load directly into your data warehouse as a table.
Instead of manually creating and maintaining small tables in Databricks, you can store the data as CSV files within your dbt project and let dbt manage them.
Think of Seeds as:
“Version-controlled tables created from CSV files.”
Because the CSV files are stored inside your project, they can be tracked through Git, reviewed during code reviews, and deployed consistently across environments.
Why Use dbt Seeds?
dbt Seeds are ideal for small datasets that do not change frequently.
Some common use cases include:
Reference Data
Examples:
- Country codes
- Currency mappings
- Department lists
- Product categories
Example:
| country_code | country_name |
|---|---|
| IN | India |
| US | United States |
| UK | United Kingdom |
Lookup Tables
Many organizations maintain small mapping tables used across multiple transformations.
Example:
| department_id | department_name |
|---|---|
| 10 | HR |
| 20 | Finance |
| 30 | IT |
Instead of creating this table manually in Databricks, you can simply maintain it as a CSV file.
Development and Testing
Seeds are extremely useful when:
- Learning dbt
- Demonstrating concepts
- Building proof-of-concepts
- Creating test datasets
This makes Seeds a perfect feature for beginners who want to understand the dbt workflow.
Static Business Rules
Example:
| status_code | description |
|---|---|
| A | Active |
| I | Inactive |
Since such values rarely change, storing them as a Seed is often the easiest approach.
How dbt Seeds Work
The process is simple:
- Create a CSV file.
- Place it inside the
seedsfolder. - Run
dbt seed. - dbt creates a table in Databricks.
The generated table can then be used inside your dbt models just like any other source table.
Project Structure
A typical dbt project may look like this:
my_dbt_project/
|
├── models/
├── seeds/
├── macros/
├── tests/
└── dbt_project.yml
Create a folder named:
seeds/
if it does not already exist.
Step 1: Create a Seed File
Create a file named:
seeds/employees.csv
Add the following content:
emp_id,emp_name,department,salary
1,Rahul,IT,50000
2,Priya,HR,40000
3,Amit,Finance,60000
4,Neha,IT,70000
This CSV file will become a table in Databricks.
Step 2: Load the Seed into Databricks
Run the following command:
dbt seed
Sample output:
Finished running 1 seed in 4.12 seconds
dbt will create a table called:
employees
inside your target schema.
Step 3: Verify the Data
Open Databricks and execute:
SELECT *
FROM employees;
Output:
| emp_id | emp_name | department | salary |
|---|---|---|---|
| 1 | Rahul | IT | 50000 |
| 2 | Priya | HR | 40000 |
| 3 | Amit | Finance | 60000 |
| 4 | Neha | IT | 70000 |
Congratulations! Your first Seed has been successfully loaded.
Step 4: Use the Seed in a dbt Model
Now let’s create a simple transformation.
Create a file:
models/high_salary_employees.sql
Add the following SQL:
SELECT *
FROM {{ ref('employees') }}
WHERE salary > 50000
Run:
dbt run
dbt creates a new model containing employees whose salary exceeds ₹50,000.
Output:
| emp_id | emp_name | department | salary |
|---|---|---|---|
| 3 | Amit | Finance | 60000 |
| 4 | Neha | IT | 70000 |
Why Use ref() Instead of Direct Table Names?
You may wonder why we use:
{{ ref('employees') }}
instead of:
SELECT * FROM employees
The answer is simple: dbt understands dependencies through ref().
Benefits include:
- Automatic dependency tracking
- Better lineage visualization
- Easier environment management
- Improved maintainability
- Accurate build ordering
Using ref() is considered a dbt best practice.
Loading a Specific Seed
If your project contains multiple Seed files, you can load only one:
dbt seed --select employees
This is particularly useful in large projects.
Refreshing Existing Seed Data
When the CSV file changes, reload it using:
dbt seed --full-refresh
This recreates the table with the latest data.
Configuring Seed Schemas
You can control where Seed tables are created.
In dbt_project.yml:
seeds:
my_dbt_project:
+schema: seed_data
Now dbt will create the Seed table in:
seed_data.employees
This helps separate Seed tables from business models.
Real-World Example
Imagine a marketing analytics project where campaign data comes from multiple platforms.
You may maintain a channel mapping file like:
channel_id,channel_name
1,Facebook
2,Instagram
3,LinkedIn
4,Twitter
Instead of hardcoding these values in SQL, you can store them as a Seed and join them with campaign performance data.
Benefits include:
- Easier maintenance
- Better version control
- Centralized mappings
- Reduced SQL complexity
Best Practices for dbt Seeds
✅ Use Seeds for small datasets only.
✅ Store reference and lookup data as Seeds.
✅ Always use ref() when referencing Seeds.
✅ Keep Seed files under version control.
✅ Avoid loading large transactional datasets as Seeds.
❌ Do not use Seeds for millions of records.
❌ Do not use Seeds as a replacement for source systems.
Conclusion
dbt Seeds provide a simple and efficient way to manage small, static datasets directly within your dbt project. They are perfect for lookup tables, reference data, testing, and learning dbt concepts.
For beginners working with Databricks, learning Seeds is a great next step after creating your first dbt model. With just a CSV file and a single command, you can create reusable tables that integrate seamlessly into your dbt transformation workflow.
By mastering Seeds early, you’ll build cleaner projects, improve maintainability, and follow dbt best practices from day one.
Key Takeaways
- dbt Seeds convert CSV files into database tables.
- Ideal for lookup and reference data.
- Loaded using the
dbt seedcommand. - Can be referenced in models using
ref(). - Fully version-controlled and easy to maintain.
- Commonly used in production analytics projects.
Happy Learning and Happy Data Transforming with dbt and Databricks! 🚀



Leave a Reply