Seed Data for PostgreSQL with pg-promise and faker
Introduction
pg-promise is a powerful library for Node.js dedicated to interacting with PostgreSQL databases. This library is built on top of the native pg driver but provides a high-level abstraction layer with a Promise-based architecture, making the source code cleaner, easier to read and more maintainable.
Advantages
- Automatically manages connections and transactions safely
- Features a built-in powerful SQL formatting system to prevent SQL Injection attacks
- Supports handling external SQL files via Query Files and offers exceptionally high performance for bulk data insertion thanks to optimized utilities like
pgp.helpers.insertandpgp.helpers.update.
Prerequisites
- In the previous article, I provided instructions on using Prisma to seed data, but using ORMs always introduces limitations such as lack of support for all data types as well as advanced database features. Since ORMs are designed to work with various database types, they sacrifice specific functionalities to solve general problems
- In this article, I will guide you through using pg-promise and faker to seed data, which offers the following advantages
- It does not use any ORM, so there is no need to synchronize the schema between the ORM and the database before use
- You can apply database-specific features directly before seeding data
- You only need to handle data faking and inserting into the database
Detail
You should install these packages
yarn add pg-promise dotenv
For example, suppose you need to create a database with the following tables
CREATE EXTENSION IF NOT EXISTS btree_gist;
CREATE TABLE warehouse_dock (
id SERIAL PRIMARY KEY,
dock_name VARCHAR(50) NOT NULL,
location VARCHAR(100)
);
CREATE TABLE truck_schedule (
id SERIAL PRIMARY KEY,
truck_plate_number VARCHAR(20) NOT NULL,
warehouse_dock_id INT REFERENCES warehouse_dock(id),
operating_time tstzrange NOT NULL,
activity_type VARCHAR(20) CHECK (activity_type IN ('INBOUND', 'OUTBOUND', 'OUT_FOR_DELIVERY')),
CONSTRAINT check_schedule_overlap EXCLUDE USING gist (
warehouse_dock_id WITH =,
operating_time WITH &&
) WHERE (activity_type IN ('INBOUND', 'OUTBOUND'))
);
CREATE INDEX idx_truck_schedule_dock_time ON truck_schedule USING gist (warehouse_dock_id, operating_time);
- The schema above utilizes the
tstzrangedata type, which is not supported in the current Prisma version. If you use TypeORM, it supports this type but remains limited when creating this specific Constraint combined with a Partial Index in Postgres - Therefore, the solution is to create the database schema first and then seed data independently using seed/index.ts with the following content
import {faker} from '@faker-js/faker'
import pgPromise from 'pg-promise'
const pgp = pgPromise()
const db = pgp(process.env.DATABASE_URL!)
const csDock = new pgp.helpers.ColumnSet(['dock_name', 'location'], {
table: 'warehouse_dock',
})
const csSchedule = new pgp.helpers.ColumnSet
[
'truck_plate_number',
'warehouse_dock_id',
'operating_time',
'activity_type',
],
{table: 'truck_schedule'}
)
async function main() {
try {
console.log('🧹 Clearing old data...')
await db.none(
'TRUNCATE TABLE truck_schedule, warehouse_dock RESTART IDENTITY CASCADE;'
)
console.log('🏭 Generating 100 warehouse docks...')
const dockData = Array.from({length: 100}, () => ({
dock_name: faker.company.name() + ' Dock',
location: faker.location.streetAddress(),
}))
const queryDock = pgp.helpers.insert(dockData, csDock) + ' RETURNING id'
const createdDocks = await db.many(queryDock)
const dockIds = createdDocks.map((dock: {id: number}) => dock.id)
console.log(`✅ Inserted ${dockIds.length} warehouse docks.`)
const TOTAL_SCHEDULES = 1_000_000
const BATCH_SIZE = 50_000
const STEPS = TOTAL_SCHEDULES / BATCH_SIZE
const dockTimelines = new Map<number, Date>()
dockIds.forEach(id => {
dockTimelines.set(id, new Date('2026-01-20T00:00:00.000Z'))
})
console.log(
`📦 Generating ${TOTAL_SCHEDULES.toLocaleString()} truck schedules via ${STEPS} sequential batches...`
)
for (let step = 0; step < STEPS; step++) {
const scheduleData = Array.from({length: BATCH_SIZE}, () => {
const randomDockId = faker.helpers.arrayElement(dockIds)
const currentTimeline = dockTimelines.get(randomDockId)!
const start = new Date(currentTimeline.getTime())
const durationHours = faker.number.int({min: 2, max: 5})
const end = new Date(start.getTime() + durationHours * 60 * 60 * 1000)
const restMinutes = faker.number.int({min: 15, max: 45})
const nextStartTimeline = new Date(
end.getTime() + restMinutes * 60 * 1000
)
boxTimelines.set(randomDockId, nextStartTimeline)
return {
truck_plate_number: faker.vehicle.vrm(),
warehouse_dock_id: randomDockId,
operating_time: `[${start.toISOString()},${end.toISOString()})`,
activity_type: faker.helpers.arrayElement([
'INBOUND',
'OUTBOUND',
'OUT_FOR_DELIVERY',
]),
}
})
const querySchedule =
pgp.helpers.insert(scheduleData, csSchedule) +
' ON CONFLICT ON CONSTRAINT check_schedule_overlap DO NOTHING;'
await db.none(querySchedule)
console.log(
`⚡ Finished Batch ${step + 1}/${STEPS} (${((step + 1) * BATCH_SIZE).toLocaleString()} rows)`
)
}
console.log('🚀 100,000 records seeded successfully without any conflicts!')
} catch (error) {
console.error('❌ Seeding Error:', error)
} finally {
pgp.end()
}
}
main()
- The code snippet above leverages faker to generate mock data and inserts 100 records into the
warehouse_docktable. After retrieving the dockIds, they are used as foreign keys to insert into thetruck_scheduletable - It inserts 1,000,000 records into the
truck_scheduletable with a batch size of 50,000 records per iteration
Create a .env file
DATABASE_URL = postgresql://{USER}:{PASSWORD}@{HOST}:{PORT}/{DATABASE_NAME}?schema={SCHEMA_NAME}
Add the script to package.json
{
"scripts": {
"db:seed": "ts-node -r dotenv/config seed/index.ts"
}
}
Consequently, you only need to execute that script in the future to seed data
yarn db:seed
yarn run v1.22.22
$ ts-node -r dotenv/config seed/index.ts
🧹 Clearing old data...
🏭 Generating 100 warehouse docks...
✅ Inserted 100 warehouse docks.
📦 Generating 1,000,000 truck schedules via 20 sequential batches...
⚡ Finished Batch 1/20 (50,000 rows)
⚡ Finished Batch 2/20 (100,000 rows)
⚡ Finished Batch 3/20 (150,000 rows)
⚡ Finished Batch 4/20 (200,000 rows)
⚡ Finished Batch 5/20 (250,000 rows)
⚡ Finished Batch 6/20 (300,000 rows)
⚡ Finished Batch 7/20 (350,000 rows)
⚡ Finished Batch 8/20 (400,000 rows)
⚡ Finished Batch 9/20 (450,000 rows)
⚡ Finished Batch 10/20 (500,000 rows)
⚡ Finished Batch 11/20 (550,000 rows)
⚡ Finished Batch 12/20 (600,000 rows)
⚡ Finished Batch 13/20 (650,000 rows)
⚡ Finished Batch 14/20 (700,000 rows)
⚡ Finished Batch 15/20 (750,000 rows)
⚡ Finished Batch 16/20 (800,000 rows)
⚡ Finished Batch 17/20 (850,000 rows)
⚡ Finished Batch 18/20 (900,000 rows)
⚡ Finished Batch 19/20 (950,000 rows)
⚡ Finished Batch 20/20 (1,000,000 rows)
🚀 100,000 records seeded successfully without any conflicts!
✨ Done in 26.39s.
Happy coding!
Comments
Post a Comment