Seeding bulk records with Snaplet Seed and PostgreSQL
Introduction
In the previous article, I provided a basic guide on how to use @snaplet/seed to seed data quickly. In this article, I will show you how to generate a large amount of data in a short time, approximately 1,000,000 records, which is useful when you need to perform performance testing. You can apply a similar approach to larger datasets like several million records, using a streaming mechanism to avoid Out of Memory (OOM) errors caused by allocating too much memory at once and overloading the system. I will explain two different approaches here:
- Coding: chunking data and writing each small part to the database
- Database: creating a CSV file and using a command to copy that data directly into PostgreSQL
Prerequisites
You need to set up Prisma, PostgreSQL and @snaplet/seed before continuing, you can check out the previous articles for instructions.
Detail
First, let us create the file prisma/seed/seed.ts
import {faker} from '@faker-js/faker'
import {createSeedClient} from '@snaplet/seed'
const TOTAL_CATEGORIES = 2_000
const TOTAL_PRODUCTS = 1_000_000
const BATCH_SIZE = 50_000
const STEPS = TOTAL_PRODUCTS / BATCH_SIZE
async function main() {
let seed = await createSeedClient()
console.log('🧹 Clearing old data...')
await seed.$resetDatabase()
console.log('⏳ Directly seeding Category data into Database...')
const categoryStore = await seed.category(x =>
x(TOTAL_CATEGORIES, {
name: () => faker.commerce.department(),
})
)
const categoryIds = categoryStore.category.map(cat => cat.id as number)
console.log(`✅ Successfully seeded ${TOTAL_CATEGORIES} categories!`)
console.log(
`⏳ Starting chunked direct seed of ${TOTAL_PRODUCTS.toLocaleString()} Products...`
)
for (let i = 0; i < STEPS; i++) {
await seed.product(x =>
x(BATCH_SIZE, () => ({
name: faker.commerce.productName(),
price: parseFloat(faker.commerce.price({min: 5, max: 2000, dec: 2})),
categoryId: faker.helpers.arrayElement(categoryIds),
}))
)
console.log(
`📦 Successfully seeded Product Batch ${i + 1}/${STEPS} (${((i + 1) * BATCH_SIZE).toLocaleString()} rows)`
)
seed = await createSeedClient()
}
console.log(
'🚀 All data has been successfully seeded directly into PostgreSQL!'
)
}
main().catch(e => {
console.error('❌ Seeding Error:', e)
process.exit(1)
})
- Seeding 1,000,000 Products by splitting them into batches (BATCH_SIZE = 50,000).
- In each loop iteration, Snaplet generates Product data (using a callback to bypass format errors) and saves it directly to the DB.
- After each batch, the seed client is reinitialized (createSeedClient()) to clear the cache, free up RAM memory and prevent Out of Memory errors.
When running, the result will look like this:
$ yarn prisma:seed
yarn run v1.22.22
$ prisma db seed
Loaded Prisma config from prisma.config.ts.
Running seed command `ts-node ./prisma/seed/seed.ts` ...
🧹 Clearing old data...
⏳ Directly seeding Category data into Database...
✅ Successfully seeded 2000 categories!
⏳ Starting chunked direct seed of 1,000,000 Products...
📦 Successfully seeded Product Batch 1/20 (50,000 rows)
📦 Successfully seeded Product Batch 2/20 (100,000 rows)
📦 Successfully seeded Product Batch 3/20 (150,000 rows)
📦 Successfully seeded Product Batch 4/20 (200,000 rows)
📦 Successfully seeded Product Batch 5/20 (250,000 rows)
📦 Successfully seeded Product Batch 6/20 (300,000 rows)
📦 Successfully seeded Product Batch 7/20 (350,000 rows)
📦 Successfully seeded Product Batch 8/20 (400,000 rows)
📦 Successfully seeded Product Batch 9/20 (450,000 rows)
📦 Successfully seeded Product Batch 10/20 (500,000 rows)
📦 Successfully seeded Product Batch 11/20 (550,000 rows)
📦 Successfully seeded Product Batch 12/20 (600,000 rows)
📦 Successfully seeded Product Batch 13/20 (650,000 rows)
📦 Successfully seeded Product Batch 14/20 (700,000 rows)
📦 Successfully seeded Product Batch 15/20 (750,000 rows)
📦 Successfully seeded Product Batch 16/20 (800,000 rows)
📦 Successfully seeded Product Batch 17/20 (850,000 rows)
📦 Successfully seeded Product Batch 18/20 (900,000 rows)
📦 Successfully seeded Product Batch 19/20 (950,000 rows)
📦 Successfully seeded Product Batch 20/20 (1,000,000 rows)
🚀 All data has been successfully seeded directly into PostgreSQL!
🌱 The seed command has been executed.
✨ Done in 23.18s.
Create the file prisma/seed/seed-csv.ts to write the CSV file:
import {faker} from '@faker-js/faker'
import {createSeedClient} from '@snaplet/seed'
import * as format from 'fast-csv'
import * as fs from 'fs'
const TOTAL_CATEGORIES = 2_000
const TOTAL_PRODUCTS = 1_000_000
const BATCH_SIZE = 50_000
const STEPS = TOTAL_PRODUCTS / BATCH_SIZE
async function main() {
const seed = await createSeedClient()
console.log('⏳ Generating Category data...')
const prefix = 'gen/csv/'
const categoryFile = fs.createWriteStream(prefix + 'categories_data.csv')
const categoryStream = format.format({headers: true})
categoryStream.pipe(categoryFile)
const categoryStore = await seed.category(x =>
x(TOTAL_CATEGORIES, {
name: () => faker.commerce.department(),
})
)
const categoryArray = categoryStore.category
categoryArray.forEach(cat => categoryStream.write(cat))
categoryStream.end()
console.log('✅ Successfully wrote categories_data.csv!')
const categoryIds = categoryArray.map(cat => cat.id)
console.log('⏳ Generating 1,000,000 Product rows and streaming to CSV...')
const productFile = fs.createWriteStream(prefix + 'products_data.csv')
const productStream = format.format({headers: true})
productStream.pipe(productFile)
let productIdCounter = 1
for (let i = 0; i < STEPS; i++) {
const batchData = Array.from({length: BATCH_SIZE}, () => {
const priceString = faker.commerce.price({min: 5, max: 2000, dec: 2})
return {
id: productIdCounter++,
name: faker.commerce.productName(),
price: parseFloat(priceString),
categoryId: faker.helpers.arrayElement(categoryIds),
}
})
batchData.forEach(product => {
productStream.write(product)
})
console.log(
`📦 Processed Product Batch ${i + 1}/${STEPS} (${(i + 1) * BATCH_SIZE} rows)`
)
}
productStream.end()
console.log(
'🚀 Both CSV files are ready for PostgreSQL bulk loading (COPY command)!'
)
}
main().catch(e => {
console.error('❌ Seeding Error:', e)
process.exit(1)
})
The fast-csv library in the above code snippet acts as a data formatter to convert JavaScript objects into CSV format in real time. By utilizing format.format({headers: true}) combined with fs.createWriteStream, it creates a Writable Stream. This allows data to be written directly to disk as soon as it is generated (streamed), ensuring the system does not need to keep all 1 million records in RAM memory, thereby optimizing performance and preventing memory overflow (OOM) errors.
The result will generate two CSV files as follows
categories_data.csv
id,name
2051,Industrial
2052,Toys
2053,Grocery
2054,Computers
...
products_data.csv
id,name,price,categoryId
1,Rustic Soft Table,380,2244
2,Handmade Soft Shirt,1581,2915
3,Awesome Soft Sausages,1475,3767
4,Luxurious Bronze Ball,647,3276
5,Handcrafted Granite Fish,1889,3351
...
Next, you need to differentiate between the two use cases of using Copy:
Using the COPY command directly on the DB Server (SQL Command)
- Processing flow: The Postgres Server uses a line-by-line processing mechanism to open a file stream (File Descriptor) and process data as a stream.
- Network: No network bandwidth is consumed. Data goes straight from Disk to the Postgres Server RAM and is then stored in the Table data files.
- If you install PostgreSQL directly on localhost or run it via Docker, you can follow this path to achieve the best performance. However, after deployment, the database is typically located on a remote server or cloud, so you must be able to copy the CSV file there first before executing the command, which means this approach is not always feasible in practice.
Using the command from the PG Client (which is the approach I will use below)
- Processing flow: The PG Client opens the CSV file on the local machine, reads the data and packs that data into packets according to the PostgreSQL Protocol.
- Network: The client sends this data stream over the network port to the Server. The Postgres Server then simply receives the data stream from the network to write to disk.
- Thus, you can apply this method to insert data from a CSV file from anywhere, such as on a local computer or a CI/CD Runner.
Run the copy command for both files, noting that the execution order must follow the relationships between tables:
docker exec -i {container} psql -U {username} -d {database} -c "\copy \"Category\" FROM STDIN WITH (FORMAT csv, HEADER true);" < {file.csv}
docker exec -i postgres psql -U username -d postgres -c "\copy \"Category\" FROM STDIN WITH (FORMAT csv, HEADER true);" < categories_data.csv
COPY 2000
docker exec -i postgres psql -U username -d postgres -c "\copy \"Product\" FROM STDIN WITH (FORMAT csv, HEADER true);" < products_data.csv
COPY 1000000
Result after inserting data
Happy coding!
Comments
Post a Comment