Design a time-series data store for IoT sensor readings at scale.
sql-sys-004
Your answer
Answer as you would in a real interview — explain your thinking, not just the conclusion.
Model answer
For IoT time-series at scale I first consider purpose-built engines like TimescaleDB (PostgreSQL extension), InfluxDB, or ClickHouse before going with vanilla PostgreSQL, because they handle the specific access patterns — high-frequency appends, range queries over time, automatic data retention — far more efficiently. If staying in PostgreSQL, the critical design decisions are: partition the table by time range (monthly or weekly) so old partitions can be dropped or compressed without full-table locks; use (sensor_id, recorded_at) as the primary key and index key, with recorded_at as the leading column for recent-data queries; choose column-oriented storage or BRIN indexes for range scans on immutable time columns. For retention I implement automated partition pruning via a cron job rather than row-by-row deletes. For aggregation I materialise summaries (hourly/daily rollups) in a separate table and update them asynchronously, because querying raw data at millisecond granularity for dashboards is impractical. I also design the ingestion path to batch writes and use COPY instead of individual INSERTs to maximise throughput.
Code example
-- PostgreSQL native range partitioning for time-series
CREATE TABLE sensor_readings (
sensor_id UUID NOT NULL,
recorded_at TIMESTAMPTZ NOT NULL,
value DOUBLE PRECISION,
unit VARCHAR(20)
) PARTITION BY RANGE (recorded_at);
-- Create monthly partitions
CREATE TABLE sensor_readings_2025_05
PARTITION OF sensor_readings
FOR VALUES FROM ('2025-05-01') TO ('2025-06-01');
CREATE TABLE sensor_readings_2025_06
PARTITION OF sensor_readings
FOR VALUES FROM ('2025-06-01') TO ('2025-07-01');
-- Index on each partition (auto-inherited from parent)
CREATE INDEX ON sensor_readings(sensor_id, recorded_at DESC);
-- Materialised hourly rollup
CREATE TABLE sensor_hourly_avg (
sensor_id UUID NOT NULL,
hour_bucket TIMESTAMPTZ NOT NULL, -- truncated to the hour
avg_value DOUBLE PRECISION,
sample_count INT,
PRIMARY KEY (sensor_id, hour_bucket)
);
-- Efficient range query using partition pruning
SELECT sensor_id, recorded_at, value
FROM sensor_readings
WHERE sensor_id = 'abc-123'
AND recorded_at BETWEEN '2025-05-20' AND '2025-05-21'
ORDER BY recorded_at;
Follow-up
How would you handle late-arriving data that comes in hours after the fact? How does TimescaleDB's chunk architecture differ from Postgres native range partitioning?