PolarDB-X Best Practice Series (5): How to Store IoT Data

By Mengshi

Background

IoT (Internet of Things) refers to the connection of various physical devices, sensors, software, and other technologies through the Internet to form an intelligent network. These devices can communicate with each other, exchange data, and collect and analyze information for automation, remote monitoring, and control.

Common IoT smart devices include:

Smart Home: Smart light bulbs, smart thermostats, smart audio, and more.
Smart City: Optimized traffic management, smart lighting, environmental monitoring, and more.
Industrial Internet of Things (IIoT): Sensor monitoring equipment and more.
Health Monitoring: Wearable devices for monitoring health data, such as electronic watches.

The prominent features of the IOT scenarios are:

The devices are equal to each other. There are no "cold devices" or "hot devices", and all devices will generate real-time data at a relatively fixed frequency.
The data volume is extremely large, usually in the tens of or even hundreds of billions.
IoT devices can collect various data types, such as temperature, humidity, location, speed, and images. The richness of these data also increases the data volume.

The storage of massive data collected by IoT devices, through the analysis and processing of these data, can bring huge economic and social benefits to all walks of life.

For example, let's take a look at a commonly used temperature sensor in smart homes:

The sensor will sample the temperature once every minute and upload the temperature data.
500,000 units of this sensor are sold.
The data is retained for 3 months.

How much data will be generated? Let's do a simple calculation: 500,0006024306=65 billion, which averages 8,300 records per second.

To store such a vast amount of data, we often use time series databases, mainly considering:

Time series databases have high write performance and can quickly process large volumes of real-time data.
Time series databases, which optimize data storage methods, usually use specific compression technologies to reduce storage space, lowering storage costs.

However, for some reasons, relational databases are also used to store IoT data, for example:

Relational databases like MySQL and PostgreSQL have a longer history with lower risks in terms of stability.
MySQL and PostgreSQL can be used by more people, with low learning costs, a rich ecosystem of upstream and downstream tools, and low development costs.
Capabilities such as secondary indexes and transactions make relational databases more user-friendly.
Users like relational databases and want to use them!

As long as relational databases are used properly, storing IoT data in relational databases is not necessarily inferior to storing data in time series databases. However, using a single-instance RDBMS to store such a vast amount of data will have problems. Therefore, in many cases, distributed relational databases are considered.

In this article, we will discuss how to use the distributed database PolarDB-X to implement IoT services, achieving high efficiency and low cost while maintaining the usability of relational databases.

How to Design an Appropriate Table Structure

Generally, the collection table of IoT devices is as follows:

create table device_data(
  id int primary key auto_increment,
  gmt_create datetime,
  device_id varchar(128),
  c1 int,
  c2 varchar(128),
  c3 int,
  c4 int
);

In the example, device_id represents the ID of the device that collects the data, gmt_create represents the timestamp of the data collection, and c1…c4 fields represent various pieces of collected information. If it is initially designed as a single table to store 65 billion records, it is simply impossible for the following reasons:

Limited to the CPU/memory/IOPS resources of a single node and the higher B+ tree, its performance will deteriorate.
A single node cannot provide a data space of 200 TB.

Therefore, in PolarDB-X, it is easy to think of splitting the table based on the device ID and transforming the above table into a partitioned table. The table structure is as follows:

create table device_data(
    id int primary key auto_increment,
    gmt_create datetime,
    device_id varchar(128),
    c1 int,
    c2 varchar(128),
    c3 int,
    c4 int
) PARTITION BY KEY(`device_id`)
PARTITIONS 512;

The table is split into 512 partitioned tables:

The number of data rows in a partitioned table is controlled below 50 million, which can avoid the increased query cost due to the growth of the B+ tree.
Split into multiple tables, use multiple data nodes to store 200 TB, and fully use distributed resources to query and update tens of billions of tables.

After the table design was completed, we conducted a stress test. As the data volume increased, the database query and update performance became increasingly unstable, and the flushing rate in the buffer pool also gradually increased.

It's easy to understand. Splitting by device ID means that writes and queries will involve all the partitioned tables. Each partitioned table will be loaded into the buffer pool (BP). When a partitioned table is in the BP, performance is better. However, as the data in the partitioned tables grows, it's not guaranteed that all partitioned tables can be fully loaded into the BP. This can lead to the BP being overwhelmed, with frequent swap-in and swap-out operations, resulting in a noticeable decline in performance.

However, we know that there is a significant feature in IoT scenarios:

Data has a strong time attribute, and it is often necessary to query or update the latest data.

In this context, the partitioning strategy for table creation is recommended: Use a level-2 partitioning approach of hash + range. The level-1 partition performs hash operations based on the device ID, and the level-2 partition is based on time, with one sub-partition per day.

CREATE TABLE device_data (
  ..... 
  KEY idx (device_id,create_time)
) 
PARTITION BY KEY(`device_id`)
PARTITIONS 128
SUBPARTITION BY RANGE(TO_DAYS(`create_time`))
(SUBPARTITION `p20240901` VALUES LESS THAN (739282),
 SUBPARTITION `p20240902` VALUES LESS THAN (739311),
 SUBPARTITION `p20240903` VALUES LESS THAN (739342),
 SUBPARTITION `p20240904` VALUES LESS THAN (739372),
 SUBPARTITION `p20240905` VALUES LESS THAN (739403),
 SUBPARTITION `p20240906` VALUES LESS THAN (739433),
 SUBPARTITION `p20240907` VALUES LESS THAN (739464),
 SUBPARTITION `p20240908` VALUES LESS THAN (739495),
 SUBPARTITION `p20240909` VALUES LESS THAN (739525), 
....
);

It ensures that data writes and updates occur in the last sub-partition of each level-1 partition. It also allows all non-latest partitions to be eliminated from the BP. It ensures that the last new partition is always in the BP. This approach guarantees stable write and query performance. As the data volume increases, the flushing in the BP can remain stable, maintaining consistent write performance.

How to Efficiently Process Real-time Data

In addition to data reporting and storage in IoT scenarios, it is often necessary to query the latest batch of data, such as obtaining the latest real-time status of a device, specifying a device ID, and querying the 1,000 reporting records in the latest period:

select * from device_data where device_id = 1 
order by create_time desc limit 1000;

The level-1 partition is split based on the device ID, and the level-2 partition uses the range partition. The preceding SQL statement can be pruned to a level-1 partition based on the device ID, but the level-2 partition cannot be further pruned.

Level-2 partitions are range partitions distributed in an ordered manner based on create_time. We use the adaptive prefetch mechanism to scan level-2 partitions:

In the first round, the latest p20240925 partition is scanned. If 1,000 data entries that meet the conditions are found, the full query is completed; otherwise, continue.
In the second round, the p20240924 and p20240923 partitions are scanned to continue to find the data that meets the conditions. If 1,000 records are found, the query ends; otherwise, proceed.
In the third round, the partitions p20240922, p20240921, p20240920, and p20240919 are scanned to find records that meet the conditions. If 1,000 records are found, the query ends; otherwise, proceed.
...

Make full use of the orderliness of range partitions and adopt the adaptive prefetch mechanism to effectively reduce the scanning of level-2 partitions and RT.

In addition to the query scenarios, the collected data is maintained and updated based on the device_id in the IOT scenarios.

update device_data set xxx where device_id=xx  and id=?

Due to the characteristics of IoT services, data updates are obviously time-sensitive. That is, most of the data to be updated is the latest data (it's easy to understand that only when a device is offline for a long time will data from a long time ago be updated). This means that when the above business initiates an update, the data to be updated can often be found in the range partition of the last day. Therefore, in the actual update process, we can give priority to finding the latest partition and then try to find the second partition if we cannot find it. Next, the original need to scan multiple partitions can be optimized to scan only the latest partition, greatly improving update performance.

However, there is a scenario where the data is serially updated in reverse order of the most recent time. If the data is updated several days ago, multiple interactions will lead to a significant performance regression. For this reason, we have established a business standard. If the business has the characteristics of the latest updates, the SQL statement can be rewritten as

update device_data where device_id=? and id=? 
and create_time<=now() 
order by create_time desc limit 1;

To scan the pre-created partitions in the future, add the condition create_time<=now().
As updates are time-sensitive, add order by create_time desc to prioritize updates from the latest partition and use Range Scan's features. The update process ends after updating one record.
Increase limit 1 as required. If the partition key does not contain the primary key's ID, the global uniqueness of the ID is not guaranteed. You can add limit 1. If the ID is an auto-increment one and meets the uniqueness constraint, you do not need to add limit 1.

In this way, the database can accurately identify business scenarios and leverage the orderliness of range partitions to achieve efficient updates.

How to Minimize Storage Costs

Data is generated every moment in IoT scenarios, making storage cost optimization an inevitable topic. More IoT data can be leveraged to make predictions and identify potential issues or changes in demand in advance. At the same time, due to compliance requirements, data must often be retained for longer periods.

Take the data collection table of the temperature sensor in the smart home as an example. If we require data for one year of inventory, let's calculate the costs.

For 800 TB of IoT data per year, at a unit price of 1,000 yuan per TB per month, the storage cost for one year would be 8 million yuan.

This cost is quite high. However, IoT data generally has a strong time attribute, and it is often the hottest data that is more valuable for analysis and decision-making. By using the transparent archiving capability provided by PolarDB-X, we can archive data that is not recently collected to low-cost storage in OSS. For more information, please refer to the official documentation: Cold Data Archiving (TTL).

For example, let's store the last month's data in PolarDB-X row storage and archive the remaining 11 months' data in the column storage format to OSS. Then, let's calculate the costs again.

If 66 TB of IoT data for one year is stored in row storage, the storage cost would be 660,000 yuan. The remaining 734 TB of row storage data is archived in OSS in column storage format. According to a five-times compression ratio calculation, the required data space is 146 TB. The cost for storing this data in OSS for one year would be 210,000 yuan. The total storage cost would be 870,000 yuan.

From 8000,000 to 870,000, the storage cost has dropped to 10%, a very significant decline.

Users only need to execute a DDL statement for archiving on the device_data.

ALTER TABLE device_data
MODIFY TTL
SET
TTL_ENABLE = 'ON'
TTL_EXPR = `gmt_create` EXPIRE AFTER 30 DAY TIMEZONE '+08:00'
TTL_JOB = CRON '0 0 2 */1 * ? *' TIMEZONE '+00:00';

After the above data is executed, we detect and archive the data in the device_data table 30 days ago to OSS after 2:00 a.m. every day. The data stored online during archiving does not affect the online business. At the same time, our online archiving has two features:

Support DDL and data changes for cold data.
The query method for archived cold data remains the same as that for row storage data without changing user habits.

End-to-end HTAP Capabilities

In IoT scenarios, data needs to be efficiently imported, updated, and queried in real-time, and the volume of collected data is substantial. Analyzing user behavior data helps make product decisions and provide better services. For example, in the industrial monitoring scenario, preventive maintenance involves monitoring the working status of equipment to predict failure.

Previously, we would synchronize large amounts of IoT data stored in relational or time series databases to traditional data warehouse tools, such as Hadoop or Spark, for in-depth analysis of historical data.

Now, with the end-to-end HTAP capabilities provided by PolarDB-X, we can meet the requirements of efficient data import and updates in the PolarDB-X database and leverage our clustered columnar indexes to satisfy the demands for analyzing massive amounts of data.

For more information, please refer to our official usage guide and performance white papers.

In IoT scenarios, using relational databases appropriately can combine the advantages of high performance and low cost. In real IoT scenarios, migration from time series databases to PolarDB-X offers significant advantages in both cost and performance. At the same time, relational databases have incomparable advantages over time series databases:

They have a well-established ecosystem with low integration costs; the interactive language is SQL, which means low development costs.
They support complex SQL queries, including multi-table joins and nested queries.
They provide strong support for transactional operations (ACID semantics).
They support the creation of various indexes, which can flexibly meet query needs from different dimensions.

Community

PolarDB-X Best Practice Series (5): How to Store IoT Data

Background

How to Design an Appropriate Table Structure

How to Efficiently Process Real-time Data

How to Minimize Storage Costs

End-to-end HTAP Capabilities

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

IoT Platform

IoT Solution

Best Practices

Global Internet Access Solution