By Mengshi
IoT (Internet of Things) refers to the connection of various physical devices, sensors, software, and other technologies through the Internet to form an intelligent network. These devices can communicate with each other, exchange data, and collect and analyze information for automation, remote monitoring, and control.
Common IoT smart devices include:
The prominent features of the IOT scenarios are:
The storage of massive data collected by IoT devices, through the analysis and processing of these data, can bring huge economic and social benefits to all walks of life.
For example, let's take a look at a commonly used temperature sensor in smart homes:
How much data will be generated? Let's do a simple calculation: 500,0006024306=65 billion, which averages 8,300 records per second.
To store such a vast amount of data, we often use time series databases, mainly considering:
However, for some reasons, relational databases are also used to store IoT data, for example:
As long as relational databases are used properly, storing IoT data in relational databases is not necessarily inferior to storing data in time series databases. However, using a single-instance RDBMS to store such a vast amount of data will have problems. Therefore, in many cases, distributed relational databases are considered.
In this article, we will discuss how to use the distributed database PolarDB-X to implement IoT services, achieving high efficiency and low cost while maintaining the usability of relational databases.
Generally, the collection table of IoT devices is as follows:
create table device_data(
id int primary key auto_increment,
gmt_create datetime,
device_id varchar(128),
c1 int,
c2 varchar(128),
c3 int,
c4 int
);
In the example, device_id represents the ID of the device that collects the data, gmt_create represents the timestamp of the data collection, and c1…c4 fields represent various pieces of collected information. If it is initially designed as a single table to store 65 billion records, it is simply impossible for the following reasons:
Therefore, in PolarDB-X, it is easy to think of splitting the table based on the device ID and transforming the above table into a partitioned table. The table structure is as follows:
create table device_data(
id int primary key auto_increment,
gmt_create datetime,
device_id varchar(128),
c1 int,
c2 varchar(128),
c3 int,
c4 int
) PARTITION BY KEY(`device_id`)
PARTITIONS 512;
The table is split into 512 partitioned tables:
After the table design was completed, we conducted a stress test. As the data volume increased, the database query and update performance became increasingly unstable, and the flushing rate in the buffer pool also gradually increased.
It's easy to understand. Splitting by device ID means that writes and queries will involve all the partitioned tables. Each partitioned table will be loaded into the buffer pool (BP). When a partitioned table is in the BP, performance is better. However, as the data in the partitioned tables grows, it's not guaranteed that all partitioned tables can be fully loaded into the BP. This can lead to the BP being overwhelmed, with frequent swap-in and swap-out operations, resulting in a noticeable decline in performance.
However, we know that there is a significant feature in IoT scenarios:
Data has a strong time attribute, and it is often necessary to query or update the latest data.
In this context, the partitioning strategy for table creation is recommended: Use a level-2 partitioning approach of hash + range. The level-1 partition performs hash operations based on the device ID, and the level-2 partition is based on time, with one sub-partition per day.
CREATE TABLE device_data (
.....
KEY idx (device_id,create_time)
)
PARTITION BY KEY(`device_id`)
PARTITIONS 128
SUBPARTITION BY RANGE(TO_DAYS(`create_time`))
(SUBPARTITION `p20240901` VALUES LESS THAN (739282),
SUBPARTITION `p20240902` VALUES LESS THAN (739311),
SUBPARTITION `p20240903` VALUES LESS THAN (739342),
SUBPARTITION `p20240904` VALUES LESS THAN (739372),
SUBPARTITION `p20240905` VALUES LESS THAN (739403),
SUBPARTITION `p20240906` VALUES LESS THAN (739433),
SUBPARTITION `p20240907` VALUES LESS THAN (739464),
SUBPARTITION `p20240908` VALUES LESS THAN (739495),
SUBPARTITION `p20240909` VALUES LESS THAN (739525),
....
);
It ensures that data writes and updates occur in the last sub-partition of each level-1 partition. It also allows all non-latest partitions to be eliminated from the BP. It ensures that the last new partition is always in the BP. This approach guarantees stable write and query performance. As the data volume increases, the flushing in the BP can remain stable, maintaining consistent write performance.
In addition to data reporting and storage in IoT scenarios, it is often necessary to query the latest batch of data, such as obtaining the latest real-time status of a device, specifying a device ID, and querying the 1,000 reporting records in the latest period:
select * from device_data where device_id = 1
order by create_time desc limit 1000;
The level-1 partition is split based on the device ID, and the level-2 partition uses the range partition. The preceding SQL statement can be pruned to a level-1 partition based on the device ID, but the level-2 partition cannot be further pruned.
Level-2 partitions are range partitions distributed in an ordered manner based on create_time. We use the adaptive prefetch mechanism to scan level-2 partitions:
Make full use of the orderliness of range partitions and adopt the adaptive prefetch mechanism to effectively reduce the scanning of level-2 partitions and RT.
In addition to the query scenarios, the collected data is maintained and updated based on the device_id in the IOT scenarios.
update device_data set xxx where device_id=xx and id=?
Due to the characteristics of IoT services, data updates are obviously time-sensitive. That is, most of the data to be updated is the latest data (it's easy to understand that only when a device is offline for a long time will data from a long time ago be updated). This means that when the above business initiates an update, the data to be updated can often be found in the range partition of the last day. Therefore, in the actual update process, we can give priority to finding the latest partition and then try to find the second partition if we cannot find it. Next, the original need to scan multiple partitions can be optimized to scan only the latest partition, greatly improving update performance.
However, there is a scenario where the data is serially updated in reverse order of the most recent time. If the data is updated several days ago, multiple interactions will lead to a significant performance regression. For this reason, we have established a business standard. If the business has the characteristics of the latest updates, the SQL statement can be rewritten as
update device_data where device_id=? and id=?
and create_time<=now()
order by create_time desc limit 1;
In this way, the database can accurately identify business scenarios and leverage the orderliness of range partitions to achieve efficient updates.
Data is generated every moment in IoT scenarios, making storage cost optimization an inevitable topic. More IoT data can be leveraged to make predictions and identify potential issues or changes in demand in advance. At the same time, due to compliance requirements, data must often be retained for longer periods.
Take the data collection table of the temperature sensor in the smart home as an example. If we require data for one year of inventory, let's calculate the costs.
For 800 TB of IoT data per year, at a unit price of 1,000 yuan per TB per month, the storage cost for one year would be 8 million yuan.
This cost is quite high. However, IoT data generally has a strong time attribute, and it is often the hottest data that is more valuable for analysis and decision-making. By using the transparent archiving capability provided by PolarDB-X, we can archive data that is not recently collected to low-cost storage in OSS. For more information, please refer to the official documentation: Cold Data Archiving (TTL).
For example, let's store the last month's data in PolarDB-X row storage and archive the remaining 11 months' data in the column storage format to OSS. Then, let's calculate the costs again.
If 66 TB of IoT data for one year is stored in row storage, the storage cost would be 660,000 yuan. The remaining 734 TB of row storage data is archived in OSS in column storage format. According to a five-times compression ratio calculation, the required data space is 146 TB. The cost for storing this data in OSS for one year would be 210,000 yuan. The total storage cost would be 870,000 yuan.
From 8000,000 to 870,000, the storage cost has dropped to 10%, a very significant decline.
Users only need to execute a DDL statement for archiving on the device_data.
ALTER TABLE device_data
MODIFY TTL
SET
TTL_ENABLE = 'ON'
TTL_EXPR = `gmt_create` EXPIRE AFTER 30 DAY TIMEZONE '+08:00'
TTL_JOB = CRON '0 0 2 */1 * ? *' TIMEZONE '+00:00';
After the above data is executed, we detect and archive the data in the device_data
table 30 days ago to OSS after 2:00 a.m. every day. The data stored online during archiving does not affect the online business. At the same time, our online archiving has two features:
In IoT scenarios, data needs to be efficiently imported, updated, and queried in real-time, and the volume of collected data is substantial. Analyzing user behavior data helps make product decisions and provide better services. For example, in the industrial monitoring scenario, preventive maintenance involves monitoring the working status of equipment to predict failure.
Previously, we would synchronize large amounts of IoT data stored in relational or time series databases to traditional data warehouse tools, such as Hadoop or Spark, for in-depth analysis of historical data.
Now, with the end-to-end HTAP capabilities provided by PolarDB-X, we can meet the requirements of efficient data import and updates in the PolarDB-X database and leverage our clustered columnar indexes to satisfy the demands for analyzing massive amounts of data.
For more information, please refer to our official usage guide and performance white papers.
In IoT scenarios, using relational databases appropriately can combine the advantages of high performance and low cost. In real IoT scenarios, migration from time series databases to PolarDB-X offers significant advantages in both cost and performance. At the same time, relational databases have incomparable advantages over time series databases:
PolarDB-X Best Practice Series (6): How to Write Data in Batches
ApsaraDB - April 20, 2023
ApsaraDB - April 10, 2024
ApsaraDB - April 10, 2024
ApsaraDB - February 21, 2023
ApsaraDB - April 16, 2025
ApsaraDB - March 5, 2025
Provides secure and reliable communication between devices and the IoT Platform which allows you to manage a large number of devices on a single IoT Platform.
Learn MoreA cloud solution for smart technology providers to quickly build stable, cost-efficient, and reliable ubiquitous platforms
Learn MoreFollow our step-by-step best practices guides to build your own business case.
Learn MoreMigrate your Internet Data Center’s (IDC) Internet gateway to the cloud securely through Alibaba Cloud’s high-quality Internet bandwidth and premium Mainland China route.
Learn MoreMore Posts by ApsaraDB