How to Use Python to Analyze and Forecast Big Data on an IoT Platform

While IoT technology connects everything, it also brings us massive amounts of data. The urgent question is how to organize and analyze this data so that it can truly generate value. Large-scale data can help users improve efficiency, reduce costs, and enhance safety, but these benefits must be uncovered through data science and artificial intelligence before they can be applied in practice. This article uses data from an IoT system for urban energy monitoring as an example, and demonstrates basic analysis and forecasting through a Python data analysis platform.

1. Hardware and Software System

For the sake of rapid development, this project’s IoT system uses Raspberry Pi hardware and Alibaba Cloud edge computing technology, with Modbus drivers for data collection and upload. For details, see Using Alibaba Cloud IoT Edge Computing Functions to Quickly Enable Device Access for a Smart Energy Monitoring System.

The data uploaded to the IoT platform is transformed and stored in an RDS database. It could also be read directly from the IoT platform via API, but in this article, to improve data acquisition efficiency, the data is read from the database instead.

This article uses Anaconda, the Python data science distribution, together with Jupyter Notebook for data analysis. This is also a common combination in data analysis. Since the database used is MS SQL Server, the pymssql package needs to be installed for data acquisition.

The data to be collected mainly consists of on-site energy data. In addition to total energy consumption, the system monitors the energy usage of subsystems such as fans, lighting, and pumps, while also collecting environmental parameters such as temperature and humidity in real time. For demonstration purposes, this article only performs aggregated and simplified analysis on subsystem data.

2. Data Collection and Processing

2.1 Data Import

First, import the pymssql package for MS SQL Server and the pandas package for data processing, with the conventional alias pd. Since the IoT system uses 13-digit timestamps to represent time during transmission, the time module is also needed for time conversion.

import pymssql
import pandas as pd
import pandas as pd
import time
conn = pymssql.connect(host='****.aliyuncs.com:3433', user='****',
password='****', database='***', charset='utf8')
powersql='SELECT * FROM ***.dbo.***;'
powerdata = pd.read_sql_query(powersql, conn)
# Read energy consumption data
thsql='SELECT * FROM ***.dbo.***;'
thdata = pd.read_sql_query(thsql, conn)
# Read environmental data

The raw data basically looks like this.

2.2 Timestamp Conversion and Local Data Saving

To convert 13-digit timestamps into datetime format, several functions from Python’s time library are needed. It is worth noting that after conversion, the time values are still only strings in time format as far as pandas is concerned. To truly use them as datetimes, further processing is needed later.

time.localtime(x*0.001) converts a 13-digit timestamp into local time by dividing by 1000 to convert milliseconds into seconds.
time.strftime("%Y-%m-%d %H:%M:%S", x) formats the time into a human-readable string.
In this example, Python’s anonymous lambda function is used for fast data processing.
To process an entire column, Python’s map function is used.

Some raw data is missing, so the fillna function is used for a simple zero-fill. Of course, the missing values could also be dropped or handled in other ways. The pandas to_csv function is used to save the processed data locally, so that the database does not need to be queried every time.

powerdata['gmtCreate'] = powerdata['gmtCreate'].map(
    lambda x: time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(x*0.001))
)
thdata['gmtCreate'] = thdata['gmtCreate'].map(
    lambda x: time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(x*0.001))
)
powerdata.fillna(0, inplace=True)  # Replace missing values with 0 for processing
thdata.fillna(0, inplace=True)     # Replace missing values with 0 for processing
powerdata.to_csv("powerdata.csv")
thdata.to_csv("thdata.csv")

2.3 Data Processing and Aggregation

# Constant parameters related to data calculation
voltage = 0.23
phases = 3
dayscycle = 7

To simplify the process, this example reads energy consumption data from the 7 days (one week) prior to the current day, and instead of analyzing each individual device, it aggregates the data by category.

Data slicing is performed using df.loc[filter] together with a filter expression.

The filter is: (powerdata['gmtCreate'] >= startday) & (powerdata['gmtCreate'] <= endday)
Since powerdata['gmtCreate'] is still a string rather than a datetime object, and the time string format can be directly compared, it can still be used as a filter in this way.
A column named load is created to calculate total load.
The original current data is converted into power by multiplying by line voltage and phase count.
To simplify processing, the original data is resampled at 1-minute intervals and averaged using resample and mean.
The total power of each subsystem is then integrated into energy consumption data in kWh.

powerdata['gmtCreate'] = pd.to_datetime(powerdata['gmtCreate'])
powerdata.set_index('gmtCreate', inplace=True)
powerdata = powerdata.loc[(powerdata.index >= startday) & (powerdata.index <= endday)]
powerdata['load'] = powerdata['current'] * voltage * phases
powerdata = powerdata.resample('1Min').mean()

3. Data Analysis and Visualization

After the data has been processed and aggregated, it can be visualized and analyzed.

A typical approach is to plot the energy consumption curve for a single day:

powerdata.loc[(powerdata.index >= '2019-08-06 00:00:00') &
              (powerdata.index <= '2019-08-06 23:59:59')].plot.line(
    figsize=(16, 12), title='Typical Day'
)
plt.show()

In addition to line charts, bar charts are also commonly used. The following two charts show categorized energy consumption by day and in total, respectively.

powerdata.groupby(powerdata.index.floor('D')).sum().plot.bar(
    figsize=(16, 12), title='Daily Energy Cost'
)
plt.show()

powerdata.sum().plot.bar(figsize=(16, 12), title='Total Energy Cost')
plt.show()

Finally, the user’s energy consumption data can be summarized simply:

print('Total energy consumption last week was approximately %.2f kWh.' % powerdata['totalpower'].sum())
print('Including:')
print('Maintenance subsystem: approximately %.2f kWh,' % powerdata['maintain'].sum())
print('Pump subsystem: approximately %.2f kWh,' % powerdata['pumps'].sum())
print('Lighting subsystem: approximately %.2f kWh,' % powerdata['lights'].sum())
print('Ventilation subsystem: approximately %.2f kWh,' % powerdata['fans'].sum())
print('Fire protection and other usage: approximately %.2f kWh,' % powerdata['others'].sum())

In addition, pandas also provides the convenient and comprehensive statistical function describe.

powerdata.describe()

4. Conclusion and Outlook

This example shows some of the ways Python can be used in data science. The case involves about 250,000 records. For pandas, however, reading, writing, and analyzing this amount of data is still very fast, and even the longest operations can usually be completed within a few seconds. This is something that traditional office software simply cannot achieve. Efficient analysis and processing of massive data requires proper data analysis tools.

Of course, to simplify the analytical process, this example performs significant aggregation and simplification of the data. Similar to most data science work, the majority of effort goes into data acquisition, cleaning, and processing. Once the data is standardized, visualization and further processing can be carried out quickly. At the same time, however, the heavy simplification in this example leads to a considerable loss of data value. If the goal is only simple data visualization, much sparser data may already be sufficient. But for deep learning and more detailed applications, the data must be analyzed and processed much more carefully and precisely, rather than merely aggregated and summarized as in this article.

This case can provide the foundational data for subsequent machine learning work. Through machine learning, the data in this example can serve at least the following purposes:

Combine environmental parameters and real-time energy prices to provide more optimized and cost-saving operating strategies for the system
Use pattern learning to automatically identify hidden hazards and incidents that cannot be detected manually, and trigger timely alarms
Use time series analysis to forecast future energy usage and system safety conditions
Continuously optimize and build mature reference models for similar projects
...