炼丹记之国家电投2020风电机组异常数据识别与清洗 baseline f1=0.858分享

赛题地址:

https://www.datafountain.cn/competitions/451

赛题任务:

依据提供的12台风力电机1年的10min间隔SCADA运行数据,包括时间戳信息、风速信息和功率信息等,利用机器学习相关技术,建立鲁棒的风电机组异常数据检测模型,用于识别并剔除潜在的异常数据,提高数据质量。
此任务未给出异常数据标签,视为聚类任务,为引导选手向赛题需求对接,现简单阐述异常数据定义。异常数据是由风机运行过程与设计运行工况出现较大偏离时产生,如风速仪测风异常导致采集的功率散点明显偏离设计风功率。

数据介绍:

https://www.datafountain.cn/competitions/451/datasets

以下是线上 f1 0.858方案:

#!/usr/bin/env python
# coding: utf-8
import pandas as pd
import numpy as np
from tqdm import tqdm
from matplotlib import pyplot as plt

data_df = pd.read_csv('../data/dataset.csv')
fan_info = pd.read_csv('../data/12faninfo.csv', names=["WindNumber", "fan_diam", "rated_power", "speed_in", "speed_out", "speed_min", "speed_max", "speed_range"])

data_fan_df = data_df.merge(fan_info, on='WindNumber')
data_fan_df['label'] = 0

#异常值
#三列值小于0
data_fan_df.loc[(data_fan_df['WindSpeed']  0 ) ,'label'] = 1
#风速大于切入,功率小于等于0
data_fan_df.loc[(data_fan_df['WindSpeed'] >= data_fan_df['speed_in']) & (data_fan_df['Power']  data_fan_df['speed_out']) & (data_fan_df['Power'] > 0)  ,'label'] = 1

#功率大于额定功率1.2倍
data_fan_df.loc[ data_fan_df['Power'] > 1.2*data_fan_df['rated_power'] ,'label'] = 1
#风轮异常
data_fan_df.loc[ data_fan_df['RotorSpeed'] > 1.2*data_fan_df['speed_max'] ,'label'] = 1
data_fan_df.loc[ data_fan_df['RotorSpeed']  hig_ratio*temp_data_df[corr_col_name]),'label'] = 1

                cut_label = pd.concat([cut_label, temp_data_df[['index','label']].copy() ], axis=0)
                temp_data_df = temp_data_df[ (temp_data_df[corr_col]  hig_ratio*temp_data_df[corr_col_name])]
                print(temp_data_df.shape)
                #分别画异常点、所有点散点图
                plt.scatter(temp_data_df[cut_col_name], temp_data_df[corr_col])
                plt.scatter(temp_x, temp_y, alpha=0.1)
                plt.xlabel(cut_col)
                plt.ylabel(corr_col)
                plt.show()
            cut_label.columns = ['index', 'labelcut']
            data_fan_df_filter = data_fan_df_filter.merge(cut_label, how='outer', on='index')
            data_fan_df_filter['labelcut'] = data_fan_df_filter['labelcut'].fillna(0)
            data_fan_df_filter['label'] = data_fan_df_filter['label'] + data_fan_df_filter['labelcut']
            data_fan_df_filter['label'] = data_fan_df_filter['label'].replace(2,1)
            data_fan_df_filter.drop(['labelcut'], axis=1,inplace=True)

filter_res  = data_fan_df_filter[['index', 'label']].copy()
filter_res.columns = ['index', 'labelres']
data_fan_df_cp = data_fan_df_cp.merge(filter_res, how='outer', on='index')
data_fan_df_cp['labelres'] = data_fan_df_cp['labelres'].fillna(1)
data_fan_df_cp['label'] = data_fan_df_cp['label'] + data_fan_df_cp['labelres']
data_fan_df_cp['label'] = data_fan_df_cp['label'].replace(2,1)
data_fan_df_cp.drop(['labelres'], axis=1,inplace=True)

data_sub = data_fan_df_cp[['WindNumber','Time','label']].copy()
data_sub.to_csv('../subs/sub_6_cut_low0.5_hight1.5.csv',index=None)

以上提交结果,线上约为:0.85824

本文由 @龙岩 发布于 职涯宝 ,未经作者许可,禁止转载,欢迎您分享文章

发表评论

登录后才能评论
小程序
小程序
微信客服
微信客服
QQ客服 建站服务
分享本页
返回顶部