PhD Dissertations
Permanent URI for this communityhttps://hdl.handle.net/10679/9876
Browse
Browsing by Subject "Big data"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
PhD DissertationPublication Metadata only Digital oil refinery: utilizing real-time analytics and cloud computing over industrial sensor data(2018-12-14) Khodabakhsh, Athar; Arı, İsmail; Arı, İsmail; Şensoy, Murat; Kayış, Enis; Aktaş, M.; Alkaya, A. F.; Department of Computer Science; Khodabakhsh, AtharThis thesis addresses big data challenges seen in large-scale, mission-critical industrial plants such as oil refineries. These plants are equipped with heavy machinery (boilers, engines, turbines, etc.) that are continuously monitored by thousands and various types of sensors for process efficiency, environmental safety, and predictive maintenance purposes. However, sensors themselves are also prone to errors and failure. The quality of data received from them should be verified before being used in system modeling or prediction. There is a need for reliable methods and systems that can provide data validation and reconciliation in real-time with high accuracy. Furthermore, it is necessary to develop accurate, yet simple and efficient analytical models that can be used with high-speed industrial data streams. In this thesis, design and implementation of a novel method called DREDGE, is proposed and presented first by developing methods for real-time data validation, gross error detection (GED), and gross error classification (GEC) over multivariate sensor data streams. The validated and high quality data obtained from these processes is later used for pattern analysis and modeling of industrial plants. We obtained sensor data from the power and petrochemical plants of an oil refinery and analyzed them using various time-series modeling and data mining techniques that are integrated into a complex event processing (CEP) engine. Next, the computational performance implications of the proposed methods are studied and regimes that are sustainable over fast streams of sensor data are uncovered. Distributed Control Systems (DCS) continuously monitor hundreds of sensors in industrial systems, and relationships between variables of the system can change over time. Operational mode (or state) identification methods are developed and presented for these large-scale industrial systems using stream analytics, which are shown to be more effective than batch processing models, especially for time-varying systems. To detect drifts among modes, predictive modeling techniques such as regression analysis, K-means and DBSCAN clustering are used over sensor data streams from an oil refinery and models are updated in real-time using window-based analysis. In addition, the shifts among steady states of data are detected, which represent systems' multiple operating modes. Also, the time when a model reconstruction is required is identified using DBSCAN algorithm. An adaptive window size tuning approach based on the TCP congestion control algorithm is proposed, which reduces model update costs as well as prediction errors. Finally, we proposed a new Lambda architecture for Oil & Gas industry for unified data and analytical processing over DCS. We discussed cloud integration issues and share our experiences with the implementation of sensor fault detection and classification modules inside the proposed architecture.PhD DissertationPublication Metadata only Scalable analysis of large-scale system logs for anomaly detection(2019-05-30) Astekin, Merve; Sözer, Hasan; Sözer, Hasan; Arı, İsmail; Öztop, Erhan; Aktaş, M. S.; Akyokuş, S.; Department of Computer Science; Astekin, MerveSystem logs provide information regarding the status of system components and various events that occur at runtime. This information can support fault detection, diagnosis and prediction activities. However, it is a challenging task to analyze and interpret a huge volume of log data, which do not always conform to a standardized structure. As the scale increases, distributed systems can generate logs as a collection of huge volume of messages from several components. Thus, it becomes infeasible to monitor and detect anomalies e ciently and e ectively by applying manual or traditional analysis techniques. There have been several studies that aim at detecting system anomalies automatically by applying machine learning techniques on system logs. However, they o er limited e ciency and scalability. We identified three shortcomings that cause these limitations: i) Existing log parsing techniques do not parse unstructured log messages in a parallel and distributed manner. ii) Log data is processed mainly in o ine mode rather than online. That is, the entire log data is collected beforehand, instead of analyzing it piece-by-piece as soon as more data becomes available. iii) Existing studies employ centralized implementations of machine learning algorithms. In this dissertation, we address these shortcomings to facilitate end-to-end scalable analysis of large-scale system logs for anomaly detection. We introduce a framework for distributed analysis of unstructured log messages. We evaluated our framework with two sets of log messages obtained from real systems. Results showed that our framework achieves more than 30% performance improvement on average, compared to baseline approaches that do not employ fully distributed processing. In addition, it maintains the same accuracy level as those obtained with benchmark studies although it does not require the availability of the source code, unlike those studies. Our framework also enables online processing, where log data is processed progressively in successive time windows. The benefit of this approach is that some anomalies can be detected earlier. The risk is that the accuracy might be hampered. Experimental results showed that this risk occurs rarely, only when a window boundary cross-cuts a session of events. On the other hand, average anomaly detection time is reduced significantly. Finally, we introduce a case study that evaluates distributed implementations of PCA and K-means algorithms. We compared the accuracy and performance of these algorithms both with respect to each other and with respect to their centralized implementations. Results showed that the distributed versions can achieve the same accuracy and provide a performance improvement by orders of magnitude when compared to their centralized versions. The performance of PCA turns out to be better than K-means, although we observed that the di erence between the two tends to decrease as the degree of parallelism increases.