Scalable analysis of large-scale system logs for anomaly detection

Astekin, Merve

Publication:
Scalable analysis of large-scale system logs for anomaly detection

Authors

Astekin, Merve

Organizational Unit

Department of Computer Science

Type

PhD dissertation

Access

restrictedAccess

Publication Status

Unpublished

Abstract

System logs provide information regarding the status of system components and various events that occur at runtime. This information can support fault detection, diagnosis and prediction activities. However, it is a challenging task to analyze and interpret a huge volume of log data, which do not always conform to a standardized structure. As the scale increases, distributed systems can generate logs as a collection of huge volume of messages from several components. Thus, it becomes infeasible to monitor and detect anomalies e ciently and e ectively by applying manual or traditional analysis techniques. There have been several studies that aim at detecting system anomalies automatically by applying machine learning techniques on system logs. However, they o er limited e ciency and scalability. We identified three shortcomings that cause these limitations: i) Existing log parsing techniques do not parse unstructured log messages in a parallel and distributed manner. ii) Log data is processed mainly in o ine mode rather than online. That is, the entire log data is collected beforehand, instead of analyzing it piece-by-piece as soon as more data becomes available. iii) Existing studies employ centralized implementations of machine learning algorithms. In this dissertation, we address these shortcomings to facilitate end-to-end scalable analysis of large-scale system logs for anomaly detection. We introduce a framework for distributed analysis of unstructured log messages. We evaluated our framework with two sets of log messages obtained from real systems. Results showed that our framework achieves more than 30% performance improvement on average, compared to baseline approaches that do not employ fully distributed processing. In addition, it maintains the same accuracy level as those obtained with benchmark studies although it does not require the availability of the source code, unlike those studies. Our framework also enables online processing, where log data is processed progressively in successive time windows. The benefit of this approach is that some anomalies can be detected earlier. The risk is that the accuracy might be hampered. Experimental results showed that this risk occurs rarely, only when a window boundary cross-cuts a session of events. On the other hand, average anomaly detection time is reduced significantly. Finally, we introduce a case study that evaluates distributed implementations of PCA and K-means algorithms. We compared the accuracy and performance of these algorithms both with respect to each other and with respect to their centralized implementations. Results showed that the distributed versions can achieve the same accuracy and provide a performance improvement by orders of magnitude when compared to their centralized versions. The performance of PCA turns out to be better than K-means, although we observed that the di erence between the two tends to decrease as the degree of parallelism increases.
Sistem logları (günlükleri), sistem bileşenlerinin durumu ve çalışma zamanında meydana gelen çeşitli olaylar hakkında bilgi sağlamaktadır. Bu bilgi hata tespit, teşhis ve tahmin faaliyetlerini destekleyebilmektedir. Bununla birlikte, her zaman standart bir yapıya uymayan geniş ölçekteki log verisinin analiz edilmesi ve yorumlanması oldukça zor bir iş haline gelebilmektedir. Ölçekleri arttıkça, dağıtık sistemler çeşitli bileşenlerinden gelen çok sayıda mesajın bir yığını halinde sistem logları oluşturabilmektedir. Bu nedenle, manuel veya geleneksel analiz teknikleri bu ölçekteki sistem loglarının verimli ve etkili bir şekilde izlenmesinde ve anomalilerin tespit edilmesinde yetersiz kalmaktadır. Sistem logları üzerinde makine öğrenme tekniklerini uygulayarak sistem anormalliklerini otomatik olarak tespit etmeyi amaçlayan çeşitli çalışmalar yapılmıştır. Ancak, bu çalışmalar verimlilik ve ölçeklenebilirlik açısından kısıtlı kalmaktadırlar. Bu kısıtlamalara neden olan üç eksiklik tespit ettik: i) Mevcut log ayrıştırma teknikleri, yapılandırılmamış log mesajlarını paralel ve dağıtık bir şekilde ayrıştırmamaktadır. ii) Log verisi genellikle çevrimiçi (akış halinde) değil, çevrimdışı (yığın) modda işlenmektedir. Diğer bir deyisle, yeni log mesajları geldikçe parça parça işlemek yerine tüm log verisi önceden toplanmış yığın halinde analiz edilmektedir. iii) Mevcut çalışmalar, makine öğrenmesi algoritmalarının merkezi uygulamalarını kullanmaktadır. Bu tezde, anomali tespit için geniş ölçekli sistem loglarının uctan uca ölçeklenebilir analizini kolaylaştırmak amacıyla bu eksiklikleri ele alıyoruz. Öncelikli olarak, yapılandırılmamış log mesajlarının dağıtık analizi için bir çerçeve sunuyoruz. Çerçevemizi, gerçek sistemlerden elde edilen iki günlük log mesajlarından oluşan veri kümesi ile değerlendirdik. Sonuçlar, çerçevemizin tümüyle dağıtık işlem yapmayan temel yaklaşımlara kıyasla, ortalama olarak %30'dan fazla performans artışı sağladığını göstermiştir. Ayrıca, çerçevemiz diğer çalışmalardan farklı olarak kaynak kodun bulunmasını gerektirmeden de, kıyaslama çalışmaları ile elde edilenlerle aynı doğruluk seviyesini korumaktadır. İkinci olarak, çerçevemiz sistem log verisinin art arda zaman pencerelerinde aşamalı olarak işlendiği çevrimiçi işleme altyapısı sağlamaktadır. Bu yaklaşım, bazı anomalilerin daha erken tespit edilebilmesine olanak tanımaktadır. Diğer yandan, çevrimiçi işleme anomali tespitinde doğruluğun azalması riskini doğurabilir. Deneysel sonuçlar, bu riskin ancak bir pencere sınırının bir olay oturumunu kestiği zamanlarda nadiren ortaya çıktığını göstermiştir. Diğer yandan, ortalama anomali tespit süresi önemli ölçüde kısaltılmıştır. Bu tezde son olarak, PCA ve K-ortalama algoritmalarının dağıtık uygulamalarını değerlendiren bir vaka çalışması sunuyoruz. Bu algoritmaların doğruluğunu ve performansını, hem birbirine hem de merkezi uygulamalarına göre karşılaştırdık. Sonuçlar, dağıtık sürümlerin aynı doğruluğa ulaşabileceğini ve merkezi sürümleriyle karşılaştırıldığında onlarca kat performans iyileştirmesi sağladığını göstermiştir. PCA algoritmasının performansının, K-ortalama algoritmasından daha iyi olduğu, ancak ikisi arasındaki farkın paralellik derecesi arttıkça düşme eğiliminde olduğu gözlemlenmiştir.

Date

2019-05-30

URI

http://hdl.handle.net/10679/6320
http://discover.ozyegin.edu.tr/iii/encore/record/C__Rb3781648?lang=eng
https://tez.yok.gov.tr/

Publication:
Scalable analysis of large-scale system logs for anomaly detection

Institution Authors

Authors

Research Projects

Journal Title

Journal ISSN

Volume Title

Type

Access

Publication Status

Journal Issue

Abstract

Date

Publisher

Description

Keywords

Citation

URI

Collections

Page Views

0

File Download

0

Publication: Scalable analysis of large-scale system logs for anomaly detection

Institution Authors

Authors

Research Projects

Journal Title

Journal ISSN

Volume Title

Type

Access

Publication Status

Journal Issue

Abstract

Date

Publisher

Description

Keywords

Citation

URI

Collections

Page Views

0

File Download

0

Publication:
Scalable analysis of large-scale system logs for anomaly detection