Real-Time Analytics with Hadoop: Integrating Streaming Engines for Performance Gains

Harsha  Vardhan Reddy Goli

doi:10.61841/turcomat.v11i2.15250

PDF

Published: 2020-08-31

DOI: https://doi.org/10.61841/turcomat.v11i2.15250

Keywords:

Hadoop, Real-time analytics, Apache Storm, Apache Flink, Hybrid big data architecture

Harsha Vardhan Reddy Goli

Software Developer, Alephys LLC, Texas, USA

Abstract

The rising demand for real-time data analytics in domains such as the Internet of Things (IoT) and telecommunications necessitates hybrid big data architectures that seamlessly combine batch and stream processing. This study investigates the integration of Hadoop with real-time streaming engines, specifically Apache Storm and Apache Flink, to address the challenges of low-latency analytics within traditional big data frameworks. We analyze performance tradeoffs, latency mitigation techniques, and fault tolerance mechanisms involved in such hybrid deployments. Through benchmarking and architectural evaluation, the research identifies key design considerations, including pipeline optimization and efficient resource management strategies that support concurrent batch and real-time workloads. Empirical insights from IoT and telecom use cases illustrate the effectiveness of integrating Hadoop’s scalable storage with the high-throughput, low-latency processing capabilities of modern stream engines. The findings affirm the practicality and performance benefits of adopting a unified analytics ecosystem for real-time data-driven decision-making.

Issue

Vol. 11 No. 2 (2020)

Section

Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

You are free to:

Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
Adapt — remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Notices:

You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .

No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.

How to Cite

Real-Time Analytics with Hadoop: Integrating Streaming Engines for Performance Gains. (2020). Turkish Journal of Computer and Mathematics Education (TURCOMAT), 11(2), 1347-1358. https://doi.org/10.61841/turcomat.v11i2.15250

References

Zaharia, M., Chowdhury, M., Das, T., Dave, A., & Shenker, S. (2010). Resilient Distributed

Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Proceedings of

the 9th USENIX conference on Networked Systems Design and Implementation (NSDI’10),

(1), 15–28.

Soni, M., & Chhajed, S. (2014). Hadoop in Action: Real-Time Analytics with Apache

Hadoop. Packt Publishing.

Kim, B., Lee, S., & Kim, Y. (2013). Real-Time Stream Processing with Apache Storm and

Hadoop. Proceedings of the International Conference on Cloud Computing and Big Data.

Kreps, J., Narkhede, N., & Rao, J. (2011). Kafka: A Distributed Messaging System for Log

Processing. Proceedings of the 6th International Workshop on Networking Meets

Databases.

Davy, M., & Wang, X. (2014). A Study of Apache Flink for Big Data Streaming Analytics.

Proceedings of the International Conference on Big Data Computing and

Communications.

Agarwal, R., & Agrawal, R. (2016). Streaming Analytics with Apache Flink: A New

Approach for Processing Data Streams. IEEE Transactions on Big Data, 2(1), 15-20.

Gubbi, J., Buyya, R., Marusic, S., & Palaniswami, M. (2013). Internet of Things (IoT): A

Vision, Architectural Elements, and Future Directions. Future Generation Computer

Systems, 29(7), 1645–1660.

Meng, X., Bradley, J., Yavuz, B., & Liu, S. (2016). Mllib: Scalable Machine Learning on

Apache Spark. Proceedings of the 23rd ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining.

White, T. (2012). Hadoop: The Definitive Guide. O’Reilly Media.

Dastgheibi, S. A., & Fox, A. (2014). Real-Time Big Data Stream Processing with

Apache Kafka. Proceedings of the International Workshop on Big Data.

Soni, S., & Rani, R. (2017). Real-Time Data Stream Analytics Using Apache Flink: A

Survey. International Journal of Computer Applications, 167(6), 1-7.

Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large

Clusters. Proceedings of the 6th USENIX Symposium on Operating Systems Design and

Implementation (OSDI’04).

Zhang, Z., & Zhang, L. (2015). Performance Analysis of Apache Hadoop and Apache

Spark for Big Data Processing. Proceedings of the International Conference on Data

Mining and Big Data.

Huang, X., & Cao, Y. (2017). Design and Optimization of Big Data Real-Time

Processing System Based on Hadoop and Apache Storm. International Journal of

Computer Science and Network Security, 17(4), 69-75.

Li, Y., & Liu, Y. (2016). A Comparative Study of Real-Time Stream Processing

Frameworks: Apache Storm and Apache Flink. Proceedings of the International

Conference on Computational Intelligence and Communication Networks.

Ucar, N., & Yildirim, E. (2019). Performance Evaluation of Stream Processing

Frameworks for Big Data Analytics. Future Generation Computer Systems, 89, 20-30.

Gajbhiye, S., & Apte, M. (2018). Real-Time Big Data Processing and Analytics: A

Case Study of IoT in Smart City. Proceedings of the 2nd International Conference on

Cloud Computing and Data Science.

Hasan, S. S., & Zulkernine, M. (2017). Performance Evaluation of Streaming Analytics

Systems: A Survey of Apache Storm, Spark Streaming, and Flink. Proceedings of the

International Conference on Cloud Computing and Data Science.

Dong, M., & Liu, Q. (2019). Efficient Data Stream Processing and Its Applications in

IoT. International Journal of Computing and Digital Systems, 8(1), 23-30.

Pal, S., & Kundu, M. (2015). Real-Time Data Processing in Hadoop Using Apache

Flink. Proceedings of the International Conference on Big Data.

Ekanayake, J., & Pallickara, S. (2011). Real-Time Stream Processing with Apache

Storm. Proceedings of the International Conference on Cloud Computing Technology and

Science (CloudCom), 148-155.

Milani, M., & Triani, F. (2018). Real-Time Big Data Processing with Apache Flink: A

Comparative Study. Computers & Electrical Engineering, 68, 775-782.

Basu, A., & Soni, M. (2017). A Review on Real-Time Big Data Stream Processing with

Apache Kafka and Apache Storm. International Journal of Computer Applications, 160(5),

-31.

Chaudhary, A., & Agrawal, R. (2015). Integration of Hadoop with Real-Time Stream

Processing for Big Data Analytics. IEEE International Conference on Big Data (Big Data),

-240.

Yan, Z., & Liu, Y. (2016). Real-Time Big Data Analytics with Apache Flink and

Hadoop. Journal of Software Engineering and Applications, 9(6), 384-390.

Article Sidebar

Main Article Content