Data Tarn: A New Approach for Management and Real-Time Analyses of Big Data

Document Type : Computer Article

Authors

Abstract

By increasing the speed of data generation, need to process, store and analyze of Big Data becomes increasing. Related work has been done to create real-time data warehouse, but according to current unstructured data in Big Data, data warehouse with the old structure, it doesn't answer new management requirements of this type of Data. Recently, Data Lake has been proposed for unstructured data (with BASE properties). However, existence of important structured data (with ACID properties) and less sensitive unstructured big data on the other hand, causing new problems in the management of Big Data by using of this methods. In this paper we will offer a solution which is able to store structured data and unstructured data simultaneously and it can response to user’s queries in real-time. As one of the important results of this research, after comparing the data warehouse and Data Lake concluded that the lake is not a replacement for a data warehouse, and data warehouse has particular use, especially in financial data; because the data warehouse compliance ACID theory, and Data Lake cater requirements of BASE theory.  The raised idea in this paper has three main advantage: 1- Simultaneous use of data warehouse and Data Lake to meet the needs of the organization data with the benefits of them. 2- Separating new data from old data to achieve real-time. 3- Development parallelism, thus synchronization loading data and query processing to reduce the cost of time.

Keywords

Main Subjects


 
[1] We Are Social, "Digital in 2016 report"[Online], Available:http://wearesocial.com/uk/special-reports/digital-in-2016. Company number 06629464, London, [Accessed: 27 February 2016].
[2] Tay L. (2013), "Inside eBay’s 90PB data warehouse"[Online], Available: http://www.itnews.com.au/news/inside-ebay8217s-90pb-data-warehouse-342615, [Accessed: 27 February 2016]
[3] کشوری، س.، نقوی، م.، کشوری، س.، (1394). "معرفی، بررسی و مقایسه سیستم فایل‏های توزیع شده"، اولین همایش ملی فناوری‌های نوین رایانه و توسعه پایدار، تهران.
[4] Abhinay B. Angadi, Akshata B. Angadi, Karuna C. Gull. (2013), Growth of New Databases & Analysis of NOSQL Datastores, International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 3, pp. 06-20.
[5] Brewer, E. A., (2000) Towards Robust Distributed Systems, Portland, Oregon, J. – Keynote at the ACM Symposium on Principles of Distributed Computing (PODC) on, PP. 07-19.
[6] Jacobsohn, M., Delurey, M. (2014) HOW THE DATA LAKE WORKS, Available: https://www.boozallen.com/content/dam/boozallen/documents/Data_Lake.pdf [Accessed: 07 August 2016].
[7] Ali-ud-din Khan, M., Fahim Uddin, M. Gupta N. (2014) , Seven V's of Big Data understanding Big Data to extract value, American Society for Engineering Education (ASEE Zone 1), Bridgeport, CT, PP. 1-5.
[8] Fowler, M. (2015) "DataLake"[Online], Available: http://martinfowler.com/bliki/DataLake.html, [Accessed: 07 August 2016].
[9] Langseth, J. (2004) Real-Time Data Warehousing: Challenges and Solutions, DSSResources.COM, [Online], Available: http://dssresources.com/papers/features/langseth/langseth02082004.html, [Accessed: 07 August 2016].
[10] Santos, R., Bernardino, J. (2008) ‘Real-time data warehouse loading methodology’, International Database Engineering and Applications Symposium (IDEAS), New York, NY, USA, PP. 49-58.
[11] Jain, T., Rajasree, S., Saluja, S. (2012) ‘Refreshing datawarehouse in near real-time’, International Journal of Computer Applications, May, Vol. 46, No. 18, pp.0975–8887.
[12] Zuters, J. (2011) ‘Near real-time data warehousing with multi-stage trickle and flip’, Perspectives in Business Informatics Research, Vol. 90 of Lecture Notes in Business Information Processing, pp.73–82.
[13] Nguyen, M., Tjoav, A. (2003) ‘Zero-latency data warehousing for heterogeneous data sources and continuous data streams’, Fifth International Conference on Information and Web-based Applications and Services, Austrian Computer Society (OCG), PP 167 - 176.
[14] Golab, L., Johnson, T. (2014), Data stream warehousing, IEEE 30th International Conference on Data Engineering, Chicago, IL, PP. 1290-1293.
[15] Zhu, Y., an, L., Liu, S. (2008) ‘Data updating and query in real-time data warehouse system’, Computer Science and Software Engineering, International Conference on (Volume: 5), PP. 1295 - 1297.
[16] Obali, M., Erdem, Z., Görür, A. K. (2013), A real time data warehouse approach for data processing, Signal Processing and Communications Applications Conference (SIU), Haspolat, PP. 1 - 4.
[17] Vassiliadis, P., Simitsis, A. (2008) ‘Near real time ETL’, in Springer journal Annals of Information Systems, Vol. 3, Special issue on New Trends in Data Warehousing and Data Analysis, Springer, pp. 1-31.
[18] Ferreira, N., Furtado, P. (2013), Real-time data warehouse: a solution and evaluation, International Journal of Business Intelligence and Data Mining, Volume 8 Issue 3, PP. 244-263.
[19] Fang, H. (2015), managing data lakes in big data era: What's a data lake and why has it became popular in data management ecosystem, Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), 2015 IEEE International Conference on, Shenyang, PP. 820 - 824.
[20] Hai, R., Geisler,S., Quix, C. (2016) Constance: An Intelligent Data Lake System, SIGMOD '16 Proceedings of the 2016 International Conference on Management of Data, New York, NY, USA, PP. 2097 - 2100.
[21] Walker, C., Alrehamy, H. (2015), Personal Data Lake with Data Gravity Pull, Big Data and Cloud Computing (BDCloud), 2015 IEEE Fifth International Conference on, Dalian, PP. 160 - 167.
[22] Xie, Ch., Su Ch., Littley, C., ET. al., (2015) High-performance ACID via modular concurrency control, SOSP '15 Proceedings of the 25th Symposium on Operating Systems Principles, New York, USA, PP. 279-294.
[23] Kong, Ch., China, Sh., Gao, M., ET. AL. (2015), ACID Encountering the CAP Theorem: Two Bank Case Studies, 2015 12th Web Information System and Application Conference (WISA), Jinan, PP. 235 - 240.
[24] Chandra, D. G. (2015), BASE analysis of NoSQL database, Future Generation Computer Systems, Volume 52, Pages 13–21.
[25] Brewer, E. (2012), "CAP twelve years later: How the "rules" have changed", Computer, vol. 45, pp. 23 - 29.
[26] Elsa Estrada-Guzman, R. M., Gómez, L. (2015), "NoSQL method for the metric analysis of Smart Cities," presented at the IEEE Guadalajara Metrics for Smart Cities Working Group.
[27] کشوری, س.، صابری، ح.، کشوری، س.، (۱۳۹۴)، "نقش نظریه CAP و همزیستی مسالمت‌آمیز در انتخاب بانک‏های اطلاعاتی"، سومین کنفرانس بین‌المللی پژوهش‌های کاربردی در مهندسی کامپیوتر و فن‌آوری اطلاعات، تهران، دانشگاه صنعتی مالک اشتر.
[28] Gilbert, S., Lynch, N. (2012) Perspectives on the CAP Theorem, Computer, Volume: 45, Issue: 2, PP. 30-36.
[29] Prasad, S., Nunifar Sha, M.S., (2013) "NextGen data persistence pattern in healthcare: Polyglot persistence," presented at the Computing, Communications and Networking Technologies (ICCCNT), Fourth International Conference on, Tiruchengode, PP. 1-8.
[30] Rifaie, M., Kianmehr, K., Alhajj, R., Ridley M. J. (2008), Data warehouse architecture and design, Information Reuse and Integration, 2008. IRI 2008. IEEE International Conference on, Las Vegas, NV, USA, PP. 58 - 63.
[31] Kimball, R., Ross, M. (2013). The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, Wiley Publishing.
[32] W. H. Inmon, K. Krishnan, (2011). Building the Unstructured Data Warehouse, Technics Publications, LLC , USA.
[33] CAMPBELL, CH. (2015), Top Five Differences between Data Lakes and Data Warehouses, Available: https://www.blue-granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses, [Accessed: 14 August 2016].
[34] Refaat, M. (2007), Data Preparation for Data Mining Using SAS, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA.
[35] Goth, G. (2016), The Data Lake Concept Is Maturing, ACM NEWS, Available: http://cacm.acm.org/news/200095-the-data-lake-concept-is-maturing/fulltext, [Accessed: 16 August 2016].
[36] Apache Software Foundation, (2015) Apache Atlas,Available: http://atlas.incubator.apache.org, [Accessed: 16 August 2016].
[37] Ahn, A. (2015), APACHE ATLAS PROJECT PROPOSED FOR HADOOP GOVERNANCE, Hortonworks, Available: http://hortonworks.com/blog/apache-atlas-project-proposed-for-hadoop-governance/, [Accessed: 16 August 2016].
[38] Gorelik, A., Chen, J., Claude, O., ET. al., (2016), waterline data, Available: http://www.waterlinedata.com, [Accessed: 16 August 2016].
[39] Oracle (2012) Oracle, Best Practices for Real-time Data Warehousing, White Paper, and August 2012.
[40] Anandan, S., Bogoevici, M., Renfro, G., ET. Al. (2015) Spring XD: a modular distributed stream and batch processing system, DEBS '15 Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems, PP. 217-225.
[41] Qiao, L., Li, Y., Takiar S., ET. Al. (2015), Gobblin: unifying data ingestion for Hadoop, Journal Proceedings of the VLDB Endowment - Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii, Volume 8 Issue 12, PP. 1764-1769.
[42] Waas, F., Wrembel, R., Freudenreich, T., ET. Al. (2013) On-Demand ELT Architecture for Right-Time BI: Extending the Vision, International Journal of Data Warehousing and Mining archive Volume 9و Issue 2, PP. 21-38.
[43] کشوری، س. جوادزاده، م.ع. نقوی، م. (1396)، "ارزیابی و مقایسه کارایی پایگاه داده‌های کلید‏-مقدار با هدف انتخابِ مبتنی بر نیاز"، مجله علوم رایانشی، 6، (در نوبت چاپ http://csj.isi.org.ir/page12.aspx).