Advertisement

Advertisement

A comprehensive and systematic literature review on the big data management techniques in the internet of things

  • Original Paper
  • Published: 15 November 2022
  • Volume 29 , pages 1085–1144, ( 2023 )

Cite this article

latest research paper on big data

  • Arezou Naghib   nAff1 ,
  • Nima Jafari Navimipour 2 , 3 ,
  • Mehdi Hosseinzadeh 4 , 5 , 6 &
  • Arash Sharifi 1  

11k Accesses

16 Citations

Explore all metrics

The Internet of Things (IoT) is a communication paradigm and a collection of heterogeneous interconnected devices. It produces large-scale distributed, and diverse data called big data. Big Data Management (BDM) in IoT is used for knowledge discovery and intelligent decision-making and is one of the most significant research challenges today. There are several mechanisms and technologies for BDM in IoT. This paper aims to study the important mechanisms in this area systematically. This paper studies articles published between 2016 and August 2022. Initially, 751 articles were identified, but a paper selection process reduced the number of articles to 110 significant studies. Four categories to study BDM mechanisms in IoT include BDM processes, BDM architectures/frameworks, quality attributes, and big data analytics types. Also, this paper represents a detailed comparison of the mechanisms in each category. Finally, the development challenges and open issues of BDM in IoT are discussed. As a result, predictive analysis and classification methods are used in many articles. On the other hand, some quality attributes such as confidentiality, accessibility, and sustainability are less considered. Also, none of the articles use key-value databases for data storage. This study can help researchers develop more effective BDM in IoT methods in a complex environment.

Similar content being viewed by others

latest research paper on big data

Systematic Literature Review on Data Provenance in Internet of Things

latest research paper on big data

Data quality and the Internet of Things

latest research paper on big data

Recent Research on Data Analytics Techniques for Internet of Things

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

The Internet of Things (IoT) is an emerging information technology model and a dynamic network that enables interaction between self-configuring, smart, and interconnected devices and humans [ 1 ]. The IoT's ubiquitous data collection devices (such as Radio-Frequency Identification (RFID) tags, sensors, Global Positioning Systems (GPS), Geographical Information Systems (GIS), drives, Near-Field Communication (NFC), actuators, and mobile phones) collect and share real-time, mobile, and environmental data for automatic monitoring, identification, processing, maintenance, and control in real-time [ 2 , 3 , 4 ]. The IoT ecosystem has five main components generally: IoT devices, including sensors and actuators that collect data and perform actions on things; IoT connectivity, including protocols and gateways, that is responsible for creating communication in the IoT ecosystem between smart devices, gateways, and the cloud; an IoT cloud that is responsible for data storage, processing, analysis, and decision-making; IoT analytics and data management are responsible for processing the data; and end-user devices and user interfaces help to control and configure the system [ 5 ]. The most important applications of IoT include environmental monitoring, disaster management, smart homes/buildings, smart farms, healthcare, smart cities, urban, smart manufacturing, intelligent transport systems, smart floods, financial risk management, supply chain management, water management, enterprise culture, cultural heritage, smart surveillance, military tracking and environment, digital forensics, underwater environments, and understanding social phenomena [ 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 ]. The IoT devices and sensors in the Wireless Sensor Networks (WSN) generate large data. According to the international data corporation Footnote 1 forecast, the number of IoT devices will be 41.6 billion and generate 79.4 zettabytes of data in 2025. This massive structured, semi-structured, and unstructured data, which is expanding rapidly with time, results in "Big Data" [ 23 ]. "Big data" technologies are a new generation of distributed architectures and technologies that provide distributed data mining capabilities to inexpensively, valuable, and effectively extract value from a huge dataset with characteristics such as volume, velocity, variety, variability, veracity, and value [ 24 ]. Big data provides both opportunities and problems for organizations and enterprises. Big data can improve data precision, be used for forecasting and decision-making, and give stakeholders more in-depth analytical findings [ 2 ]. Traditional data processing systems cannot collect, process, manage, and interpret data effectively using conventional mechanisms. Therefore, it requires a scalable architecture or framework for effective capture, storage, management, and analysis [ 25 ].

A major challenge in implementing IoT in real and complex environments is analyzing heterogeneous data volumes that contain a wide variety of knowledge content [ 26 ]. Various platforms, tools, and technologies have been developed for big data monitoring, collecting, ingesting, storing, processing, analysis, and visualization [ 10 , 27 ]. These platforms and tools are Apache Hadoop, MapReduce, 1010data, Apache Storm, Cloudera, Cassandra, HP-HAVEn, SAP-Hana, Hortonworks, MongoDB, Apache Kafka, Apache Spark, Infobright, etc. Industries and enterprises use Big Data Analytics (BDA) with IoT technologies to handle the timely analysis of information streams and intelligent decision-making [ 28 , 29 , 30 ]. BDM in the IoT involves different analytic types [ 31 ]. Marjani et al. [ 29 ] discussed analytical types in real-time, offline, memory, business intelligence, and at massive levels. Singh and Yassine [ 28 ] divided analytical types into preprocessing, pattern mining, and classification. Gandomi and Haider [ 32 ] divided big data processing into two major phases: data management and data analytics. Also, Ahmed et al. [ 33 ] provided five aspects of big data: acquisition and storage; programming model; benchmark process; analysis; and application. Finally, ur Rehman et al. [ 34 ] divided BDA into five main steps: data ingestion, cleaning, conformation, transformation, and shaping.

However, despite the importance of BDM in the IoT and the rising challenges in this area, as far as we know, there is not any complete and detailed systematic review in this field. Hence, this paper tries to analyze the mechanisms of BDM in the IoT. The main contributions of this paper are as follows:

Presenting a study of the existing methods for BDM in the IoT.

Dividing BDM methods in the IoT are divided into four main categories: BDM processes, BDM architectures/frameworks, quality attributes, and big data analytics types.

Dividing the BDM process in the IoT into six main steps, including data collection, communication, data ingestion, data storage, processing and analysis, and post-processing.

Dividing the BDM architecture/framework in the IoT into two main subcategories: BDM architectures/frameworks in IoT-based applications and BDM architectures/frameworks in the IoT paradigms.

Exploring the primary challenges, issues, and future works for BDM in the IoT.

The following subsection discusses related work to show the main differences between this review and similar studies. Also, the abbreviations used in this paper are presented in Table 1 .

1.1 Related work and contributions of this review

This section studies some reviews and survey articles that work on BDM in the IoT to highlight the need for reviewing them. In addition, this section describes the main advantages and disadvantages of this article to distinguish this one.

Ahmed et al. [ 27 ] analyzed several techniques for IoT-based big data. This article categorizes the literature based on parameters, including big data sources, system components, big data enabling technologies, functional elements, and analytics types. The authors also discussed connectivity, storage, quality of services, real-time analytics, and benchmarking as the critical requirements for big data processing and analytics.

Constante Nicolalde et al. [ 35 ] overviewed the technical tools used to process big data and discussed the relationship between BDA and IoT. The big data challenges are divided into four general categories: data storage and analysis; the discovery of knowledge and computational complexities; information security; and scalability and data visualization.

Talebkhah et al. [ 36 ] investigated the architecture, challenges, and opportunities of big data systems in smart cities. This article suggested a 4-layer architecture for BDM in smart cities. The layers of this architecture are data acquisition, data preprocessing, data storage, and data analytics. This article also considered the opportunities and challenges for smart cities, such as heterogeneity, design and maintenance costs, failure management, throughout, etc.

Bansal et al. [ 37 ] investigated state-of-the-art research on IoT and BDM. This article proposed a taxonomy based on BDM in the IoT applications, including smart transport, smart cities, smart buildings, and smart living. BDM steps are considered as data acquisition, communication, storage, processing, and retrieval. Also, the related surveys on BDM were divided into three general categories: surveys on IoT BDA, domain-specific surveys on IoT big data, and surveys on challenges in IoT big data. The authors classified the articles based on four major vendor services (Google, Amazon, Microsoft, and IBM) to integrate IoT and IoT big data with case studies. The big data management challenges in the IoT are considered based on 13 V’s challenges.

Marjani et al. [ 29 ] investigated state-of-the-art research efforts directed toward big IoT data analytics and proposed a new architecture for big IoT data analytics. This article discusses big IoT data analytic types under real-time, offline, memory-level, business intelligence, and massive level analytics categories.

Simmhan and Perera [ 38 ] presented the analytics requirements of IoT applications. They defined the relationship between data volume capacity and processing latency of new big data platforms. This article divided decision systems into visual analytics, alerts and warnings, reactive systems, control and optimization, complex systems, knowledge-driven intelligent systems, and behavioral and probabilistic systems.

Shoumy et al. [ 39 ] discussed frameworks and techniques for multimodal big data analytics. They divided multimodal big data analytics techniques into four topics: affective framework; multimodal framework; big data and analytics framework; and fusion techniques. Furthermore, Ge et al. [ 40 ] discussed the similarities and differences among big data technologies used in IoT domains and developed a conceptual framework. This article interpreted big data research and application opportunities in eight IoT domains (healthcare, energy, transportation, building automation, smart cities, agriculture, industry, and military) and discussed the advantages and disadvantages of big data technologies. In addition, it examined four aspects of big data processes: storage, cleaning/cleansing, analysis/analytics, and visualization.

Siow et al. [ 41 ] considered the analytics infrastructure from data generation, collection, integration, storage, and computing. This article presented a comprehensive classification of analytical capabilities consisting of five categories: descriptive, diagnostic, discovery, predictive, and prescriptive analytics. In addition, a 3-layered taxonomy of data analytics was presented, including data, analytics, and applications.

Fawzy et al. [ 42 ] investigated the techniques and technologies of IoT systems from BDA architectures and software engineering perspectives. This article proposed a taxonomy based on BDA systems in the IoT, including smart environments, human, network, energy, and environmental analytics. The BDA target, approach, technology, challenges, software architecture and design, model-driven engineering, separation of concerns, and system validation and verification. The authors presented the IoT data features as multidimensional, massive, timely, heterogeneous, inconsistent, traded, valuable, and spatially correlated. The proposed domain-independent BDA-based IoT architecture has six layers. The layers of architecture are data manager, system resources controller, system recovery manager, BDA handler, software engineering handler, and security manager.

Zhong et al. [ 43 ] investigated using BDA and data mining techniques in the IoT. This article divided the review articles into four categories: architecture and platform, framework, applications, and security. The data mining methods for BDA in the IoT were discussed in these four categories. The challenges investigated in the article are as follows: data volume, data diversity, speed, data value, security, data visualization, knowledge extraction, and real-time analysis.

Hajjaji et al. [ 44 ] discussed applications, tools, technologies, architectures, current developments, challenges, and opportunities in big data and IoT-based applications in smart environments. This article divided the benefits of combining the IoT and big data into six categories: multi-source and heterogeneous data; connectivity; data storage; data analysis; and cost-effectiveness.

Ahmadova et al. [ 45 ] discussed big data applications in the IoT. They proposed a taxonomy of big data in the IoT that includes healthcare, smart cities, security, big data algorithms, industry, and general view. In the article, the authors discussed big data technologies' advantages and disadvantages for IoT domains. Also, the evaluation factors that are considered in the article are security, throughput, cost, energy consumption, reliability, response time, and availability.

Table 2 shows the summary contributions of related survey articles. The publication year, methodology, discussion, and other disadvantages are shown for each article in this table. Due to the existing weaknesses in the review articles, this paper presents a systematic literature review and a proper categorization of BDM mechanisms in the IoT that addresses the shortcomings as follows:

This paper provides a complete research methodology that includes research questions and the article selection process.

This paper discusses the newly proposed mechanisms for BDM in the IoT between 2016 and August 2022.

This paper considers the architectures/frameworks of IoT-based applications, including healthcare, smart cities, smart homes/buildings, intelligent transport, traffic control and energy, urban planning, and other IoT applications (smart IoT systems, smart flood, smart farms, disaster management, laundry, digital manufacturing, and smart factory).

This paper investigates the quality attributes and categorizes the review articles based on the quality attributes used and the reference model of standard software quality attributes, i.e., ISO 25010.

This paper classifies the review articles based on BDA types in the IoT and their tactics.

This paper considers the big data storage systems and tools in the IoT based on relational databases, NoSQL databases, distributed file systems, and cloud/edge/fog/mist storage.

This paper discusses the BDM process in six steps: data collection, communication, data ingestion, data storage, processing and analysis, and post-processing, and proposes the main tools in each step.

This paper presents open issues and challenges on BDM in the IoT and divides challenges into two categories: BDM in the IoT and quality attributes challenges.

The rest of the paper is structured as follows: Sect.  2 explains the research methodology and the article selection process. The categories of the BDM methods in the IoT and their comparison are described in Sect.  3 . Section  4 discusses the challenges and some open issues. Finally, Sect.  5 represents the conclusion and the paper’s limitations.

2 Research methodology

Systematic literature review (SLR) is a research methodology that examines data and findings of the researchers relative to specified questions [ 46 , 47 ]. It aims to find as much relevant research on the defined questions as possible and to use explicit methods to identify what can reliably be said based on these studies [ 48 , 49 ]. This section provides an SLR to understand the BDM techniques in the IoT. The following subsection will explain the research questions and the article selection process.

2.1 Research questions

This study focuses more explicitly on the articles related to BDM in the IoT, focusing on their advantages and disadvantages, architectures, processing and analysis methods, storage systems, evaluation metrics, and tools. To achieve the goals mentioned above, the following research questions are presented.

RQ1: What is BDM in IoT?

Section  1 answered this question.

RQ2: What is the importance of BDM in the IoT?

This question aims to show the number of published articles about BDM in IoT between 2016 and August 2022.

Section  2 answers this question.

RQ3: How are the articles searched and chosen to be assessed?

Section 2.2 discusses the question.

RQ4: What are the classifications of BDM methods in the IoT?

This question aims to show the existing methods of BDM in the IoT environment. Section  3 will discuss this answer.

RQ5: What are the challenges and technical issues of BDM in the IoT?

This question identifies the challenges for BDM in the IoT and provides open issues for future research. Section  4 will discuss this answer.

2.2 Article selection process

In this study, the article’s search and selection process consists of three stages. These stages are shown in Fig.  1 . In the first stage, the articles between 2016 and August 2022 were searched based on the keywords and terms (presented in Table 3 ). These articles are the results of searching popular electronic databases. These electronic databases include Google Scholar, Elsevier, ACM, IEEE Explore, Emerald Insight, MDPI, Springer Link, Taylor and Francis, Wiley, JST, Dblp, DOAJ, and ProQuest. The articles include journals, chapters, conference papers, books, notes, technical reports, and special issues. 751 articles were found in Stage 1. In Stage 2, there are two steps to select the final number of articles to review. First, the articles are considered based on the inclusion criteria in Fig.  2 . There are 314 articles left at this stage. Next, the review articles are removed; of the remaining 314 articles in the previous stage, 85 (27.07%) were review articles. Elsevier has the highest number of review articles (31.76%, 27 articles). EMERALD and Taylor and Francis have the lowest number of reviewed articles (2.35%, one article). The highest number of published review articles is in 2019 (24.71%), and the lowest is in 2022 (8.24%). The number of remaining articles at this stage is 229. In Stage 3, the title and abstract of the articles are reviewed. Also, to ensure that the articles are relevant to the study, we reviewed the methodology, evaluation, discussion, and conclusion sections. The number of selected articles retained at this stage is 110. Elsevier publishes most of the selected articles (30.91%, 34 articles). The lowest number is related to ACM (0.91%, one article). 2018 has the highest number of published articles (26.36%, 29 articles). The Future Generation Computer Systems journal publishes the highest number of articles (11.82%, 13 articles).

figure 1

Articles search and selection process stage

figure 2

Inclusion criteria in the articles selection process

3 Big data management approaches in the IoT

This section presents four different categories for the reviewed articles. These categories include the BDM process in the IoT (Sect. 3.1 ), BDM architectures/frameworks for IoT applications (Sect. 3.2 ), quality attributes (Sect. 3.3 ), and big data analytics types (Sect. 3.4 ). Each category has subcategories that will be considered in its relevant section. Figure  3 shows this taxonomy.

figure 3

Taxonomy of the selected articles

3.1 Big data management process in the IoT

This section categorizes articles based on BDM process mechanisms and presents a comprehensive framework for BDM in the IoT. The comprehensive framework for BDM in the IoT is shown in Fig.  4 . The steps of BDM in IoT include data collection, communication, data ingestion, storage, processing and analysis, and post-processing.

figure 4

Big data management framework in IoT

3.1.1 Data collection

A variety of sources generates IoT data. There are different mechanisms for IoT data collection, but there is still no fully efficient and adaptive mechanism for IoT data collection [ 50 ]. This paper divides IoT sources into sensors, applications, devices, and other resources. Figure  5 shows the classification of the sources based on these four categories.

figure 5

Big data sources categories in IoT

3.1.2 Communication

The data sources are located on various networks, such as IoT sensor networks, wired and wireless sensor networks, fiber-optic sensor networks, and machine-to-machine communications. Communication technologies are required to process and analyze these data sources [ 51 , 52 ]. There are several communication technologies and protocols in the IoT. The communication protocols used in the articles are IPV6, RPL, MQTT, CoAP, SSL, AMQP, Websocket, 6LowPANIPV6, Alljoyn, TCP/IP, HTTP/IP. Communication technologies are compared based on frequency, data rate, range, power usage, cost, latency, etc. There are several categories of these communication technologies. This paper divides big data communication technologies in the IoT based on distance criteria into three categories: pan, local, and WAN. Table 4 shows the articles' classification based on these three categories. Wi-Fi, ZigBee, Bluetooth, and 4G LTE are of the utmost importance in communication technology, with a total number of 29, 19, 17, and 17 articles, respectively.

3.1.3 Data ingestion

Data ingestion is the process of importing and transporting data in different formats from various sources (shown in Fig.  4 ) to a storage medium, processing and analyzing platform, and decision support engines [ 93 , 94 ]. The quality of the dataset used by ML-based prediction models (classification) plays a vital role in BDM in the IoT. A prediction model requires a lot of correctly labeled data for correct construction, assessment, and accurate result generation [ 95 ]. Therefore, the data ingestion layer should handle the enormous volume, high speed (velocity), variety, value, variable, and validated data for the processing and analysis step. In different articles, this layer has multiple tasks. The data ingestion layer in [ 96 ] includes identification, filtration, validation, noise reduction, integration, transformation, and compression. The data ingestion layer in [ 97 ] provides data synchronization, data slicing, data splitting, and data indexing. Also, the data ingestion layer in [ 98 ] includes data stream acquisition, data stream extraction, enrichment, integration, and data stream distribution. Finally, the data ingestion layer in [ 99 ] includes data cleaning, data integration, and data compression.

There are three categories of data ingestion technologies: real-time data ingestion, batch data ingestion, and both. Real-time data ingestion is used for time-sensitive data and real-time intelligent decision-making. Batch data ingestion is used for data collection from sources at regular intervals (daily reports and schedules) [ 100 ]. There are many tools and platforms for data ingestion, such as Apache Kafka, Apache NIFI, Apache Storm, Apache Flume, Apache Sqoop, Apache Samza, Apache Minifi, Confluent Platform, and Elastic Logstash. These tools can be compared based on throughput, latency, scalability, and security [ 98 ]. The data ingestion layer in this paper includes data cleaning, data integration, data transformation/ discretization, and data reduction. Each of these steps uses special tools, methods, and algorithms. Table 5 shows the categorization of articles based on the tools that are used for data ingestion. Data ingestion tools have been compared based on ingestion type, throughput, reliability, latency, scalability, security, and fault tolerance. Platforms in some articles use a combination of these tools, such as the Horton data flow platform in [ 101 ], including Apache NiFi/MiNiFi, Apache Kafka, Apache Storm, and Druid tools. As you can see in Table 5 , Apache Kafka is of utmost importance to the data ingestion tool, with a total of 8 articles. Also, Table 6 shows the categorization of articles based on the big data preprocessing stage in the IoT.

3.1.4 Data storage

This subsection categorizes articles based on storage mechanisms. The articles use various methods and tools to store big data. This study divides these mechanisms into four categories: relational, NoSQL, Distributed File Systems (DFS), and cloud/edge/fog/mist storage. Each of these categories has subcategories. One of the most critical big data challenges is the categorization and scalability that traditional relational databases such as MySQL, SQL Server, and Postgres cannot overcome. Therefore, NoSQL databases are used to store big data. NoSQL technologies are divided into four categories: key-value, column-oriented, document-oriented, and graph-oriented [ 102 ]. These NoSQL technologies have many platforms to support their operations. Key-value storage is the most straightforward and highly flexible type of NoSQL database and stores all the data as a pair of keys and values. A document-oriented database stores data as a set of columns. In a relational database, data is stored in rows and read row-by-row. A graph database focuses on the relationships between data elements, and each element is stored as a node. Tables 7 and 8 show the types of storage methods used in articles. Table 7 shows the classification of articles based on relational databases, NoSQL databases, and DFS. As you can see, any of the 110 selected articles do not use the key-value databases. In relational databases, Hive, NoSQL databases, Hbase, and distributed file systems, HDFS is most commonly used. Table 7 compares these storage tools and platforms based on in-memory database/storage or disk-based, data type, scalability, security, availability, flexibility, performance, fault-tolerant, easy to use, and replication.

Table 8 shows the classification of articles based on cloud/edge/fog/mist storage. Cloud computing provides scalable computing, high data storage, processing power, and ensures the quality of the applications. However, it has main challenges such as latency, network overhead, bandwidth, data privacy, lower real-time responsiveness, location awareness, security, reliability, data availability, and accessibility [ 103 ]. Network architectures came into existence to overcome these challenges, such as fog, edge, and mist computing, that move the data and computation closer to the consumer and reduce some of the workloads from the cloud [ 104 ].

Fog computing is a type of decentralized computing that is between cloud storage and IoT devices. Fog computing reduces service latency, bandwidth, energy consumption, storage, and computing costs and improves the QoS [ 149 ]. The fog computing for the IoT model supports real-time services, mobility, and geographic distribution [ 150 ]. Another alternative approach to cloud computing is edge computing. Data storage and processing in edge computing occur closer to the device or data source to improve data locality, performance, and decision-making [ 151 ]. Edge computing is less scalable than fog computing but provides near real-time analytics and high-speed data access and reduces data leakage during transmission [ 104 , 152 ]. Mist computing is an intermediate layer between fog/cloud and edge computing. It can improve the fog/cloud challenges, such as response time, location awareness, data privacy, local decision-making, network overhead, latency, and computing and storage costs. Mist nodes had low processing power and storage [ 153 ]. In some articles, in addition to using cloud/edge/fog/mist storage, HDFS and NoSQL databases are used alongside these technologies. The goal is to overcome the disadvantages of these technologies by using them together.

3.1.5 Processing and analysis

Big data processing and analysis in the IoT are techniques or programming models for extracting knowledge from large amounts of data for supporting and providing intelligent decisions [ 154 ]. Efficient big data processing and analysis in IoT can help mitigate many challenges in event management, action management, control and monitoring, improved customer service, cost savings, improve business relationships [ 155 ], etc. This paper divides the big data processing and analysis step in IoT into a set of sub-steps: batch and stream processing, query processing, statistical and numerical analysis, graph processing, ML, resource management, and infrastructure/containers. Table 9 shows the articles' classification and comparison of the tools based on criteria: throughput, reliability, availability, latency, scalability, security, flexibility, ease of use, and cost-effectiveness. Big data processing in the IoT is generally done at both batch and stream levels. Many tools, platforms, and frameworks exist for batch and stream processing. The tools used in the articles are Apache Hadoop, Apache Spark, Map Reduce, Apache Storm, Apache Flink, Anaconda, Apache S4, Weka, streaming analytics manager, and CEP.

As you can see in Table 9 , Apache Hadoop, MapReduce, and Apache Spark are the most critical quality attributes, with a total number of 45, 32, and 31 articles, respectively. Some of these tools include a set of libraries and procedures for efficient processing and analysis. In the study, the libraries and functions used by the articles are Hadooppcap-lib, Hadoop-pcap-serde, Hadoop-pcap-input (Apache Hadoop), MLlib, GraphX, Spark Streaming, Spark SQL, Spark Core (Apache Spark), Map, FlatMap, Filter, Reduce, Shuffle (Map Reduce), Gelly, FlinkML, Table and SQL, FlinkCEP (Apache Flink), NumPy [ 132 ], Keras [ 108 ], Pandas [ 59 ], and Scikit-Learn, Paho-MQTT (Anaconda). Also, various algorithms and methods are used to process and analyze data, such as classification, clustering, regression, optimization algorithms, and SVM. Most of these tools have these algorithms.

3.1.6 Post-processing

The post-processing step is another vital task in knowledge discovery from big data in the IoT. This paper divides the post-processing step into evaluation and selection (data governance), virtualization/dashboard, intelligent decision, and service and application. The evaluation and selection stage evaluates results obtained using test methods on different types of datasets. There are various criteria for assessing the results. In this section, the articles are categorized based on the methods they used for the test. These methods are divided into four categories, including test methods, classification, clustering, and regression. Each of them uses various criteria for evaluation. Table 10 shows the articles’ classification based on these four categories. The virtualization/dashboard stage uses tools, graphs, tables [ 75 ], graphical user interface [ 59 ], and charts [ 92 ] to display the results. Intelligent decisions can be made using stochastic binary decisions [ 156 ], ML, pattern recognition, soft computing, and decision models [ 51 , 53 , 74 ]. These tools are Kibana, Plotly, Tableau, Microsoft Power BI, Grafana, vSphere, NodeJS, and Matplotlib [ 59 , 105 , 106 , 109 , 110 , 113 , 140 ].

Tables 11 and 12 show the relevant datasets that the articles used for investigating/numerically assessing techniques for BDM in the IoT. These datasets are divided into two categories: 1) categorized based on characteristics including dataset name, repository, dataset characteristics, attribute characteristics, number of instances/size, and number of attributes 2) categorized based on characteristics including dataset name, website address, and size. As you can see, the UCI machine learning repository has been repeatedly used in articles as a repository to access techniques for BDM in the IoT.

3.2 Big data management architectures/frameworks in the IoT

This subsection investigates and analyzes the articles that (71 articles) presented the frameworks and architectures for BDM techniques in the IoT. These articles are divided into two categories: BDM architectures/frameworks in the IoT-based applications (63 articles) and BDM architectures/frameworks in the IoT paradigms (8 articles).

3.2.1 Big data management architectures/frameworks in the IoT applications

The architectural models used in the selected articles are layered, component-based, and cloud/fog-based architecture. A layered architecture is organized hierarchically, and each layer performs a service. The layered architecture ensures the system is more adaptable to emerging technologies at each layer and improves the acquisition and integration of data processes [ 167 ]. Component-based architecture is a framework that decomposes the system into reusable and logical components. The advantages of component-based architecture are increased quality, reliability, component reusability, and reduced time. Operations and components related to processing or storage in cloud-based or fog-based architectures are placed in the cloud or fog. Most of the proposed architectures are layered, and the most common types of BDM architectures in the IoT are 3-layer and 4-layer (22 and 20 articles). Also, most of the proposed architectures are in IoT-based healthcare, equivalent to 33.33%, followed by IoT-based smart cities, which equals 22.22%. The selected articles in this study used nine different OS for BDM in the IoT. Ubuntu is the most important OS, with 18 articles. Articles used programming languages to analyze and process big data in the IoT. Java, Python, and MATLAB are the major programming languages. In the following, these architectures and frameworks will be examined. For a better presentation, we have divided these architectures and frameworks into seven categories in terms of IoT applications (healthcare, smart cities, smart home/building, intelligent transport, traffic control and energy, urban planning, and other IoT applications (smart IoT systems, smart flood, smart farms, disaster management, laundry, digital manufacturing, and smart factory)). Then we review the attributes of the architectures and frameworks, including layers, the functions of the layers, the operating system, the programming language, and the advantages and disadvantages of each.

3.2.1.1 BDM architectural/framework for IoT-based healthcare

Predicting health and disease and preventing deaths are essential in our modern world [ 168 , 169 ]. Healthcare IoT (e.g., electronic and mobile health) uses wireless body sensor networks for monitoring the patients’ environmental, physiological, and behavioral parameters [ 170 ]. Wearables and other IoT devices within the healthcare industry generate a large amount of data. The health data must be collected, stored, processed, and analyzed for future intelligent decision-making. BDA plays a vital role in minimizing computation time, predicting the future status of individuals, providing reliable health services, prevention, healthy living, population health, early detection, and optimal management [ 133 , 158 , 171 ]. There are the BDM mechanisms’ objectives and requirements for different types of medical data [ 172 ]. Various research has presented many mechanisms for BDM in IoT-based healthcare that have advantages and disadvantages. Therefore, this subsection examines the articles (21 articles; 33.33%) that discussed the architectures or frameworks of BDM in IoT-based healthcare.

Rathore et al. [ 58 ] proposed Hadoop-based intelligent healthcare using a BDA approach. This system collected the big data and directed them to a 3-unit smart building for storing and processing. The units of this system are big data collection, Hadoop processing, and analysis and decision. This system used the 5-layer architecture for parallel, real-time, and offline processing. The layers of this architecture are the data collection, communication, processing, management, and service. The data collection layer includes data sensing, acquisition, buffering, and filtration. The big data are divided into small pieces in the processing layer, processed in parallel using HDFS and MapReduce, and stored. The management layer uses medical expert systems for processing the results and recommending corresponding actions.

Chui et al. [ 126 ] proposed a 6-layer architecture for patient behavior monitoring based on big data and IoT. Message queue, Apache Hadoop, behavior analytics, Mongo database, distributed stream processing, and exposer are the layers of this architecture. This architecture uses Hadoop for processing (descriptive, diagnostic, predictive, and prescriptive analytics), MongoDB for storing, Spark/Flink/Storm for stream processing, and Apache Kafka for breaking up the data stream into several partitions. Also, the authors have discussed the challenges of trust, security, privacy, and interoperability in the healthcare research field.

Ullah et al. [ 140 ] proposed a lightweight Semantic Interoperability Model for Big-Data in IoT (SIMB-IoT). The SIMB-IoT model has two main components: user interface and semantic interoperability. The semantic interoperability component is divided into three subcomponents: semantic interoperability, cloud services, and big data analytics. IoT data is collected and directed into an intelligent health cloud for online storage and processing. After processing, it sends suitable medicines to the patient’s IoT devices. This article used the SPARQL query to find hidden patterns.

Elhoseny et al. [ 173 ] presented a Parallel Particle Swarm Optimization (PPSO) algorithm for IoT big data analysis in cloud computing healthcare applications. This article aims are: optimize virtual machine selection and storage by using GA, PSO, and PPSO algorithms; real-time processing; and reducing the execution time. This architecture has four components: stakeholders’ devices; tasks; cloud broker; and network administrator. The cloud broker sends and receives requests to the cloud. The network administrator finds the optimal selection of virtual machines in the cloud for task scheduling.

Manogaran et al. [ 141 ] proposed a secured cloud-fog-based architecture for storing and processing real-time data for health care applications. This architecture has two sub-architectures: meta fog-redirection and grouping and choosing architectures. The meta fog-redirection architecture has three phases: data collection, data transfer, and big data storage. The data collection phase collected data from sensors in fog computing. The data transfer phase used the ‘s3cmd utility’ method for transferring data to Amazon S3.The big data storage phase used Apache Pig and Apache HBase for storage. The grouping and choosing architecture protects data and provides security services in fog and cloud environments. Also, this architecture used MapReduce to predict.

García-Magariño et al. [ 156 ] is an agent-based simulation framework for IoT BDA in smart beds. This framework has two layers: the primary mechanism for simulating sleepers' postures and the information's analyzer. The first layer provides the simulation of the poses of sleeper mechanisms. The second layer analysis collected data from the first layer. The agent types in this framework are sleeper agent, weight sensor agent, bed agent, observer agent, analyzer agent, stochastic sleeper agent, bed sleeper agent, restless sleeper agent, and healthy sleeper agent. This framework helps researchers to test different sleeper posture recognition algorithms, discusses other sleeper behaviors, and performs online or offline detection mechanisms.

Yacchirema et al. [ 59 ] proposed a 3-layer architecture for sleep monitoring based on IoT and big data at the network's edge. The layers of this architecture are the IoT layer, the fog layer, and the cloud layer. The IoT layer collected and aggregated the big data and directed them to the fog layer. The fog layer is responsible for connectivity and interoperability between heterogeneous devices, preprocessing the collected data, and sending notifications to react in real-time. The big data is stored, processed, and analyzed in the cloud layer for intelligent decision-making. This layer has three modules: data management, big data analyzer, and web application. This architecture used HDFS for data storage and Spark for offline and real-time processing.

BigReduce [ 137 ] is a cloud-based IoT framework for big data reduction for health monitoring in smart cities that focuses on reducing energy costs. This framework has two schemes: real-time big data reduction and intelligent big data decision-making. The big data reduction is made in two phases: at the time of acquisition and before transmission using an event-insensitive frequency content process.

Ma et al. [ 33 ] proposed a 3-layer architecture for the IoT big health system based on cloud-to-end fusion. The layers of this architecture are the big health perception layer, transport layer, and big health cloud service layer. In the big health perception layer, data are collected and preprocessed. The transport layer sends data to sensor nodes and receives data from the perception layer using network technologies. The big health cloud service layer has two sub-layers: the cloud service support and the cloud service application. The cloud service support sub-layer is responsible for compressing, storing, processing, and analyzing the real-time data. The cloud service application sub-layer is the interface between users and health networking. This sub-layer controls the sensor nodes and visualizes the big data.

Rathore et al. [ 61 ] proposed the 5-layer architecture for big data IoT analytics-based real-time medical emergency response systems. The data collection layer is responsible for data sensing, acquisition, buffering, filtration, and processing. This layer collected and aggregated data using a coordinator or relay node and transmitted them to a polarization mode dispersion. The communication layer provides device-to-device communication to various smart devices. The processing layer divides big data into small chunks. Each chunk is processed separately, aggregated, and stored. This article used MapReduce, HDFS, and Spark for data processing and analysis. The management layer is responsible for managing all types of outcomes using a medical expert system. The service layer is the interface between end-users and health networking. This architecture minimized the processing time and increased the throughput.

El‐Hasnony et al. [ 84 ] proposed a hybrid real-time remote patient monitoring framework based on mist, fog, and cloud computing. This article provided the 5-layer architecture for near real-time data analysis. The layers are the perception layer, the mist layer, the fog layer, the cloud layer, and the service provider layer. The mist layer is responsible for data filtering, data fusion, anomaly detection, and data transmission to the fog layer. The fog layer has done local monitoring and analysis, data aggregation, local storage, data pre-analysis, and data transmission to the cloud layer. The cloud layer implemented several data analytics techniques for intelligent decision-making and storage. This article presented a case study comparing traditional data mining techniques, including REPtree, MLP, Naive Bayes (NB), and sequential minimal optimization algorithms. The results showed that the REPtree algorithm achieved better accuracy, and the NB achieved the least time.

Harb et al. [ 106 ] proposed the 4-layer architecture for real-time BDA for patient monitoring and decision-making in healthcare applications. The layers of this platform are real-time patient monitoring, real-time decision and data storage, patient classification, and disease diagnosis, and data retrieval and visualization. The first layer is responsible for data ingestion using Kafka and Sqoop tools. The second layer processes and stores data using Spark and Hadoop HDFS. This layer preprocesses data and finds the missing records using MissRec (a script for Spark). The third layer is responsible for classification data using stability-based K-means, an adapted version of K-means clustering, and disease diagnosis using a modified version of the association rule mining algorithm. The last layer retrieves and visualizes data to understand the patient’s situation using Hive, SparkSQL, and Matplotlib.

Zhou et al. [ 62 ] proposed a data mining technology based on the IoT. The layers of the proposed functional architecture are the data acquisition layer, data transmission layer, data storage layer, and cloud service center layer. This article used the WIT120 system for data collection, the adaptive k-means clustering method based on the MapReduce framework for data preprocessing, HDFS for storing, and the GM (1,1) grey model for users’ health status prediction.

Hong-Tan et al. [ 90 ] proposed a real-time Ambient Intelligence assisted Student Health Monitoring System (AmIHMS). The data required by time ambient intelligence environments are collected from the WSN and sent to the cloud for handling. Their work developed a framework for real-time effective alerting of student health information. The AmIHMS architecture has three layers. The IoT layer collects health data from medical devices and sensors and saves it on one mobile computer or smartphone. The cloud layer receives the data through internet platforms such as 4G, 5G, LTE, etc., and executes the mining algorithms to extract relevant data for processing. The student health monitoring layer performs four stages to provide information and warnings about student health status. These stages include data retrieval, preprocessing, normalization, and classification/health status recognition.

Li [ 30 ] designed the fog-based Smart and Real-time Healthcare Information Processing (SRHIP) system. SRHIP architecture has three layers. IoT body sensor network layer performs data collection (health, environment, and locality), aggregation, compression, and encryption. Fog processing and computation layer use Spark and Hadoop ecosystem for information extraction, data normalization, rule engine, data filtration, and data processing. This layer performs the classification using the NB classifier. The cloud computation layer performs in-depth data analysis, storage, and decision-making. SRHIP minimizes the delay, transmission cost, and data size. This article uses hierarchical symmetric key data encryption to increase confidentiality.

The Improved Bayesian Convolution Network (IBCN) was proposed for human activity recognition [ 87 ]. The system architecture includes Wi-Fi and clouds onboard applications. The combination of a variable autoencoder with a standard deep net classifier is used to improve the performance of IBCN. This article used the convolution layers to extract the features and Enhanced Deep Learning (EDL) for security issues. IBCN provided the ability to download data via traditional radio frequency or low-power back-distribution communication. According to the experimental analysis, the proposed method allows the network to be continuously improved as new training sets are added and distinguishes between data-dependency and model-dependency. This architecture has high accuracy, versatility, flexibility, and reliability.

Sengupta and Bhunia [ 88 ] implemented a 3-layer IoT-enabled e-health framework for secure real-time data management using Cloudlet. The IoT layer uses IoT Hub for communicating with IoT devices. The Cloudlet layer is an intermediate layer between the IoT and cloud layers. This layer performs in-depth healthcare data analytics and processes. The cloud layer performs various analytics applications and processes queries. This framework uses SQLite for data storage in IoT Hub and Cassandra for future storing of sensed data. The result demonstrated that this framework has high efficiency, low data transmission time, low communication energy, data-packet loss, and query response time.

IBDAM [ 133 ] is an Intelligent BDA Model for efficient cardiac disease prediction in the IoT using multi-level fuzzy rules and valuable feature selection. This article used the open-source UCI database. First, it performs preprocessing on the UCI database, and the next step uses multi-level fuzzy rule generation for feature selection. IBDAM uses an optimized Recurrent Neural Network (RNN) to train the features. Finally, the features are classified into labeled classes according to the risk of evaluation by a medical practitioner. The results of this article demonstrate that this architecture has high performance and is quick and accurate.

Ahmed et al. [ 158 ] proposed an IoT-based health monitoring framework for pandemic disease analysis, prediction, and detection, such as COVID-19, using BDA. In this framework, the COVID-19 data set is collected from different data sources. Four data analysis techniques are performed on these data, including descriptive, diagnostic, predictive, and prescriptive. The experts opine on the results, and then users receive the results of these analyses through the internet and cloud servers. This article uses a neural network-based model for diagnosing and predicting the pandemic. The results of this article indicated that the accuracy, precision, F-score, and recall of the proposed architecture are better than AdaBoost, k-Nearest Neighbors (KNN), logistic regression, NB, and linear Support Vector Machine (SVM).

Ahanger et al. [ 71 ] proposed an IoT-based healthcare architecture for real-time COVID-19 data monitoring and predicting based on fog and cloud computing. This architecture has four layers. The data collection layer collects data from sensors and uses protocols to guarantee information security. The information classification layer classifies the information into four classes: health data, meteorological data, location data, and environmental data. The COVID-19-mining and extraction layer is responsible for splitting information into two groups using a fuzzy C-means procedure in the fog layer. The COVID-19 prediction and decision modeling layer use temporal RNN for estimating the results of the COVID-19 measure and a self-organization map-based technique to increase the perceived viability of the model. This article, in contrast to the existing methods, has high classification efficiency, viability, precision, and reliability.

Oğur et al. [ 109 ] proposed a real-time data analytics architecture for smart healthcare in IoT. This architecture has two domains. The software-defined networking-based WSN and RFID technology are used in the vertical domain, and data analytics tools, including Kafka, Spark, MongoDB, and NodeJS, are used in the horizontal domain. The collected data from WSN using RFID transmit to the Kafka platform using TCP sockets. The Kafka sends data to three consumers: The Apache Spark analysis engine that analyzes data in real-time; the NodeJS web application that visualizes patient data; and the MongoDB database that stores data. This article uses logistic regression and Apache spark MLlib for data classification. The result demonstrated this architecture has high performance and accuracy and is appropriate for a time-saving experimental environment.

Table 13 shows the result of the analysis of the articles. This table shows each article's architecture or framework name, OS name, programming language, advantages, and disadvantages. As you can see, layered architecture is the most important, with 14 articles.

3.2.1.2 BDM architectural/framework for IoT-based smart cities

According to the United Nations forecasting, about 67% of the world population will live in urban areas by 2050, resulting in environmental pollution, ecosystem destruction, energy shortage, emission reduction, and resource limitation [ 36 , 174 , 175 ]. Smart cities are large-scale distributed systems that could be a solution to overcoming these problems and improving intelligent services for residents [ 112 , 176 ]. Smart cities have many implemented sensing devices that generate large amounts of data. These data must be stored, processed, and analyzed to extract valuable information [ 177 ]. BDM plays a significant role in this context and facilitates better resource management and decision-making [ 176 ]. Many research focused on BDM mechanisms in IoT-based smart cities with different objectives, including improving monitoring and communication, real-time controlling, and increased quality attributes (such as reliability, throughput, energy conservation, accuracy, scalability, delay, bandwidth usage, etc.). Therefore, this subsection examines the articles (14 articles; 22.22%) that have discussed the architectures or frameworks of BDM in IoT-based smart cities.

Jindal et al. [ 85 ] propose a tensor-based big data processing technique for energy consumption in smart cities. This article aims to reduce the dimensionality of data and decrease the overall complexity. The proposed framework has two phases. The first phase is the 3-layer data gathering and processing architecture. The layers of this architecture are data acquisition, transmission, and processing. In the second phase, the collected data was represented in tensor form, and SVM was used to identify the loads to manage the demand response services in smart cities. The technique reduces data storage by 38%.

ESTemd [ 105 ] is a distributed stream processing middleware framework for real-time analysis using big data techniques on Apache Kafka. The layers of this framework are the data ingestion layer, the data broker layer (source), the stream data processing engine and services, the data broker layer (sink), and the event hub. The data broker layer is responsible for data processing and transformation, with the support of multiple transport protocols. The third layer does stream processing and consists of the predictive data analytics model and Kafka CEP operators. This framework helps with performance improvement through data integration and distributed applications' interoperability.

CPSO [ 115 ] is a self-adaptive preprocessing approach for big data stream classification. This approach handles four mechanisms: sub-window processing; feature extraction; feature selection; and optimization of the window size and feature picking. CPSO uses clustering-based PSO for data stream mining; the sliding window technique for data segmentation; statistical feature extraction for variable partitioning; correlation feature selection, and information gain for feature selection. The proposed approach improves its accuracy.

Rani and Chauhdary [ 72 ] proposed a novel approach for smart city applications based on BDA and a new protocol for mobile IoT. They presented the 5-layer architecture where the layers are: data source, technology, data management, application, and utility programs. The data source layer collects, compresses, and filters data. The technology layer is responsible for communication between sensor nodes, edge nodes, and base station. The management layer used MapReduce, SQL, and Hbase for analyzing, storing, and processing. The utility program layer used WSN and IoT protocols to work with the other layers. Also, this article presented a new protocol that reduces energy consumption, increases throughput, and reduces the delay and transmission time.

SCDAP [ 107 ] is the 3-layer BDA architecture for smart cities. The first layer is the platform that includes hardware clusters, the operating system, communication protocols, and other required computing nodes. The second layer is security. The last layer is the data processing layer that supports online and batch data processing. This layer has ten components: data acquisition; data preprocessing; online analytics; real-time analytics; batch data repository; batch data analytics; model management; model aggregation; smart application; and user interface. This architecture used Hadoop and Spark for data analysis. Also, this article presented a taxonomy of literature reviews based on six characteristics: focus, goal, organization, perspective, audience, and coverage.

Chilipirea et al. [ 80 ] proposed a data flow-based architecture for big data processing in smart cities. The architecture has seven steps: data sources, data normalization; data brokering; data storage; data analysis; data visualization; and decision support systems. This article used Extract, Transform, and Load (ETL) and Electronic Batchload Service (EBS) for normalizing the real-time and batch data. The data brokering step created the links between the collected data and the relevant context. This architecture used Hadoop for batch data processing and Storm for real-time data processing.

Gohar et al. [ 92 ] proposed a four-layer architecture for analyzing and storing data on the Internet of Small Things (IoST). The layers of this architecture are the small things layer, the infrastructure layer, the platform layer, and the application layer. The first layer collected data by using the LoRa gateway from LoRa devices. The infrastructure layer provides connectivity to devices by using the Internet. The platform layer is responsible for data preprocessing. For processing, this layer employs Max–Min normalization, the Kalman filter, the Round-Robin load balancing technique, the Least Slack Time algorithm (LST), the divide-and-conquer approach for aggregation, and NoSQL databases for storage. In the last layer, data is visualized for decision-making. This article implemented the architecture by using Hadoop, Spark, and GraphX. In this article, throughput has increased with the rise in data size.

Farmanbar and Rong [ 113 ] proposed an interactive cloud-based dashboard for online data visualization and a data analytics toolkit for smart city applications. The proposed architecture has three layers: the data layer, application and analysis layer, and presentation layer. The data layer is the core of the architecture and contains data acquisition units, data ingestion, data storage, and data access. This architecture used Logstash for data ingesting, Elasticsearch for storing, and Kibana for accessing and real-time monitoring. This platform has been tested on five datasets, including transportation data, electricity consumption, cargo e-bikes, parking, vacancies, and energy. The results showed this architecture is robust, scalable, and improves communication between users and urban service providers.

He et al. [ 116 ] proposed a big data architecture to achieve high Quality of Experience (QoE) performance in smart cities. This architecture has three plans: the data storage plane, the data processing plane, and the data application plane. This article used MongoDB and HDFS for data storing and Spark and the deep-learning-based greedy algorithm for data processing. The simulation result indicated that the proposed architecture's accuracy, precision, and recall are better than SVM and KNN.

Khan et al. [ 128 ] proposed an SDN-based 3-tier architecture that includes data collection, data processing and management, and an application layer for real-time big data processing in smart cities with two intermediate levels that work on SDN principles. This architecture uses Spark and GraphX with Hadoop for offline and real-time data analysis and processing. Also, this article proposed an adaptive job scheduling mechanism for load balancing and achieving high performance. The results showed that when clusters and processing time increase, the proposed system's performance also increases.

IoTDeM [ 73 ] is the IoT big data-oriented multiple edge-cloud architectures for MapReduce performance prediction with varying cluster scales. This architecture consists of three parts: multiple edge cloud redirectors, an edge cloud-based big data platform, and a centralized cloud-based big data platform. This architecture used historical job execution records and Locally Weighted Linear Regression (LWLR) techniques for predicting jobs' executing times and Ceph for storing them. Because of Ceph, there was no need to transfer data to the newly added slave node. This article validated the accuracy of the proposed model by using the TESTDFSIO and Sort benchmark applications in a general implementation scenario based on Hadoop2 and Ceph and achieved an average relative error of less than 10%.

Ahab [ 112 ] is a generic, scalable, fault-tolerant, and cloud-based framework for online and offline big data processing. This framework has four components: the user API, repositories, messaging infrastructure, and stream processing. The API directs the published data streams from different sources. Ahab uses the component, stream, policy, and action repositories for storing data streams, management policies, and actions. Ahab uses distributed messaging for handling data streams, minimizing unnecessary network traffic. Also, it allows the components to choose an appropriate communication point freely. The Ahab architecture has two layers: the streaming and service layers. The streaming layer is implemented as a lambda architecture. This layer has three sub-layers for data stream processing: the batch layer, the speed layer, and the serving layer. The HDFS and Apache Spark are used for data storing and stream processing. The service layer is responsible for analyzing, managing, and adapting components.

Mobi-Het [ 81 ] is a mobility-aware optimal resource allocation architecture for remote big data task execution in mobile cloud computing. This article uses the SMOOTH random mobility model to propound the free movement of mobile devices and estimate their speed and direction. Mobi-Het has three layers: mobile devices, cloudlets, and the master cloud. The mobile devices component has a decision-maker module that decides whether tasks should be executed remotely or locally. The master cloud component implements the resource allocation algorithm. This article has a low execution time, high execution reliability, and efficiency in timeliness.

Hossain et al. [ 132 ] proposed a knowledge-driven framework that automatically selects the suitable data mining and ML algorithms for a dynamic IoT smart city dataset. The system architecture has four units: data Knowledgeextraction, extactGoalKnowledge, extractAlgoKnowledge, and matchKnowledge. The framework's inputs are three key factors: datasets, goals, and data mining and ML algorithms. This article discussed both supervised and unsupervised data mining. The results show that this framework reduces computational time and complexity and increases performance and flexibility while dynamically choosing a high-accuracy solution.

Table 14 shows the result of the analysis of the articles. This table shows the architecture or framework name, OS name, programming language, advantages, and disadvantages of each article. As you can see, layered architecture is the most important, with 13 articles.

3.2.1.3 BDM architectural/framework for IoT-based smart home/ building

BDM mechanisms and IoT (architecture/ frameworks) have a crucial role in smart home/building, including processing data collected by the home sensors; analyzing, classifying, monitoring, and managing energy consumption and saving; intelligently identifying user behavior patterns and home activities; and increasing safety and comfort at home [ 76 ]. This subsection presents a review of the articles (8 articles; 12.70%) that have discussed the architectures or frameworks of BDM in the IoT-based smart home/ building.

Al-Ali et al. [ 68 ] proposed a smart home energy management architecture using IoT and BDA approaches. This architecture is divided into two sub-architectures: hardware architecture and software architecture. The hardware architecture includes sensors and actuators, high-end microcontrollers, and server blocks. The software architecture comprises the data acquisition module on the edge device, a middleware module, and a client application module. The first module monitors and collects data and transmits them to the middleware module. The second module uses several tools to provide different services, including facilitating communication between edge devices and middleware, data storage, data analysis, and sending results to the requester. The third module develops the front-end mobile user interface using a cross-platform integrated development environment. This article is evaluated using a prototype. The results showed the proposed architecture has high scalability, security, privacy, throughput, and speed.

Silva et al. [ 55 ] proposed a real-time BDA embedded architecture for the smart city with the RESTful web of things. This article integrated the web and smart control systems using a smart gateway system. The proposed architecture consists of four levels: data creation and collection; data processing and management; event and decision management; and application. The data processing and management level utilized HDFS for primary data storing, MapReduce for processing, Hbase to speed up the processing, and HIVE for data querying and managing. The event and decision management level classified two events as service and resource events based on the processed information. The application level remotely provides access to the smart city services and has three sub-layers: departmental layer, services layer, and sub-services layer. This article has high performance and throughput, low processing time, and minimizes energy consumption.

Khan et al. [ 57 ] proposed a scheduling algorithm, an IoT BDA architecture, and a real-time platform for managing sensors' energy consumption. This architecture has four steps: appliance discovery, sensor configuration and deployment, event management and scheduling, and information gathering and processing. Appliances are identified and classified in the first step based on user availability and usage time. The second step used Poisson distribution for sensor distribution in an IoT environment. In the third step, the appliance sleep-scheduling mechanism is presented for job scheduling. In the last step, the collected data from sensors were directed to Hadoop, Spark, and GraphX for processing and analysis. This step used HDFS for data storage. This article minimized total execution time and energy consumption.

HEMS-IoT [ 76 ] is a 7-layer architecture based on big data and ML for in-home energy management. The layers of this architecture are the presentation layer, IoT services layer, security layer, management layer, communication layer, data layer, and device layer. The management layer uses the J48 ML algorithm and the Weka API for energy consumption reduction and user behavior pattern extraction. This layer also classifies the data and houses based on energy consumption using the C4.5 algorithm. The IoT services layer provides different REST-based web services. The security layer guarantees data confidentiality. This layer has two components, namely authorization and authentication. This article uses RULEML and Apache Mahout to generate energy-saving recommendations.

Yassine et al. [ 56 ] proposed a platform for IoT smart homes based on fog and cloud computing. The components of the proposed platform are smart home components, IoT management and integration services, fog computing nodes, and cloud systems. The smart home component is divided into three tiers. The three tiers are: 1) the cyber-physical tier is responsible for interacting with the outside world through the second tier; 2) the connectivity tier is responsible for communicating with the smart home; and 3) the context-aware tier consists of user-defined rules and policies that create a privacy and security configuration. The IoT management and integration services component is in charge of providing interoperability, handling requests, authentication, and service registration. The fog computing nodes performed preprocessing, pattern mining, event detection, behavioral and predictive analytics, and visualization functions. The cloud system is responsible for storing and performing historical data analytics.

Luo et al. [ 131 ] proposed a 4-layer ML-based energy demand predictive model for smart building energy demands. Firstly, the sensitization layer collected data and transferred them to the storage layer. The storage layer performed data cleaning and storing. The model’s smart core is in the analytics support layer, where Artificial Neural Network (ANN) and k-means clustering are used for identifying features in weather profile patterns. The service layer is an interface between the proposed model and the smart building management system. The proposed model improved accuracy and decreased mean absolute percentage error.

Bashir et al. [ 110 ] proposed an Integrated Big Data Management and Analytics (IBDMA) framework for smart buildings. The reference architecture and the metamodel are two phases of this framework. The reference architecture has eight layers: data monitoring, sourcing, ingestion, storage, analysis, visualization, decision-making, and action. People, processes, technology, information, and facility are the components of the metamodel phase. The core component of the metamodel is people (IoT policymakers, developers, and residents of intelligent buildings). The process component includes data monitoring, sourcing, ingesting, storage, decision-making, analytics, and action/control. The technology component consists of the tools and software packages to implement the IBDMA. Some of these tools are Apache Flume for data ingesting; HDFS for data storing; Apache Spark for data analysis; Microsoft Power BI for static data visualization; and Elasticsearch and Kibana for near-real-time data visualization. The information element manages disasters and controls various facilities based on results obtained by using the technology stack. The last element is the facility that improves the comfort, safety, and living conditions for the people of the building.

Table 15 shows the result of the analysis of the articles. This table shows each article's architecture or framework name, OS name, programming language, advantages, and disadvantages. As you can see, layered architecture is the most important, with five articles.

3.2.1.4 BDM architectural/framework for IoT-based intelligent transport

Safety, reliability, fault diagnosis, data transmission, and early warning in the intelligent transport system are critical for decision-making [ 178 ]. The intelligent transport system uses digital technologies, sensor networks, ML, and BDA mechanisms to overcome the challenges, including accident prevention, road safety, pollution reduction, automated driving, traffic control, intelligent navigation, and parking systems [ 179 ]. This subsection presents a review of the articles (2 articles; 3.17%) that have discussed the architectures or frameworks of BDM in IoT-based intelligent transport.

SMART TSS [ 129 ] is a BDA modular architecture for intelligent transportation systems. This architecture has four units: a big data acquisition and preprocessing unit, a big data processing unit, a big data analytics unit, and a data visualization unit. The big data processing unit stored the offline data in the cloud system for future analysis. The online data is sent to the extraction and filtration unit for load balancing on NoSQL databases. The big data analytics unit uses the map-reduce mechanism for analysis. This article uses Hadoop, Spark, and GraphX for big data processing and analysis. The throughput of the proposed system increases with increasing data size and has low accuracy and security.

Babar and Arif [ 89 ] proposed a real-time IoT big data analytics architecture for the smart transportation system. This architecture has three phases: big data organization and management, big data processing and analysis, and big data service management. The first phase performed data preprocessing, including big data detection, logging, integration, reduction, transformation, and cleaning. This phase used the divide-and-conquer technique for data aggregation, the Min–Max method for data transformation, and the Kalman filter technique for data cleaning. The second phase used Hadoop for big data processing, HDFS, Hive, and Hbase for data storage, and Spark for data stream analysis. This phase performed load balancing that caused increased throughput, minimized processor use, and reduced response time. The third phase is responsible for intelligent decision-making and event management.

Table 16 shows the result of the analysis of the articles. This table shows the architecture or framework name, OS name, programming language, advantages, and disadvantages of each article. As you can see, layered architecture is the most important, with two articles.

3.2.1.5 BDM architectural/framework for IoT-based traffic control and energy

Two reviewed articles discussed the architectures or frameworks of BDM in IoT-based traffic control and energy and used the ML for this purpose. ML4IoT [ 108 ] is a container-based ML framework for IoT data analytics and coordinating ML workflows. This framework aims to define and automate the execution of ML workflows. The proposed framework uses several types of ML algorithms. The ML4IoT framework has two layers: ML4IoT data management and ML4IoT core. The ML4IoT core layer trains and deploys ML models and consists of five components: a workflow designer, a workflow orchestrator, a workflow scheduler, container-based components, and a distributed data processing engine. ML4IoT data management is responsible for data ingesting and storing and has three sub-components: a messaging system, a distributed file system, and a NoSQL database. The results of this article reveal that this framework has high elasticity, scalability, robustness, and performance. Furthermore, Chhabra et al. [ 111 ] proposed a scalable and flexible cyber-forensics framework for IoT BDA analytics with high precision and sensitivity. This framework consisted of four modules: the data collector and information generator; feature analytics and extraction; designing ML models; and analyzing models on various efficiency matrices. This article used Google’s programming model, MapReduce, as the core for traffic translation, extraction, and analysis of dynamic traffic features. Also, they presented a comparative study of globally accepted ML models for peer-to-peer malware analysis in mocked real-time.

Table 17 shows the result of the analysis of the articles. This table shows the architecture or framework name, OS name, programming language, advantages, and disadvantages for each article. As you can see, the component-based architecture is the most important, with two articles.

3.2.1.6 BDM architectural/framework for IoT-based urban planning

To improve the quality, plan, design, sustainability, living standards, dynamic organization, mobility of urban space and structure, and maintain the urban services, BDM is responsible for offline and online aggregation, managing, processing, and analyzing the large amounts of big data in urbanization [ 180 , 181 , 182 ]. Rathore et al. [ 51 ] proposed the 4-layer IoT-based BDA architecture for smart city development and urban planning. The first layer generated, aggregated, registered, and filtrated data from various IoT sources. Using communication technologies, the second layer created communication between sensors and the relay node. The third layer used HDFS, Hbase, Hive, and SQL for storage; MapReduce for offline analysis; and Spark, VoltDB, and Storm for real-time analysis. The last layer is responsible for showing the study results for intelligent and fast decision-making. The results show that the architecture provides efficient outcomes even on IoT big data sets. Throughput has increased with the rise in data size, and the processing time has decreased.

Silva et al. [ 63 ] proposed a reliable 3-layer BDA-embedded architecture for urban planning. The layers of this architecture are data aggregation, data management, and service management. The purpose of this article is to increase throughput and minimize processing time. The real-time data management layer is the main layer and performs data filtration, analysis, processing, and storing. This layer used data filtration and min–max normalization techniques to improve energy data. This architecture used MapReduce for offline data processing, Spark for online data processing, and Hbase for storing.

Table 18 shows the result of the analysis of the articles. This table shows the architecture or framework name, OS name, programming language, advantages, and disadvantages for each article. As you can see, layered architecture is the most important, with two articles.

3.2.1.7 BDM architectural/framework for other IoT-based applications

This subsection presents a review of the articles (14 articles) that have discussed the architectures or frameworks of BDM in other IoT-based applications. These IoT applications are smart IoT systems (4 articles), smart flood (1 article), smart farms (2 articles), disaster management (1 article), laundry (1 article), smart pipeline (1 article), network traffic (1 article), digital manufacturing (1 article), smart factory (2 articles).

Al-Osta et al. [ 121 ] proposed an event-driven and semantic rules-based approach for IoT data processing. The main levels of this system are sensor, edge, and cloud levels. This article has two purposes: reducing the required resources and the volume of data before transfer to the cloud for storage. The collected data is first aggregated, filtered, and classified at the gateway level. This causes a saving in bandwidth and minimizes the network traffic. This approach used semantic rules for data filtering. It also employed a complex event processing module to analyze input events and detect processing priority.

Wang et al. [ 148 ] proposed a 3-layer edge-based architecture and a dynamic switching algorithm for IoT big data analytics. The layers of this architecture are the cloud layer, edge layer, and IoT layer. The edge layer performed some functions, including identifying IoT applications, classifying them, and sending classification results to the cloud layer. The LibSVM method is used for IoT application identification and classification based on system status and requirements. Also, this article presented a new algorithm, namely the dynamic switching algorithm, for task offloading from cloud to edge based on the delay and network conditions. This algorithm performed task offloading based on classification results. The results showed the proposed architecture reduced delay, processing time, and energy consumption.

IODML-BDA [ 124 ] is a model for Intelligent Outlier Detection in Apache Spark using ML-powered BDA for mobile edge computing. This model performs four steps: data preprocessing, outlier detection, feature selection, and classification. This article employs an Adaptive Synthetic Sampling (ADASYN)-based technique for outlier detection, the Oppositional Swallow Swarm Optimization (OSSO) for feature selection, and a Long Short-Term Memory (LSTM) model for classification. This model has high performance and accuracy in BDA.

Kumar et al. [ 3 ] presented a novel 4-layer architecture for IoT big data management in cloud computing networks and a collaborative filtering recommender system. The information layer collects data and transmits them to the second layer. The transport layer uses GPRS/CDMA, wireless RFID, or Ethernet channels for communication and data uploading in the data mining layer. The data mining layer utilizes the ML method for data analysis. The application layer is responsible for data visualization based on extracted information from the data mining layer. The article also proposed a collaborative filtering algorithm to improve the prediction accuracy based on the time-weighted decay function and asymmetrical influence degree. The result of this article demonstrated that this architecture has high accuracy.

Sood et al. [ 75 ] proposed a 4-layer flood forecasting and monitoring architecture based on IoT, High-Performance Computing (HPC), and big data convergence. The IoT layer is responsible for IoT device installation and data collection. The fog computing layer reduces the latency of application execution when predicting the real-time flood. The data analysis layer received, stored, and analyzed the collected data. This layer used Singular Value Decomposition (SVD) for data reduction and a K-mean clustering algorithm to estimate the flood situation and rating. Also, Holt-Winter’s forecasting method is utilized to forecast the flood. The last layer is the presentation layer, which generates information for decision-making. The results showed the proposed architecture reduced latency, complexity, completion time, and energy consumption.

Muangprathub et al. [ 79 ] proposed a WSN system for agriculture data analysis based on the IoT for watering crops. This system consists of three components. The hardware component collected data and sent them to the web application for real-time analysis. This component is responsible for data preprocessing, data reduction by the equal-width histograms technique, data modeling/discovery by association rules mining technique, and solution analysis. The web application manages real-time information. The mobile application component controlled crop watering remotely. The architecture of this system has three layers: the environmental data acquisition layer, the data, and communication layer, and the application layer. This system can help to reduce costs and increase agricultural productivity.

Al-Qurabat et al. [ 65 ] proposed a two-level system for data traffic management in smart agriculture based on compression and Minimum Description Length (MDL) techniques. The first level is the sensor node level. This level monitors the features of the environment using a lightweight lossless compression algorithm based on Differential Encoding (DE) and Huffman techniques. The second level is the edge gateway level. This level is responsible for processing, analyzing, filtering, storing, and sending the data to the cloud, and minimizes the first level dataset using MDL and hierarchical clustering. The results demonstrated the suggested method has a high compression ratio and accuracy and decreases data and energy consumption.

Shah et al. [ 53 ] proposed the 5-layer architecture for IoT BDA in a disaster-resilient smart city. The purpose of this architecture is to store, mine, and process big data from IoT devices. This architecture's layers include data resource, transmission, aggregation, analytics and management, and application and support services. This architecture used Apache Flume and Apache Sqoop for unstructured and structured data collection; Hadoop and Spark for real-time and offline data analysis; and HDFS for data storage. The proposed implementation model comprises data harvesting, data aggregation, data preprocessing, and a big data analytics and service platform. This article used a variety of datasets for validation and evaluation based on processing time and throughput.

Liu et al. [ 14 ] proposed a cloud laundry business model based on the IoT and BDA. This model used big data analytics, intelligent logistics management, and ML techniques for big data analytics. This model minimized human interference and increased system efficiency.

Tang et al. [ 7 ] proposed the 4-layer distributed fog computing-based architecture for big data analysis in smart cities. The layers of this architecture are the data center on the cloud layer, intermediate computing nodes layer, edge devices layer, and sensing networks on the critical infrastructure layer. This architecture reduces the communication bandwidth and data size. First, data was collected from the fiber sensor network and transmitted to the edge computing nodes layer. This layer performed two tasks: identifying potential threat patterns and feature extraction using supervised and non-supervised ML algorithms. The intermediate computing nodes layer used the hidden Markov model for big data analysis and hazardous event detection. The results showed the proposed architecture reduced the service response time and the number of service requests submitted to the cloud.

Kotenko et al. [ 136 ] introduced a framework for security monitoring mobile IoT based on big data processing and ML. This framework consists of three layers: 1) extraction and decomposition of a data set using the heuristic approach; 2) compression of feature vectors using Principal Component Analysis (PCA); and 3) learning and classification using the SVM k-nearest neighbor’s method, Gaussian NB, artificial neural network, and decision tree. This framework has high performance and accuracy in the detection of attacks.

Bi et al. [ 157 ] proposed a new enterprise architecture that integrates IoT and BDA for managing the complexity and stability of the digital manufacturing system. This article used Shannon entropy to measure the complexity of a system based on the number of events and the probabilities of event occurrences. This architecture performs three processes: data acquisition, management, and utilization. The result of this article demonstrated that this architecture decreases the system complexity and increases flexibility, resilience, responsiveness, agility, and adaptability.

Yu et al. [ 118 ] presented a BDA and IoT-based framework for health state monitoring in a smart factory. This framework consists of four phases. The data ingestion phase is responsible for extracting different data types, managing data collection, data security, data transformation using a secure file transfer protocol, and data storage issues. The big data management phase uses optimized HDFS for data storage on the cloud nodes and processing using Apache Spark. The data preparation phase performs sensor selection and noise detection processing to produce high-quality data. This phase uses the high-variance feature removal method for feature selection and a novel method for noise detection. The predictive modeling phase has four stages: PCA model training, streaming anomaly detection, contribution analysis, and alarm sequence analysis.

Kahveci et al. [ 183 ] proposed a secure, interoperable, resilient, scalable, and real-time end-to-end BDA platform for IoT-based smart factories. The platform architecture has five layers and several components that perform data collection, data integration, data storing, data analytics, and data visualization. The layers of architecture are the control and sensing layer, the data collection layer, the data integration layer, the data storage and analytics layer, and the data presentation layer. All kinds of sensing and control activities are performed in the first layer. The data collection layer communicates with the first layer through a multi-node client/server architecture. The data integration layer uses the RESTful application program interface to transfer data collected to the data storage layer. The data storage layer uses InfluxDB for industrial metrics and events. Using this architecture, production line performance is improved, bottlenecks are identified, product quality is improved, and production costs are reduced.

Table 19 shows the result of the analysis of the articles. This table shows the architecture or framework name, OS name, programming language, advantages, and disadvantages for each article. As you can see, layered architecture is the most important, with 14 articles.

3.2.2 BDM architectural/framework for IoT paradigms

Another category presented in this article is BDM architectures and frameworks in two important IoT paradigms, i.e., Social Internet of Things (SIoT) and Multiple Internet of Things (MIoT). SIoT is the integration of the IoT with social networking that leads to improved scalability in information and service discovery, trustworthy relationships, security, performance, and high network navigability [ 91 , 184 ]. The SIoT establishes relationships and interactions between human-to-human, human-to-object, and object-to-object social networks in which humans are considered intellectual and relational objects [ 185 , 186 ]. The types of relationships.

between smart, complex, and social objects in SIoT are parental object relationships, co-location object relationships, co-work object relationships, ownership object relationships, social object relationships, stranger object relationships, guest object relationships, sibling object relationships, and service object relationships [ 187 , 188 ]. A MIoT is a collection of connected things that are different kinds of relationships and objects.

In contrast to SIoT, the number of relationships in MIoT is not predefined. Therefore, SIoT is a specific case of MIoT where the number of possible relationship types is limited [ 187 ]. The MIoT paradigm has advantages over the IoT and SIoT. IoT can be divided into multiple networks of interconnected smart objects through MIoT. The MIOT can handle situations where the same objects behave differently in different networks and allows objects from various networks to communicate without being directly connected [ 189 ]. Social objects in the SIoT and MIoT can perform tasks, including physical condition detection, data collection, information exchange, big data processing and analysis, and visualization for decision-making, predicting human behavior, and increasing efficiency and scalability. Due to the heterogeneous nature of communication and social networks, which generate high volume, multi-source, dynamic, and sparse data from SIoT and MIoT objects, the BDA is a vital issue in these paradigms. For BDA in SIoT and MIoT, a large amount of memory, power processing, and bandwidth are required to store, define, process, predict, and assist humans for a limited time [ 64 , 91 ]. Different researchers have examined BDA in these paradigms in various ways.

Paul et al. [ 91 ] proposed a system called SmartBuddy that performs the BDA for SIoT-based smart city data to define real-time human dynamics. This architecture has three domains: the object domain, the SIoT server domain, and the application domain. The object domain collects the data and sends them to the SIoT server for balancing, storing, querying, processing, defining, and predicting human behavior. The application domain has four main components: security, cloud server, results in storage devices, and data server. This domain compilation is the result of the SIoT server domain. This article uses MapReduce programming for offline data analysis and Apache Spark for real-time analysis. SmartBuddy has high throughput and applicability.

HABC [ 52 ] is a Hadoop-based architecture for social IoT big data feature selection and analysis. This architecture has four layers: data collection, communication, feature selection and processing, and service. The data collection layer collected, registered, and filtered data. The communication layer provided end-to-end connectivity to various devices and used the Kalman filter to remove noise. The feature selection and processing layer used MapReduce for data analysis and HDFS, HBSE, and HIVE for manipulation and storing. The Artificial Bee Colony (ABC) is used for feature selection. The results indicate that the architecture increases throughput and accuracy and is more scalable.

Lakshmanaprabu et al. [ 64 ] proposed a hierarchical framework for feature extraction in SIoT big data using the MapReduce framework and a supervised classifier model. This framework has five steps: SIoT data collection, filtering, database reduction, feature selection, and classification. This article used the Gabor filter to reduce the noisy data, Hadoop MapReduce for database reduction, Elephant Herd Optimization (EHO) for feature selection, and a linear kernel SVM-based classifier for data classification. The result showed the proposed architecture has high maximum accuracy, specificity, sensitivity, and throughput.

Socio-cyber network [ 66 ] is the 4-layer architecture that integrates the social network with the technical network for analyzing human behavior using big data. This architecture uses the user's geolocation information to make friendships and graph theory to examine the trust index. The data generation layer is responsible for data collection, aggregating, registration, and filtration. The communication layer provides end-to-end connectivity to various devices. This layer creates a graph of data, and when new data are added to the system, this graph is updated. The data storage and processing layer perform the load balancing algorithm and graph processing. This layer uses MapReduce for data processing, the Spark GraphX tool for real-time analysis, and HDFS for data storage. This article uses the Knowledge Pyramid for knowledge extraction. The service layer shows the result to users.

Shaji et al. [ 120 ] presented a 5-phase approach for big data classification in SIoT. The phases of this approach are the data acquisition phase, data filtering phase, reduction phase, feature selection phase, and classification phase. This article uses an adaptive Savitzky–Golay filter for filtering and eliminating noisy data; the Hadoop MapReduce framework for data reduction; a modified relief technique for optimal feature selection; and a deep neural network-based marine predator algorithm for classification. This article has high accuracy, precision, specificity, sensitivity, throughput, and low energy consumption.

Floris et al. [ 67 ] proposed a 4-layer architecture based on SIoT to deploy a full-stack smart parking solution. The layers of this architecture are the hardware layer, virtualization layer, aggregation layer, and application layer. The hardware layer collected data and consisted of a vehicle detection board, Bluetooth beacon, data transmission board, and concentrator. The SIoT paradigm is implements in the virtualization layer using device virtualization. ML algorithms are implemented in the aggregation layer for data aggregation and data processing. The application layer includes the management platform that supports the control dashboard for smart parking management and the Android App for the citizens.

Cauteruccio et al. [ 166 ] presented a framework for anomaly detection and classification in MIoT scenarios. This framework investigated two problems: the anomaly effects analysis on the MIoT and the source of the anomaly detection. The anomalies in MIoT are divided into three categories: presence anomalies versus success anomalies, hard anomalies versus soft anomalies, and contact anomalies versus content anomalies.

Lo Giudice et al. [ 189 ] proposed a definition of a thing’s profile and topic-guided virtual IoT. The profile of a thing has two components: a content-based component (past behavior) and a collaborative filtering component (principal characteristics of those things it has previously interacted with the most). This article uses a supervised and unsupervised approach to build topic-guided virtual IoTs in a MIoT scenario. Table 20 shows the result of the analysis of the articles. The architecture or framework name, the OS name, programming language, advantages, and disadvantages are shown for each article in this table. As you can see, layered architecture is the most important, with five articles.

3.3 Categories based on quality attributes

Systems have different attributes generally divided into qualitative or functional attributes and non-qualitative or non-functional attributes. This section considers the quality attributes of the selected articles. Quality attributes indicate the system’s characteristics, operating conditions, and constraints. There are different software quality models, such as McCall [ 190 ], Bohem [ 191 ], ISO/IEC9126, and FURPS [ 192 ]. As far as we know, no systematic article has completely categorized articles based on qualitative characteristics. Therefore, this paper categorized the selected articles based on 18 qualitative attributes presented in Table 21 . In this table, the first column shows the names of these 18 quality attributes. The reviewed articles used these quality attributes to show the characteristics, quality attribute analysis, and performance analysis of the proposed approaches, architectures, and frameworks and comparison with other works. Performance attributes have been analyzed in different articles based on different criteria. The reviewed articles utilized 12 quality attributes for performance attribute analysis. These quality attributes are load balancing, energy conservation, network lifetime, processing/execution time, response time, delay, CPU usage, memory usage, bandwidth usage, throughput, latency, and concurrency. In Table 21 , ↓ indicates the reduction of that quality characteristic and ↑ indicates the increase of that quality characteristic. The second column in this table shows the articles that have used these features. The performance, efficiency, accuracy, and scalability attributes are the most critical quality attributes, with 79, 62, 58, and 47 articles, respectively. From another point of view, the reference model of standard software quality attributes, i.e., ISO 25010, has been used to classify articles based on quality attributes. Table 22 shows the articles' classification according to this standard. In the following, some quality attributes and their importance will be defined.

Performance: Performance refers to the ability of BDM techniques in the IoT to provide results and services with high load balancing, energy conservation, throughput, concurrency, low processing/execution time, delay, CPU/memory/ bandwidth usage, and latency.

Feasibility: Feasibility refers to the ability to perform successfully or study the current mode of operation, evaluate alternatives, and develop BDM techniques in the IoT.

Scalability: Scalability refers to the ability of BDM techniques in the IoT to exploit increasing computing resources effectively to maintain service quality when the real data volumes increase. BDM techniques in IoT must be scalable in performance and data storage. Some methods and advanced systems are used to improve the scalability of big data analysis, like parallel implementation, HPC systems, and clouds [ 193 ].

Accuracy: Accuracy refers to the ability to describe data and represent a real-world object or event correctly [ 194 ]. In the reviewed articles, various definitions of accuracy are provided, including clustering accuracy, classification accuracy, the accuracy of features selecting/extracting, and the accuracy of the prediction model. Each of these cases is evaluated in different ways.

Efficiency: Efficiency refers to BDM techniques in IoT with minimum energy and response time and high throughput, accuracy, and performance.

Reliability: Reliability refers to the ability of BDM techniques in the IoT to apply the specified functions under specified conditions and within the expected duration.

Availability: The main goal of many researchers is the availability of information and their analysis from heterogeneous data sources. Availability is one of the components of service trust and is part of reliability.

Interoperability: Interoperability refers to the ability to interconnect and communicate among smart objects, heterogeneous IoT devices, and different operating systems. Low-cost device interoperability is a vital issue in IoT [ 53 , 54 , 195 ].

Flexibility: Flexibility refers to the capacity of BDM techniques in the IoT to be adapted for different environments and situations to face external changes [ 196 ].

Robustness: Robustness refers to a stable BDM system in the IoT that can function despite erroneous, exceptional, or unexpected inputs and unexpected events.

3.4 Big data analytics types in IoT

There are different types of analytics. This study uses Gartner’s classification, Footnote 2 which includes four types of analysis: descriptive analysis (“what happened?”), diagnostic analysis (“why did it happen?”), predictive analysis (“What could happen?”), and prescriptive analysis (“What should we do?”). In descriptive analytics, historical business data is analyzed to describe what happened in the past. Diagnostic analytics investigates and identifies the causes of trends and why they occurred. The goal of predictive analytics is to forecast the future using a variety of statistical and ML techniques. Prescriptive analytics proposes the best action to take to accomplish a business’s objective using the data collected from descriptive and predictive analytics for decision-making based on future situations [ 197 ].

This paper investigates the applied methods for data analysis and categorizes them based on the type of analysis these methods provide. Organizations need statistics, AI, deep learning, data mining, prediction mechanisms, etc., for BDA and to evaluate the data [ 198 ]. The articles used ML algorithms to perform various analyses in the steps of BDA. ML algorithm is an appropriate approach or tool for BDA; decision-making; meaningful, precise, and valuable information extraction; and detecting hidden patterns in big datasets [ 199 , 200 ]. Utilizing the ML algorithms in BDA has advantages such as improving and optimizing BDM processes; heterogeneous big data analysis; sustainability; fault detection, prediction, and prevention; accurate and reliable real-time processing; resource management and reduction; and increased quality prediction, visual inspection, and productivity in IoT applications [ 83 , 201 ]. These algorithms are divided into four types: supervised, semi-supervised, unsupervised, and reinforcement ML algorithms [ 53 , 202 ]. Table 23 shows the categorization of articles based on BDA types. The most common tactics that the selected articles use for BDM in the IoT include classification (51 articles), simulation (38 articles), optimization (30 articles), and clustering (25 articles).

The reason for using more classification algorithms is that they help to categorize unstructured and high-volume data. Therefore, BDM in the IoT is faster and more efficient. Before classification begins, it must optimize the classification algorithm's inputs. Data reduction strategies extract optimal and required data from a large amount of data. These strategies include dimensionality reduction, numerosity reduction, and data compression. Some reviewed articles used Principal Components Analysis (PCA) to standardize, reduce the data redundancy and dimensionality, reduce the cost and processing time, and maintain the original data [ 69 , 114 , 118 , 135 , 136 ]. Also, the authors in [ 160 ] used the fuzzy C-means algorithm to reduce the amount of data. Feature selection methods improve classification accuracy and reduce the number of features in BDA. The collected data from IoT applications and monitoring systems are usually anomalous, and it is difficult to distinguish between the original data and the anomaly [ 201 ]. The anomaly and outlier data reduce the accuracy of the classification and prediction models. For instance, NRDD-DBSCAN [ 114 ], DBSCAN-based outlier detection [ 83 ], GA, and One-Class Support Tucker Machine (OCSTuM) [ 122 , 124 ] are some of the high-robust, high-performance, and anti-noisy methods for anomaly detection that are presented in reviewed articles.

SVM is the most common method based on classification (10 articles) for BDM in the IoT in supervised classification. SVM is a non-parametric, memory-efficient, error-reduction classification method that performs well in theoretical analysis and real-world applications. It can model non-linear, complex, and real-world problems in high-dimensional feature space [ 2 , 69 , 203 ]. However, SVM is difficult to interpret, has a high computational cost, and is not scalable [ 204 ]. In unsupervised classification, the k-means clustering algorithm is the most common strategy (6 articles). The standard k-mean clustering algorithm is a simple partitioning method that works well for small and structured datasets. It is sensitive to the number of clusters, initial input, and noise data. The standard k-means clustering must be modified to be used in BDA. Some research focuses on the MapReduce/Spark implementation of traditional k-means clustering that improves the accuracy and reduces the time complexity [ 205 ]. Also, articles used the k-means clustering algorithm to predict floods [ 75 ], security monitoring [ 136 ], energy management and improve the prediction accuracy [ 56 , 131 ], the data access and resource utilization [ 144 ] in IoT. Association rules are an unsupervised learning approach used to discover interesting and hidden relationships and correlations between variables and objects in large databases and for data modeling in IoT [ 79 ]. Association rule mining uses various algorithms to identify frequent item sets, such as the apriori algorithm, FP growth algorithm, and maximal frequent itemset algorithm [ 79 , 106 ]. Neural networks (NN) perform big data processing and analysis efficiently. NN has self-learning ability and plays a significant role in BDA in IoT. NN is used for classification, big data mining, hidden pattern recognition, correlation recognition in big data raw, and decision-making in IoT applications. There are several different kinds of neural network algorithms, including LSTM [ 108 ], radial basis functions network [ 69 ], Deep NN [ 101 , 162 ], convolutional NN [ 163 ], etc.

Deep learning is a modern machine learning model that employs supervised or unsupervised methods to learn and extract multiple-level, high-level, and hierarchical features for big data classification tasks and pattern recognition [ 163 , 206 ]. Deep learning is a BDA tool that can speed up big data decision-making and feature extraction, improve the extracted information QoE level, resolve security issues, data dimensionality, and unlabeled and un-categorized big data processing in IoT applications [ 116 , 207 ]. In the reviewed articles, deep learning methods are used for human activity recognition [ 87 ], flood detection [ 130 ], smart cities [ 116 ], and feature learning on big data in the IoT [ 163 ]. Optimization refers to selecting the best solution from a set of alternatives by minimizing or maximizing a specified objective function [ 208 ]. Bio-inspired algorithms are stochastic search techniques used by many researchers to solve optimization problems in BDM processes in the IoT, including data ingestion, processing, analytics, and virtualization [ 209 ]. The features of these algorithms are good applicability, simplicity, robustness, flexibility, self-organization, and the possibility of dealing with real-world problems [ 210 ]. There are different types of categories for these algorithms in various articles. For instance, in [ 211 ], these algorithms are categorized into six categories: local search-based and global search-based; single-solution based and population-based; memory-based and memoryless; greedy and iterative; parallel; and nature-inspired and hybridized. In the reviewed articles, GA and NN are used more for BDM in the IoT (6 articles). GA has been used for feature extraction and selection, outlier detection, scheduling, optimizing energy consumption, reducing execution time and delay, and optimizing the predictive model in IoT applications [ 69 , 86 , 115 , 122 , 146 , 173 ].

4 Open issues and challenge

This section offers a variety of vital issues and challenges that require future work. IoT faces many challenges and open issues, including security, privacy, hardware, heterogeneity, data analysis, and virtualization challenges. IoT devices produce big data that must be monitored and managed using particular data patterns. For efficient decision-making, BDA in the IoT is applied to large datasets to reveal unseen patterns and correlations. So the key challenge in big data in the IoT is analyzing that data for knowledge discovery and virtualization. Various types of research have presented different categories for challenges and open issues for BDM in the IoT. Romero et al. [ 212 ] divided challenges into principal worries, security and monitoring, technological development, standardization, and privacy. Santana et al. [ 213 ] divided challenges into privacy, data management, heterogeneity, energy management, communication, scalability, security, lack of testbed, city models, and platform maintenance. Ahmed et al. [ 27 ] divided challenges into four categories: diversity, security, data provenance, data management, and data governance and regulation. This study divides challenges into BDM in the IoT and quality attributes challenges.

4.1 Big data management in the IoT challenges

In many reviewed articles, IoT big data management depends on centralized centers, including cloud-based servers, and has technical limitations. These architectures are platform-centric and have costly customized access mechanisms. A centralized architecture can have a single point of failure, which is very inefficient in terms of scalability and reliability. Also, in these architectures, unauthorized access to the server might easily result in the modification, leak, or manipulation of critical data [ 215 ]. In some research, authors used blockchain technology to overcome these problems [ 215 , 216 ]. But this technology has some challenges. For example, blockchain platforms can consume IoT devices' computational resources extensively. During the review in Sect.  3.1 , the process of BDM in the IoT includes data collection, communication, data ingestion, data storage, processing and analysis, and post-processing, each of which faces a variety of challenges and problems. This section examines the challenges involved in each of these steps.

4.1.1 Data collection

Big data in the IoT is generated from different, distributed, and multisource heterogeneous unsupervised domain [ 217 , 218 ]. Collecting this large amount of diverse data faces challenges such as energy consumption, limited battery life in sensors and other data collection devices, different hardware and operating systems, multiple and disparate resources, and combining them. It can be difficult to obtain complete, accurate, and maintain quality data. IoT and WSN encompass a large number of distributed mobile nodes. Mobile nodes [ 219 ] must increase the amount of data collected while minimizing the power consumption of both the mobile node and IoT devices. Therefore, the main challenge is mobile data collection management, determining and planning mobile sink trajectories for collecting data from nodes. Most existing mobile data collection approaches are static and only find a solution for a scenario with fixed parameters [ 220 ]. These solutions do not consider the change in the amount of data generated by the IoT nodes or devices when an IoT device can move from one situation to another. For future work, we propose using AI techniques, including ML or deep learning, for intelligent management of mobile data collection.

4.1.2 Communication

Transferring data from different sources to the data processing and analysis stage is one of the steps in BDM in the IoT. Communication protocols and technologies must share data at high speeds and on time. The connectivity challenges include interoperability, bandwidth, reducing traffic, energy consumption, security, network, transport protocols, delivery of services, network congestion, and communication cost. Another connectivity challenge is nodes accessing other nodes' information under different network topologies with different channel fading [ 221 ]. Concerning advances in mobile information infrastructure, integration of the 6G technologies, mobile satellite communications, and AI can increase frequency band, network speed, and network coverage and improve the number of connections [ 222 ]. Different approaches are proposed for data transmission optimizing and overcoming these limitations, such as parsimonious/compressive sensing [ 223 , 224 ]. Compressive sensing technology is a theory of acquiring and compressing signals that use the sparsity behavior of natural signals at the sensing stage to minimize power consummation and data dimensionality reduction [ 225 ]. In compressive sensing technology, the collected data from different sensors are first compressed and then transmitted. Therefore, the complexity is transferred to the receiver side from the sensors, which are usually resource-constrained and self-powered [ 226 ]. For future work, we propose combining compressive sensing with AI technologies to present a lightweight, real-time, and dynamic compressive sensing method for overcoming the communication challenges in BDM in the IoT.

4.1.3 Data ingestion

Big data in the IoT have various features such as: enormous, high-speed, heterogeneity of data formats, complexity, different data resolutions, abnormal and incorrect, ambiguity, unbalanced, massive redundancy, multidimensional, granularity, continuously, inconsistencies, probabilistic, sparse, sequential, dynamical, timeliness, non-randomly distributed, and misplaced [ 56 , 63 , 89 , 117 , 119 , 125 , 135 , 137 , 173 , 227 ]. Each data ingestion step discussed in Sect.  3.1.3 has challenges. These issues are anomaly detection, missing data, outlier detection, feature selection/extraction, dimensionality reduction, redundancy, standardization, rule discovery, computational cost, and normalization that different mechanisms use for these challenges. Missing data could lead to the loss of a large amount of valuable and reliable information and bad decision-making. Many articles utilize the delete, ignore, mean/median value, or constant global methods for handling missing data. These dangerous methods may yield biased and untrustworthy results [ 228 ]. Therefore, adding new techniques by considering more efficiency, high accuracy, minimal computational complexity, and less time consumption is interesting in the future. For this purpose, we can use ML and nature-inspired optimization algorithms or a combination thereof. The parallel technology has made data ingestion and processing more efficient in recent years, and it saves space and time by eliminating the need to decompress data [ 229 ]. Also, BDA types in the IoT are used in this stage, which is discussed in Sect.  3.4 . Each of these methods has challenges. For example, clustering has challenges such as real-time clustering, local optima, determining the number of clusters, updating the clustering centers, and determining the initial clustering centers. ANN faces many issues, including how to determine the number of layers, the training, and test samples, the number of nodes, choosing an operable objective function, and how to improve the training speed of the network in a big data environment. Various articles solve these problems using meta-heuristic algorithms. However, these algorithms cannot handle big IoT data sets within the specified time due to high computation costs, limited memory, and processing units, and premature convergence [ 145 , 230 ]. For future work, we propose using new optimization meta-heuristic algorithms and AI methods based on these techniques by utilizing the strengths of MapReduce and Apache Spark.

4.1.4 Data storage

Data storage is another major challenge in BDM in the IoT. The big data storage mechanisms in the IoT were discussed in Sect.  3.1.4 . The challenges in this regard can be categorized as IoT-based big data storage systems in cloud computing and complex environments such as industry 4.0 applications and data storage architecture. The main data storage challenges are IoT data replication and consistency management. Many researchers have proposed strategies for determining the best location for copy storage in geo-distributed storage systems based on cloud and fog computing. But many of them, due to the geographical distance between distributed storage systems, cannot handle the problems of high data access latencies and replica synchronization costs [ 231 ]. Also, data consistency management strategies must manage the massive amounts of data with different data consistency requirements and system heterogeneity.

4.1.5 Processing and analysis

The big data processing and analysis in BDM in the IoT has different challenges, including task scheduling, real-time data analysis, developing the IoT data analysis infrastructure, data management in the cloud-IoT environments, and query optimization. The authors used data mining and AI algorithms to overcome these challenges. The challenges of using AI technologies for data analytics in the IoT are to balance the computational costs (or response time) and improve the accuracy of the prediction and analysis results [ 232 ]. Also, many multi-objective optimization problems have more than three objective functions, which present challenges, including the diversity and convergence speed of the algorithm [ 152 ]. However, determining an algorithm to process a dynamic IoT dataset based on some application-specific goals for better accuracy remains a challenge. Also, most current methods cannot meet user demands for the fundamental features of cloud-IoT environments, including heterogeneity, dynamism, reliability, flexibility, responsiveness, and elasticity. For future work, we propose studies of various optimization algorithms, including metaheuristic algorithms (many-objective) and ML algorithms, and combined versions of these algorithms for big data processing and analysis in the IoT. Regarding the limitations of wireless nodes (low power and computational) and cloud servers (high latency, privacy, performance bottleneck, context unawareness, etc.) for processing and analysis computing tasks, using mobile edge or fog computing to overcome these problems is helpful.

4.1.6 Post-processing

Providing insight from processed and analyzed data in the IoT requires selecting appropriate visualization techniques. Most of the reviewed methods use simulator tools such as CloudSim [ 143 , 173 ], TRNSYS [ 131 ], Cooja [ 82 ], and Extend-Sim [ 8 ] for evaluation. Additional studies are needed to evaluate the mentioned approaches in real-world systems and datasets.

4.2 QoS management

QoS is one of the critical factors in BDM in the IoT and needs research, management, and optimization (discussed in Sect.  3.3 ). The reviewed articles used these parameters and metrics for evaluation. No article considers these parameters thoroughly for its proposed architecture. Therefore, it is exciting to compare various architectures by considering the different QoS parameters and quality attributes in the future. Security, privacy, and trust are critical issues in IoT BDA that most reviewed articles did not address, and the proposed architectures or frameworks did not involve the data perception layer. The security frame generally consists of confidentiality, integrity, authentication, non-repudiation, availability, and privacy [ 233 ]. We concede that no comprehensive and highly secure scheme or platform for all types of data collection, analysis, and sharing meets all security requirements. The other main challenges are integrating privacy protection methods with data sharing platforms and selecting the best privacy protection algorithms to use during data processing [ 172 ]. Therefore, it is suggested for the future to utilize cryptographic mechanisms in different layers of architectures or frameworks, add a data perception layer, and develop security protocols specifically for IoT devices because of their heterogeneity and resource limitations.

The blockchain framework is widely used in IoT to improve protection, trust, reputation, management, control, and security. The blockchain framework provides decentralized security, authentication rules, and privacy for IoT devices. However, there are major challenges, such as high energy consumption, delay, and computational overhead, because of the resource constraints in IoT devices. Many types of research have been suggested as solutions to these problems. For instance, Corradini et al. [ 234 ] proposed a two-tier Blockchain framework for increasing the security and autonomy of smart objects in the IoT by implementing a trust-based protection mechanism. The tiers of this framework are a point-to-point local tier and a community-oriented global tier. Pincheira et al. [ 235 ] proposed a cost-effective blockchain-based architecture for ensuring data integrity, auditability, and traceability and increasing trust and trustworthiness in IoT devices. This architecture has four components: the cloud module, mobile app, connected tool, and blockchain module. Tchagna Kouanou et al. [ 236 ] proposed a 4-layer blockchain-based architecture to secure data in the IoT to increase security, integrity, scalability, flexibility, and throughput. The layers of this architecture are tokens, smart contracts, blockchain, and peers. In future research, we suggest using AI techniques and a lightweight blockchain framework to increase protection, trust, reputation, and security in the IoT.

Trust and reputation management are vital issues in the SIoT and MIoT scenarios. In [ 237 ], the authors defined trust and reputation in the MIoT as the trust of an instance in another one of the same IoT; the trust of an object in another one of the MIoT; the reputation of an instance in an IoT; the reputation of an object in a MIoT; the reputation of an IoT in a MIoT; the trust of an IoT in another IoT; and the trust of an object in an IoT. Security in the SIoT aims to differentiate between secure and malicious things and increase the safety and protection of SIoT networks [ 185 ]. Investigating trust and reputation in SIoT and MIoT has many benefits, such as identifying, isolating, managing malicious objects, supporting collaboration, and identifying and evaluating the objects’ QoS parameters. Also, the lack of trust and reputation management in SIoT and MIoT causes problems such as loss of accessibility, privacy, and security [ 237 ]. To overcome these issues, we suggest utilizing trust and reputation with AI methods to develop detection techniques for anomalous and malicious behaviors of things in the MIoT and SIoT in future works.

5 Conclusion

This paper presented a systematic review of the BDM mechanisms in the IoT. First, we discussed the advantages and disadvantages of some systematic and review articles about BDM in the IoT and then explained the purpose of this paper. Then, the research methodology and details of 110 selected articles were presented. These articles were divided into four main categories, including BDM processes, big BDM architectures/frameworks, quality attributes, and data analytics types in IoT. Some of these categories have been divided into some subcategories: BDM process in IoT was divided into data collection, communication, data ingestion, data storage, processing and analysis, and post-processing; big data architectures/frameworks in the IoT were divided into BDM architectures/frameworks in the IoT-based applications and BDM architectures/frameworks in the IoT paradigms; big data analytics-types were divided into the descriptive, diagnostic, predictive, and prescriptive analysis; and big data storage systems in the IoT were divided into relational databases, NoSQL databases, DFS, and cloud/edge/fog/mist storage. Also, the advantages and disadvantages of each of the BDM mechanisms in the IoT were discussed. The tools and platforms used for BDM in the IoT in the articles were reviewed and compared based on criteria. The most common type of analysis that articles use is predictive analysis, with 57.27%, which uses ML algorithms. The classification, optimization, and clustering algorithms are the most widely used for big data analysis in the IoT. Some articles present architectures mostly in IoT-based healthcare, with 33.33%, and IoT-based smart cities, with 22.22%. These architectures have two to eight layers, each performing a set of functions. In the review of qualitative characteristics, we observed that most articles evaluated their evaluations based on criteria, including performance, efficiency, accuracy, and scalability. Meanwhile, some features are less used, including confidentiality, sustainability, accessibility, portability, generality, and maintainability. The NoSQL database and DFS are used more to store data than other databases. The BDM process in the IoT uses different algorithms and tools with various features. Various programming languages and operating systems are used to evaluate and implement the proposed mechanisms. The Java and python programming languages and the UBUNTU operating system are used more.

This paper tries to review the BDM mechanisms in the IoT. Specifically, it considers studies published in high-quality international journals. The most recent works on BDM mechanisms in the IoT have been compared and analyzed in this paper. We hope that this study will be helpful for the next generation of studies for developing BDM mechanisms in real-complex environments.

https://www.idc.com/ .

http://www.gartner.com/it-glossary/predictive-analytics/ .

Cao, B., Zhang, Y., Zhao, J., Liu, X., Skonieczny, Ł, & Lv, Z. (2021). Recommendation based on large-scale many-objective optimization for the intelligent internet of things system. IEEE Internet of Things Journal . https://doi.org/10.1109/JIOT.2021.3104661

Article   Google Scholar  

Hou, R., Kong, Y., Cai, B., & Liu, H. (2020). Unstructured big data analysis algorithm and simulation of internet of things based on machine learning. Neural Computing and Applications, 32 , 5399–5407.

Kumar, M., Kumar, S., & Kashyap, P. K. (2021). Towards data mining in IoT cloud computing networks: Collaborative filtering based recommended system. Journal of Discrete Mathematical Sciences and Cryptography, 24 , 1309–1326.

Article   MathSciNet   MATH   Google Scholar  

Cao, B., Zhao, J., Lv, Z., & Yang, P. (2020). Diversified personalized recommendation optimization based on mobile data. IEEE Transactions on Intelligent Transportation Systems, 22 , 2133–2139.

Sanislav, T., Mois, G. D., Zeadally, S., & Folea, S. C. (2021). Energy harvesting techniques for internet of things (IoT). IEEE Access, 9 , 39530–39549.

Zhou, H., Sun, G., Fu, S., Liu, J., Zhou, X., & Zhou, J. (2019). A Big data mining approach of PSO-based BP Neural network for financial risk management with IoT. IEEE Access, 7 , 154035–154043.

Tang, B., Chen, Z., Hefferman, G., Pei, S., Wei, T., He, H., et al. (2017). Incorporating intelligence in fog computing for big data analysis in smart cities. IEEE Transactions on Industrial informatics, 13 , 2140–2150.

Jiang, W. (2019). An intelligent supply chain information collaboration model based on internet of things and big data. IEEE Access, 7 , 58324–58335.

Xiao, S., Yu, H., Wu, Y., Peng, Z., & Zhang, Y. (2017). Self-evolving trading strategy integrating internet of things and big data. IEEE Internet of Things Journal, 5 , 2518–2525.

Sowe, S. K., Kimata, T., Dong, M., & Zettsu K. (2014). Managing heterogeneous sensor data on a big data platform: IoT services for data-intensive science. In 2014 IEEE 38th International Computer Software and Applications Conference Workshops , Vasteras, Sweden, pp. 295-300

Nie, X., Fan, T., Wang, B., Li, Z., Shankar, A., & Manickam, A. (2020). Big data analytics and IoT in operation safety management in under water management. Computer Communications, 154 , 188–196.

Liu, H., & Liu, X. (2019). A novel research on the influence of enterprise culture on internal control in big data and internet of things. Mobile Networks and Applications, 24 , 365–374.

Piccialli, F., Benedusi, P., Carratore, L., & Colecchia, G. (2020). An IoT data analytics approach for cultural heritage. Personal and Ubiquitous Computing . https://doi.org/10.1007/s00779-019-01323-z

Liu, C., Feng, Y., Lin, D., Wu, L., & Guo, M. (2020). Iot based laundry services: an application of big data analytics, intelligent logistics management, and machine learning techniques. International Journal of Production Research . https://doi.org/10.1080/00207543.2019.1677961

Wang, J., Wu, Y., Yen, N., Guo, S., & Cheng, Z. (2016). Big data analytics for emergency communication networks: A survey. IEEE Communications Surveys & Tutorials, 18 , 1758–1778.

Jahanbakht, M., Xiang, W., Hanzo, L., & Azghadi, M. R. (2020) Internet of underwater things and big marine data analytics--a comprehensive survey. arXiv preprint arXiv:2012.06712 .

Stoyanova, M., Nikoloudakis, Y., Panagiotakis, S., Pallis, E., & Markakis, E. K. (2020). A survey on the internet of things (IoT) forensics: Challenges, approaches, and open issues. IEEE Communications Surveys & Tutorials, 22 , 1191–1221.

Aldalahmeh, S. A., & Ciuonzo, D. (2022). Distributed detection fusion in clustered sensor networks over multiple access fading channels. IEEE Transactions on Signal and Information Processing over Networks, 8 , 317–329.

Article   MathSciNet   Google Scholar  

Rajavel, R., Ravichandran, S. K., Harimoorthy, K., Nagappan, P., & Gobichettipalayam, K. R. (2022). IoT-based smart healthcare video surveillance system using edge computing. Journal of Ambient Intelligence and Humanized Computing, 13 , 3195–3207.

Shahid, H., Shah, M. A., Almogren, A., Khattak, H. A., Din, I. U., Kumar, N., et al. (2021). Machine learning-based mist computing enabled internet of battlefield things. ACM Transactions on Internet Technology (TOIT), 21 , 1–26.

Thomas, D., Orgun, M., Hitchens, M., Shankaran, R., Mukhopadhyay, S. C., & Ni, W. (2020). A graph-based fault-tolerant approach to modeling QoS for IoT-based surveillance applications. IEEE Internet of Things Journal, 8 , 3587–3604.

S. Vahdat (2020) The role of IT-based technologies on the management of human resources in the COVID-19 era. Kybernetes .

Hassan, M., Awan, F. M., Naz, A., deAndrés-Galiana, E. J., Alvarez, O., Cernea, A., et al. (2022). Innovations in genomics and big data analytics for personalized medicine and health care: A review. International Journal of Molecular Sciences, 23 , 4645.

Honar Pajooh, H., Rashid, M. A., Alam, F., & Demidenko, S. (2021). IoT big data provenance scheme using blockchain on Hadoop ecosystem. Journal of Big Data, 8 , 1–26.

Priyadarshini, S. B. B., Bhusan Bagjadab, A., & Mishra B. K. (2019). The role of IoT and big data in modern technological arena: A comprehensive study. In Internet of Things and Big Data Analytics for Smart Generation. Springer, pp. 13–25.

Zheng, W., Yin, L., Chen, X., Ma, Z., Liu, S., & Yang, B. (2021). Knowledge base graph embedding module design for Visual question answering model. Pattern Recognition, 120 , 108153.

Ahmed, E., Yaqoob, I., Hashem, I. A. T., Khan, I., Ahmed, A. I. A., Imran, M., et al. (2017). The role of big data analytics in internet of things. Computer Networks, 129 , 459–471.

Singh, S., & Yassine, A. (2018). IoT big data analytics with fog computing for household energy management in smart grids. In International Conference on Smart Grid and Internet of Things . pp. 13–22.

Marjani, M., Nasaruddin, F., Gani, A., Karim, A., Hashem, I. A. T., Siddiqa A., et al. (2017). Big IoT data analytics: architecture, opportunities, and open research challenges. ieee access , 5, 5247–5261.

Li, C. (2020). Information processing in internet of things using big data analytics. Computer Communications, 160 , 718–729.

Kwon, O., Lee, N., & Shin, B. (2014). Data quality management, data usage experience and acquisition intention of big data analytics. International journal of information management, 34 , 387–394.

Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35 , 137–144.

Ahmed, M., Choudhury, S., & Al-Turjman, F. (2019). Big data analytics for intelligent internet of things. In Artificial Intelligence in IoT . Springer, pp. 107–127.

Urrehman, M. H., Ahmed, E., Yaqoob, I., Hashem, I. A. T., Imran, M., & Ahmad, S. (2018). Big data analytics in industrial IoT using a concentric computing model. IEEE Communications Magazine, 56 , 37–43.

Constante Nicolalde, F., Silva, F., Herrera, B., & Pereira, A. (2018). Big data analytics in IOT: challenges, open research issues and tools. In World conference on information systems and technologies , pp. 775–788.

Talebkhah, M., Sali, A., Marjani, M., Gordan, M., Hashim, S. J., & Rokhani, F. Z. (2021). IoT and big data applications in smart cities: Recent advances, challenges, and critical issues. IEEE Access, 9 , 55465–55484.

Bansal, M., Chana, I., & Clarke, S. (2020). A survey on iot big data: Current status, 13 v’s challenges, and future directions. ACM Computing Surveys (CSUR), 53 , 1–59.

Simmhan, Y., & Perera, S. (2016). Big data analytics platforms for real-time applications in IoT. In Big data analytics . Springer, pp. 115–135.

Shoumy, N. J., Ang, L.-M., Seng, K. P., Rahaman, D. M., & Zia, T. (2020). Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals. Journal of Network and Computer Applications, 149 , 102447.

Ge, M., Bangui, H., & Buhnova, B. (2018). Big data for internet of things: A survey. Future Generation Computer Systems, 87 , 601–614.

Siow, E., Tiropanis, T., & Hall, W. (2018). Analytics for the internet of things: A survey. ACM Computing Surveys (CSUR), 51 , 1–36.

Fawzy, D., Moussa, S. M., & Badr, N. L. (2022). The internet of things and architectures of big data analytics: Challenges of intersection at different domains. IEEE Access, 10 , 4969–4992.

Zhong, Y., Chen, L., Dan, C., & Rezaeipanah, A. (2022). A systematic survey of data mining and big data analysis in internet of things. The Journal of Supercomputing . https://doi.org/10.1007/s11227-022-04594-1

Hajjaji, Y., Boulila, W., Farah, I. R., Romdhani, I., & Hussain, A. (2021). Big data and IoT-based applications in smart environments: A systematic review. Computer Science Review, 39 , 100318.

Ahmadova, U., Mustafayev, M., Kiani Kalejahi, B., Saeedvand, S., & Rahmani, A. M. (2021). Big data applications on the internet of things: A systematic literature review. International Journal of Communication Systems, 34 , e5004.

Doewes, R. I., Gharibian, G., Zadeh, F. A., Zaman, B. A., Vahdat, S., & Akhavan-Sigari, R. (2022). An updated systematic review on the effects of aerobic exercise on human blood lipid profile. Current Problems in Cardiology . https://doi.org/10.1016/j.cpcardiol.2022.101108

Zadeh, F. A., Bokov, D. O., Yasin, G., Vahdat, S., & Abbasalizad-Farhangi, M. (2021). Central obesity accelerates leukocyte telomere length (LTL) shortening in apparently healthy adults: A systematic review and meta-analysis. Critical Reviews in Food Science and Nutrition . https://doi.org/10.1080/10408398.2021.1971155

Esmailiyan, M., Amerizadeh, A., Vahdat, S., Ghodsi, M., Doewes, R. I., & Sundram, Y. (2021). Effect of different types of aerobic exercise on individuals with and without hypertension: An updated systematic review. Current Problems in Cardiology . https://doi.org/10.1016/j.cpcardiol.2021.101034

Vahdat, S., & Shahidi, S. (2020). D-dimer levels in chronic kidney illness: a comprehensive and systematic literature review. Proceedings of the National Academy of Sciences, India Section b: Biological Sciences . https://doi.org/10.1007/s40011-020-01172-4

Zhou, D., Yan, Z., Fu, Y., & Yao, Z. (2018). A survey on network data collection. Journal of Network and Computer Applications, 116 , 9–23.

Rathore, M. M., Ahmad, A., Paul, A., & Rho, S. (2016). Urban planning and building smart cities based on the internet of things using big data analytics. Computer Networks, 101 , 63–80.

Ahmad, A., Khan, M., Paul, A., Din, S., Rathore, M. M., Jeon, G., et al. (2018). Toward modeling and optimization of features selection in big data based social Internet of Things. Future Generation Computer Systems, 82 , 715–726.

Shah, S. A., Seker, D. Z., Rathore, M. M., Hameed, S., Yahia, S. B., & Draheim, D. (2019). Towards disaster resilient smart cities: Can internet of things and big data analytics be the game changers? IEEE Access, 7 , 91885–91903.

Celesti, A., & Fazio, M. (2019). A framework for real time end to end monitoring and big data oriented management of smart environments. Journal of Parallel and Distributed Computing, 132 , 262–273.

Silva, B. N., Khan, M., & Han, K. (2017). Integration of big data analytics embedded smart city architecture with RESTful web of things for efficient service provision and energy management. Future generation computer systems . https://doi.org/10.1016/j.future.2017.06.024

Yassine, A., Singh, S., Hossain, M. S., & Muhammad, G. (2019). IoT big data analytics for smart homes with fog and cloud computing. Future Generation Computer Systems, 91 , 563–573.

Khan, M., Han, K., & Karthik, S. (2018). Designing smart control systems based on internet of things and big data analytics. Wireless Personal Communications, 99 , 1683–1697.

Rathore, M. M., Paul, A., Ahmad, A., Anisetti, M., & Jeon, G. (2017). Hadoop-based intelligent care system (HICS) analytical approach for big data in IoT. ACM Transactions on Internet Technology (TOIT), 18 , 1–24.

Yacchirema, D. C., Sarabia-Jácome, D., Palau, C. E., & Esteve, M. (2018). A smart system for sleep monitoring by integrating IoT with big data analytics. IEEE Access, 6 , 35988–36001.

Ma, Y., Wang, Y., Yang, J., Miao, Y., & Li, W. (2016). Big health application system based on health internet of things and big data. IEEE Access, 5 , 7885–7897.

Rathore, M. M., Ahmad, A., Paul, A., Wan, J., & Zhang, D. (2016). Real-time medical emergency response system: Exploiting IoT and big data for public health. Journal of medical systems, 40 , 283.

Zhou, Q., Zhang, Z., & Wang, Y. (2019). WIT120 data mining technology based on internet of things. Health Care Management Science . https://doi.org/10.1007/s10729-019-09497-x

Silva, B. N., Khan, M., Jung, C., Seo, J., Muhammad, D., Han, J., et al. (2018). Urban planning and smart city decision management empowered by real-time data processing using big data analytics. Sensors, 18 , 2994.

Lakshmanaprabu, S., Shankar, K., Khanna, A., Gupta, D., Rodrigues, J. J., Pinheiro, P. R., et al. (2018). Effective features to classify big data using social internet of things. IEEE access, 6 , 24196–24204.

Al-Qurabat, A. K. M., Mohammed, Z. A., & Hussein, Z. J. (2021). Data traffic management based on compression and MDL techniques for smart agriculture in IoT. Wireless Personal Communications, 120 , 2227–2258.

Ahmad, A., Babar, M., Din, S., Khalid, S., Ullah, M. M., Paul, A., et al. (2019). Socio-cyber network: The potential of cyber-physical system to define human behaviors using big data analytics. Future generation computer systems, 92 , 868–878.

Floris, A., Porcu, S., Atzori, L., & Girau, R. (2022). A Social IoT-based platform for the deployment of a smart parking solution. Computer Networks, 205 , 108756.

Al-Ali, A.-R., Zualkernan, I. A., Rashid, M., Gupta, R., & AliKarar, M. (2017). A smart home energy management system using IoT and big data analytics approach. IEEE Transactions on Consumer Electronics, 63 , 426–434.

Moreno, M. V., Terroso-Sáenz, F., González-Vidal, A., Valdés-Vela, M., Skarmeta, A. F., Zamora, M. A., et al. (2016). Applicability of big data techniques to smart cities deployments. IEEE Transactions on Industrial Informatics, 13 , 800–809.

Nasiri, H., Nasehi, S., & Goudarzi, M. (2019). Evaluation of distributed stream processing frameworks for IoT applications in smart cities. Journal of Big Data, 6 , 52.

Ahanger, T. A., Tariq, U., Nusir, M., Aldaej, A., Ullah, I., & Sulman, A. (2022). A novel IoT–fog–cloud-based healthcare system for monitoring and predicting COVID-19 outspread. The Journal of Supercomputing, 78 , 1783–1806.

Rani, S., & Chauhdary, S. H. (2018). A novel framework and enhanced QoS big data protocol for smart city applications. Sensors, 18 , 3980.

Lu, Z., Wang, N., Wu, J., & Qiu, M. (2018). IoTDeM: An IoT big data-oriented MapReduce performance prediction extended model in multiple edge clouds. Journal of Parallel and Distributed Computing, 118 , 316–327.

Rathore, M. M., Paul, A., Hong, W.-H., Seo, H., Awan, I., & Saeed, S. (2018). Exploiting IoT and big data analytics: Defining smart digital city using real-time urban data. Sustainable cities and society, 40 , 600–610.

Sood, S. K., Sandhu, R., Singla, K., & Chang, V. (2018). IoT, big data and HPC based smart flood management framework. Sustainable Computing: Informatics and Systems, 20 , 102–117.

Google Scholar  

Machorro-Cano, I., Alor-Hernández, G., Paredes-Valverde, M. A., Rodríguez-Mazahua, L., Sánchez-Cervantes, J. L., & Olmedo-Aguirre, J. O. (2020). HEMS-IoT: A big data and machine learning-based smart home system for energy saving. Energies, 13 , 1097.

Raptis, T. P., Passarella, A., & Conti, M. (2018). Performance analysis of latency-aware data management in industrial IoT networks. Sensors, 18 , 2611.

Seng, K. P., & Ang, L.-M. (2018). A big data layered architecture and functional units for the multimedia Internet of Things. IEEE Transactions on Multi-Scale Computing Systems, 4 , 500–512.

Muangprathub, J., Boonnam, N., Kajornkasirat, S., Lekbangpong, N., Wanichsombat, A., & Nillaor, P. (2019). IoT and agriculture data analysis for smart farm. Computers and electronics in agriculture, 156 , 467–474.

Chilipirea, C., Petre, A.-C., Groza, L.-M., Dobre, C., & Pop, F. (2017). An integrated architecture for future studies in data processing for smart cities. Microprocessors and Microsystems, 52 , 335–342.

Enayet, A., Razzaque, M. A., Hassan, M. M., Alamri, A., & Fortino, G. (2018). A mobility-aware optimal resource allocation architecture for big data task execution on mobile cloud in smart cities. IEEE Communications Magazine, 56 , 110–117.

Plageras, A. P., Psannis, K. E., Stergiou, C., Wang, H., & Gupta, B. B. (2018). Efficient IoT-based sensor BIG data collection–processing and analysis in smart buildings. Future Generation Computer Systems, 82 , 349–357.

Syafrudin, M., Alfian, G., Fitriyani, N. L., & Rhee, J. (2018). Performance analysis of IoT-based sensor, big data processing, and machine learning model for real-time monitoring system in automotive manufacturing. Sensors, 18 , 2946.

El-Hasnony, I. M., Mostafa, R. R., Elhoseny, M., & Barakat, S. I. (2021). Leveraging mist and fog for big data analytics in IoT environment. Transactions on Emerging Telecommunications Technologies . https://doi.org/10.1002/ett.4057

Jindal, A., Kumar, N., & Singh, M. (2020). A unified framework for big data acquisition, storage, and analytics for demand response management in smart cities. Future Generation Computer Systems, 108 , 921–934.

Hussain, M. M., Beg, M. S., & Alam, M. S. (2020). Fog computing for big data analytics in IoT aided smart grid networks. Wireless Personal Communications . https://doi.org/10.1007/s11277-020-07538-1

Zhou, Z., Yu, H., & Shi, H. (2020). Human activity recognition based on improved Bayesian convolution network to analyze health care data using wearable IoT device. IEEE Access, 8 , 86411–86418.

Sengupta, S., & Bhunia, S. S. (2020). Secure data management in cloudlet assisted IoT enabled e-health framework in smart city. IEEE Sensors Journal, 20 , 9581–9588.

Babar, M., & Arif, F. (2019). Real-time data processing scheme using big data analytics in internet of things based smart transportation environment. Journal of Ambient Intelligence and Humanized Computing, 10 , 4167–4177.

Hong-Tan, L., Cui-hua, K., Muthu, B., & Sivaparthipan, C. (2021). Big data and ambient intelligence in IoT-based wireless student health monitoring system. Aggression and Violent Behavior . https://doi.org/10.1016/j.avb.2021.101601

Paul, A., Ahmad, A., Rathore, M. M., & Jabbar, S. (2016). Smartbuddy: Defining human behaviors using big data analytics in social internet of things. IEEE Wireless communications, 23 , 68–74.

Gohar, M., Ahmed, S. H., Khan, M., Guizani, N., Ahmed, A., & Rahman, A. U. (2018). A big data analytics architecture for the internet of small things. IEEE Communications Magazine, 56 , 128–133.

Armoogum, S., & Li, X. (2019). Big data analytics and deep learning in bioinformatics with hadoop. In Deep Learning and Parallel Computing Environment for Bioengineering Systems . Elsevier, pp. 17–36.

Demchenko, Y., Turkmen, F., de Laat, C., Hsu, C. H., Blanchet, C., & Loomis, C. (2017). Cloud computing infrastructure for data intensive applications. In Big Data Analytics for Sensor-Network Collected Intelligence . Elsevier, pp. 21–62.

Wu, X., Zheng, W., Xia, X., & Lo, D. (2021). Data quality matters: A case study on data label correctness for security bug report prediction. IEEE Transactions on Software Engineering . https://doi.org/10.1109/TSE.2021.3063727

Erraissi, A., & Belangour, A. (2018). Data sources and ingestion big data layers: Meta-modeling of key concepts and features. International Journal of Engineering & Technology, 7 , 3607–3612.

Ji, C., Shao, Q., Sun, J., Liu, S., Pan, L., Wu, L., et al. (2016). Device data ingestion for industrial big data platforms with a case study. Sensors, 16 , 279.

Isah, H., & Zulkernine F (2018) A scalable and robust framework for data stream ingestion. In 2018 IEEE International Conference on Big Data (Big Data) . pp. 2900-2905

Dai, H.-N., Wong, R.C.-W., Wang, H., Zheng, Z., & Vasilakos, A. V. (2019). Big data analytics for large-scale wireless networks: Challenges and opportunities. ACM Computing Surveys (CSUR), 52 , 1–36.

Chawla, H., & Khattar, P., (2020). Data ingestion. In Data Lake Analytics on Microsoft Azure . Springer, pp. 43–85.

Sankaranarayanan, S., Rodrigues, J. J., Sugumaran, V., & Kozlov, S. (2020). Data flow and distributed deep neural network based low latency IoT-edge computation model for big data environment. Engineering Applications of Artificial Intelligence, 94 , 103785.

Davoudian, A., Chen, L., & Liu, M. (2018). A survey on NoSQL stores. ACM Computing Surveys (CSUR), 51 , 1–43.

Cao, B., Sun, Z., Zhang, J., & Gu, Y. (2021). Resource allocation in 5G IoV architecture based on SDN and fog-cloud computing. IEEE Transactions on Intelligent Transportation Systems, 22 , 3832–3840.

Sonbol, K., Özkasap, Ö., Al-Oqily, I., & Aloqaily, M. (2020). EdgeKV: Decentralized, scalable, and consistent storage for the edge. Journal of Parallel and Distributed Computing, 144 , 28–40.

Akanbi, A., & Masinde, M. (2020). A distributed stream processing middleware framework for real-time analysis of heterogeneous data on big data platform: case of environmental monitoring. Sensors, 20 , 3166.

Harb, H., Mroue, H., Mansour, A., Nasser, A., & Motta Cruz, E. (2020). A hadoop-based platform for patient classification and disease diagnosis in healthcare applications. Sensors, 20 , 1931.

Osman, A. M. S. (2019). A novel big data analytics framework for smart cities. Future Generation Computer Systems, 91 , 620–633.

Alves, J. M., Honório, L. M., & Capretz, M. A. (2019). ML4IoT: A framework to orchestrate machine learning workflows on internet of things data. IEEE Access, 7 , 152953–152967.

Oğur, N. B., Al-Hubaishi, M., & Çeken, C. (2022). IoT data analytics architecture for smart healthcare using RFID and WSN. ETRI Journal, 44 , 135–146.

Bashir, M. R., Gill, A. Q., Beydoun, G., & Mccusker, B. (2020). Big data management and analytics metamodel for IoT-enabled smart buildings. IEEE Access, 8 , 169740–169758.

Chhabra, G. S., Singh, V. P., & Singh, M. (2018). Cyber forensics framework for big data analytics in IoT environment using machine learning. Multimedia Tools and Applications . https://doi.org/10.1007/s11042-018-6338-1

Vögler, M., Schleicher, J. M., Inzinger, C., & Dustdar, S. (2017). Ahab: A cloud-based distributed big data analytics framework for the internet of things. Software: Practice and Experience, 47 , 443–454.

Farmanbar, M., & Rong, C. (2020). Triangulum city dashboard: An interactive data analytic platform for visualizing smart city performance. Processes, 8 , 250.

Ghallab, H., Fahmy, H., & Nasr, M. (2020). Detection outliers on internet of things using big data technology. Egyptian Informatics Journal, 21 , 131–138.

Lan, K., Fong, S., Song, W., Vasilakos, A. V., & Millham, R. C. (2017). Self-adaptive pre-processing methodology for big data stream mining in internet of things environmental sensor monitoring. Symmetry, 9 , 244.

He, X., Wang, K., Huang, H., & Liu, B. (2018). QoE-driven big data architecture for smart city. IEEE Communications Magazine, 56 , 88–93.

Singh, A., Garg, S., Batra, S., Kumar, N., & Rodrigues, J. J. (2018). Bloom filter based optimization scheme for massive data handling in IoT environment. Future Generation Computer Systems, 82 , 440–449.

Yu, W., Liu, Y., Dillon, T., Rahayu, W., & Mostafa, F. (2021). An integrated framework for health state monitoring in a smart factory employing IoT and big data techniques. IEEE Internet of Things Journal, 9 , 2443–2454.

Zhang, Q., Zhu, C., Yang, L. T., Chen, Z., Zhao, L., & Li, P. (2017). An incremental CFS algorithm for clustering large data in industrial Internet of Things. IEEE Transactions on Industrial Informatics, 13 , 1193–1201.

Shaji, B., Lal Raja Singh, R., & Nisha, K. (2022). A novel deep neural network based marine predator model for effective classification of big data from social internet of things. Concurrency and Computation: Practice and Experience . https://doi.org/10.1002/cpe.7244

Al-Osta, M., Bali, A., & Gherbi, A. (2019). Event driven and semantic based approach for data processing on IoT gateway devices. Journal of Ambient Intelligence and Humanized Computing, 10 , 4663–4678.

Deng, X., Jiang, P., Peng, X., & Mi, C. (2018). An intelligent outlier detection method with one class support tucker machine and genetic algorithm toward big sensor data in Internet of Things. IEEE Transactions on Industrial Electronics, 66 , 4672–4683.

Yao, X., Wang, J., Shen, M., Kong, H., & Ning, H. (2019). An improved clustering algorithm and its application in IoT data analysis. Computer Networks, 159 , 63–72.

Mansour, R. F., Abdel-Khalek, S., Hilali-Jaghdam, I., Nebhen, J., Cho, W., & Joshi, G. P. (2021). An intelligent outlier detection with machine learning empowered big data analytics for mobile edge computing. Cluster Computing . https://doi.org/10.1007/s10586-021-03472-4

Karyotis, V., Tsitseklis, K., Sotiropoulos, K., & Papavassiliou, S. (2018). Big data clustering via community detection and hyperbolic network embedding in IoT applications. Sensors, 18 , 1205.

Chui, K. T., Liu, R. W., Lytras, M. D., & Zhao, M. (2019). Big data and IoT solution for patient behaviour monitoring. Behaviour & Information Technology, 38 , 940–949.

Song, C.-W., Jung, H., & Chung, K. (2019). Development of a medical big-data mining process using topic modeling. Cluster Computing, 22 , 1949–1958.

Khan, M., Iqbal, J., Talha, M., Arshad, M., Diyan, M., & Han, K. (2018). Big data processing using internet of software defined things in smart cities. International Journal of Parallel Programming . https://doi.org/10.1007/s10766-018-0573-y

Gohar, M., Muzammal, M., & Rahman, A. U. (2018). SMART TSS: Defining transportation system behavior using big data analytics in smart cities. Sustainable cities and society, 41 , 114–119.

Anbarasan, M., Muthu, B., Sivaparthipan, C., Sundarasekar, R., Kadry, S., Krishnamoorthy, S., et al. (2020). Detection of flood disaster system based on IoT, big data and convolutional deep neural network. Computer Communications, 150 , 150–157.

Luo, X., Oyedele, L. O., Ajayi, A. O., Monyei, C. G., Akinade, O. O., & Akanbi, L. A. (2019). Development of an IoT-based big data platform for day-ahead prediction of building heating and cooling demands. Advanced Engineering Informatics, 41 , 100926.

Hossain, M. A., Ferdousi, R., Hossain, S. A., Alhamid, M. F., & El Saddik, A. (2020). A novel framework for recommending data mining algorithm in dynamic iot environment. IEEE Access, 8 , 157333–157345.

Safa, M., & Pandian, A. (2021). Intelligent big data analytics model for efficient cardiac disease prediction with IoT devices in WSN using fuzzy rules. Wireless Personal Communications . https://doi.org/10.1007/s11277-021-08788-3

Alsaig, A., Alagar, V., Chammaa, Z., & Shiri, N. (2019). Characterization and efficient management of big data in IoT-driven smart city development. Sensors, 19 , 2430.

Tang, R., & Fong, S. (2018). Clustering big IoT data by metaheuristic optimized mini-batch and parallel partition-based DGC in Hadoop. Future Generation Computer Systems, 86 , 1395–1412.

Kotenko, I., Saenko, I., & Branitskiy, A. (2018). Framework for mobile internet of things security monitoring based on big data processing and machine learning. IEEE Access . https://doi.org/10.1109/ACCESS.2018.2881998

Wang, T., Bhuiyan, M. Z. A., Wang, G., Rahman, M. A., Wu, J., & Cao, J. (2018). Big data reduction for a smart city’s critical infrastructural health monitoring. IEEE Communications Magazine, 56 , 128–133.

Kaur, I., Lydia, E. L., Nassa, V. K., Shrestha, B., Nebhen, J., Malebary, S., et al. (2021). Generative adversarial networks with quantum optimization model for mobile edge computing in IoT big data. Wireless Personal Communications . https://doi.org/10.1007/s11277-021-08706-7

Lakshmanaprabu, S., Shankar, K., Ilayaraja, M., Nasir, A. W., Vijayakumar, V., & Chilamkurti, N. (2019). Random forest for big data classification in the internet of things using optimal features. International journal of machine learning and cybernetics, 10 , 2609–2618.

Ullah, F., Habib, M. A., Farhan, M., Khalid, S., Durrani, M. Y., & Jabbar, S. (2017). Semantic interoperability for big-data in heterogeneous IoT infrastructure for healthcare. Sustainable Cities and Society, 34 , 90–96.

Manogaran, G., Varatharajan, R., Lopez, D., Kumar, P. M., Sundarasekar, R., & Thota, C. (2018). A new architecture of Internet of Things and big data ecosystem for secured smart healthcare monitoring and alerting system. Future Generation Computer Systems, 82 , 375–387.

Hendawi, A., Gupta, J., Liu, J., Teredesai, A., Ramakrishnan, N., Shah, M., et al. (2019). Benchmarking large-scale data management for internet of things. The Journal of Supercomputing, 75 , 8207–8230.

Mo, Y. (2019). A data security storage method for IoT under hadoop cloud computing platform. International Journal of Wireless Information Networks, 26 , 152–157.

Tu, L., Liu, S., Wang, Y., Zhang, C., Li, P. (2019). An optimized cluster storage method for real-time big data in internet of things. The Journal of Supercomputing . 1–17.

Tripathi, A. K., Sharma, K., Bala, M., Kumar, A., Menon, V. G., & Bashir, A. K. (2020). A parallel military-dog-based algorithm for clustering big data in cognitive industrial internet of things. IEEE Transactions on Industrial Informatics, 17 , 2134–2142.

Alelaiwi, A. (2017). A collaborative resource management for big IoT data processing in Cloud. Cluster Computing, 20 , 1791–1799.

Meerja, K. A., Naidu, P. V., & Kalva, S. R. K. (2019). Price versus performance of big data analysis for cloud based internet of things networks. Mobile Networks and Applications, 24 , 1078–1094.

Wang, T., Liang, Y., Zhang, Y., Arif, M., Wang, J., & Jin, Q. (2020). An intelligent dynamic offloading from cloud to edge for smart IoT systems with big data. IEEE Transactions on Network Science and Engineering . https://doi.org/10.1109/TNSE.2020.2988052

Vasconcelos, D., Andrade, R., Severino, V., & Souza, J. D. (2019). Cloud, fog, or mist in IoT? That is the question. ACM Transactions on Internet Technology (TOIT), 19 , 1–20.

Jamil, B., Ijaz, H., Shojafar, M., Munir, K., & Buyya, R. (2022). Resource allocation and task scheduling in fog computing and internet of everything environments: A taxonomy, review, and future directions. ACM Computing Surveys (CSUR) . https://doi.org/10.1145/3513002

Javadzadeh, G., & Rahmani, A. M. (2020). Fog computing applications in smart cities: A systematic survey. Wireless Networks, 26 , 1433–1457.

Cao, B., Zhang, J., Liu, X., Sun, Z., Cao, W., Nowak, R. M., et al. (2021). Edge–cloud resource scheduling in space–air–ground-integrated networks for internet of vehicles. IEEE Internet of Things Journal, 9 , 5765–5772.

Linaje, M., Berrocal, J., & Galan-Benitez, A. (2019). Mist and edge storage: Fair storage distribution in sensor networks. IEEE Access, 7 , 123860–123876.

Mehdipour, F., Noori, H., & Javadi, B. (2016). Energy-efficient big data analytics in datacenters. In Advances in Computers . Vol. 100. Elsevier, pp. 59–101.

Zhou, L., Mao, H., Zhao, T., Wang, V. L., Wang, X., & Zuo, P. (2022). How B2B platform improves Buyers’ performance: Insights into platform’s substitution effect. Journal of Business Research, 143 , 72–80.

García-Magariño, I., Lacuesta, R., & Lloret, J. (2017). Agent-based simulation of smart beds with Internet-of-Things for exploring big data analytics. IEEE Access, 6 , 366–379.

Bi, Z., Jin, Y., Maropoulos, P., Zhang, W.-J., & Wang, L. (2021). Internet of things (IoT) and big data analytics (BDA) for digital manufacturing (DM). International Journal of Production Research . https://doi.org/10.1080/00207543.2021.1953181

Ahmed, I., Ahmad, M., Jeon, G., & Piccialli, F. (2021). A framework for pandemic prediction using big data analytics. Big Data Research, 25 , 100190.

Puschmann, D., Barnaghi, P., & Tafazolli, R. (2016). Adaptive clustering for dynamic IoT data streams. IEEE Internet of Things Journal, 4 , 64–74.

Bu, F. (2018). An efficient fuzzy c-means approach based on canonical polyadic decomposition for clustering big data in IoT. Future Generation Computer Systems, 88 , 675–682.

Zhang, Q., Yang, L. T., Chen, Z., & Li, P. (2018). High-order possibilistic c-means algorithms based on tensor decompositions for big data in IoT. Information Fusion, 39 , 72–80.

Lavalle, A., Teruel, M. A., Maté, A., & Trujillo, J. (2020). Improving sustainability of smart cities through visualization techniques for big data from IoT devices. Sustainability, 12 , 5595.

Li, P., Chen, Z., Yang, L. T., Zhang, Q., & Deen, M. J. (2017). Deep convolutional computation model for feature learning on big data in internet of things. IEEE Transactions on Industrial Informatics, 14 , 790–798.

Patterson, E. K., Gurbuz, S., Tufekci, Z., & Gowdy, J. N. (2002). CUAVE: A new audio-visual database for multimodal human-computer interface research. In 2002 IEEE International conference on acoustics, speech, and signal processing , pp. II-2017-II-2020.

Zhang, Q., Yang, L. T., & Chen, Z. (2015). Deep computation model for unsupervised feature learning on big data. IEEE Transactions on Services Computing, 9 , 161–171.

Cauteruccio, F., Cinelli, L., Corradini, E., Terracina, G., Ursino, D., Virgili, L., et al. (2021). A framework for anomaly detection and classification in Multiple IoT scenarios. Future Generation Computer Systems, 114 , 322–335.

Liang, W., Li, W., & Feng, L. (2021). Information security monitoring and management method based on big data in the internet of things environment. IEEE Access, 9 , 39798–39812.

Vahdat, S. (2022). A review of pathophysiological mechanism, diagnosis, and treatment of thrombosis risk associated with COVID-19 infection. IJC Heart & Vasculature . https://doi.org/10.1016/j.ijcha.2022.101068

Abbasi, S., Naderi, Z., Amra, B., Atapour, A., Dadkhahi, S. A., Eslami, M. J., et al. (2021). Hemoperfusion in patients with severe COVID-19 respiratory failure, lifesaving or not? Journal of Research in Medical Sciences, 26 , 34.

Li, W., Chai, Y., Khan, F., Jan, S. R. U., Verma, S., Menon, V. G., et al. (2021). A comprehensive survey on machine learning-based big data analytics for IoT-enabled smart healthcare system. Mobile Networks and Applications, 26 , 234–252.

Biswas, R. (2022). Outlining big data analytics in health sector with special reference to Covid-19. Wireless Personal Communications, 124 , 2097–2108.

Wu, X., Zhang, Y., Wang, A., Shi, M., Wang, H., & Liu, L. (2020). MNSSp3: Medical big data privacy protection platform based on Internet of things. Neural Computing and Applications . https://doi.org/10.1007/s00521-020-04873-z

Elhoseny, M., Abdelaziz, A., Salama, A. S., Riad, A. M., Muhammad, K., & Sangaiah, A. K. (2018). A hybrid model of internet of things and cloud computing to manage big data in health services applications. Future generation computer systems, 86 , 1383–1394.

Jan, M. A., He, X., Song, H., & Babar, M. (2021). Machine learning and big data analytics for IoT-enabled smart cities. Mobile Networks and Applications, 26 , 156–158.

Liu, Z., Wang, Y., & Feng, J. (2022). Vehicle-type strategies for manufacturer’s car sharing. Kybernetes . https://doi.org/10.1108/K-11-2021-1095

Khan, M. A., Siddiqui, M. S., Rahmani, M. K. I., & Husain, S. (2021). Investigation of big data analytics for sustainable smart city development: An emerging country. IEEE Access, 10 , 16028–16036.

Sivaparthipan, C., Muthu, B. A., Manogaran, G., Maram, B., Sundarasekar, R., Krishnamoorthy, S., et al. (2020). Innovative and efficient method of robotics for helping the Parkinson’s disease patient using IoT in big data analytics. Transactions on Emerging Telecommunications Technologies, 31 , e3838.

Yang, L., Xiong, Z., Liu, G., Hu, Y., Zhang, X., & Qiu, M. (2021). An analytical model of page dissemination for efficient big data transmission of C-ITS. IEEE Transactions on Intelligent Transportation Systems . https://doi.org/10.1109/TITS.2021.3134557

Zantalis, F., Koulouras, G., Karabetsos, S., & Kandris, D. (2019). A review of machine learning and IoT in smart transportation. Future Internet, 11 , 94.

Guo, J., Liu, R., Cheng, D., Shanthini, A., & Vadivel, T. (2022). Urbanization based on IoT using big data analytics the impact of internet of things and big data in urbanization. Arabian Journal for Science and Engineering . https://doi.org/10.1007/s13369-021-06124-2

Shao, N. (2022). Research on architectural planning and landscape design of smart city based on computational intelligence. Computational Intelligence and Neuroscience. 2022.

Jia, T., Cai, C., Li, X., Luo, X., Zhang, Y., & Yu, X. (2022). Dynamical community detection and spatiotemporal analysis in multilayer spatial interaction networks using trajectory data. International Journal of Geographical Information Science . https://doi.org/10.1080/13658816.2022.2055037

Kahveci, S., Alkan, B., Mus’ab, H. A., Ahmad, B., & Harrison, R. (2022). An end-to-end big data analytics platform for IoT-enabled smart factories: A case study of battery module assembly system for electric vehicles. Journal of Manufacturing Systems, 63 , 214–223.

Nitti, M., Girau, R., & Atzori, L. (2013). Trustworthiness management in the social internet of things. IEEE Transactions on knowledge and data engineering, 26 , 1253–1266.

Shahab, S., Agarwal, P., Mufti, T., & Obaid, A. J. (2022). SIoT (social internet of things): A review. ICT Analysis and Applications . https://doi.org/10.1007/978-981-16-5655-2_28

Atzori, L., Iera, A., Morabito, G., & Nitti, M. (2012). The social internet of things (siot)–when social networks meet the internet of things: Concept, architecture and network characterization. Computer networks, 56 , 3594–3608.

Baldassarre, G., Giudice, P. L., Musarella, L., & Ursino, D. (2019). The MIoT paradigm: Main features and an “ad-hoc” crawler. Future Generation Computer Systems, 92 , 29–42.

Meghana, J., Hanumanthappa, J., & Prakash, S. S. (2021). Performance comparison of machine learning algorithms for data aggregation in social internet of things. Global Transitions Proceedings, 2 , 212–219.

Lo Giudice, P., Nocera, A., Ursino, D., & Virgili, L. (2019). Building topic-driven virtual iots in a multiple iots scenario. Sensors, 19 , 2956.

McCall, J. (1994). Quality factors, encyclopedia of software engineering. (vol. 2, p. 760). New York: Wiley

Boehm, B., & In, H. (1996). Identifying quality-requirement conflicts. IEEE software, 13 , 25–35.

Grady, R. B. (1992). Practical software metrics for project management and process improvement : Prentice-Hall, Inc.

Talia, D. (2019). A view of programming scalable data analysis: From clouds to exascale. Journal of Cloud Computing, 8 , 1–16.

Firmani, D., Mecella, M., Scannapieco, M., & Batini, C. (2016). On the meaningfulness of “big data quality.” Data Science and Engineering, 1 , 6–20.

Jabbar, S., Ullah, F., Khalid, S., Khan, M., & Han, K. (2017). Semantic interoperability in heterogeneous IoT infrastructure for healthcare. Wireless Communications and Mobile Computing, 2017

Rialti, R., Marzi, G., Caputo, A., & Mayah, K. A. (2020) Achieving strategic flexibility in the era of big data. Management Decision .

Roy, D., Srivastava, R., Jat, M., & Karaca, M. S. (2022). A complete overview of analytics techniques: descriptive, predictive, and prescriptive. Decision intelligence analytics and the implementation of strategic business management, 15–30.

Rahul, K., Banyal, R. K., Goswami, P., & Kumar, V. (2021). Machine learning algorithms for big data analytics. In Computational Methods and Data Engineering , Springer, pp. 359–367.

Nti, I. K., Quarcoo, J. A., Aning, J., & Fosu, G. K. (2022). A mini-review of machine learning in big data analytics: Applications, challenges, and prospects. Big Data Mining and Analytics, 5 , 81–97.

Rajendran, R., Sharma, P., Saran, N. K., Ray, S., Alanya-Beltran, J., & Tongkachok, K. (2022) An exploratory analysis of machine learning adaptability in big data analytics environments: A data aggregation in the age of big data and the internet of things. In 2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM) , pp. 32–36.

Angelopoulos, A., Michailidis, E. T., Nomikos, N., Trakadas, P., Hatziefremidis, A., Voliotis, S., et al. (2019). Tackling faults in the industry 4.0 era—a survey of machine-learning solutions and key aspects. Sensors, 20 , 109.

Zhou, L., Pan, S., Wang, J., & Vasilakos, A. V. (2017). Machine learning on big data: Opportunities and challenges. Neurocomputing, 237 , 350–361.

Prastyo, D. D., Khoiri, H. A., Purnami, S. W., Fam, S.-F., & Suhermi, N. (2020). Survival support vector machines: A simulation study and its health-related application. Supervised and Unsupervised Learning for Data Science (pp. 85–100). Cham: Springer.

Chapter   Google Scholar  

Pink, C. M. (2016). Forensic ancestry assessment using cranial nonmetric traits traditionally applied to biological distance studies. In Biological Distance Analysis , Elsevier, pp. 213–230.

Lu, W. (2019). Improved K-means clustering algorithm for big data mining under Hadoop parallel framework. Journal of Grid Computing . https://doi.org/10.1007/s10723-019-09503-0

Zheng, W., Liu, X., & Yin, L. (2021). Research on image classification method based on improved multi-scale relational network. PeerJ Computer Science, 7 , e613.

Goswami, S., & Kumar, A. (2022). Survey of deep-learning techniques in big-data analytics. Wireless Personal Communications . https://doi.org/10.1007/s11277-022-09793-w

Roni, M., Karim, H., Rana, M., Pota, H., Hasan, M., & Hussain, M. (2022). Recent trends in bio-inspired meta-heuristic optimization techniques in control applications for electrical systems: A review. International Journal of Dynamics and Control . https://doi.org/10.1007/s40435-021-00892-3

Swayamsiddha, S. (2020). Bio-inspired algorithms: principles, implementation, and applications to wireless communication. In Nature-Inspired Computation and Swarm Intelligence . Elsevier, pp. 49–63.

Ni, J., Wu, L., Fan, X., & Yang, S. X. (2016). Bioinspired intelligent algorithm and its applications for mobile robot control: a survey. Computational intelligence and neuroscience, 2016 .

Game, P. S., & Vaze, D. (2020). Bio-inspired Optimization: metaheuristic algorithms for optimization. arXiv preprint arXiv:2003.11637 .

Romero, C. D. G., Barriga, J. K. D., & Molano, J. I. R. (2016) Big data meaning in the architecture of IoT for smart cities. In International Conference on Data Mining and Big Data , pp. 457–465.

Santana, E. F. Z., Chaves, A. P., Gerosa, M. A., Kon, F., & Milojicic, D. S. (2017). Software platforms for smart cities: Concepts, requirements, challenges, and a unified reference architecture. ACM Computing Surveys (Csur), 50 , 1–37.

Granat, J., Batalla, J. M., Mavromoustakis, C. X., & Mastorakis, G. (2019). Big data analytics for event detection in the IoT-multicriteria approach. IEEE Internet of Things Journal, 7 , 4418–4430.

Xiong, Z., Zhang, Y., Luong, N. C., Niyato, D., Wang, P., & Guizani, N. (2020). The best of both worlds: A general architecture for data management in blockchain-enabled Internet-of-Things. IEEE Network, 34 , 166–173.

Oktian, Y. E., Lee, S.-G., & Lee, B.-G. (2020). Blockchain-based continued integrity service for IoT big data management: A comprehensive design. Electronics, 9 , 1434.

Liu, F., Zhang, G., & Lu, J. (2020). Multisource heterogeneous unsupervised domain adaptation via fuzzy relation neural networks. IEEE Transactions on Fuzzy Systems, 29 , 3308–3322.

Dong, J., Cong, Y., Sun, G., Fang, Z., & Ding, Z. (2021). Where and how to transfer: knowledge aggregation-induced transferability perception for unsupervised domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence . https://doi.org/10.1109/TPAMI.2021.3128560

Zenggang, X., Xiang, L., Xueming, Z., Sanyuan, Z., Fang, X., Xiaochao, Z., et al. (2022). A service pricing-based two-stage incentive algorithm for socially aware networks. Journal of Signal Processing Systems . https://doi.org/10.1007/s11265-022-01768-1

Benhamaid, S., Lakhlef, H., & Bouabdallah, A. (2021) Towards energy efficient mobile data collection in cluster-based IoT networks. In 2021 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops) , pp. 340-343.

Sun, W., Lv, X., & Qiu, M. (2020). Distributed estimation for stochastic Hamiltonian systems with fading wireless channels. IEEE Transactions on Cybernetics .

Lv, Z., Qiao, L., & You, I. (2020). 6G-enabled network in box for internet of connected vehicles. IEEE transactions on intelligent transportation systems, 22 , 5275–5282.

Xifilidis, T., & Psannis, K. E. (2022). Correlation-based wireless sensor networks performance: The compressed sensing paradigm. Cluster Computing, 25 , 965–981.

Mohammadi, A., Ciuonzo, D., Khazaee, A., & Rossi, P. S. (2022). Generalized locally most powerful tests for distributed sparse signal detection. IEEE Transactions on Signal and Information Processing over Networks, 8 , 528–542.

Aziz, A., Osamy, W., Khedr, A. M., El-Sawy, A. A., & Singh, K. (2020). Grey Wolf based compressive sensing scheme for data gathering in IoT based heterogeneous WSNs. Wireless Networks, 26 , 3395–3418.

Djelouat, H., Amira, A., & Bensaali, F. (2018). Compressive sensing-based IoT applications: A review. Journal of Sensor and Actuator Networks, 7 , 45.

Wang, K., Zhang, B., Alenezi, F., & Li, S. (2022). Communication-efficient surrogate quantile regression for non-randomly distributed system. Information Sciences, 588 , 425–441.

Lee, G. H., Han, J., & Choi, J. K. (2021). MPdist-based missing data imputation for supporting big data analyses in IoT-based applications. Future Generation Computer Systems, 125 , 421–432.

Zhang, F., Zhai, J., Shen, X., Mutlu, O., & Du, X. (2021). POCLib: A high-performance framework for enabling near orthogonal processing on compression. IEEE Transactions on Parallel and Distributed Systems, 33 , 459–475.

Abualigah, L., Diabat, A., & Elaziz, M. A. (2021). Intelligent workflow scheduling for big data applications in IoT cloud computing environments. Cluster Computing, 24 , 2957–2976.

Naas, M. I., Lemarchand, L., Raipin, P., & Boukhobza, J. (2021). IoT data replication and consistency management in fog computing. Journal of Grid Computing, 19 , 1–25.

Ma, Z., Zheng, W., Chen, X., & Yin, L. (2021). Joint embedding VQA model based on dynamic word vector. PeerJ Computer Science, 7 , e353.

Rahouma, K. H., Aly, R. H., & Hamed, H. F. (2020). Challenges and solutions of using the social internet of things in healthcare and medical solutions—a survey. Toward Social Internet of Things (SIoT): Enabling Technologies, Architectures and Applications (pp. 13–30). Cham: Springer.

Corradini, E., Nicolazzo, S., Nocera, A., Ursino, D., & Virgili, L. (2022). A two-tier Blockchain framework to increase protection and autonomy of smart objects in the IoT. Computer Communications, 181 , 338–356.

Pincheira, M., Antonini, M., & Vecchio, M. (2022). Integrating the IoT and blockchain technology for the next generation of mining inspection systems. Sensors, 22 , 899.

Tchagna Kouanou, A., Tchito Tchapga, C., Sone Ekonde, M., Monthe, V., Mezatio, B. A., Manga, J., et al. (2022). Securing data in an internet of things network using blockchain technology: smart home case. SN Computer Science, 3 , 1–10.

Ursino, D., & Virgili, L. (2020). An approach to evaluate trust and reputation of things in a Multi-IoTs scenario. Computing, 102 , 2257–2298.

Download references

Author information

Arezou Naghib

Present address: Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran

Authors and Affiliations

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran

Arash Sharifi

Department of Computer Engineering, Faculty of Engineering and Natural Sciences, Kadir Has University, Istanbul, Turkey

Nima Jafari Navimipour

Department of Computer Engineering, Tabriz Branch, Islamic Azad University, Tabriz, Iran

Institute of Research and Development, Duy Tan University, Da Nang, Vietnam

Mehdi Hosseinzadeh

School of Medicine and Pharmacy, Duy Tan University, Da Nang, Vietnam

Computer Science, University of Human Development, Sulaymaniyah, 0778-6, Iraq

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Nima Jafari Navimipour .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Naghib, A., Jafari Navimipour, N., Hosseinzadeh, M. et al. A comprehensive and systematic literature review on the big data management techniques in the internet of things. Wireless Netw 29 , 1085–1144 (2023). https://doi.org/10.1007/s11276-022-03177-5

Download citation

Accepted : 19 October 2022

Published : 15 November 2022

Issue Date : April 2023

DOI : https://doi.org/10.1007/s11276-022-03177-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Big data management
  • Internet of things
  • Knowledge discovery
  • Systematic literature review (SLR)
  • Find a journal
  • Publish with us
  • Track your research
  • IEEE Xplore Digital Library
  • IEEE Standards
  • IEEE Spectrum

IEEE

Publications

IEEE Talks Big Data - Check out our new Q&A article series with big Data experts!

Call for Papers - Check out the many opportunities to submit your own paper. This is a great way to get published, and to share your research in a leading IEEE magazine!

Publications - See the list of various IEEE publications related to big data and analytics here.

Call for Blog Writers!

IEEE Cloud Computing Community is a key platform for researchers, academicians and industry practitioners to share and exchange ideas regarding cloud computing technologies and services, as well as identify the emerging trends and research topics that are defining the future direction of cloud computing. Come be part of this revolution as we invite blog posts in this regard and not limited to the list provided below:

  • Cloud Deployment Frameworks
  • Cloud Architecture
  • Cloud Native Design Patterns
  • Testing Services and Frameworks
  • Storage Architectures
  • Big Data and Analytics
  • Internet of Things
  • Virtualization techniques
  • Legacy Modernization
  • Security and Compliance
  • Pricing Methodologies
  • Service Oriented Architecture
  • Microservices
  • Container Technology
  • Cloud Computing Impact and Trends shaping today’s business
  • High availability and reliability

Call for Papers

No call for papers at this time.

IEEE Publications on Big Data

computer

Read more at IEEE Computer Society.

  

computer

IEEE Computer Magazine Special Issue on Big Data Management

  • Big Data: Promises and Problems

institutebigdata

Connecting the Dots With Big Data

  • Better Health Care Through Data
  • The Future of Crime Prevention
  • Census and Sensibility
  • Landing a Job in Big Data

Read more at The Institute.

Download full issue. (PDF, 5 MB)

IEEE Internet Computing - July/August 2014

IEEE Internet Computing July/August 2014

Web-Scale Datacenters

This issue of Internet Computing surveys issues surrounding Web-scale datacenters, particularly in the areas of cloud provisioning as well as networking optimization and configuration. They include workload isolation, recovery from transient server availability, network configuration, virtual networking, and content distribution.

Read more at IEEE Computer Society .

IEEE Network - July 2014

Networking for Big Data

The most current information for communications professionals involved with the interconnection of computing systems, this bimonthly magazine covers all aspects of data and computer communications.

Read more at IEEE Communications Society .

ieeemicro_bigdata

Special Issue on Big Data

Big data is transforming our lives, but it is also placing an unprecedented burden on our compute infrastructure. As data expansion rates outpace Moore's law and supply voltage scaling grinds to a halt, the IT industry is being challenged in its ability to effectively store, process, and serve the growing volumes of data. Delivering on the premise of big data in the post­Dennard era calls for specialization and tight integration across the system stack, with the aim of maximizing energy efficiency, performance scalability, resilience, and security.

data science Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Assessing the effects of fuel energy consumption, foreign direct investment and GDP on CO2 emission: New data science evidence from Europe & Central Asia

Documentation matters: human-centered ai system to assist data science code documentation in computational notebooks.

Computational notebooks allow data scientists to express their ideas through a combination of code and documentation. However, data scientists often pay attention only to the code, and neglect creating or updating their documentation during quick iterations. Inspired by human documentation practices learned from 80 highly-voted Kaggle notebooks, we design and implement Themisto, an automated documentation generation system to explore how human-centered AI systems can support human data scientists in the machine learning code documentation scenario. Themisto facilitates the creation of documentation via three approaches: a deep-learning-based approach to generate documentation for source code, a query-based approach to retrieve online API documentation for source code, and a user prompt approach to nudge users to write documentation. We evaluated Themisto in a within-subjects experiment with 24 data science practitioners, and found that automated documentation generation techniques reduced the time for writing documentation, reminded participants to document code they would have ignored, and improved participants’ satisfaction with their computational notebook.

Data science in the business environment: Insight management for an Executive MBA

Adventures in financial data science, gecoagent: a conversational agent for empowering genomic data extraction and analysis.

With the availability of reliable and low-cost DNA sequencing, human genomics is relevant to a growing number of end-users, including biologists and clinicians. Typical interactions require applying comparative data analysis to huge repositories of genomic information for building new knowledge, taking advantage of the latest findings in applied genomics for healthcare. Powerful technology for data extraction and analysis is available, but broad use of the technology is hampered by the complexity of accessing such methods and tools. This work presents GeCoAgent, a big-data service for clinicians and biologists. GeCoAgent uses a dialogic interface, animated by a chatbot, for supporting the end-users’ interaction with computational tools accompanied by multi-modal support. While the dialogue progresses, the user is accompanied in extracting the relevant data from repositories and then performing data analysis, which often requires the use of statistical methods or machine learning. Results are returned using simple representations (spreadsheets and graphics), while at the end of a session the dialogue is summarized in textual format. The innovation presented in this article is concerned with not only the delivery of a new tool but also our novel approach to conversational technologies, potentially extensible to other healthcare domains or to general data science.

Differentially Private Medical Texts Generation Using Generative Neural Networks

Technological advancements in data science have offered us affordable storage and efficient algorithms to query a large volume of data. Our health records are a significant part of this data, which is pivotal for healthcare providers and can be utilized in our well-being. The clinical note in electronic health records is one such category that collects a patient’s complete medical information during different timesteps of patient care available in the form of free-texts. Thus, these unstructured textual notes contain events from a patient’s admission to discharge, which can prove to be significant for future medical decisions. However, since these texts also contain sensitive information about the patient and the attending medical professionals, such notes cannot be shared publicly. This privacy issue has thwarted timely discoveries on this plethora of untapped information. Therefore, in this work, we intend to generate synthetic medical texts from a private or sanitized (de-identified) clinical text corpus and analyze their utility rigorously in different metrics and levels. Experimental results promote the applicability of our generated data as it achieves more than 80\% accuracy in different pragmatic classification problems and matches (or outperforms) the original text data.

Impact on Stock Market across Covid-19 Outbreak

Abstract: This paper analysis the impact of pandemic over the global stock exchange. The stock listing values are determined by variety of factors including the seasonal changes, catastrophic calamities, pandemic, fiscal year change and many more. This paper significantly provides analysis on the variation of listing price over the world-wide outbreak of novel corona virus. The key reason to imply upon this outbreak was to provide notion on underlying regulation of stock exchanges. Daily closing prices of the stock indices from January 2017 to January 2022 has been utilized for the analysis. The predominant feature of the research is to analyse the fact that does global economy downfall impacts the financial stock exchange. Keywords: Stock Exchange, Matplotlib, Streamlit, Data Science, Web scrapping.

Information Resilience: the nexus of responsible and agile approaches to information use

AbstractThe appetite for effective use of information assets has been steadily rising in both public and private sector organisations. However, whether the information is used for social good or commercial gain, there is a growing recognition of the complex socio-technical challenges associated with balancing the diverse demands of regulatory compliance and data privacy, social expectations and ethical use, business process agility and value creation, and scarcity of data science talent. In this vision paper, we present a series of case studies that highlight these interconnected challenges, across a range of application areas. We use the insights from the case studies to introduce Information Resilience, as a scaffold within which the competing requirements of responsible and agile approaches to information use can be positioned. The aim of this paper is to develop and present a manifesto for Information Resilience that can serve as a reference for future research and development in relevant areas of responsible data management.

qEEG Analysis in the Diagnosis of Alzheimers Disease; a Comparison of Functional Connectivity and Spectral Analysis

Alzheimers disease (AD) is a brain disorder that is mainly characterized by a progressive degeneration of neurons in the brain, causing a decline in cognitive abilities and difficulties in engaging in day-to-day activities. This study compares an FFT-based spectral analysis against a functional connectivity analysis based on phase synchronization, for finding known differences between AD patients and Healthy Control (HC) subjects. Both of these quantitative analysis methods were applied on a dataset comprising bipolar EEG montages values from 20 diagnosed AD patients and 20 age-matched HC subjects. Additionally, an attempt was made to localize the identified AD-induced brain activity effects in AD patients. The obtained results showed the advantage of the functional connectivity analysis method compared to a simple spectral analysis. Specifically, while spectral analysis could not find any significant differences between the AD and HC groups, the functional connectivity analysis showed statistically higher synchronization levels in the AD group in the lower frequency bands (delta and theta), suggesting that the AD patients brains are in a phase-locked state. Further comparison of functional connectivity between the homotopic regions confirmed that the traits of AD were localized in the centro-parietal and centro-temporal areas in the theta frequency band (4-8 Hz). The contribution of this study is that it applies a neural metric for Alzheimers detection from a data science perspective rather than from a neuroscience one. The study shows that the combination of bipolar derivations with phase synchronization yields similar results to comparable studies employing alternative analysis methods.

Big Data Analytics for Long-Term Meteorological Observations at Hanford Site

A growing number of physical objects with embedded sensors with typically high volume and frequently updated data sets has accentuated the need to develop methodologies to extract useful information from big data for supporting decision making. This study applies a suite of data analytics and core principles of data science to characterize near real-time meteorological data with a focus on extreme weather events. To highlight the applicability of this work and make it more accessible from a risk management perspective, a foundation for a software platform with an intuitive Graphical User Interface (GUI) was developed to access and analyze data from a decommissioned nuclear production complex operated by the U.S. Department of Energy (DOE, Richland, USA). Exploratory data analysis (EDA), involving classical non-parametric statistics, and machine learning (ML) techniques, were used to develop statistical summaries and learn characteristic features of key weather patterns and signatures. The new approach and GUI provide key insights into using big data and ML to assist site operation related to safety management strategies for extreme weather events. Specifically, this work offers a practical guide to analyzing long-term meteorological data and highlights the integration of ML and classical statistics to applied risk and decision science.

Export Citation Format

Share document.

  • Statistical Analysis
  • Mathematical Sciences
  • Data Analysis

An Overview of Big Data Concepts, Methods, and Analytics: Challenges, Issues, and Opportunities

  • Conference: 2023 5th Global Power, Energy and Communication Conference (GPECOM)
  • At: Cappadocia, Turkey
  • This person is not on ResearchGate, or hasn't claimed this research yet.

s. Mohammadali Zanjani at Islamic Azad University, Najafabad Branch

  • Islamic Azad University, Najafabad Branch

Hossein Shahinzadeh at Amirkabir University of Technology

  • Amirkabir University of Technology

Yasin Kabalci at Nigde Ömer Halisdemir University

  • Nigde Ömer Halisdemir University

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

Mohamed Cherradi

  • S. Mohammadhosain Zanjani
  • Nasim Nourmahnad
  • Hamidreza Sadrarhami
  • Ghazanfar Shahgholian

s.M.Hasan Zanjani

  • Jalal Moradi

Jalal Moradi

  • Nilesh Kumar Jadav

Anuja Nair

  • CLUSTER COMPUT

Anayo Ikegwu

  • Chioma Virginia Anikwe

Obikwelu Raphael Okonkwo

  • Kamalendu Pal

Georgios Lampropoulos

  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Big Data, Artificial Intelligence, and Financial Economics

The proliferation of large unstructured datasets along with advances in artificial intelligence (AI) technology provides researchers in financial economics with new opportunities for data analysis, and it also changes the set of subjects these researchers are studying. As AI becomes increasingly important in making decisions using financial market data, it becomes crucial to study how AI interacts with both data resources and with human decisionmakers. 

To promote research on emerging issues related to the methodology, applications, and socioeconomic implications of the growing availability of large datasets and AI tools, the National Bureau of Economic Research (NBER), with the generous support of the Office of Financial Research (OFR) and in collaboration with the Review of Financial Studies (RFS) , will convene a research conference on December 13, 2024. The program will be organized by RFS  Executive Editor Tarun Ramadorai of Imperial College London, and NBER Research Associates Itay Goldstein of the University of Pennsylvania, Chester Spatt of Carnegie Mellon University, and Mao Ye of Cornell University.

The organizers will consider submissions on topics including, but not limited to: 

 •  Unstructured Data Analysis and AI: The impact on financial markets of the growing use of AI technology to analyze unstructured data, such as text, images, audio, and video.

 •  Trading and AI: The impact of using AI in high-frequency trading, algorithmic trading, and the impacts of this use on financial markets.

 •  Big Data and AI in Investment: The rise of machines in asset management, particularly the growing analysis of high-dimensional datasets using machine learning techniques.

 •  Big Data and AI in Corporate Decisions: The impact of AI as well as other means of analyzing unstructured datasets and automating decision-making on corporate decisions, such as capital budgeting, working capital management, and regulatory compliance and reporting.

 •  Financial Institutions and Financial Intermediation: The impact of AI, fintech, and the analysis of large datasets on traditional financial institutions.

 •  AI and Regulation: The role of AI in detecting improper market conduct, the regulation of algorithms and winner-take-all markets, and strategies for ensuring accountability, fairness and transparency in AI models.

The organizers welcome submissions of both empirical and theoretical research papers and encourage submissions from scholars who are early in their careers, who are not NBER affiliates, and who are from under-represented groups in the financial economics profession.  Papers that are submitted for presentation at the conference may also be submitted to the RFS  under its dual review system at no extra cost. Papers that are rejected at any stage of this process are not considered to have been “rejected” at the RFS .  Authors may submit a future version of the same paper to the RFS , even if the paper is not selected for presentation at the conference. For a paper to be considered under the dual submission option, it may not be under review or invited revision at any journal, including the RFS, until the author has been notified of the outcome of the dual submission process. The details of the dual submission program may be found at http://sfs.org/dualsubmissionpolicy/. To be considered for inclusion on the program, papers must be uploaded by 11:59 pm EDT on Thursday, September 12, 2024 to one of the following sites:

For submissions to both the conference and the Review of Financial Studies

For submissions to the conference alone  

Please do not submit papers that have been accepted for publication or that will be published before the conference. Authors chosen to present papers will be notified in October, 2024. All presenters are expected to attend the meeting in person. The NBER will cover the travel and lodging cost of up to two presenters per paper. 

Questions about this conference may be addressed to  [email protected] .

  • Survey paper
  • Open access
  • Published: 01 October 2015

Big data analytics: a survey

  • Chun-Wei Tsai 1 ,
  • Chin-Feng Lai 2 ,
  • Han-Chieh Chao 1 , 3 , 4 &
  • Athanasios V. Vasilakos 5  

Journal of Big Data volume  2 , Article number:  21 ( 2015 ) Cite this article

148k Accesses

478 Citations

130 Altmetric

Metrics details

The age of big data is now coming. But the traditional data analytics may not be able to handle such large quantities of data. The question that arises now is, how to develop a high performance platform to efficiently analyze big data and how to design an appropriate mining algorithm to find the useful things from big data. To deeply discuss this issue, this paper begins with a brief introduction to data analytics, followed by the discussions of big data analytics. Some important open issues and further research directions will also be presented for the next step of big data analytics.

Introduction

As the information technology spreads fast, most of the data were born digital as well as exchanged on internet today. According to the estimation of Lyman and Varian [ 1 ], the new data stored in digital media devices have already been more than 92 % in 2002, while the size of these new data was also more than five exabytes. In fact, the problems of analyzing the large scale data were not suddenly occurred but have been there for several years because the creation of data is usually much easier than finding useful things from the data. Even though computer systems today are much faster than those in the 1930s, the large scale data is a strain to analyze by the computers we have today.

In response to the problems of analyzing large-scale data , quite a few efficient methods [ 2 ], such as sampling, data condensation, density-based approaches, grid-based approaches, divide and conquer, incremental learning, and distributed computing, have been presented. Of course, these methods are constantly used to improve the performance of the operators of data analytics process. Footnote 1 The results of these methods illustrate that with the efficient methods at hand, we may be able to analyze the large-scale data in a reasonable time. The dimensional reduction method (e.g., principal components analysis; PCA [ 3 ]) is a typical example that is aimed at reducing the input data volume to accelerate the process of data analytics. Another reduction method that reduces the data computations of data clustering is sampling [ 4 ], which can also be used to speed up the computation time of data analytics.

Although the advances of computer systems and internet technologies have witnessed the development of computing hardware following the Moore’s law for several decades, the problems of handling the large-scale data still exist when we are entering the age of big data . That is why Fisher et al. [ 5 ] pointed out that big data means that the data is unable to be handled and processed by most current information systems or methods because data in the big data era will not only become too big to be loaded into a single machine, it also implies that most traditional data mining methods or data analytics developed for a centralized data analysis process may not be able to be applied directly to big data. In addition to the issues of data size, Laney [ 6 ] presented a well-known definition (also called 3Vs) to explain what is the “big” data: volume, velocity, and variety. The definition of 3Vs implies that the data size is large, the data will be created rapidly, and the data will be existed in multiple types and captured from different sources, respectively. Later studies [ 7 , 8 ] pointed out that the definition of 3Vs is insufficient to explain the big data we face now. Thus, veracity, validity, value, variability, venue, vocabulary, and vagueness were added to make some complement explanation of big data [ 8 ].

Expected trend of the marketing of big data between 2012 and 2018. Note that yellow , red , and blue of different colored box represent the order of appearance of reference in this paper for particular year

The report of IDC [ 9 ] indicates that the marketing of big data is about $16.1 billion in 2014. Another report of IDC [ 10 ] forecasts that it will grow up to $32.4 billion by 2017. The reports of [ 11 ] and [ 12 ] further pointed out that the marketing of big data will be $46.34 billion and $114 billion by 2018, respectively. As shown in Fig. 1 , even though the marketing values of big data in these researches and technology reports [ 9 – 15 ] are different, these forecasts usually indicate that the scope of big data will be grown rapidly in the forthcoming future.

In addition to marketing, from the results of disease control and prevention [ 16 ], business intelligence [ 17 ], and smart city [ 18 ], we can easily understand that big data is of vital importance everywhere. A numerous researches are therefore focusing on developing effective technologies to analyze the big data. To discuss in deep the big data analytics, this paper gives not only a systematic description of traditional large-scale data analytics but also a detailed discussion about the differences between data and big data analytics framework for the data scientists or researchers to focus on the big data analytics.

Moreover, although several data analytics and frameworks have been presented in recent years, with their pros and cons being discussed in different studies, a complete discussion from the perspective of data mining and knowledge discovery in databases still is needed. As a result, this paper is aimed at providing a brief review for the researchers on the data mining and distributed computing domains to have a basic idea to use or develop data analytics for big data.

Roadmap of this paper

Figure 2 shows the roadmap of this paper, and the remainder of the paper is organized as follows. “ Data analytics ” begins with a brief introduction to the data analytics, and then “ Big data analytics ” will turn to the discussion of big data analytics as well as state-of-the-art data analytics algorithms and frameworks. The open issues are discussed in “ The open issues ” while the conclusions and future trends are drawn in “ Conclusions ”.

Data analytics

To make the whole process of knowledge discovery in databases (KDD) more clear, Fayyad and his colleagues summarized the KDD process by a few operations in [ 19 ], which are selection, preprocessing, transformation, data mining, and interpretation/evaluation. As shown in Fig. 3 , with these operators at hand we will be able to build a complete data analytics system to gather data first and then find information from the data and display the knowledge to the user. According to our observation, the number of research articles and technical reports that focus on data mining is typically more than the number focusing on other operators, but it does not mean that the other operators of KDD are unimportant. The other operators also play the vital roles in KDD process because they will strongly impact the final result of KDD. To make the discussions on the main operators of KDD process more concise, the following sections will focus on those depicted in Fig. 3 , which were simplified to three parts (input, data analytics, and output) and seven operators (gathering, selection, preprocessing, transformation, data mining, evaluation, and interpretation).

The process of knowledge discovery in databases

As shown in Fig. 3 , the gathering, selection, preprocessing, and transformation operators are in the input part. The selection operator usually plays the role of knowing which kind of data was required for data analysis and select the relevant information from the gathered data or databases; thus, these gathered data from different data resources will need to be integrated to the target data. The preprocessing operator plays a different role in dealing with the input data which is aimed at detecting, cleaning, and filtering the unnecessary, inconsistent, and incomplete data to make them the useful data. After the selection and preprocessing operators, the characteristics of the secondary data still may be in a number of different data formats; therefore, the KDD process needs to transform them into a data-mining-capable format which is performed by the transformation operator. The methods for reducing the complexity and downsizing the data scale to make the data useful for data analysis part are usually employed in the transformation, such as dimensional reduction, sampling, coding, or transformation.

The data extraction, data cleaning, data integration, data transformation, and data reduction operators can be regarded as the preprocessing processes of data analysis [ 20 ] which attempts to extract useful data from the raw data (also called the primary data) and refine them so that they can be used by the following data analyses. If the data are a duplicate copy, incomplete, inconsistent, noisy, or outliers, then these operators have to clean them up. If the data are too complex or too large to be handled, these operators will also try to reduce them. If the raw data have errors or omissions, the roles of these operators are to identify them and make them consistent. It can be expected that these operators may affect the analytics result of KDD, be it positive or negative. In summary, the systematic solutions are usually to reduce the complexity of data to accelerate the computation time of KDD and to improve the accuracy of the analytics result.

Data analysis

Since the data analysis (as shown in Fig. 3 ) in KDD is responsible for finding the hidden patterns/rules/information from the data, most researchers in this field use the term data mining to describe how they refine the “ground” (i.e, raw data) into “gold nugget” (i.e., information or knowledge). The data mining methods [ 20 ] are not limited to data problem specific methods. In fact, other technologies (e.g., statistical or machine learning technologies) have also been used to analyze the data for many years. In the early stages of data analysis, the statistical methods were used for analyzing the data to help us understand the situation we are facing, such as public opinion poll or TV programme rating. Like the statistical analysis, the problem specific methods for data mining also attempted to understand the meaning from the collected data.

After the data mining problem was presented, some of the domain specific algorithms are also developed. An example is the apriori algorithm [ 21 ] which is one of the useful algorithms designed for the association rules problem. Although most definitions of data mining problems are simple, the computation costs are quite high. To speed up the response time of a data mining operator, machine learning [ 22 ], metaheuristic algorithms [ 23 ], and distributed computing [ 24 ] were used alone or combined with the traditional data mining algorithms to provide more efficient ways for solving the data mining problem. One of the well-known combinations can be found in [ 25 ], Krishna and Murty attempted to combine genetic algorithm and k -means to get better clustering result than k -means alone does.

Data mining algorithm

As Fig. 4 shows, most data mining algorithms contain the initialization, data input and output, data scan, rules construction, and rules update operators [ 26 ]. In Fig. 4 , D represents the raw data, d the data from the scan operator, r the rules, o the predefined measurement, and v the candidate rules. The scan, construct, and update operators will be performed repeatedly until the termination criterion is met. The timing to employ the scan operator depends on the design of the data mining algorithm; thus, it can be considered as an optional operator. Most of the data algorithms can be described by Fig. 4 in which it also shows that the representative algorithms— clustering , classification , association rules , and sequential patterns —will apply these operators to find the hidden information from the raw data. Thus, modifying these operators will be one of the possible ways for enhancing the performance of the data analysis.

Clustering is one of the well-known data mining problems because it can be used to understand the “new” input data. The basic idea of this problem [ 27 ] is to separate a set of unlabeled input data Footnote 2 to k different groups, e.g., such as k -means [ 28 ]. Classification [ 20 ] is the opposite of clustering because it relies on a set of labeled input data to construct a set of classifiers (i.e., groups) which will then be used to classify the unlabeled input data to the groups to which they belong. To solve the classification problem, the decision tree-based algorithm [ 29 ], naïve Bayesian classification [ 30 ], and support vector machine (SVM) [ 31 ] are widely used in recent years.

Unlike clustering and classification that attempt to classify the input data to k groups, association rules and sequential patterns are focused on finding out the “relationships” between the input data. The basic idea of association rules [ 21 ] is find all the co-occurrence relationships between the input data. For the association rules problem, the apriori algorithm [ 21 ] is one of the most popular methods. Nevertheless, because it is computationally very expensive, later studies [ 32 ] have attempted to use different approaches to reducing the cost of the apriori algorithm, such as applying the genetic algorithm to this problem [ 33 ]. In addition to considering the relationships between the input data, if we also consider the sequence or time series of the input data, then it will be referred to as the sequential pattern mining problem [ 34 ]. Several apriori-like algorithms were presented for solving it, such as generalized sequential pattern [ 34 ] and sequential pattern discovery using equivalence classes [ 35 ].

Output the result

Evaluation and interpretation are two vital operators of the output. Evaluation typically plays the role of measuring the results. It can also be one of the operators for the data mining algorithm, such as the sum of squared errors which was used by the selection operator of the genetic algorithm for the clustering problem [ 25 ].

To solve the data mining problems that attempt to classify the input data, two of the major goals are: (1) cohesion—the distance between each data and the centroid (mean) of its cluster should be as small as possible, and (2) coupling—the distance between data which belong to different clusters should be as large as possible. In most studies of data clustering or classification problems, the sum of squared errors (SSE), which was used to measure the cohesion of the data mining results, can be defined as

where k is the number of clusters which is typically given by the user; \(n_i\) the number of data in the i th cluster; \(x_{ij}\) the j th datum in the i th cluster; \(c_i\) is the mean of the i th cluster; and \(n= \sum ^k_{i=1} n_i\) is the number of data. The most commonly used distance measure for the data mining problem is the Euclidean distance, which is defined as

where \(p_i\) and \(p_j\) are the positions of two different data. For solving different data mining problems, the distance measurement \(D(p_i, p_j)\) can be the Manhattan distance, the Minkowski distance, or even the cosine similarity [ 36 ] between two different documents.

Accuracy (ACC) is another well-known measurement [ 37 ] which is defined as

To evaluate the classification results, precision ( p ), recall ( r ), and F -measure can be used to measure how many data that do not belong to group A are incorrectly classified into group A ; and how many data that belong to group A are not classified into group A . A simple confusion matrix of a classifier [ 37 ] as given in Table 1 can be used to cover all the situations of the classification results.

In Table 1 , TP and TN indicate the numbers of positive examples and negative examples that are correctly classified, respectively; FN and FP indicate the numbers of positive examples and negative examples that are incorrectly classified, respectively. With the confusion matrix at hand, it is much easier to describe the meaning of precision ( p ), which is defined as

and the meaning of recall ( r ), which is defined as

The F -measure can then be computed as

In addition to the above-mentioned measurements for evaluating the data mining results, the computation cost and response time are another two well-known measurements. When two different mining algorithms can find the same or similar results, of course, how fast they can get the final mining results will become the most important research topic.

After something (e.g., classification rules) is found by data mining methods, the two essential research topics are: (1) the work to navigate and explore the meaning of the results from the data analysis to further support the user to do the applicable decision can be regarded as the interpretation operator [ 38 ], which in most cases, gives useful interface to display the information [ 39 ] and (2) a meaningful summarization of the mining results [ 40 ] can be made to make it easier for the user to understand the information from the data analysis. The data summarization is generally expected to be one of the simple ways to provide a concise piece of information to the user because human has trouble of understanding vast amounts of complicated information. A simple data summarization can be found in the clustering search engine, when a query “oasis” is sent to Carrot2 ( http://search.carrot2.org/stable/search ), it will return some keywords to represent each group of the clustering results for web links to help us recognize which category needed by the user, as shown in the left side of Fig. 5 .

Screenshot of the results of clustering search engine

A useful graphical user interface is another way to provide the meaningful information to an user. As explained by Shneiderman in [ 39 ], we need “overview first, zoom and filter, then retrieve the details on demand”. The useful graphical user interface [ 38 , 41 ] also makes it easier for the user to comprehend the meaning of the results when the number of dimensions is higher than three. How to display the results of data mining will affect the user’s perspective to make the decision. For instance, data mining can help us find “type A influenza” at a particular region, but without the time series and flu virus infected information of patients, the government could not recognize what situation (pandemic or controlled) we are facing now so as to make appropriate responses to that. For this reason, a better solution to merge the information from different sources and mining algorithm results will be useful to let the user make the right decision.

Since the problems of handling and analyzing large-scale and complex input data always exist in data analytics, several efficient analysis methods were presented to accelerate the computation time or to reduce the memory cost for the KDD process, as shown in Table 2 . The study of [ 42 ] shows that the basic mathematical concepts (i.e., triangle inequality) can be used to reduce the computation cost of a clustering algorithm. Another study [ 43 ] shows that the new technologies (i.e., distributed computing by GPU) can also be used to reduce the computation time of data analysis method. In addition to the well-known improved methods for these analysis methods (e.g., triangle inequality or distributed computing), a large proportion of studies designed their efficient methods based on the characteristics of mining algorithms or problem itself, which can be found in [ 32 , 44 , 45 ], and so forth. This kind of improved methods typically was designed for solving the drawback of the mining algorithms or using different ways to solve the mining problem. These situations can be found in most association rules and sequential patterns problems because the original assumption of these problems is for the analysis of large-scale dataset. Since the earlier frequent pattern algorithm (e.g., apriori algorithm) needs to scan the whole dataset many times which is computationally very expensive. How to reduce the number of times the whole dataset is scanned so as to save the computation cost is one of the most important things in all the frequent pattern studies. The similar situation also exists in data clustering and classification studies because the design concept of earlier algorithms, such as mining the patterns on-the-fly [ 46 ], mining partial patterns at different stages [ 47 ], and reducing the number of times the whole dataset is scanned [ 32 ], are therefore presented to enhance the performance of these mining algorithms. Since some of the data mining problems are NP-hard [ 48 ] or the solution space is very large, several recent studies [ 23 , 49 ] have attempted to use metaheuristic algorithm as the mining algorithm to get the approximate solution within a reasonable time.

Abundant research results of data analysis [ 20 , 27 , 63 ] show possible solutions for dealing with the dilemmas of data mining algorithms. It means that the open issues of data analysis from the literature [ 2 , 64 ] usually can help us easily find the possible solutions. For instance, the clustering result is extremely sensitive to the initial means, which can be mitigated by using multiple sets of initial means [ 65 ]. According to our observation, most data analysis methods have limitations for big data, that can be described as follows:

Unscalability and centralization Most data analysis methods are not for large-scale and complex dataset. The traditional data analysis methods cannot be scaled up because their design does not take into account large or complex datasets. The design of traditional data analysis methods typically assumed they will be performed in a single machine, with all the data in memory for the data analysis process. For this reason, the performance of traditional data analytics will be limited in solving the volume problem of big data.

Non-dynamic Most traditional data analysis methods cannot be dynamically adjusted for different situations, meaning that they do not analyze the input data on-the-fly. For example, the classifiers are usually fixed which cannot be automatically changed. The incremental learning [ 66 ] is a promising research trend because it can dynamically adjust the the classifiers on the training process with limited resources. As a result, the performance of traditional data analytics may not be useful to the problem of velocity problem of big data.

Uniform data structure Most of the data mining problems assume that the format of the input data will be the same. Therefore, the traditional data mining algorithms may not be able to deal with the problem that the formats of different input data may be different and some of the data may be incomplete. How to make the input data from different sources the same format will be a possible solution to the variety problem of big data.

Because the traditional data analysis methods are not designed for large-scale and complex data, they are almost impossible to be capable of analyzing the big data. Redesigning and changing the way the data analysis methods are designed are two critical trends for big data analysis. Several important concepts in the design of the big data analysis method will be given in the following sections.

Big data analytics

Nowadays, the data that need to be analyzed are not just large, but they are composed of various data types, and even including streaming data [ 67 ]. Since big data has the unique features of “massive, high dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and erroneous,” which may change the statistical and data analysis approaches [ 68 ]. Although it seems that big data makes it possible for us to collect more data to find more useful information, the truth is that more data do not necessarily mean more useful information. It may contain more ambiguous or abnormal data. For instance, a user may have multiple accounts, or an account may be used by multiple users, which may degrade the accuracy of the mining results [ 69 ]. Therefore, several new issues for data analytics come up, such as privacy, security, storage, fault tolerance, and quality of data [ 70 ].

The comparison between traditional data analysis and big data analysis on wireless sensor network

The big data may be created by handheld device, social network, internet of things, multimedia, and many other new applications that all have the characteristics of volume, velocity, and variety. As a result, the whole data analytics has to be re-examined from the following perspectives:

From the volume perspective, the deluge of input data is the very first thing that we need to face because it may paralyze the data analytics. Different from traditional data analytics, for the wireless sensor network data analysis, Baraniuk [ 71 ] pointed out that the bottleneck of big data analytics will be shifted from sensor to processing, communications, storage of sensing data, as shown in Fig. 6 . This is because sensors can gather much more data, but when uploading such large data to upper layer system, it may create bottlenecks everywhere.

In addition, from the velocity perspective, real-time or streaming data bring up the problem of large quantity of data coming into the data analytics within a short duration but the device and system may not be able to handle these input data. This situation is similar to that of the network flow analysis for which we typically cannot mirror and analyze everything we can gather.

From the variety perspective, because the incoming data may use different types or have incomplete data, how to handle them also bring up another issue for the input operators of data analytics.

In this section, we will turn the discussion to the big data analytics process.

Big data input

The problem of handling a vast quantity of data that the system is unable to process is not a brand-new research issue; in fact, it appeared in several early approaches [ 2 , 21 , 72 ], e.g., marketing analysis, network flow monitor, gene expression analysis, weather forecast, and even astronomy analysis. This problem still exists in big data analytics today; thus, preprocessing is an important task to make the computer, platform, and analysis algorithm be able to handle the input data. The traditional data preprocessing methods [ 73 ] (e.g., compression, sampling, feature selection, and so on) are expected to be able to operate effectively in the big data age. However, a portion of the studies still focus on how to reduce the complexity of the input data because even the most advanced computer technology cannot efficiently process the whole input data by using a single machine in most cases. By using domain knowledge to design the preprocessing operator is a possible solution for the big data. In [ 74 ], Ham and Lee used the domain knowledge, B -tree, divide-and-conquer to filter the unrelated log information for the mobile web log analysis. A later study [ 75 ] considered that the computation cost of preprocessing will be quite high for massive logs, sensor, or marketing data analysis. Thus, Dawelbeit and McCrindle employed the bin packing partitioning method to divide the input data between the computing processors to handle this high computations of preprocessing on cloud system. The cloud system is employed to preprocess the raw data and then output the refined data (e.g., data with uniform format) to make it easier for the data analysis method or system to preform the further analysis work.

Sampling and compression are two representative data reduction methods for big data analytics because reducing the size of data makes the data analytics computationally less expensive, thus faster, especially for the data coming to the system rapidly. In addition to making the sampling data represent the original data effectively [ 76 ], how many instances need to be selected for data mining method is another research issue [ 77 ] because it will affect the performance of the sampling method in most cases.

To avoid the application-level slow-down caused by the compression process, in [ 78 ], Jun et al. attempted to use the FPGA to accelerate the compression process. The I/O performance optimization is another issue for the compression method. For this reason, Zou et al. [ 79 ] employed the tentative selection and predictive dynamic selection and switched the appropriate compression method from two different strategies to improve the performance of the compression process. To make it possible for the compression method to efficiently compress the data, a promising solution is to apply the clustering method to the input data to divide them into several different groups and then compress these input data according to the clustering information. The compression method described in [ 80 ] is one of this kind of solutions, it first clusters the input data and then compresses these input data via the clustering results while the study [ 81 ] also used clustering method to improve the performance of the compression process.

In summary, in addition to handling the large and fast data input, the research issues of heterogeneous data sources, incomplete data, and noisy data may also affect the performance of the data analysis. The input operators will have a stronger impact on the data analytics at the big data age than it has in the past. As a result, the design of big data analytics needs to consider how to make these tasks (e.g., data clean, data sampling, data compression) work well.

Big data analysis frameworks and platforms

Various solutions have been presented for the big data analytics which can be divided [ 82 ] into (1) Processing/Compute: Hadoop [ 83 ], Nvidia CUDA [ 84 ], or Twitter Storm [ 85 ], (2) Storage: Titan or HDFS, and (3) Analytics: MLPACK [ 86 ] or Mahout [ 87 ]. Although there exist commercial products for data analysis [ 83 – 86 ], most of the studies on the traditional data analysis are focused on the design and development of efficient and/or effective “ways” to find the useful things from the data. But when we enter the age of big data, most of the current computer systems will not be able to handle the whole dataset all at once; thus, how to design a good data analytics framework or platform Footnote 3 and how to design analysis methods are both important things for the data analysis process. In this section, we will start with a brief introduction to data analysis frameworks and platforms, followed by a comparison of them.

The basic idea of big data analytics on cloud system

Researches in frameworks and platforms

To date, we can easily find tools and platforms presented by well-known organizations. The cloud computing technologies are widely used on these platforms and frameworks to satisfy the large demands of computing power and storage. As shown in Fig. 7 , most of the works on KDD for big data can be moved to cloud system to speed up the response time or to increase the memory space. With the advance of these works, handling and analyzing big data within a reasonable time has become not so far away. Since the foundation functions to handle and manage the big data were developed gradually; thus, the data scientists nowadays do not have to take care of everything, from the raw data gathering to data analysis, by themselves if they use the existing platforms or technologies to handle and manage the data. The data scientists nowadays can pay more attention to finding out the useful information from the data even thought this task is typically like looking for a needle in a haystack. That is why several recent studies tried to present efficient and effective framework to analyze the big data, especially on find out the useful things.

Performance-oriented From the perspective of platform performance, Huai [ 88 ] pointed out that most of the traditional parallel processing models improve the performance of the system by using a new larger computer system to replace the old computer system, which is usually referred to as “scale up”, as shown in Fig. 8 a. But for the big data analytics, most researches improve the performance of the system by adding more similar computer systems to make it possible for a system to handle all the tasks that cannot be loaded or computed in a single computer system (called “scale out”), as shown in Fig. 8 b where M1, M2, and M3 represent computer systems that have different computing power, respectively. For the scale up based solution, the computing power of the three systems is in the order of \(\text {M3}>\text {M2}>\text {M1}\) ; but for the scale out based system, all we have to do is to keep adding more similar computer systems to to a system to increase its ability. To build a scalable and fault-tolerant manager for big data analysis, Huai et al. [ 88 ] presented a matrix model which consists of three matrices for data set (D), concurrent data processing operations (O), and data transformations (T), called DOT. The big data is divided into n subsets each of which is processed by a computer node (worker) in such a way that all the subsets are processed concurrently, and then the results from these n computer nodes are collected and transformed to a computer node. By using this framework, the whole data analysis framework is composed of several DOT blocks. The system performance can be easily enhanced by adding more DOT blocks to the system.

The comparisons between scale up and scale out

Another efficient big data analytics was presented in [ 89 ], called generalized linear aggregates distributed engine (GLADE). The GLADE is a multi-level tree-based data analytics system which consists of two types of computer nodes that are a coordinator and workers. The simulation results [ 90 ] show that the GLADE can provide a better performance than Hadoop in terms of the execution time. Because Hadoop requires large memory and storage for data replication and it is a single master, Footnote 4 Essa et al. [ 91 ] presented a mobile agent based framework to solve these two problems, called the map reduce agent mobility (MRAM). The main reason is that each mobile agent can send its code and data to any other machine; therefore, the whole system will not be down if the master failed. Compared to Hadoop, the architecture of MRAM was changed from client/server to a distributed agent. The load time for MRAM is less than Hadoop even though both of them use the map-reduce solution and Java language. In [ 92 ], Herodotou et al. considered issues of the user needs and system workloads. They presented a self-tuning analytics system built on Hadoop for big data analysis. Since one of the major goals of their system is to adjust the system based on the user needs and system workloads to provide good performance automatically, the user usually does not need to understand and manipulate the Hadoop system. The study [ 93 ] was from the perspectives of data centric architecture and operational models to presented a big data architecture framework (BDAF) which includes: big data infrastructure, big data analytics, data structures and models, big data lifecycle management, and big data security. According to the observations of Demchenko et al. [ 93 ], cluster services, Hadoop related services, data analytics tools, databases, servers, and massively parallel processing databases are typically the required applications and services in big data analytics infrastructure.

Result-oriented Fisher et al. [ 5 ] presented a big data pipeline to show the workflow of big data analytics to extract the valuable knowledge from big data, which consists of the acquired data, choosing architecture, shaping data into architecture, coding/debugging, and reflecting works. From the perspectives of statistical computation and data mining, Ye et al. [ 94 ] presented an architecture of the services platform which integrates R to provide better data analysis services, called cloud-based big data mining and analyzing services platform (CBDMASP). The design of this platform is composed of four layers: the infrastructure services layer, the virtualization layer, the dataset processing layer, and the services layer. Several large-scale clustering problems (the datasets are of size from 0.1 G up to 25.6 G) were also used to evaluate the performance of the CBDMASP. The simulation results show that using map-reduce is much faster than using a single machine when the input data become too large. Although the size of the test dataset cannot be regarded as a big dataset, the performance of the big data analytics using map-reduce can be sped up via this kind of testings. In this study, map-reduce is a better solution when the dataset is of size more than 0.2 G, and a single machine is unable to handle a dataset that is of size more than 1.6 G.

Another study [ 95 ] presented a theorem to explain the big data characteristics, called HACE: the characteristics of big data usually are large-volume, Heterogeneous, Autonomous sources with distributed and decentralized control, and we usually try to find out some useful and interesting things from complex and evolving relationships of data. Based on these concerns and data mining issues, Wu and his colleagues [ 95 ] also presented a big data processing framework which includes data accessing and computing tier, data privacy and domain knowledge tier, and big data mining algorithm tier. This work explains that the data mining algorithm will become much more important and much more difficult; thus, challenges will also occur on the design and implementation of big data analytics platform. In addition to the platform performance and data mining issues, the privacy issue for big data analytics was a promising research in recent years. In [ 96 ], Laurila et al. explained that the privacy is an essential problem when we try to find something from the data that are gathered from mobile devices; thus, data security and data anonymization should also be considered in analyzing this kind of data. Demirkan and Delen [ 97 ] presented a service-oriented decision support system (SODSS) for big data analytics which includes information source, data management, information management, and operations management.

Comparison between the frameworks/platforms of big data

In [ 98 ], Talia pointed out that cloud-based data analytics services can be divided into data analytics software as a service, data analytics platform as a service, and data analytics infrastructure as a service. A later study [ 99 ] presented a general architecture of big data analytics which contains multi-source big data collecting, distributed big data storing, and intra/inter big data processing. Since many kinds of data analytics frameworks and platforms have been presented, some of the studies attempted to compare them to give a guidance to choose the applicable frameworks or platforms for relevant works. To give a brief introduction to big data analytics, especially the platforms and frameworks, in [ 100 ], Cuzzocrea et al. first discuss how recent studies responded the “computational emergency” issue of big data analytics. Some open issues, such as data source heterogeneity and uncorrelated data filtering, and possible research directions are also given in the same study. In [ 101 ], Zhang and Huang used the 5Ws model to explain what kind of framework and method we need for different big data approaches. Zhang and Huang further explained that the 5Ws model represents what kind of data, why we have these data, where the data come from, when the data occur, who receive the data, and how the data are transferred. A later study [ 102 ] used the features (i.e., owner, workload, source code, low latency, and complexity) to compare the frameworks of Hadoop [ 83 ], Storm [ 85 ] and Drill [ 103 ]. Thus, it can be easily seen that the framework of Apache Hadoop has high latency compared with the other two frameworks. To better understand the strong and weak points of solutions of big data, Chalmers et al. [ 82 ] then employed the volume, variety, variability, velocity, user skill/experience, and infrastructure to evaluate eight solutions of big data analytics.

In [ 104 ], in addition to defining that a big data system should include data generation, data acquisition, data storage, and data analytics modules, Hu et al. also mentioned that a big data system can be decomposed into infrastructure, computing, and application layers. Moreover, a promising research for NoSQL storage systems was also discussed in this study which can be divided into key-value , column , document , and row databases. Since big data analysis is generally regarded as a high computation cost work, the high performance computing cluster system (HPCC) is also a possible solution in early stage of big data analytics. Sagiroglu and Sinanc [ 105 ] therefore compare the characteristics between HPCC and Hadoop. They then emphasized that HPCC system uses the multikey and multivariate indexes on distributed file system while Hadoop uses the column-oriented database. In [ 17 ], Chen et al. give a brief introduction to the big data analytics of business intelligence (BI) from the perspective of evolution, applications, and emerging research topics. In their survey, Chen et al. explained that the revolution of business intelligence and analytics (BI&I) was from BI&I 1.0, BI&I 2.0, to BI&I 3.0 which are DBMS-based and structured content, web-based and unstructured content, and mobile and sensor based content, respectively.

Big data analysis algorithms

Mining algorithms for specific problem.

Because the big data issues have appeared for nearly ten years, in [ 106 ], Fan and Bifet pointed out that the terms “big data” [ 107 ] and “big data mining” [ 108 ] were first presented in 1998, respectively. The big data and big data mining almost appearing at the same time explained that finding something from big data will be one of the major tasks in this research domain. Data mining algorithms for data analysis also play the vital role in the big data analysis, in terms of the computation cost, memory requirement, and accuracy of the end results. In this section, we will give a brief discussion from the perspective of analysis and search algorithms to explain its importance for big data analytics.

Clustering algorithms In the big data age, traditional clustering algorithms will become even more limited than before because they typically require that all the data be in the same format and be loaded into the same machine so as to find some useful things from the whole data. Although the problem [ 64 ] of analyzing large-scale and high-dimensional dataset has attracted many researchers from various disciplines in the last century, and several solutions [ 2 , 109 ] have been presented presented in recent years, the characteristics of big data still brought up several new challenges for the data clustering issues. Among them, how to reduce the data complexity is one of the important issues for big data clustering. In [ 110 ], Shirkhorshidi et al. divided the big data clustering into two categories: single-machine clustering (i.e., sampling and dimension reduction solutions), and multiple-machine clustering (parallel and MapReduce solutions). This means that traditional reduction solutions can also be used in the big data age because the complexity and memory space needed for the process of data analysis will be decreased by using sampling and dimension reduction methods. More precisely, sampling can be regarded as reducing the “amount of data” entered into a data analyzing process while dimension reduction can be regarded as “downsizing the whole dataset” because irrelevant dimensions will be discarded before the data analyzing process is carried out.

CloudVista [ 111 ] is a representative solution for clustering big data which used cloud computing to perform the clustering process in parallel. BIRCH [ 44 ] and sampling method were used in CloudVista to show that it is able to handle large-scale data, e.g., 25 million census records. Using GPU to enhance the performance of a clustering algorithm is another promising solution for big data mining. The multiple species flocking (MSF) [ 112 ] was applied to the CUDA platform from NVIDIA to reduce the computation time of clustering algorithm in [ 113 ]. The simulation results show that the speedup factor can be increased from 30 up to 60 by using GPU for data clustering. Since most traditional clustering algorithms (e.g, k -means) require a computation that is centralized, how to make them capable of handling big data clustering problems is the major concern of Feldman et al. [ 114 ] who use a tree construction for generating the coresets in parallel which is called the “merge-and-reduce” approach. Moreover, Feldman et al. pointed out that by using this solution for clustering, the update time per datum and memory of the traditional clustering algorithms can be significantly reduced.

Classification algorithms Similar to the clustering algorithm for big data mining, several studies also attempted to modify the traditional classification algorithms to make them work on a parallel computing environment or to develop new classification algorithms which work naturally on a parallel computing environment. In [ 115 ], the design of classification algorithm took into account the input data that are gathered by distributed data sources and they will be processed by a heterogeneous set of learners. Footnote 5 In this study, Tekin et al. presented a novel classification algorithm called “classify or send for classification” (CoS). They assumed that each learner can be used to process the input data in two different ways in a distributed data classification system. One is to perform a classification function by itself while the other is to forward the input data to another learner to have them labeled. The information will be exchanged between different learners. In brief, this kind of solutions can be regarded as a cooperative learning to improve the accuracy in solving the big data classification problem. An interesting solution uses the quantum computing to reduce the memory space and computing cost of a classification algorithm. For example, in [ 116 ], Rebentrost et al. presented a quantum-based support vector machine for big data classification and argued that the classification algorithm they proposed can be implemented with a time complexity \(O(\log NM)\) where N is the number of dimensions and M is the number of training data. There are bright prospects for big data mining by using quantum-based search algorithm when the hardware of quantum computing has become mature.

Frequent pattern mining algorithms Most of the researches on frequent pattern mining (i.e., association rules and sequential pattern mining) were focused on handling large-scale dataset at the very beginning because some early approaches of them were attempted to analyze the data from the transaction data of large shopping mall. Because the number of transactions usually is more than “tens of thousands”, the issues about how to handle the large scale data were studied for several years, such as FP-tree [ 32 ] using the tree structure to include the frequent patterns to further reduce the computation time of association rule mining. In addition to the traditional frequent pattern mining algorithms, of course, parallel computing and cloud computing technologies have also attracted researchers in this research domain. Among them, the map-reduce solution was used for the studies [ 117 – 119 ] to enhance the performance of the frequent pattern mining algorithm. By using the map-reduce model for frequent pattern mining algorithm, it can be easily expected that its application to “cloud platform” [ 120 , 121 ] will definitely become a popular trend in the forthcoming future. The study of [ 119 ] no only used the map-reduce model, it also allowed users to express their specific interest constraints in the process of frequent pattern mining. The performance of these methods by using map-reduce model for big data analysis is, no doubt, better than the traditional frequent pattern mining algorithms running on a single machine.

Machine learning for big data mining

The potential of machine learning for data analytics can be easily found in the early literature [ 22 , 49 ]. Different from the data mining algorithm design for specific problems, machine learning algorithms can be used for different mining and analysis problems because they are typically employed as the “search” algorithm of the required solution. Since most machine learning algorithms can be used to find an approximate solution for the optimization problem, they can be employed for most data analysis problems if the data analysis problems can be formulated as an optimization problem. For example, genetic algorithm, one of the machine learning algorithms, can not only be used to solve the clustering problem [ 25 ], it can also be used to solve the frequent pattern mining problem [ 33 ]. The potential of machine learning is not merely for solving different mining problems in data analysis operator of KDD; it also has the potential of enhancing the performance of the other parts of KDD, such as feature reduction for the input operators [ 72 ].

A recent study [ 68 ] shows that some traditional mining algorithms, statistical methods, preprocessing solutions, and even the GUI’s have been applied to several representative tools and platforms for big data analytics. The results show clearly that machine learning algorithms will be one of the essential parts of big data analytics. One of the problems in using current machine learning methods for big data analytics is similar to those of most traditional data mining algorithms which are designed for sequential or centralized computing. However, one of the most possible solutions is to make them work for parallel computing. Fortunately, some of the machine learning algorithms (e.g., population-based algorithms) can essentially be used for parallel computing, which have been demonstrated for several years, such as parallel computing version of genetic algorithm [ 122 ]. Different from the traditional GA, as shown in Fig. 9 a, the population of island model genetic algorithm, one of the parallel GA’s, can be divided into several sub-populations, as shown in Fig. 9 b. This means that the sub-populations can be assigned to different threads or computer nodes for parallel computing, by a simple modification of the GA.

The comparison between basic idea of traditional GA (TGA) and parallel genetic algorithm (PGA)

For this reason, in [ 123 ], Kiran and Babu explained that the framework for distributed data mining algorithm still needs to aggregate the information from different computer nodes. As shown in Fig. 10 , the common design of distributed data mining algorithm is as follows: each mining algorithm will be performed on a computer node (worker) which has its locally coherent data, but not the whole data. To construct a globally meaningful knowledge after each mining algorithm finds its local model, the local model from each computer node has to be aggregated and integrated into a final model to represent the complete knowledge. Kiran and Babu [ 123 ] also pointed out that the communication will be the bottleneck when using this kind of distributed computing framework.

A simple example of distributed data mining framework [ 86 ]

Bu et al. [ 124 ] found some research issues when trying to apply machine learning algorithms to parallel computing platforms. For instance, the early version of map-reduce framework does not support “iteration” (i.e., recursion). But the good news is that some recent works [ 87 , 125 ] have paid close attention to this problem and tried to fix it. Similar to the solutions for enhancing the performance of the traditional data mining algorithms, one of the possible solutions to enhancing the performance of a machine learning algorithm is to use CUDA, i.e., a GPU, to reduce the computing time of data analysis. Hasan et al. [ 126 ] used CUDA to implement the self-organizing map (SOM) and multiple back-propagation (MBP) for the classification problem. The simulation results show that using GPU is faster than using CPU. More precisely, SOM running on a GPU is three times faster than SOM running on a CPU, and MPB running on a GPU is twenty-seven times faster than MPB running on a. Another study [ 127 ] attempted to apply the ant-based algorithm to grid computing platform. Since the proposed mining algorithm is extended by the ant clustering algorithm of Deneubourg et al. [ 128 ], Footnote 6 Ku-Mahamud modified the ant behavior of this ant clustering algorithm for big data clustering. That is, each ant will be randomly placed on the grid. This means that the ant clustering algorithm then can be used on a parallel computing environment.

The trends of machine learning studies for big data analytics can be divided into twofold: one attempts to make machine learning algorithms run on parallel platforms, such as Radoop [ 129 ], Mahout [ 87 ], and PIMRU [ 124 ]; the other is to redesign the machine learning algorithms to make them suitable for parallel computing or to parallel computing environment, such as neural network algorithms for GPU [ 126 ] and ant-based algorithm for grid [ 127 ]. In summary, both of them make it possible to apply the machine learning algorithms to big data analytics although still many research issues need to be solved, such as the communication cost for different computer nodes [ 86 ] and the large computation cost most machine learning algorithms require [ 126 ].

Output the result of big data analysis

The benchmarks of PigMix [ 130 ], GridMix [ 131 ], TeraSort and GraySort [ 132 ], TPC-C, TPC-H, TPC-DS [ 133 ], and yahoo cloud serving benchmark (YCSB) [ 134 ] have been presented for evaluating the performance of the cloud computing and big data analytics systems. Ghazal et al. [ 135 ] presented another benchmark (called BigBench) to be used as an end-to-end big data benchmark which covers the characteristics of 3V of big data and uses the loading time, time for queries, time for procedural processing queries, and time for the remaining queries as the metrics. By using these benchmarks, the computation time is one of the intuitive metrics for evaluating the performance of different big data analytics platforms or algorithms. That is why Cheptsov [ 136 ] compered the high performance computing (HPC) and cloud system by using the measurement of computation time to understand their scalability for text file analysis. In addition to the computation time, the throughput (e.g., the number of operations per second) and read/write latency of operations are the other measurements of big data analytics [ 137 ]. In the study of [ 138 ], Zhao et al. believe that the maximum size of data and the maximum number of jobs are the two important metrics to understand the performance of the big data analytics platform. Another study described in [ 139 ] presented a systematic evaluation method which contains the data throughput, concurrency during map and reduce phases, response times, and the execution time of map and reduce. Moreover, most benchmarks for evaluating the performance of big data analytics typically can only provide the response time or the computation cost; however, the fact is that several factors need to be taken into account at the same time when building a big data analytics system. The hardware, bandwidth for data transmission, fault tolerance, cost, power consumption of these systems are all issues [ 70 , 104 ] to be taken into account at the same time when building a big data analytics system. Several solutions available today are to install the big data analytics on a cloud computing system or a cluster system. Therefore, the measurements of fault tolerance, task execution, and cost of cloud computing systems can then be used to evaluate the performance of the corresponding factors of big data analytics.

How to present the analysis results to a user is another important work in the output part of big data analytics because if the user cannot easily understand the meaning of the results, the results will be entirely useless. Business intelligent and network monitoring are the two common approaches because their user interface plays the vital role of making them workable. Zhang et al. [ 140 ] pointed out that the tasks of the visual analytics for commercial systems can be divided into four categories which are exploration, dashboards, reporting, and alerting. The study [ 141 ] showed that the interface for electroencephalography (EEG) interpretation is another noticeable research issue in big data analytics. The user interface for cloud system [ 142 , 143 ] is the recent trend for big data analytics. This usually plays vital roles in big data analytics system, one of which is to simplify the explanation of the needed knowledge to the users while the other is to make it easier for the users to handle the data analytics system to work with their opinions. According to our observations, a flexible user interface is needed because although the big data analytics can help us to find some hidden information, the information found usually is not knowledge. This situation is just like the example we mentioned in “ Output the result ”. The mining or statistical techniques can be employed to know the flu situation of each region, but data scientists sometimes need additional ways to display the information to find out the knowledge they need or to prove their assumption. Thus, the user interface can be adjusted by the user to display the knowledge that is needed urgently for big data analytics.

Summary of process of big data analytics

This discussion of big data analytics in this section was divided into input, analysis, and output for mapping the data analysis process of KDD. For the input (see also in “ Big data input ”) and output (see also “ Output the result of big data analysis ”) of big data, several methods and solutions proposed before the big data age (see also “ Data input ”) can also be employed for big data analytics in most cases.

However, there still exist some new issues of the input and output that the data scientists need to confront. A representative example we mentioned in “ Big data input ” is that the bottleneck will not only on the sensor or input devices, it may also appear in other places of data analytics [ 71 ]. Although we can employ traditional compression and sampling technologies to deal with this problem, they can only mitigate the problems instead of solving the problems completely. Similar situations also exist in the output part. Although several measurements can be used to evaluate the performance of the frameworks, platforms, and even data mining algorithms, there still exist several new issues in the big data age, such as information fusion from different information sources or information accumulation from different times.

Several studies attempted to present an efficient or effective solution from the perspective of system (e.g., framework and platform) or algorithm level. A simple comparison of these big data analysis technologies from different perspectives is described in Table 3 , to give a brief introduction to the current studies and trends of data analysis technologies for the big data. The “Perspective” column of this table explains that the study is focused on the framework or algorithm level; the “Description” column gives the further goal of the study; and the “Name” column is an abbreviated names of the methods or platform/framework. From the analysis framework perspective, this table shows that big data framework , platform , and machine learning are the current research trends in big data analytics system. For the mining algorithm perspective, the clustering , classification , and frequent pattern mining issues play the vital role of these researches because several data analysis problems can be mapped to these essential issues.

A promising trend that can be easily found from these successful examples is to use machine learning as the search algorithm (i.e., mining algorithm) for the data mining problems of big data analytics system. The machine learning-based methods are able to make the mining algorithms and relevant platforms smarter or reduce the redundant computation costs. That parallel computing and cloud computing technologies have a strong impact on the big data analytics can also be recognized as follows: (1) most of the big data analytics frameworks and platforms are using Hadoop and Hadoop relevant technologies to design their solutions; and (2) most of the mining algorithms for big data analysis have been designed for parallel computing via software or hardware or designed for Map-Reduce-based platform.

From the results of recent studies of big data analytics, it is still at the early stage of Nolan’s stages of growth model [ 146 ] which is similar to the situations for the research topics of cloud computing, internet of things, and smart grid. This is because several studies just attempted to apply the traditional solutions to the new problems/platforms/environments. For example, several studies [ 114 , 145 ] used k -means as an example to analyze the big data, but not many studies applied the state-of-the-art data mining algorithms and machine learning algorithms to the analysis the big data. This explains that the performance of the big data analytics can be improved by data mining algorithms and metaheuristic algorithms presented in recent years [ 147 ]. The relevant technologies for compression, sampling, or even the platform presented in recent years may also be used to enhance the performance of the big data analytics system. As a result, although these research topics still have several open issues that need to be solved, these situations, on the contrary, also illustrate that everything is possible in these studies.

The open issues

Although the data analytics today may be inefficient for big data caused by the environment, devices, systems, and even problems that are quite different from traditional mining problems, because several characteristics of big data also exist in the traditional data analytics. Several open issues caused by the big data will be addressed as the platform/framework and data mining perspectives in this section to explain what dilemmas we may confront because of big data. Here are some of the open issues:

Platform and framework perspective

Input and output ratio of platform.

A large number of reports and researches mentioned that we will enter the big data age in the near future. Some of them insinuated to us that these fruitful results of big data will lead us to a whole new world where “everything” is possible; therefore, the big data analytics will be an omniscient and omnipotent system. From the pragmatic perspective, the big data analytics is indeed useful and has many possibilities which can help us more accurately understand the so-called “things.” However, the situation in most studies of big data analytics is that they argued that the results of big data are valuable, but the business models of most big data analytics are not clear. The fact is that assuming we have infinite computing resources for big data analytics is a thoroughly impracticable plan, the input and output ratio (e.g., return on investment) will need to be taken into account before an organization constructs the big data analytics center.

Communication between systems

Since most big data analytics systems will be designed for parallel computing, and they typically will work on other systems (e.g., cloud platform) or work with other systems (e.g., search engine or knowledge base), the communication between the big data analytics and other systems will strongly impact the performance of the whole process of KDD. The first research issue for the communication is that the communication cost will incur between systems of data analytics. How to reduce the communication cost will be the very first thing that the data scientists need to care. Another research issue for the communication is how the big data analytics communicates with other systems. The consistency of data between different systems, modules, and operators is also an important open issue on the communication between systems. Because the communication will appear more frequently between systems of big data analytics, how to reduce the cost of communication and how to make the communication between these systems as reliable as possible will be the two important open issues for big data analytics.

Bottlenecks on data analytics system

The bottlenecks will be appeared in different places of the data analytics for big data because the environments, systems, and input data have changed which are different from the traditional data analytics. The data deluge of big data will fill up the “input” system of data analytics, and it will also increase the computation load of the data “analysis” system. This situation is just like the torrent of water (i.e., data deluge) rushed down the mountain (i.e., data analytics), how to split it and how to avoid it flowing into a narrow place (e.g., the operator is not able to handle the input data) will be the most important things to avoid the bottlenecks in data analytics system. One of the current solutions to the avoidance of bottlenecks on a data analytics system is to add more computation resources while the other is to split the analysis works to different computation nodes. A complete consideration for the whole data analytics to avoid the bottlenecks of that kind of analytics system is still needed for big data.

Security issues

Since much more environment data and human behavior will be gathered to the big data analytics, how to protect them will also be an open issue because without a security way to handle the collected data, the big data analytics cannot be a reliable system. In spite of the security that we have to tighten for big data analytics before it can gather more data from everywhere, the fact is that until now, there are still not many studies focusing on the security issues of the big data analytics. According to our observation, the security issues of big data analytics can be divided into fourfold: input, data analysis, output, and communication with other systems. For the input, it can be regarded as the data gathering which is relevant to the sensor, the handheld devices, and even the devices of internet of things. One of the important security issues on the input part of big data analytics is to make sure that the sensors will not be compromised by the attacks. For the analysis and input, it can be regarded as the security problem of such a system. For communication with other system, the security problem is on the communications between big data analytics and other external systems. Because of these latent problems, security has become one of the open issues of big data analytics.

Data mining perspective

Data mining algorithm for map-reduce solution.

As we mentioned in the previous sections, most of the traditional data mining algorithms are not designed for parallel computing; therefore, they are not particularly useful for the big data mining. Several recent studies have attempted to modify the traditional data mining algorithms to make them applicable to Hadoop-based platforms. As long as porting the data mining algorithms to Hadoop is inevitable, making the data mining algorithms work on a map-reduce architecture is the first very thing to do to apply traditional data mining methods to big data analytics. Unfortunately, not many studies attempted to make the data mining and soft computing algorithms work on Hadoop because several different backgrounds are needed to develop and design such algorithms. For instance, the researcher and his or her research group need to have the background in data mining and Hadoop so as to develop and design such algorithms. Another open issue is that most data mining algorithms are designed for centralized computing; that is, they can only work on all the data at the same time. Thus, how to make them work on a parallel computing system is also a difficult work. The good news is that some studies [ 145 ] have successfully applied the traditional data mining algorithms to the map-reduce architecture. These results imply that it is possible to do so. According to our observation, although the traditional mining or soft computing algorithms can be used to help us analyze the data in big data analytics, unfortunately, until now, not many studies are focused on it. As a consequence, it is an important open issue in big data analytics.

Noise, outliers, incomplete and inconsistent data

Although big data analytics is a new age for data analysis, because several solutions adopt classical ways to analyze the data on big data analytics, the open issues of traditional data mining algorithms also exist in these new systems. The open issues of noise, outliers, incomplete, and inconsistent data in traditional data mining algorithms will also appear in big data mining algorithms. More incomplete and inconsistent data will easily appear because the data are captured by or generated from different sensors and systems. The impact of noise, outliers, incomplete and inconsistent data will be enlarged for big data analytics. Therefore, how to mitigate the impact will be the open issues for big data analytics.

Bottlenecks on data mining algorithm

Most of the data mining algorithms in big data analytics will be designed for parallel computing. However, once data mining algorithms are designed or modified for parallel computing, it is the information exchange between different data mining procedures that may incur bottlenecks. One of them is the synchronization issue because different mining procedures will finish their jobs at different times even though they use the same mining algorithm to work on the same amount of data. Thus, some of the mining procedures will have to wait until the others finished their jobs. This situation may occur because the loading of different computer nodes may be different during the data mining process, or it may occur because the convergence speeds are different for the same data mining algorithm. The bottlenecks of data mining algorithms will become an open issue for the big data analytics which explains that we need to take into account this issue when we develop and design a new data mining algorithm for big data analytics.

Privacy issues

The privacy concern typically will make most people uncomfortable, especially if systems cannot guarantee that their personal information will not be accessed by the other people and organizations. Different from the concern of the security, the privacy issue is about if it is possible for the system to restore or infer personal information from the results of big data analytics, even though the input data are anonymous. The privacy issue has become a very important issue because the data mining and other analysis technologies will be widely used in big data analytics, the private information may be exposed to the other people after the analysis process. For example, although all the gathered data for shop behavior are anonymous (e.g., buying a pistol), because the data can be easily collected by different devices and systems (e.g., location of the shop and age of the buyer), a data mining algorithm can easily infer who bought this pistol. More precisely, the data analytics is able to reduce the scope of the database because location of the shop and age of the buyer provide the information to help the system find out possible persons. For this reason, any sensitive information needs to be carefully protected and used. The anonymous, temporary identification, and encryption are the representative technologies for privacy of data analytics, but the critical factor is how to use, what to use, and why to use the collected data on big data analytics.

Conclusions

In this paper, we reviewed studies on the data analytics from the traditional data analysis to the recent big data analysis. From the system perspective, the KDD process is used as the framework for these studies and is summarized into three parts: input, analysis, and output. From the perspective of big data analytics framework and platform, the discussions are focused on the performance-oriented and results-oriented issues. From the perspective of data mining problem, this paper gives a brief introduction to the data and big data mining algorithms which consist of clustering, classification, and frequent patterns mining technologies. To better understand the changes brought about by the big data, this paper is focused on the data analysis of KDD from the platform/framework to data mining. The open issues on computation, quality of end result, security, and privacy are then discussed to explain which open issues we may face. Last but not least, to help the audience of the paper find solutions to welcome the new age of big data, the possible high impact research trends are given below:

For the computation time, there is no doubt at all that parallel computing is one of the important future trends to make the data analytics work for big data, and consequently the technologies of cloud computing, Hadoop, and map-reduce will play the important roles for the big data analytics. To handle the computation resources of the cloud-based platform and to finish the task of data analysis as fast as possible, the scheduling method is another future trend.

Using efficient methods to reduce the computation time of input, comparison, sampling, and a variety of reduction methods will play an important role in big data analytics. Because these methods typically do not consider parallel computing environment, how to make them work on parallel computing environment will be a future research trend. Similar to the input, the data mining algorithms also face the same situation that we mentioned in the previous section , how to make them work on parallel computing environment will be a very important research trend because there are abundant research results on traditional data mining algorithms.

How to model the mining problem to find something from big data and how to display the knowledge we got from big data analytics will also be another two vital future trends because the results of these two researches will decide if the data analytics can practically work for real world approaches, not just a theoretical stuff.

The methods of extracting information from external and relative knowledge resources to further reinforce the big data analytics, until now, are not very popular in big data analytics. But combining information from different resources to add the value of output knowledge is a common solution in the area of information retrieval, such as clustering search engine or document summarization. For this reason, information fusion will also be a future trend for improving the end results of big data analytics.

Because the metaheuristic algorithms are capable of finding an approximate solution within a reasonable time, they have been widely used in solving the data mining problem in recent years. Until now, many state-of-the-art metaheuristic algorithms still have not been applied to big data analytics. In addition, compared to some early data mining algorithms, the performance of metaheuristic is no doubt superior in terms of the computation time and the quality of end result. From these observations, the application of metaheuristic algorithms to big data analytics will also be an important research topic.

Because social network is part of the daily life of most people and because its data is also a kind of big data, how to analyze the data of a social network has become a promising research issue. Obviously, it can be used to predict the behavior of a user. After that, we can make applicable strategies for the user. For instance, a business intelligence system can use the analysis results to encourage particular customers to buy the goods they are interested.

The security and privacy issues that accompany the work of data analysis are intuitive research topics which contain how to safely store the data, how to make sure the data communication is protected, and how to prevent someone from finding out the information about us. Many problems of data security and privacy are essentially the same as those of the traditional data analysis even if we are entering the big data age. Thus, how to protect the data will also appear in the research of big data analytics.

In this paper, by the data analytics, we mean the whole KDD process, while by the data analysis, we mean the part of data analytics that is aimed at finding the hidden information in the data, such as data mining.

In this paper, by an unlabeled input data, we mean that it is unknown to which group the input data belongs. If all the input data are unlabeled, it means that the distribution of the input data is unknown.

In this paper, the analysis framework refers to the whole system, from raw data gathering, data reformat, data analysis, all the way to knowledge representation.

The whole system may be down when the master machine crashed for a system that has only one master.

The learner typically represented the classification function which will create the classifier to help us classify the unknown input data.

The basic idea of [ 128 ] is that each ant will pick up and drop data items in terms of the similarity of its local neighbors.

Abbreviations

principal components analysis

volume, velocity, and variety

International Data Corporation

knowledge discovery in databases

support vector machine

sum of squared errors

generalized linear aggregates distributed engine

big data architecture framework

cloud-based big data mining & analyzing services platform

service-oriented decision support system

high performance computing cluster system

business intelligence and analytics

database management system

multiple species flocking

genetic algorithm

self-organizing map

multiple back-propagation

yahoo cloud serving benchmark

high performance computing

electroencephalography

Lyman P, Varian H. How much information 2003? Tech. Rep, 2004. [Online]. Available: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdf .

Xu R, Wunsch D. Clustering. Hoboken: Wiley-IEEE Press; 2009.

Google Scholar  

Ding C, He X. K-means clustering via principal component analysis. In: Proceedings of the Twenty-first International Conference on Machine Learning, 2004, pp 1–9.

Kollios G, Gunopulos D, Koudas N, Berchtold S. Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans Knowl Data Eng. 2003;15(5):1170–87.

Article   Google Scholar  

Fisher D, DeLine R, Czerwinski M, Drucker S. Interactions with big data analytics. Interactions. 2012;19(3):50–9.

Laney D. 3D data management: controlling data volume, velocity, and variety, META Group, Tech. Rep. 2001. [Online]. Available: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf .

van Rijmenam M. Why the 3v’s are not sufficient to describe big data, BigData Startups, Tech. Rep. 2013. [Online]. Available: http://www.bigdata-startups.com/3vs-sufficient-describe-big-data/ .

Borne K. Top 10 big data challenges a serious look at 10 big data v’s, Tech. Rep. 2014. [Online]. Available: https://www.mapr.com/blog/top-10-big-data-challenges-look-10-big-data-v .

Press G. $16.1 billion big data market: 2014 predictions from IDC and IIA, Forbes, Tech. Rep. 2013. [Online]. Available: http://www.forbes.com/sites/gilpress/2013/12/12/16-1-billion-big-data-market-2014-predictions-from-idc-and-iia/ .

Big data and analytics—an IDC four pillar research area, IDC, Tech. Rep. 2013. [Online]. Available: http://www.idc.com/prodserv/FourPillars/bigData/index.jsp .

Taft DK. Big data market to reach $46.34 billion by 2018, EWEEK, Tech. Rep. 2013. [Online]. Available: http://www.eweek.com/database/big-data-market-to-reach-46.34-billion-by-2018.html .

Research A. Big data spending to reach $114 billion in 2018; look for machine learning to drive analytics, ABI Research, Tech. Rep. 2013. [Online]. Available: https://www.abiresearch.com/press/big-data-spending-to-reach-114-billion-in-2018-loo .

Furrier J. Big data market $50 billion by 2017—HP vertica comes out #1—according to wikibon research, SiliconANGLE, Tech. Rep. 2012. [Online]. Available: http://siliconangle.com/blog/2012/02/15/big-data-market-15-billion-by-2017-hp-vertica-comes-out-1-according-to-wikibon-research/ .

Kelly J, Vellante D, Floyer D. Big data market size and vendor revenues, Wikibon, Tech. Rep. 2014. [Online]. Available: http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues .

Kelly J, Floyer D, Vellante D, Miniman S. Big data vendor revenue and market forecast 2012-2017, Wikibon, Tech. Rep. 2014. [Online]. Available: http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017 .

Mayer-Schonberger V, Cukier K. Big data: a revolution that will transform how we live, work, and think. Boston: Houghton Mifflin Harcourt; 2013.

Chen H, Chiang RHL, Storey VC. Business intelligence and analytics: from big data to big impact. MIS Quart. 2012;36(4):1165–88.

Kitchin R. The real-time city? big data and smart urbanism. Geo J. 2014;79(1):1–14.

Fayyad UM, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Mag. 1996;17(3):37–54.

Han J. Data mining: concepts and techniques. San Francisco: Morgan Kaufmann Publishers Inc.; 2005.

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. Proc ACM SIGMOD Int Conf Manag Data. 1993;22(2):207–16.

Witten IH, Frank E. Data mining: practical machine learning tools and techniques. San Francisco: Morgan Kaufmann Publishers Inc.; 2005.

Abbass H, Newton C, Sarker R. Data mining: a heuristic approach. Hershey: IGI Global; 2002.

Book   Google Scholar  

Cannataro M, Congiusta A, Pugliese A, Talia D, Trunfio P. Distributed data mining on grids: services, tools, and applications. IEEE Trans Syst Man Cyber Part B Cyber. 2004;34(6):2451–65.

Krishna K, Murty MN. Genetic \(k\) -means algorithm. IEEE Trans Syst Man Cyber Part B Cyber. 1999;29(3):433–9.

Tsai C-W, Lai C-F, Chiang M-C, Yang L. Data mining for internet of things: a survey. IEEE Commun Surveys Tutor. 2014;16(1):77–97.

Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comp Surveys. 1999;31(3):264–323.

McQueen JB. Some methods of classification and analysis of multivariate observations. In: Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, 1967. pp 281–297.

Safavian S, Landgrebe D. A survey of decision tree classifier methodology. IEEE Trans Syst Man Cyber. 1991;21(3):660–74.

Article   MathSciNet   Google Scholar  

McCallum A, Nigam K. A comparison of event models for naive bayes text classification. In: Proceedings of the National Conference on Artificial Intelligence, 1998. pp. 41–48.

Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Proceedings of the annual workshop on Computational learning theory, 1992. pp. 144–152.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In : Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000. pp. 1–12.

Kaya M, Alhajj R. Genetic algorithm based framework for mining fuzzy association rules. Fuzzy Sets Syst. 2005;152(3):587–601.

Article   MATH   MathSciNet   Google Scholar  

Srikant R, Agrawal R. Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the International Conference on Extending Database Technology: Advances in Database Technology, 1996. pp 3–17.

Zaki MJ. Spade: an efficient algorithm for mining frequent sequences. Mach Learn. 2001;42(1–2):31–60.

Article   MATH   Google Scholar  

Baeza-Yates RA, Ribeiro-Neto B. Modern Information Retrieval. Boston: Addison-Wesley Longman Publishing Co., Inc; 1999.

Liu B. Web data mining: exploring hyperlinks, contents, and usage data. Berlin, Heidelberg: Springer-Verlag; 2007.

d’Aquin M, Jay N. Interpreting data mining results with linked data for learning analytics: motivation, case study and directions. In: Proceedings of the International Conference on Learning Analytics and Knowledge, pp 155–164.

Shneiderman B. The eyes have it: a task by data type taxonomy for information visualizations. In: Proceedings of the IEEE Symposium on Visual Languages, 1996, pp 336–343.

Mani I, Bloedorn E. Multi-document summarization by graph search and matching. In: Proceedings of the National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence, 1997, pp 622–628.

Kopanakis I, Pelekis N, Karanikas H, Mavroudkis T. Visual techniques for the interpretation of data mining outcomes. In: Proceedings of the Panhellenic Conference on Advances in Informatics, 2005. pp 25–35.

Elkan C. Using the triangle inequality to accelerate k-means. In: Proceedings of the International Conference on Machine Learning, 2003, pp 147–153.

Catanzaro B, Sundaram N, Keutzer K. Fast support vector machine training and classification on graphics processors. In: Proceedings of the International Conference on Machine Learning, 2008. pp 104–111.

Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 1996. pp 103–114.

Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996. pp 226–231.

Ester M, Kriegel HP, Sander J, Wimmer M, Xu X. Incremental clustering for mining in a data warehousing environment. In: Proceedings of the International Conference on Very Large Data Bases, 1998. pp 323–333.

Ordonez C, Omiecinski E. Efficient disk-based k-means clustering for relational databases. IEEE Trans Knowl Data Eng. 2004;16(8):909–21.

Kogan J. Introduction to clustering large and high-dimensional data. Cambridge: Cambridge Univ Press; 2007.

MATH   Google Scholar  

Mitra S, Pal S, Mitra P. Data mining in soft computing framework: a survey. IEEE Trans Neural Netw. 2002;13(1):3–14.

Mehta M, Agrawal R, Rissanen J. SLIQ: a fast scalable classifier for data mining. In: Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology. 1996. pp 18–32.

Micó L, Oncina J, Carrasco RC. A fast branch and bound nearest neighbour classifier in metric spaces. Pattern Recogn Lett. 1996;17(7):731–9.

Djouadi A, Bouktache E. A fast algorithm for the nearest-neighbor classifier. IEEE Trans Pattern Anal Mach Intel. 1997;19(3):277–82.

Ververidis D, Kotropoulos C. Fast and accurate sequential floating forward feature selection with the bayes classifier applied to speech emotion recognition. Signal Process. 2008;88(12):2956–70.

Pei J, Han J, Mao R. CLOSET: an efficient algorithm for mining frequent closed itemsets. In: Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2000. pp 21–30.

Zaki MJ, Hsiao C-J. Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng. 2005;17(4):462–78.

Burdick D, Calimlim M, Gehrke J. MAFIA: a maximal frequent itemset algorithm for transactional databases. In: Proceedings of the International Conference on Data Engineering, 2001. pp 443–452.

Chen B, Haas P, Scheuermann P. A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. pp 462–468.

Zaki MJ. SPADE: an efficient algorithm for mining frequent sequences. Mach Learn. 2001;42(1–2):31–60.

Yan X, Han J, Afshar R. CloSpan: mining closed sequential patterns in large datasets. In: Proceedings of the SIAM International Conference on Data Mining, 2003. pp 166–177.

Pei J, Han J, Asl MB, Pinto H, Chen Q, Dayal U, Hsu MC. PrefixSpan mining sequential patterns efficiently by prefix projected pattern growth. In: Proceedings of the International Conference on Data Engineering, 2001. pp 215–226.

Ayres J, Flannick J, Gehrke J, Yiu T. Sequential PAttern Mining using a bitmap representation. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. pp 429–435.

Masseglia F, Poncelet P, Teisseire M. Incremental mining of sequential patterns in large databases. Data Knowl Eng. 2003;46(1):97–121.

Xu R, Wunsch-II DC. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16(3):645–78.

Chiang M-C, Tsai C-W, Yang C-S. A time-efficient pattern reduction algorithm for k-means clustering. Inform Sci. 2011;181(4):716–31.

Bradley PS, Fayyad UM. Refining initial points for k-means clustering. In: Proceedings of the International Conference on Machine Learning, 1998. pp 91–99.

Laskov P, Gehl C, Krüger S, Müller K-R. Incremental support vector learning: analysis, implementation and applications. J Mach Learn Res. 2006;7:1909–36.

MATH   MathSciNet   Google Scholar  

Russom P. Big data analytics. TDWI: Tech. Rep ; 2011.

Ma C, Zhang HH, Wang X. Machine learning for big data analytics in plants. Trends Plant Sci. 2014;19(12):798–808.

Boyd D, Crawford K. Critical questions for big data. Inform Commun Soc. 2012;15(5):662–79.

Katal A, Wazid M, Goudar R. Big data: issues, challenges, tools and good practices. In: Proceedings of the International Conference on Contemporary Computing, 2013. pp 404–409.

Baraniuk RG. More is less: signal processing and the data deluge. Science. 2011;331(6018):717–9.

Lee J, Hong S, Lee JH. An efficient prediction for heavy rain from big weather data using genetic algorithm. In: Proceedings of the International Conference on Ubiquitous Information Management and Communication, 2014. pp 25:1–25:7.

Famili A, Shen W-M, Weber R, Simoudis E. Data preprocessing and intelligent data analysis. Intel Data Anal. 1997;1(1–4):3–23.

Zhang H. A novel data preprocessing solution for large scale digital forensics investigation on big data, Master’s thesis, Norway, 2013.

Ham YJ, Lee H-W. International journal of advances in soft computing and its applications. Calc Paralleles Reseaux et Syst Repar. 2014;6(1):1–18.

Cormode G, Duffield N. Sampling for big data: a tutorial. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014. pp 1975–1975.

Satyanarayana A. Intelligent sampling for big data using bootstrap sampling and chebyshev inequality. In: Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering, 2014. pp 1–6.

Jun SW, Fleming K, Adler M, Emer JS. Zip-io: architecture for application-specific compression of big data. In: Proceedings of the International Conference on Field-Programmable Technology, 2012, pp 343–351.

Zou H, Yu Y, Tang W, Chen HM. Improving I/O performance with adaptive data compression for big data applications. In: Proceedings of the International Parallel and Distributed Processing Symposium Workshops, 2014. pp 1228–1237.

Yang C, Zhang X, Zhong C, Liu C, Pei J, Ramamohanarao K, Chen J. A spatiotemporal compression based approach for efficient big data processing on cloud. J Comp Syst Sci. 2014;80(8):1563–83.

Xue Z, Shen G, Li J, Xu Q, Zhang Y, Shao J. Compression-aware I/O performance analysis for big data clustering. In: Proceedings of the International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, 2012. pp 45–52.

Pospiech M, Felden C. Big data—a state-of-the-art. In: Proceedings of the Americas Conference on Information Systems, 2012, pp 1–23. [Online]. Available: http://aisel.aisnet.org/amcis2012/proceedings/DecisionSupport/22 .

Apache Hadoop, February 2, 2015. [Online]. Available: http://hadoop.apache.org .

Cuda, February 2, 2015. [Online]. Available: URL: http://www.nvidia.com/object/cuda_home_new.html .

Apache Storm, February 2, 2015. [Online]. Available: URL: http://storm.apache.org/ .

Curtin RR, Cline JR, Slagle NP, March WB, Ram P, Mehta NA, Gray AG. MLPACK: a scalable C++ machine learning library. J Mach Learn Res. 2013;14:801–5.

Apache Mahout, February 2, 2015. [Online]. Available: http://mahout.apache.org/ .

Huai Y, Lee R, Zhang S, Xia CH, Zhang X. DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems. In: Proceedings of the ACM Symposium on Cloud Computing, 2011. pp 4:1–4:14.

Rusu F, Dobra A. GLADE: a scalable framework for efficient analytics. In: Proceedings of LADIS Workshop held in conjunction with VLDB, 2012. pp 1–6.

Cheng Y, Qin C, Rusu F. GLADE: big data analytics made easy. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2012. pp 697–700.

Essa YM, Attiya G, El-Sayed A. Mobile agent based new framework for improving big data analysis. In: Proceedings of the International Conference on Cloud Computing and Big Data. 2013, pp 381–386.

Wonner J, Grosjean J, Capobianco A, Bechmann D Starfish: a selection technique for dense virtual environments. In: Proceedings of the ACM Symposium on Virtual Reality Software and Technology, 2012. pp 101–104.

Demchenko Y, de Laat C, Membrey P. Defining architecture components of the big data ecosystem. In: Proceedings of the International Conference on Collaboration Technologies and Systems, 2014. pp 104–112.

Ye F, Wang ZJ, Zhou FC, Wang YP, Zhou YC. Cloud-based big data mining and analyzing services platform integrating r. In: Proceedings of the International Conference on Advanced Cloud and Big Data, 2013. pp 147–151.

Wu X, Zhu X, Wu G-Q, Ding W. Data mining with big data. IEEE Trans Knowl Data Eng. 2014;26(1):97–107.

Laurila JK, Gatica-Perez D, Aad I, Blom J, Bornet O, Do T, Dousse O, Eberle J, Miettinen M. The mobile data challenge: big data for mobile computing research. In: Proceedings of the Mobile Data Challenge by Nokia Workshop, 2012. pp 1–8.

Demirkan H, Delen D. Leveraging the capabilities of service-oriented decision support systems: putting analytics and big data in cloud. Decision Support Syst. 2013;55(1):412–21.

Talia D. Clouds for scalable big data analytics. Computer. 2013;46(5):98–101.

Lu R, Zhu H, Liu X, Liu JK, Shao J. Toward efficient and privacy-preserving computing in big data era. IEEE Netw. 2014;28(4):46–50.

Cuzzocrea A, Song IY, Davis KC. Analytics over large-scale multidimensional data: The big data revolution!. In: Proceedings of the ACM International Workshop on Data Warehousing and OLAP, 2011. pp 101–104.

Zhang J, Huang ML. 5Ws model for big data analysis and visualization. In: Proceedings of the International Conference on Computational Science and Engineering, 2013. pp 1021–1028.

Chandarana P, Vijayalakshmi M. Big data analytics frameworks. In: Proceedings of the International Conference on Circuits, Systems, Communication and Information Technology Applications, 2014. pp 430–434.

Apache Drill February 2, 2015. [Online]. Available: URL: http://drill.apache.org/ .

Hu H, Wen Y, Chua T-S, Li X. Toward scalable systems for big data analytics: a technology tutorial. IEEE Access. 2014;2:652–87.

Sagiroglu S, Sinanc D, Big data: a review. In: Proceedings of the International Conference on Collaboration Technologies and Systems, 2013. pp 42–47.

Fan W, Bifet A. Mining big data: current status, and forecast to the future. ACM SIGKDD Explor Newslett. 2013;14(2):1–5.

Diebold FX. On the origin(s) and development of the term “big data”, Penn Institute for Economic Research, Department of Economics, University of Pennsylvania, Tech. Rep. 2012. [Online]. Available: http://economics.sas.upenn.edu/sites/economics.sas.upenn.edu/files/12-037.pdf .

Weiss SM, Indurkhya N. Predictive data mining: a practical guide. San Francisco: Morgan Kaufmann Publishers Inc.; 1998.

Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A, Foufou S, Bouras A. A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Topics Comp. 2014;2(3):267–79.

Shirkhorshidi AS, Aghabozorgi SR, Teh YW, Herawan T. Big data clustering: a review. In: Proceedings of the International Conference on Computational Science and Its Applications, 2014. pp 707–720.

Xu H, Li Z, Guo S, Chen K. Cloudvista: interactive and economical visual cluster analysis for big data in the cloud. Proc VLDB Endowment. 2012;5(12):1886–9.

Cui X, Gao J, Potok TE. A flocking based algorithm for document clustering analysis. J Syst Archit. 2006;52(89):505–15.

Cui X, Charles JS, Potok T. GPU enhanced parallel computing for large scale data clustering. Future Gener Comp Syst. 2013;29(7):1736–41.

Feldman D, Schmidt M, Sohler C. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2013. pp 1434–1453.

Tekin C, van der Schaar M. Distributed online big data classification using context information. In: Proceedings of the Allerton Conference on Communication, Control, and Computing, 2013. pp 1435–1442.

Rebentrost P, Mohseni M, Lloyd S. Quantum support vector machine for big feature and big data classification. CoRR , vol. abs/1307.0471, 2014. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1307.html#RebentrostML13 .

Lin MY, Lee PY, Hsueh SC. Apriori-based frequent itemset mining algorithms on mapreduce. In: Proceedings of the International Conference on Ubiquitous Information Management and Communication, 2012. pp 76:1–76:8.

Riondato M, DeBrabant JA, Fonseca R, Upfal E. PARMA: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: Proceedings of the ACM International Conference on Information and Knowledge Management, 2012. pp 85–94.

Leung CS, MacKinnon R, Jiang F. Reducing the search space for big data mining for interesting patterns from uncertain data. In: Proceedings of the International Congress on Big Data, 2014. pp 315–322.

Yang L, Shi Z, Xu L, Liang F, Kirsh I. DH-TRIE frequent pattern mining on hadoop using JPA. In: Proceedings of the International Conference on Granular Computing, 2011. pp 875–878.

Huang JW, Lin SC, Chen MS. DPSP: Distributed progressive sequential pattern mining on the cloud. In: Proceedings of the Advances in Knowledge Discovery and Data Mining, vol. 6119, 2010, pp 27–34.

Paz CE. A survey of parallel genetic algorithms. Calc Paralleles Reseaux et Syst Repar. 1998;10(2):141–71.

kranthi Kiran B, Babu AV. A comparative study of issues in big data clustering algorithm with constraint based genetic algorithm for associative clustering. Int J Innov Res Comp Commun Eng 2014; 2(8): 5423–5432.

Bu Y, Borkar VR, Carey MJ, Rosen J, Polyzotis N, Condie T, Weimer M, Ramakrishnan R. Scaling datalog for machine learning on big data, CoRR , vol. abs/1203.0160, 2012. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1203.html#abs-1203-0160 .

Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel: A system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2010. pp 135–146.

Hasan S, Shamsuddin S,  Lopes N. Soft computing methods for big data problems. In: Proceedings of the Symposium on GPU Computing and Applications, 2013. pp 235–247.

Ku-Mahamud KR. Big data clustering using grid computing and ant-based algorithm. In: Proceedings of the International Conference on Computing and Informatics, 2013. pp 6–14.

Deneubourg JL, Goss S, Franks N, Sendova-Franks A, Detrain C, Chrétien L. The dynamics of collective sorting robot-like ants and ant-like robots. In: Proceedings of the International Conference on Simulation of Adaptive Behavior on From Animals to Animats, 1990. pp 356–363.

Radoop [Online]. https://rapidminer.com/products/radoop/ . Accessed 2 Feb 2015.

PigMix [Online]. https://cwiki.apache.org/confluence/display/PIG/PigMix . Accessed 2 Feb 2015.

GridMix [Online]. http://hadoop.apache.org/docs/r1.2.1/gridmix.html . Accessed 2 Feb 2015.

TeraSoft [Online]. http://sortbenchmark.org/ . Accessed 2 Feb 2015.

TPC, transaction processing performance council [Online]. http://www.tpc.org/ . Accessed 2 Feb 2015.

Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with ycsb. In: Proceedings of the ACM Symposium on Cloud Computing, 2010. pp 143–154.

Ghazal A, Rabl T, Hu M, Raab F, Poess M, Crolotte A, Jacobsen HA. BigBench: Towards an industry standard benchmark for big data analytics. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2013. pp 1197–1208.

Cheptsov A. Hpc in big data age: An evaluation report for java-based data-intensive applications implemented with hadoop and openmpi. In: Proceedings of the European MPI Users’ Group Meeting, 2014. pp 175:175–175:180.

Yuan LY, Wu L, You JH, Chi Y. Rubato db: A highly scalable staged grid database system for oltp and big data applications. In: Proceedings of the ACM International Conference on Conference on Information and Knowledge Management, 2014. pp 1–10.

Zhao JM, Wang WS, Liu X, Chen YF. Big data benchmark - big DS. In: Proceedings of the Advancing Big Data Benchmarks, 2014, pp. 49–57.

 Saletore V, Krishnan K, Viswanathan V, Tolentino M. HcBench: Methodology, development, and full-system characterization of a customer usage representative big data/hadoop benchmark. In: Advancing Big Data Benchmarks, 2014. pp 73–93.

Zhang L, Stoffel A, Behrisch M,  Mittelstadt S, Schreck T, Pompl R, Weber S, Last H, Keim D. Visual analytics for the big data era—a comparative review of state-of-the-art commercial systems. In: Proceedings of the IEEE Conference on Visual Analytics Science and Technology, 2012. pp 173–182.

Harati A, Lopez S, Obeid I, Picone J, Jacobson M, Tobochnik S. The TUH EEG CORPUS: A big data resource for automated eeg interpretation. In: Proceeding of the IEEE Signal Processing in Medicine and Biology Symposium, 2014. pp 1–5.

Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endowment. 2009;2(2):1626–9.

Beckmann M, Ebecken NFF, de Lima BSLP, Costa MA. A user interface for big data with rapidminer. RapidMiner World, Boston, MA, Tech. Rep., 2014. [Online]. Available: http://www.slideshare.net/RapidMiner/a-user-interface-for-big-data-with-rapidminer-marcelo-beckmann .

Januzaj E, Kriegel HP, Pfeifle M. DBDC: Density based distributed clustering. In: Proceedings of the Advances in Database Technology, 2004; vol. 2992, 2004, pp 88–105.

Zhao W, Ma H, He Q. Parallel k-means clustering based on mapreduce. Proceedings Cloud Comp. 2009;5931:674–9.

Nolan RL. Managing the crises in data processing. Harvard Bus Rev. 1979;57(1):115–26.

Tsai CW, Huang WC, Chiang MC. Recent development of metaheuristics for clustering. In: Proceedings of the Mobile, Ubiquitous, and Intelligent Computing, 2014; vol. 274, pp. 629–636.

Download references

Authors’ contributions

CWT contributed to the paper review and drafted the first version of the manuscript. CFL contributed to the paper collection and manuscript organization. HCC and AVV double checked the manuscript and provided several advanced ideas for this manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions on the paper. This work was supported in part by the Ministry of Science and Technology of Taiwan, R.O.C., under Contracts MOST103-2221-E-197-034, MOST104-2221-E-197-005, and MOST104-2221-E-197-014.

Compliance with ethical guidelines

Competing interests The authors declare that they have no competing interests.

Author information

Authors and affiliations.

Department of Computer Science and Information Engineering, National Ilan University, Yilan, Taiwan

Chun-Wei Tsai & Han-Chieh Chao

Institute of Computer Science and Information Engineering, National Chung Cheng University, Chia-Yi, Taiwan

Chin-Feng Lai

Information Engineering College, Yangzhou University, Yangzhou, Jiangsu, China

Han-Chieh Chao

School of Information Science and Engineering, Fujian University of Technology, Fuzhou, Fujian, China

Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, SE-931 87, Skellefteå, Sweden

Athanasios V. Vasilakos

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Athanasios V. Vasilakos .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Tsai, CW., Lai, CF., Chao, HC. et al. Big data analytics: a survey. Journal of Big Data 2 , 21 (2015). https://doi.org/10.1186/s40537-015-0030-3

Download citation

Received : 14 May 2015

Accepted : 02 September 2015

Published : 01 October 2015

DOI : https://doi.org/10.1186/s40537-015-0030-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • data analytics
  • data mining

latest research paper on big data

IMAGES

  1. (PDF) A REVIEW PAPER ON BIG DATA AND HADOOP

    latest research paper on big data

  2. Latest Research Papers in Big Data

    latest research paper on big data

  3. (PDF) Research Paper on Big Data Business Analytics An Overview (2) (1)

    latest research paper on big data

  4. (PDF) Big Data Security Issues Research Paper

    latest research paper on big data

  5. (PDF) RESEARCH IN BIG DATA -AN OVERVIEW

    latest research paper on big data

  6. (PDF) Big Data for Industry 4.0: A Conceptual Framework

    latest research paper on big data

VIDEO

  1. Design2Code (Frontend Development With LLM)

  2. 20 Days Strategy for C-CAT

  3. Using Big Data to Revolutionize Sustainability

  4. Big data analytics question paper,MDU question paper, big data question paper, Data analytics

  5. LATEST Research paper called "Convergeance" #ai #airesearch #aitrends #arkinvest

  6. C-CAT Previous Year MCQs for DCN || C-CAT Preparation || CDAC Entrance Exam

COMMENTS

  1. Big Data Research

    Big Data Research | Journal | ScienceDirect.com by Elsevier

  2. 15 years of Big Data: a systematic literature review

    Big Data is still gaining attention as a fundamental building block of the Artificial Intelligence and Machine Learning world. Therefore, a lot of effort has been pushed into Big Data research in the last 15 years. The objective of this Systematic Literature Review is to summarize the current state of the art of the previous 15 years of research about Big Data by providing answers to a set of ...

  3. Home page

    Journal of Big Data: Home page

  4. Articles

    Articles | Journal of Big Data - SpringerOpen

  5. A comprehensive and systematic literature review on the big data

    The big data management challenges in the IoT are considered based on 13 V's challenges. Marjani et al. investigated state-of-the-art research efforts directed toward big IoT data analytics and proposed a new architecture for big IoT data analytics. This article discusses big IoT data analytic types under real-time, offline, memory-level ...

  6. Big Data Research

    Read the latest articles of Big Data Research at ScienceDirect.com, Elsevier's leading platform of peer-reviewed scholarly literature.

  7. Big Data Research

    ML-aVAT: A Novel 2-Stage Machine-Learning Approach for Automatic Clustering Tendency Assessment. Harshal Mittal, Jagarlamudi Sai Laxman, Dheeraj Kumar. Article 100413. View PDF. Article preview. Read the latest articles of Big Data Research at ScienceDirect.com, Elsevier's leading platform of peer-reviewed scholarly literature.

  8. Study on the interaction between big data and artificial intelligence

    Based on the theory of Big Data Cycle, this paper discusses the relationship between big data and AI and how they interact and influence each other. It adopts the integrative review research method to screen latest literature and summarizes the role of AI in different phases of big data cycle. We also provide an insight into the applications of ...

  9. Data Security in Big Data: Challenges, Strategies, and Future Trends

    In the dynamic landscape of big data analytics, this paper explores the critical dimension of data. security, addressing challenges, strategies, and emerging trends. Recognizing the exponential ...

  10. A new theoretical understanding of big data analytics capabilities in

    Of the 70 papers satisfying our selection criteria, publication year and type (journal or conference paper) reveal an increasing trend in big data analytics over the last 6 years (Table 6). Additionally, journals produced more BDA papers than Conference proceedings (Fig. 2 ), which may be affected during 2020-2021 because of COVID, and fewer ...

  11. Privacy Prevention of Big Data Applications: A Systematic Literature

    This paper focuses on privacy and security concerns in Big Data. This paper also covers the encryption techniques by taking existing methods such as differential privacy, k-anonymity, T-closeness, and L-diversity.Several privacy-preserving techniques have been created to safeguard privacy at various phases of a large data life cycle.

  12. Big data analytics in healthcare: a systematic literature review

    Big data provides support for the constantly changing nature of data by offering extensionality (the addition of new data fields) and scalability (expansion in size). g) Valence (connectedness) Big data connects common fields to conjoin different data sets. 2.2. Big data in healthcare.

  13. Publications

    Publications. IEEE Talks Big Data - Check out our new Q&A article series with big Data experts!. Call for Papers - Check out the many opportunities to submit your own paper. This is a great way to get published, and to share your research in a leading IEEE magazine! Publications - See the list of various IEEE publications related to big data and analytics here.

  14. Exploring research trends in big data

    by analysing 988 papers in relevant area [40]. Analyses of 406 big data papers, published in 2011, using the co-word occurrence technique, revealed key research themes in this area. Although these studies have provided valuable insights, there is no comprehensive study to show the research trends of big data based on term co-occurrence.

  15. Big Data Research

    Deep Learning Techniques for Enhanced Mangrove Land use and Land change from Remote Sensing Imagery: A Blue Carbon Perspective. Huimin Han, Zeeshan Zeeshan, Muhammad Assam, Dr Faheem Ullah Khan, ... Nadia Sarhan. In Press, Journal Pre-proof, Available online 13 June 2024. View PDF.

  16. Business analytics and big data research in information systems

    The "Business Analytics and Big Data" track as a melting pot for topics in information systems (IS) and neighbouring disciplines has a long and successful history at the European Conference on Information Systems (ECIS). From its initial year in 2012 to 2021, the track has received 512 submissions.

  17. The use of Big Data Analytics in healthcare

    The use of Big Data Analytics in healthcare

  18. data science Latest Research Papers

    data science Latest Research Papers

  19. (PDF) An Overview of Big Data Concepts, Methods, and Analytics

    PDF | In recent years, data generation is increasing on a large scale and fast pace, and the development of Internet applications, mobile applications,... | Find, read and cite all the research ...

  20. Top 20 Latest Research Problems in Big Data and Data Science

    E ven though Big data is in the mainstream of operations as of 2020, there are still potential issues or challenges the researchers can address. Some of these issues overlap with the data science field. In this article, the top 20 interesting latest research problems in the combination of big data and data science are covered based on my personal experience (with due respect to the ...

  21. Big data stream analysis: a systematic literature review

    Recently, big data streams have become ubiquitous due to the fact that a number of applications generate a huge amount of data at a great velocity. This made it difficult for existing data mining tools, technologies, methods, and techniques to be applied directly on big data streams due to the inherent dynamic characteristics of big data. In this paper, a systematic review of big data streams ...

  22. Big Data, Artificial Intelligence, and Financial Economics

    Cambridge, MA: December 13, 2024. The proliferation of large unstructured datasets along with advances in artificial intelligence (AI) technology provides researchers in financial economics with new opportunities for data analysis, and it also changes the set of subjects these researchers are studying. As AI becomes increasingly important in ...

  23. Big data analytics: a survey

    Big data analytics: a survey