data driven research paper

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
Springer Nature - PMC COVID-19 Collection

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

Iqbal h. sarker.

1 Swinburne University of Technology, Melbourne, VIC 3122 Australia

2 Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, Chittagong, 4349 Bangladesh

The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science, advanced analytics methods including machine learning modeling can provide actionable insights or deeper knowledge about data, which makes the computing process automatic and smart. In this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios. We also discuss and summarize ten potential real-world application domains including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making. Based on this, we finally highlight the challenges and potential research directions within the scope of our study. Overall, this paper aims to serve as a reference point on data science and advanced analytics to the researchers and decision-makers as well as application developers, particularly from the data-driven solution point of view for real-world problems.

Introduction

We are living in the age of “data science and advanced analytics”, where almost everything in our daily lives is digitally recorded as data [ 17 ]. Thus the current electronic world is a wealth of various kinds of data, such as business data, financial data, healthcare data, multimedia data, internet of things (IoT) data, cybersecurity data, social media data, etc [ 112 ]. The data can be structured, semi-structured, or unstructured, which increases day by day [ 105 ]. Data science is typically a “concept to unify statistics, data analysis, and their related methods” to understand and analyze the actual phenomena with data. According to Cao et al. [ 17 ] “data science is the science of data” or “data science is the study of data”, where a data product is a data deliverable, or data-enabled or guided, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, or system. The popularity of “Data science” is increasing day-by-day, which is shown in Fig. Fig.1 1 according to Google Trends data over the last 5 years [ 36 ]. In addition to data science, we have also shown the popularity trends of the relevant areas such as “Data analytics”, “Data mining”, “Big data”, “Machine learning” in the figure. According to Fig. Fig.1, 1 , the popularity indication values for these data-driven domains, particularly “Data science”, and “Machine learning” are increasing day-by-day. This statistical information and the applicability of the data-driven smart decision-making in various real-world application areas, motivate us to study briefly on “Data science” and machine-learning-based “Advanced analytics” in this paper.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig1_HTML.jpg

The worldwide popularity score of data science comparing with relevant areas in a range of 0 (min) to 100 (max) over time where x -axis represents the timestamp information and y -axis represents the corresponding score

Usually, data science is the field of applying advanced analytics methods and scientific concepts to derive useful business information from data. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to analyze granular data, which we are interested in. In the field of data science, several types of analytics are popular, such as "Descriptive analytics" which answers the question of what happened; "Diagnostic analytics" which answers the question of why did it happen; "Predictive analytics" which predicts what will happen in the future; and "Prescriptive analytics" which prescribes what action should be taken, discussed briefly in “ Advanced analytics methods and smart computing ”. Such advanced analytics and decision-making based on machine learning techniques [ 105 ], a major part of artificial intelligence (AI) [ 102 ] can also play a significant role in the Fourth Industrial Revolution (Industry 4.0) due to its learning capability for smart computing as well as automation [ 121 ].

Although the area of “data science” is huge, we mainly focus on deriving useful insights through advanced analytics, where the results are used to make smart decisions in various real-world application areas. For this, various advanced analytics methods such as machine learning modeling, natural language processing, sentiment analysis, neural network, or deep learning analysis can provide deeper knowledge about data, and thus can be used to develop data-driven intelligent applications. More specifically, regression analysis, classification, clustering analysis, association rules, time-series analysis, sentiment analysis, behavioral patterns, anomaly detection, factor analysis, log analysis, and deep learning which is originated from the artificial neural network, are taken into account in our study. These machine learning-based advanced analytics methods are discussed briefly in “ Advanced analytics methods and smart computing ”. Thus, it’s important to understand the principles of various advanced analytics methods mentioned above and their applicability to apply in various real-world application areas. For instance, in our earlier paper Sarker et al. [ 114 ], we have discussed how data science and machine learning modeling can play a significant role in the domain of cybersecurity for making smart decisions and to provide data-driven intelligent security services. In this paper, we broadly take into account the data science application areas and real-world problems in ten potential domains including the area of business data science, health data science, IoT data science, behavioral data science, urban data science, and so on, discussed briefly in “ Real-world application domains ”.

Based on the importance of machine learning modeling to extract the useful insights from the data mentioned above and data-driven smart decision-making, in this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application. The key contribution of this study is thus understanding data science modeling, explaining different analytic methods for solution perspective and their applicability in various real-world data-driven applications areas mentioned earlier. Overall, the purpose of this paper is, therefore, to provide a basic guide or reference for those academia and industry people who want to study, research, and develop automated and intelligent applications or systems based on smart computing and decision making within the area of data science.

The main contributions of this paper are summarized as follows:

To define the scope of our study towards data-driven smart computing and decision-making in our real-world life. We also make a brief discussion on the concept of data science modeling from business problems to data product and automation, to understand its applicability and provide intelligent services in real-world scenarios.
To provide a comprehensive view on data science including advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application.
To discuss the applicability and significance of machine learning-based analytics methods in various real-world application areas. We also summarize ten potential real-world application areas, from business to personalized applications in our daily life, where advanced analytics with machine learning modeling can be used to achieve the expected outcome.
To highlight and summarize the challenges and potential research directions within the scope of our study.

The rest of the paper is organized as follows. The next section provides the background and related work and defines the scope of our study. The following section presents the concepts of data science modeling for building a data-driven application. After that, briefly discuss and explain different advanced analytics methods and smart computing. Various real-world application areas are discussed and summarized in the next section. We then highlight and summarize several research issues and potential future directions, and finally, the last section concludes this paper.

Background and Related Work

In this section, we first discuss various data terms and works related to data science and highlight the scope of our study.

Data Terms and Definitions

There is a range of key terms in the field, such as data analysis, data mining, data analytics, big data, data science, advanced analytics, machine learning, and deep learning, which are highly related and easily confusing. In the following, we define these terms and differentiate them with the term “Data Science” according to our goal.

The term “Data analysis” refers to the processing of data by conventional (e.g., classic statistical, empirical, or logical) theories, technologies, and tools for extracting useful information and for practical purposes [ 17 ]. The term “Data analytics”, on the other hand, refers to the theories, technologies, instruments, and processes that allow for an in-depth understanding and exploration of actionable data insight [ 17 ]. Statistical and mathematical analysis of the data is the major concern in this process. “Data mining” is another popular term over the last decade, which has a similar meaning with several other terms such as knowledge mining from data, knowledge extraction, knowledge discovery from data (KDD), data/pattern analysis, data archaeology, and data dredging. According to Han et al. [ 38 ], it should have been more appropriately named “knowledge mining from data”. Overall, data mining is defined as the process of discovering interesting patterns and knowledge from large amounts of data [ 38 ]. Data sources may include databases, data centers, the Internet or Web, other repositories of data, or data dynamically streamed through the system. “Big data” is another popular term nowadays, which may change the statistical and data analysis approaches as it has the unique features of “massive, high dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and erroneous” [ 74 ]. Big data can be generated by mobile devices, social networks, the Internet of Things, multimedia, and many other new applications [ 129 ]. Several unique features including volume, velocity, variety, veracity, value (5Vs), and complexity are used to understand and describe big data [ 69 ].

In terms of analytics, basic analytics provides a summary of data whereas the term “Advanced Analytics” takes a step forward in offering a deeper understanding of data and helps to analyze granular data. Advanced analytics is characterized or defined as autonomous or semi-autonomous data or content analysis using advanced techniques and methods to discover deeper insights, predict or generate recommendations, typically beyond traditional business intelligence or analytics. “Machine learning”, a branch of artificial intelligence (AI), is one of the major techniques used in advanced analytics which can automate analytical model building [ 112 ]. This is focused on the premise that systems can learn from data, recognize trends, and make decisions, with minimal human involvement [ 38 , 115 ]. “Deep Learning” is a subfield of machine learning that discusses algorithms inspired by the human brain’s structure and the function called artificial neural networks [ 38 , 139 ].

Unlike the above data-related terms, “Data science” is an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies. In [ 17 ], Cao et al. defined data science from the disciplinary perspective as “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. In “ Understanding data science modeling ”, we briefly discuss the data science modeling from a practical perspective starting from business problems to data products that can assist the data scientists to think and work in a particular real-world problem domain within the area of data science and analytics.

Related Work

In the area, several papers have been reviewed by the researchers based on data science and its significance. For example, the authors in [ 19 ] identify the evolving field of data science and its importance in the broader knowledge environment and some issues that differentiate data science and informatics issues from conventional approaches in information sciences. Donoho et al. [ 27 ] present 50 years of data science including recent commentary on data science in mass media, and on how/whether data science varies from statistics. The authors formally conceptualize the theory-guided data science (TGDS) model in [ 53 ] and present a taxonomy of research themes in TGDS. Cao et al. include a detailed survey and tutorial on the fundamental aspects of data science in [ 17 ], which considers the transition from data analysis to data science, the principles of data science, as well as the discipline and competence of data education.

Besides, the authors include a data science analysis in [ 20 ], which aims to provide a realistic overview of the use of statistical features and related data science methods in bioimage informatics. The authors in [ 61 ] study the key streams of data science algorithm use at central banks and show how their popularity has risen over time. This research contributes to the creation of a research vector on the role of data science in central banking. In [ 62 ], the authors provide an overview and tutorial on the data-driven design of intelligent wireless networks. The authors in [ 87 ] provide a thorough understanding of computational optimal transport with application to data science. In [ 97 ], the authors present data science as theoretical contributions in information systems via text analytics.

Unlike the above recent studies, in this paper, we concentrate on the knowledge of data science including advanced analytics methods, machine learning modeling, real-world application domains, and potential research directions within the scope of our study. The advanced analytics methods based on machine learning techniques discussed in this paper can be applied to enhance the capabilities of an application in terms of data-driven intelligent decision making and automation in the final data product or systems.

Understanding Data Science Modeling

In this section, we briefly discuss how data science can play a significant role in the real-world business process. For this, we first categorize various types of data and then discuss the major steps of data science modeling starting from business problems to data product and automation.

Types of Real-World Data

Typically, to build a data-driven real-world system in a particular domain, the availability of data is the key [ 17 , 112 , 114 ]. The data can be in different types such as (i) Structured—that has a well-defined data structure and follows a standard order, examples are names, dates, addresses, credit card numbers, stock information, geolocation, etc.; (ii) Unstructured—has no pre-defined format or organization, examples are sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, etc.; (iii) Semi-structured—has elements of both the structured and unstructured data containing certain organizational properties, examples are HTML, XML, JSON documents, NoSQL databases, etc.; and (iv) Metadata—that represents data about the data, examples are author, file type, file size, creation date and time, last modification date and time, etc. [ 38 , 105 ].

In the area of data science, researchers use various widely-used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 127 ], UNSW-NB15 [ 79 ], Bot-IoT [ 59 ], ISCX’12 [ 15 ], CIC-DDoS2019 [ 22 ], etc., smartphone datasets such as phone call logs [ 88 , 110 ], mobile application usages logs [ 124 , 149 ], SMS Log [ 28 ], mobile phone notification logs [ 77 ] etc., IoT data [ 56 , 11 , 64 ], health data such as heart disease [ 99 ], diabetes mellitus [ 86 , 147 ], COVID-19 [ 41 , 78 ], etc., agriculture and e-commerce data [ 128 , 150 ], and many more in various application domains. In “ Real-world application domains ”, we discuss ten potential real-world application domains of data science and analytics by taking into account data-driven smart computing and decision making, which can help the data scientists and application developers to explore more in various real-world issues.

Overall, the data used in data-driven applications can be any of the types mentioned above, and they can differ from one application to another in the real world. Data science modeling, which is briefly discussed below, can be used to analyze such data in a specific problem domain and derive insights or useful information from the data to build a data-driven model or data product.

Steps of Data Science Modeling

Data science is typically an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies, mentioned earlier in “ Background and related work ”. In this section, we briefly discuss how data science can play a significant role in the real-world business process. Figure Figure2 2 shows an example of data science modeling starting from real-world data to data-driven product and automation. In the following, we briefly discuss each module of the data science process.

Understanding business problems: This involves getting a clear understanding of the problem that is needed to solve, how it impacts the relevant organization or individuals, the ultimate goals for addressing it, and the relevant project plan. Thus to understand and identify the business problems, the data scientists formulate relevant questions while working with the end-users and other stakeholders. For instance, how much/many, which category/group, is the behavior unrealistic/abnormal, which option should be taken, what action, etc. could be relevant questions depending on the nature of the problems. This helps to get a better idea of what business needs and what we should be extracted from data. Such business knowledge can enable organizations to enhance their decision-making process, is known as “Business Intelligence” [ 65 ]. Identifying the relevant data sources that can help to answer the formulated questions and what kinds of actions should be taken from the trends that the data shows, is another important task associated with this stage. Once the business problem has been clearly stated, the data scientist can define the analytic approach to solve the problem.
Understanding data: As we know that data science is largely driven by the availability of data [ 114 ]. Thus a sound understanding of the data is needed towards a data-driven model or system. The reason is that real-world data sets are often noisy, missing values, have inconsistencies, or other data issues, which are needed to handle effectively [ 101 ]. To gain actionable insights, the appropriate data or the quality of the data must be sourced and cleansed, which is fundamental to any data science engagement. For this, data assessment that evaluates what data is available and how it aligns to the business problem could be the first step in data understanding. Several aspects such as data type/format, the quantity of data whether it is sufficient or not to extract the useful knowledge, data relevance, authorized access to data, feature or attribute importance, combining multiple data sources, important metrics to report the data, etc. are needed to take into account to clearly understand the data for a particular business problem. Overall, the data understanding module involves figuring out what data would be best needed and the best ways to acquire it.
Data pre-processing and exploration: Exploratory data analysis is defined in data science as an approach to analyzing datasets to summarize their key characteristics, often with visual methods [ 135 ]. This examines a broad data collection to discover initial trends, attributes, points of interest, etc. in an unstructured manner to construct meaningful summaries of the data. Thus data exploration is typically used to figure out the gist of data and to develop a first step assessment of its quality, quantity, and characteristics. A statistical model can be used or not, but primarily it offers tools for creating hypotheses by generally visualizing and interpreting the data through graphical representation such as a chart, plot, histogram, etc [ 72 , 91 ]. Before the data is ready for modeling, it’s necessary to use data summarization and visualization to audit the quality of the data and provide the information needed to process it. To ensure the quality of the data, the data pre-processing technique, which is typically the process of cleaning and transforming raw data [ 107 ] before processing and analysis is important. It also involves reformatting information, making data corrections, and merging data sets to enrich data. Thus, several aspects such as expected data, data cleaning, formatting or transforming data, dealing with missing values, handling data imbalance and bias issues, data distribution, search for outliers or anomalies in data and dealing with them, ensuring data quality, etc. could be the key considerations in this step.
Machine learning modeling and evaluation: Once the data is prepared for building the model, data scientists design a model, algorithm, or set of models, to address the business problem. Model building is dependent on what type of analytics, e.g., predictive analytics, is needed to solve the particular problem, which is discussed briefly in “ Advanced analytics methods and smart computing ”. To best fits the data according to the type of analytics, different types of data-driven or machine learning models that have been summarized in our earlier paper Sarker et al. [ 105 ], can be built to achieve the goal. Data scientists typically separate training and test subsets of the given dataset usually dividing in the ratio of 80:20 or data considering the most popular k -folds data splitting method [ 38 ]. This is to observe whether the model performs well or not on the data, to maximize the model performance. Various model validation and assessment metrics, such as error rate, accuracy, true positive, false positive, true negative, false negative, precision, recall, f-score, ROC (receiver operating characteristic curve) analysis, applicability analysis, etc. [ 38 , 115 ] are used to measure the model performance, which can guide the data scientists to choose or design the learning method or model. Besides, machine learning experts or data scientists can take into account several advanced analytics such as feature engineering, feature selection or extraction methods, algorithm tuning, ensemble methods, modifying existing algorithms, or designing new algorithms, etc. to improve the ultimate data-driven model to solve a particular business problem through smart decision making.
Data product and automation: A data product is typically the output of any data science activity [ 17 ]. A data product, in general terms, is a data deliverable, or data-enabled or guide, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, application, or system that process data and generate results. Businesses can use the results of such data analysis to obtain useful information like churn (a measure of how many customers stop using a product) prediction and customer segmentation, and use these results to make smarter business decisions and automation. Thus to make better decisions in various business problems, various machine learning pipelines and data products can be developed. To highlight this, we summarize several potential real-world data science application areas in “ Real-world application domains ”, where various data products can play a significant role in relevant business problems to make them smart and automate.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in business practices. The interesting part of the data science process indicates having a deeper understanding of the business problem to solve. Without that, it would be much harder to gather the right data and extract the most useful information from the data for making decisions to solve the problem. In terms of role, “Data Scientists” typically interpret and manage data to uncover the answers to major questions that help organizations to make objective decisions and solve complex problems. In a summary, a data scientist proactively gathers and analyzes information from multiple sources to better understand how the business performs, and designs machine learning or data-driven tools/methods, or algorithms, focused on advanced analytics, which can make today’s computing process smarter and intelligent, discussed briefly in the following section.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig2_HTML.jpg

An example of data science modeling from real-world data to data-driven system and decision making

Advanced Analytics Methods and Smart Computing

As mentioned earlier in “ Background and related work ”, basic analytics provides a summary of data whereas advanced analytics takes a step forward in offering a deeper understanding of data and helps in granular data analysis. For instance, the predictive capabilities of advanced analytics can be used to forecast trends, events, and behaviors. Thus, “advanced analytics” can be defined as the autonomous or semi-autonomous analysis of data or content using advanced techniques and methods to discover deeper insights, make predictions, or produce recommendations, where machine learning-based analytical modeling is considered as the key technologies in the area. In the following section, we first summarize various types of analytics and outcome that are needed to solve the associated business problems, and then we briefly discuss machine learning-based analytical modeling.

Types of Analytics and Outcome

In the real-world business process, several key questions such as “What happened?”, “Why did it happen?”, “What will happen in the future?”, “What action should be taken?” are common and important. Based on these questions, in this paper, we categorize and highlight the analytics into four types such as descriptive, diagnostic, predictive, and prescriptive, which are discussed below.

Descriptive analytics: It is the interpretation of historical data to better understand the changes that have occurred in a business. Thus descriptive analytics answers the question, “what happened in the past?” by summarizing past data such as statistics on sales and operations or marketing strategies, use of social media, and engagement with Twitter, Linkedin or Facebook, etc. For instance, using descriptive analytics through analyzing trends, patterns, and anomalies, etc., customers’ historical shopping data can be used to predict the probability of a customer purchasing a product. Thus, descriptive analytics can play a significant role to provide an accurate picture of what has occurred in a business and how it relates to previous times utilizing a broad range of relevant business data. As a result, managers and decision-makers can pinpoint areas of strength and weakness in their business, and eventually can take more effective management strategies and business decisions.
Diagnostic analytics: It is a form of advanced analytics that examines data or content to answer the question, “why did it happen?” The goal of diagnostic analytics is to help to find the root cause of the problem. For example, the human resource management department of a business organization may use these diagnostic analytics to find the best applicant for a position, select them, and compare them to other similar positions to see how well they perform. In a healthcare example, it might help to figure out whether the patients’ symptoms such as high fever, dry cough, headache, fatigue, etc. are all caused by the same infectious agent. Overall, diagnostic analytics enables one to extract value from the data by posing the right questions and conducting in-depth investigations into the answers. It is characterized by techniques such as drill-down, data discovery, data mining, and correlations.
Predictive analytics: Predictive analytics is an important analytical technique used by many organizations for various purposes such as to assess business risks, anticipate potential market patterns, and decide when maintenance is needed, to enhance their business. It is a form of advanced analytics that examines data or content to answer the question, “what will happen in the future?” Thus, the primary goal of predictive analytics is to identify and typically answer this question with a high degree of probability. Data scientists can use historical data as a source to extract insights for building predictive models using various regression analyses and machine learning techniques, which can be used in various application domains for a better outcome. Companies, for example, can use predictive analytics to minimize costs by better anticipating future demand and changing output and inventory, banks and other financial institutions to reduce fraud and risks by predicting suspicious activity, medical specialists to make effective decisions through predicting patients who are at risk of diseases, retailers to increase sales and customer satisfaction through understanding and predicting customer preferences, manufacturers to optimize production capacity through predicting maintenance requirements, and many more. Thus predictive analytics can be considered as the core analytical method within the area of data science.
Prescriptive analytics: Prescriptive analytics focuses on recommending the best way forward with actionable information to maximize overall returns and profitability, which typically answer the question, “what action should be taken?” In business analytics, prescriptive analytics is considered the final step. For its models, prescriptive analytics collects data from several descriptive and predictive sources and applies it to the decision-making process. Thus, we can say that it is related to both descriptive analytics and predictive analytics, but it emphasizes actionable insights instead of data monitoring. In other words, it can be considered as the opposite of descriptive analytics, which examines decisions and outcomes after the fact. By integrating big data, machine learning, and business rules, prescriptive analytics helps organizations to make more informed decisions to produce results that drive the most successful business decisions.

In summary, to clarify what happened and why it happened, both descriptive analytics and diagnostic analytics look at the past. Historical data is used by predictive analytics and prescriptive analytics to forecast what will happen in the future and what steps should be taken to impact those effects. In Table Table1, 1 , we have summarized these analytics methods with examples. Forward-thinking organizations in the real world can jointly use these analytical methods to make smart decisions that help drive changes in business processes and improvements. In the following, we discuss how machine learning techniques can play a big role in these analytical methods through their learning capabilities from the data.

Various types of analytical methods with examples

Analytical methods	Data-driven model building	Examples
Descriptive analytics	Answer the question, “what happened in the past”?	Summarising past events, e.g., sales, business data, social media usage, reporting general trends, etc.
Diagnostic analytics	Answer the question, “why did it happen?”	Identify anomalies and determine casual relationships, to find out business loss, identifying the influence of medications, etc.
Predictive analytics	Answer the question, “what will happen in the future?”	Predicting customer preferences, recommending products, identifying possible security breaches, predicting staff and resource needs, etc.
Prescriptive analytics	Answer the question, “what action should be taken?”	Improving business management, maintenance, improving patient care and healthcare administration, determining optimal marketing strategies, etc.

Machine Learning Based Analytical Modeling

In this section, we briefly discuss various advanced analytics methods based on machine learning modeling, which can make the computing process smart through intelligent decision-making in a business process. Figure Figure3 3 shows a general structure of a machine learning-based predictive modeling considering both the training and testing phase. In the following, we discuss a wide range of methods such as regression and classification analysis, association rule analysis, time-series analysis, behavioral analysis, log analysis, and so on within the scope of our study.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig3_HTML.jpg

A general structure of a machine learning based predictive model considering both the training and testing phase

Regression Analysis

In data science, one of the most common statistical approaches used for predictive modeling and data mining tasks is regression techniques [ 38 ]. Regression analysis is a form of supervised machine learning that examines the relationship between a dependent variable (target) and independent variables (predictor) to predict continuous-valued output [ 105 , 117 ]. The following equations Eqs. 1 , 2 , and 3 [ 85 , 105 ] represent the simple, multiple or multivariate, and polynomial regressions respectively, where x represents independent variable and y is the predicted/target output mentioned above:

Regression analysis is typically conducted for one of two purposes: to predict the value of the dependent variable in the case of individuals for whom some knowledge relating to the explanatory variables is available, or to estimate the effect of some explanatory variable on the dependent variable, i.e., finding the relationship of causal influence between the variables. Linear regression cannot be used to fit non-linear data and may cause an underfitting problem. In that case, polynomial regression performs better, however, increases the model complexity. The regularization techniques such as Ridge, Lasso, Elastic-Net, etc. [ 85 , 105 ] can be used to optimize the linear regression model. Besides, support vector regression, decision tree regression, random forest regression techniques [ 85 , 105 ] can be used for building effective regression models depending on the problem type, e.g., non-linear tasks. Financial forecasting or prediction, cost estimation, trend analysis, marketing, time-series estimation, drug response modeling, etc. are some examples where the regression models can be used to solve real-world problems in the domain of data science and analytics.

Classification Analysis

Classification is one of the most widely used and best-known data science processes. This is a form of supervised machine learning approach that also refers to a predictive modeling problem in which a class label is predicted for a given example [ 38 ]. Spam identification, such as ‘spam’ and ‘not spam’ in email service providers, can be an example of a classification problem. There are several forms of classification analysis available in the area such as binary classification—which refers to the prediction of one of two classes; multi-class classification—which involves the prediction of one of more than two classes; multi-label classification—a generalization of multiclass classification in which the problem’s classes are organized hierarchically [ 105 ].

Several popular classification techniques, such as k-nearest neighbors [ 5 ], support vector machines [ 55 ], navies Bayes [ 49 ], adaptive boosting [ 32 ], extreme gradient boosting [ 85 ], logistic regression [ 66 ], decision trees ID3 [ 92 ], C4.5 [ 93 ], and random forests [ 13 ] exist to solve classification problems. The tree-based classification technique, e.g., random forest considering multiple decision trees, performs better than others to solve real-world problems in many cases as due to its capability of producing logic rules [ 103 , 115 ]. Figure Figure4 4 shows an example of a random forest structure considering multiple decision trees. In addition, BehavDT recently proposed by Sarker et al. [ 109 ], and IntrudTree [ 106 ] can be used for building effective classification or prediction models in the relevant tasks within the domain of data science and analytics.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig4_HTML.jpg

An example of a random forest structure considering multiple decision trees

Cluster Analysis

Clustering is a form of unsupervised machine learning technique and is well-known in many data science application areas for statistical data analysis [ 38 ]. Usually, clustering techniques search for the structures inside a dataset and, if the classification is not previously identified, classify homogeneous groups of cases. This means that data points are identical to each other within a cluster, and different from data points in another cluster. Overall, the purpose of cluster analysis is to sort various data points into groups (or clusters) that are homogeneous internally and heterogeneous externally [ 105 ]. To gain insight into how data is distributed in a given dataset or as a preprocessing phase for other algorithms, clustering is often used. Data clustering, for example, assists with customer shopping behavior, sales campaigns, and retention of consumers for retail businesses, anomaly detection, etc.

Many clustering algorithms with the ability to group data have been proposed in machine learning and data science literature [ 98 , 138 , 141 ]. In our earlier paper Sarker et al. [ 105 ], we have summarized this based on several perspectives, such as partitioning methods, density-based methods, hierarchical-based methods, model-based methods, etc. In the literature, the popular K-means [ 75 ], K-Mediods [ 84 ], CLARA [ 54 ] etc. are known as partitioning methods; DBSCAN [ 30 ], OPTICS [ 8 ] etc. are known as density-based methods; single linkage [ 122 ], complete linkage [ 123 ], etc. are known as hierarchical methods. In addition, grid-based clustering methods, such as STING [ 134 ], CLIQUE [ 2 ], etc.; model-based clustering such as neural network learning [ 141 ], GMM [ 94 ], SOM [ 18 , 104 ], etc.; constrained-based methods such as COP K-means [ 131 ], CMWK-Means [ 25 ], etc. are used in the area. Recently, Sarker et al. [ 111 ] proposed a hierarchical clustering method, BOTS [ 111 ] based on bottom-up agglomerative technique for capturing user’s similar behavioral characteristics over time. The key benefit of agglomerative hierarchical clustering is that the tree-structure hierarchy created by agglomerative clustering is more informative than an unstructured set of flat clusters, which can assist in better decision-making in relevant application areas in data science.

Association Rule Analysis

Association rule learning is known as a rule-based machine learning system, an unsupervised learning method is typically used to establish a relationship among variables. This is a descriptive technique often used to analyze large datasets for discovering interesting relationships or patterns. The association learning technique’s main strength is its comprehensiveness, as it produces all associations that meet user-specified constraints including minimum support and confidence value [ 138 ].

Association rules allow a data scientist to identify trends, associations, and co-occurrences between data sets inside large data collections. In a supermarket, for example, associations infer knowledge about the buying behavior of consumers for different items, which helps to change the marketing and sales plan. In healthcare, to better diagnose patients, physicians may use association guidelines. Doctors can assess the conditional likelihood of a given illness by comparing symptom associations in the data from previous cases using association rules and machine learning-based data analysis. Similarly, association rules are useful for consumer behavior analysis and prediction, customer market analysis, bioinformatics, weblog mining, recommendation systems, etc.

Several types of association rules have been proposed in the area, such as frequent pattern based [ 4 , 47 , 73 ], logic-based [ 31 ], tree-based [ 39 ], fuzzy-rules [ 126 ], belief rule [ 148 ] etc. The rule learning techniques such as AIS [ 3 ], Apriori [ 4 ], Apriori-TID and Apriori-Hybrid [ 4 ], FP-Tree [ 39 ], Eclat [ 144 ], RARM [ 24 ] exist to solve the relevant business problems. Apriori [ 4 ] is the most commonly used algorithm for discovering association rules from a given dataset among the association rule learning techniques [ 145 ]. The recent association rule-learning technique ABC-RuleMiner proposed in our earlier paper by Sarker et al. [ 113 ] could give significant results in terms of generating non-redundant rules that can be used for smart decision making according to human preferences, within the area of data science applications.

Time-Series Analysis and Forecasting

A time series is typically a series of data points indexed in time order particularly, by date, or timestamp [ 111 ]. Depending on the frequency, the time-series can be different types such as annually, e.g., annual budget, quarterly, e.g., expenditure, monthly, e.g., air traffic, weekly, e.g., sales quantity, daily, e.g., weather, hourly, e.g., stock price, minute-wise, e.g., inbound calls in a call center, and even second-wise, e.g., web traffic, and so on in relevant domains.

A mathematical method dealing with such time-series data, or the procedure of fitting a time series to a proper model is termed time-series analysis. Many different time series forecasting algorithms and analysis methods can be applied to extract the relevant information. For instance, to do time-series forecasting for future patterns, the autoregressive (AR) model [ 130 ] learns the behavioral trends or patterns of past data. Moving average (MA) [ 40 ] is another simple and common form of smoothing used in time series analysis and forecasting that uses past forecasted errors in a regression-like model to elaborate an averaged trend across the data. The autoregressive moving average (ARMA) [ 12 , 120 ] combines these two approaches, where autoregressive extracts the momentum and pattern of the trend and moving average capture the noise effects. The most popular and frequently used time-series model is the autoregressive integrated moving average (ARIMA) model [ 12 , 120 ]. ARIMA model, a generalization of an ARMA model, is more flexible than other statistical models such as exponential smoothing or simple linear regression. In terms of data, the ARMA model can only be used for stationary time-series data, while the ARIMA model includes the case of non-stationarity as well. Similarly, seasonal autoregressive integrated moving average (SARIMA), autoregressive fractionally integrated moving average (ARFIMA), autoregressive moving average model with exogenous inputs model (ARMAX model) are also used in time-series models [ 120 ].

In addition to the stochastic methods for time-series modeling and forecasting, machine and deep learning-based approach can be used for effective time-series analysis and forecasting. For instance, in our earlier paper, Sarker et al. [ 111 ] present a bottom-up clustering-based time-series analysis to capture the mobile usage behavioral patterns of the users. Figure Figure5 5 shows an example of producing aggregate time segments Seg_i from initial time slices TS_i based on similar behavioral characteristics that are used in our bottom-up clustering approach, where D represents the dominant behavior BH_i of the users, mentioned above [ 111 ]. The authors in [ 118 ], used a long short-term memory (LSTM) model, a kind of recurrent neural network (RNN) deep learning model, in forecasting time-series that outperform traditional approaches such as the ARIMA model. Time-series analysis is commonly used these days in various fields such as financial, manufacturing, business, social media, event data (e.g., clickstreams and system events), IoT and smartphone data, and generally in any applied science and engineering temporal measurement domain. Thus, it covers a wide range of application areas in data science.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig5_HTML.jpg

An example of producing aggregate time segments from initial time slices based on similar behavioral characteristics

Opinion Mining and Sentiment Analysis

Sentiment analysis or opinion mining is the computational study of the opinions, thoughts, emotions, assessments, and attitudes of people towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes [ 71 ]. There are three kinds of sentiments: positive, negative, and neutral, along with more extreme feelings such as angry, happy and sad, or interested or not interested, etc. More refined sentiments to evaluate the feelings of individuals in various situations can also be found according to the problem domain.

Although the task of opinion mining and sentiment analysis is very challenging from a technical point of view, it’s very useful in real-world practice. For instance, a business always aims to obtain an opinion from the public or customers about its products and services to refine the business policy as well as a better business decision. It can thus benefit a business to understand the social opinion of their brand, product, or service. Besides, potential customers want to know what consumers believe they have when they use a service or purchase a product. Document-level, sentence level, aspect level, and concept level, are the possible levels of opinion mining in the area [ 45 ].

Several popular techniques such as lexicon-based including dictionary-based and corpus-based methods, machine learning including supervised and unsupervised learning, deep learning, and hybrid methods are used in sentiment analysis-related tasks [ 70 ]. To systematically define, extract, measure, and analyze affective states and subjective knowledge, it incorporates the use of statistics, natural language processing (NLP), machine learning as well as deep learning methods. Sentiment analysis is widely used in many applications, such as reviews and survey data, web and social media, and healthcare content, ranging from marketing and customer support to clinical practice. Thus sentiment analysis has a big influence in many data science applications, where public sentiment is involved in various real-world issues.

Behavioral Data and Cohort Analysis

Behavioral analytics is a recent trend that typically reveals new insights into e-commerce sites, online gaming, mobile and smartphone applications, IoT user behavior, and many more [ 112 ]. The behavioral analysis aims to understand how and why the consumers or users behave, allowing accurate predictions of how they are likely to behave in the future. For instance, it allows advertisers to make the best offers with the right client segments at the right time. Behavioral analytics, including traffic data such as navigation paths, clicks, social media interactions, purchase decisions, and marketing responsiveness, use the large quantities of raw user event information gathered during sessions in which people use apps, games, or websites. In our earlier papers Sarker et al. [ 101 , 111 , 113 ] we have discussed how to extract users phone usage behavioral patterns utilizing real-life phone log data for various purposes.

In the real-world scenario, behavioral analytics is often used in e-commerce, social media, call centers, billing systems, IoT systems, political campaigns, and other applications, to find opportunities for optimization to achieve particular outcomes. Cohort analysis is a branch of behavioral analytics that involves studying groups of people over time to see how their behavior changes. For instance, it takes data from a given data set (e.g., an e-commerce website, web application, or online game) and separates it into related groups for analysis. Various machine learning techniques such as behavioral data clustering [ 111 ], behavioral decision tree classification [ 109 ], behavioral association rules [ 113 ], etc. can be used in the area according to the goal. Besides, the concept of RecencyMiner, proposed in our earlier paper Sarker et al. [ 108 ] that takes into account recent behavioral patterns could be effective while analyzing behavioral data as it may not be static in the real-world changes over time.

Anomaly Detection or Outlier Analysis

Anomaly detection, also known as Outlier analysis is a data mining step that detects data points, events, and/or findings that deviate from the regularities or normal behavior of a dataset. Anomalies are usually referred to as outliers, abnormalities, novelties, noise, inconsistency, irregularities, and exceptions [ 63 , 114 ]. Techniques of anomaly detection may discover new situations or cases as deviant based on historical data through analyzing the data patterns. For instance, identifying fraud or irregular transactions in finance is an example of anomaly detection.

It is often used in preprocessing tasks for the deletion of anomalous or inconsistency in the real-world data collected from various data sources including user logs, devices, networks, and servers. For anomaly detection, several machine learning techniques can be used, such as k-nearest neighbors, isolation forests, cluster analysis, etc [ 105 ]. The exclusion of anomalous data from the dataset also results in a statistically significant improvement in accuracy during supervised learning [ 101 ]. However, extracting appropriate features, identifying normal behaviors, managing imbalanced data distribution, addressing variations in abnormal behavior or irregularities, the sparse occurrence of abnormal events, environmental variations, etc. could be challenging in the process of anomaly detection. Detection of anomalies can be applicable in a variety of domains such as cybersecurity analytics, intrusion detections, fraud detection, fault detection, health analytics, identifying irregularities, detecting ecosystem disturbances, and many more. This anomaly detection can be considered a significant task for building effective systems with higher accuracy within the area of data science.

Factor Analysis

Factor analysis is a collection of techniques for describing the relationships or correlations between variables in terms of more fundamental entities known as factors [ 23 ]. It’s usually used to organize variables into a small number of clusters based on their common variance, where mathematical or statistical procedures are used. The goals of factor analysis are to determine the number of fundamental influences underlying a set of variables, calculate the degree to which each variable is associated with the factors, and learn more about the existence of the factors by examining which factors contribute to output on which variables. The broad purpose of factor analysis is to summarize data so that relationships and patterns can be easily interpreted and understood [ 143 ].

Exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) are the two most popular factor analysis techniques. EFA seeks to discover complex trends by analyzing the dataset and testing predictions, while CFA tries to validate hypotheses and uses path analysis diagrams to represent variables and factors [ 143 ]. Factor analysis is one of the algorithms for unsupervised machine learning that is used for minimizing dimensionality. The most common methods for factor analytics are principal components analysis (PCA), principal axis factoring (PAF), and maximum likelihood (ML) [ 48 ]. Methods of correlation analysis such as Pearson correlation, canonical correlation, etc. may also be useful in the field as they can quantify the statistical relationship between two continuous variables, or association. Factor analysis is commonly used in finance, marketing, advertising, product management, psychology, and operations research, and thus can be considered as another significant analytical method within the area of data science.

Log Analysis

Logs are commonly used in system management as logs are often the only data available that record detailed system runtime activities or behaviors in production [ 44 ]. Log analysis is thus can be considered as the method of analyzing, interpreting, and capable of understanding computer-generated records or messages, also known as logs. This can be device log, server log, system log, network log, event log, audit trail, audit record, etc. The process of creating such records is called data logging.

Logs are generated by a wide variety of programmable technologies, including networking devices, operating systems, software, and more. Phone call logs [ 88 , 110 ], SMS Logs [ 28 ], mobile apps usages logs [ 124 , 149 ], notification logs [ 77 ], game Logs [ 82 ], context logs [ 16 , 149 ], web logs [ 37 ], smartphone life logs [ 95 ], etc. are some examples of log data for smartphone devices. The main characteristics of these log data is that it contains users’ actual behavioral activities with their devices. Similar other log data can be search logs [ 50 , 133 ], application logs [ 26 ], server logs [ 33 ], network logs [ 57 ], event logs [ 83 ], network and security logs [ 142 ] etc.

Several techniques such as classification and tagging, correlation analysis, pattern recognition methods, anomaly detection methods, machine learning modeling, etc. [ 105 ] can be used for effective log analysis. Log analysis can assist in compliance with security policies and industry regulations, as well as provide a better user experience by encouraging the troubleshooting of technical problems and identifying areas where efficiency can be improved. For instance, web servers use log files to record data about website visitors. Windows event log analysis can help an investigator draw a timeline based on the logging information and the discovered artifacts. Overall, advanced analytics methods by taking into account machine learning modeling can play a significant role to extract insightful patterns from these log data, which can be used for building automated and smart applications, and thus can be considered as a key working area in data science.

Neural Networks and Deep Learning Analysis

Deep learning is a form of machine learning that uses artificial neural networks to create a computational architecture that learns from data by combining multiple processing layers, such as the input, hidden, and output layers [ 38 ]. The key benefit of deep learning over conventional machine learning methods is that it performs better in a variety of situations, particularly when learning from large datasets [ 114 , 140 ].

The most common deep learning algorithms are: multi-layer perceptron (MLP) [ 85 ], convolutional neural network (CNN or ConvNet) [ 67 ], long short term memory recurrent neural network (LSTM-RNN) [ 34 ]. Figure Figure6 6 shows a structure of an artificial neural network modeling with multiple processing layers. The Backpropagation technique [ 38 ] is used to adjust the weight values internally while building the model. Convolutional neural networks (CNNs) [ 67 ] improve on the design of traditional artificial neural networks (ANNs), which include convolutional layers, pooling layers, and fully connected layers. It is commonly used in a variety of fields, including natural language processing, speech recognition, image processing, and other autocorrelated data since it takes advantage of the two-dimensional (2D) structure of the input data. AlexNet [ 60 ], Xception [ 21 ], Inception [ 125 ], Visual Geometry Group (VGG) [ 42 ], ResNet [ 43 ], etc., and other advanced deep learning models based on CNN are also used in the field.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig6_HTML.jpg

A structure of an artificial neural network modeling with multiple processing layers

In addition to CNN, recurrent neural network (RNN) architecture is another popular method used in deep learning. Long short-term memory (LSTM) is a popular type of recurrent neural network architecture used broadly in the area of deep learning. Unlike traditional feed-forward neural networks, LSTM has feedback connections. Thus, LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, sorting, and predicting data based on time-series data. Therefore, when the data is in a sequential format, such as time, sentence, etc., LSTM can be used, and it is widely used in the areas of time-series analysis, natural language processing, speech recognition, and so on.

In addition to the most popular deep learning methods mentioned above, several other deep learning approaches [ 104 ] exist in the field for various purposes. The self-organizing map (SOM) [ 58 ], for example, uses unsupervised learning to represent high-dimensional data as a 2D grid map, reducing dimensionality. Another learning technique that is commonly used for dimensionality reduction and feature extraction in unsupervised learning tasks is the autoencoder (AE) [ 10 ]. Restricted Boltzmann machines (RBM) can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling, according to [ 46 ]. A deep belief network (DBN) is usually made up of a backpropagation neural network and unsupervised networks like restricted Boltzmann machines (RBMs) or autoencoders (BPNN) [ 136 ]. A generative adversarial network (GAN) [ 35 ] is a deep learning network that can produce data with characteristics that are similar to the input data. Transfer learning is common worldwide presently because it can train deep neural networks with a small amount of data, which is usually the re-use of a pre-trained model on a new problem [ 137 ]. These deep learning methods can perform well, particularly, when learning from large-scale datasets [ 105 , 140 ]. In our previous article Sarker et al. [ 104 ], we have summarized a brief discussion of various artificial neural networks (ANN) and deep learning (DL) models mentioned above, which can be used in a variety of data science and analytics tasks.

Real-World Application Domains

Almost every industry or organization is impacted by data, and thus “Data Science” including advanced analytics with machine learning modeling can be used in business, marketing, finance, IoT systems, cybersecurity, urban management, health care, government policies, and every possible industries, where data gets generated. In the following, we discuss ten most popular application areas based on data science and analytics.

Business or financial data science: In general, business data science can be considered as the study of business or e-commerce data to obtain insights about a business that can typically lead to smart decision-making as well as taking high-quality actions [ 90 ]. Data scientists can develop algorithms or data-driven models predicting customer behavior, identifying patterns and trends based on historical business data, which can help companies to reduce costs, improve service delivery, and generate recommendations for better decision-making. Eventually, business automation, intelligence, and efficiency can be achieved through the data science process discussed earlier, where various advanced analytics methods and machine learning modeling based on the collected data are the keys. Many online retailers, such as Amazon [ 76 ], can improve inventory management, avoid out-of-stock situations, and optimize logistics and warehousing using predictive modeling based on machine learning techniques [ 105 ]. In terms of finance, the historical data is related to financial institutions to make high-stakes business decisions, which is mostly used for risk management, fraud prevention, credit allocation, customer analytics, personalized services, algorithmic trading, etc. Overall, data science methodologies can play a key role in the future generation business or finance industry, particularly in terms of business automation, intelligence, and smart decision-making and systems.
Manufacturing or industrial data science: To compete in global production capability, quality, and cost, manufacturing industries have gone through many industrial revolutions [ 14 ]. The latest fourth industrial revolution, also known as Industry 4.0, is the emerging trend of automation and data exchange in manufacturing technology. Thus industrial data science, which is the study of industrial data to obtain insights that can typically lead to optimizing industrial applications, can play a vital role in such revolution. Manufacturing industries generate a large amount of data from various sources such as sensors, devices, networks, systems, and applications [ 6 , 68 ]. The main categories of industrial data include large-scale data devices, life-cycle production data, enterprise operation data, manufacturing value chain sources, and collaboration data from external sources [ 132 ]. The data needs to be processed, analyzed, and secured to help improve the system’s efficiency, safety, and scalability. Data science modeling thus can be used to maximize production, reduce costs and raise profits in manufacturing industries.
Medical or health data science: Healthcare is one of the most notable fields where data science is making major improvements. Health data science involves the extrapolation of actionable insights from sets of patient data, typically collected from electronic health records. To help organizations, improve the quality of treatment, lower the cost of care, and improve the patient experience, data can be obtained from several sources, e.g., the electronic health record, billing claims, cost estimates, and patient satisfaction surveys, etc., to analyze. In reality, healthcare analytics using machine learning modeling can minimize medical costs, predict infectious outbreaks, prevent preventable diseases, and generally improve the quality of life [ 81 , 119 ]. Across the global population, the average human lifespan is growing, presenting new challenges to today’s methods of delivery of care. Thus health data science modeling can play a role in analyzing current and historical data to predict trends, improve services, and even better monitor the spread of diseases. Eventually, it may lead to new approaches to improve patient care, clinical expertise, diagnosis, and management.
IoT data science: Internet of things (IoT) [ 9 ] is a revolutionary technical field that turns every electronic system into a smarter one and is therefore considered to be the big frontier that can enhance almost all activities in our lives. Machine learning has become a key technology for IoT applications because it uses expertise to identify patterns and generate models that help predict future behavior and events [ 112 ]. One of the IoT’s main fields of application is a smart city, which uses technology to improve city services and citizens’ living experiences. For example, using the relevant data, data science methods can be used for traffic prediction in smart cities, to estimate the total usage of energy of the citizens for a particular period. Deep learning-based models in data science can be built based on a large scale of IoT datasets [ 7 , 104 ]. Overall, data science and analytics approaches can aid modeling in a variety of IoT and smart city services, including smart governance, smart homes, education, connectivity, transportation, business, agriculture, health care, and industry, and many others.
Cybersecurity data science: Cybersecurity, or the practice of defending networks, systems, hardware, and data from digital attacks, is one of the most important fields of Industry 4.0 [ 114 , 121 ]. Data science techniques, particularly machine learning, have become a crucial cybersecurity technology that continually learns to identify trends by analyzing data, better detecting malware in encrypted traffic, finding insider threats, predicting where bad neighborhoods are online, keeping people safe while surfing, or protecting information in the cloud by uncovering suspicious user activity [ 114 ]. For instance, machine learning and deep learning-based security modeling can be used to effectively detect various types of cyberattacks or anomalies [ 103 , 106 ]. To generate security policy rules, association rule learning can play a significant role to build rule-based systems [ 102 ]. Deep learning-based security models can perform better when utilizing the large scale of security datasets [ 140 ]. Thus data science modeling can enable professionals in cybersecurity to be more proactive in preventing threats and reacting in real-time to active attacks, through extracting actionable insights from the security datasets.
Behavioral data science: Behavioral data is information produced as a result of activities, most commonly commercial behavior, performed on a variety of Internet-connected devices, such as a PC, tablet, or smartphones [ 112 ]. Websites, mobile applications, marketing automation systems, call centers, help desks, and billing systems, etc. are all common sources of behavioral data. Behavioral data is much more than just data, which is not static data [ 108 ]. Advanced analytics of these data including machine learning modeling can facilitate in several areas such as predicting future sales trends and product recommendations in e-commerce and retail; predicting usage trends, load, and user preferences in future releases in online gaming; determining how users use an application to predict future usage and preferences in application development; breaking users down into similar groups to gain a more focused understanding of their behavior in cohort analysis; detecting compromised credentials and insider threats by locating anomalous behavior, or making suggestions, etc. Overall, behavioral data science modeling typically enables to make the right offers to the right consumers at the right time on various common platforms such as e-commerce platforms, online games, web and mobile applications, and IoT. In social context, analyzing the behavioral data of human being using advanced analytics methods and the extracted insights from social data can be used for data-driven intelligent social services, which can be considered as social data science.
Mobile data science: Today’s smart mobile phones are considered as “next-generation, multi-functional cell phones that facilitate data processing, as well as enhanced wireless connectivity” [ 146 ]. In our earlier paper [ 112 ], we have shown that users’ interest in “Mobile Phones” is more and more than other platforms like “Desktop Computer”, “Laptop Computer” or “Tablet Computer” in recent years. People use smartphones for a variety of activities, including e-mailing, instant messaging, online shopping, Internet surfing, entertainment, social media such as Facebook, Linkedin, and Twitter, and various IoT services such as smart cities, health, and transportation services, and many others. Intelligent apps are based on the extracted insight from the relevant datasets depending on apps characteristics, such as action-oriented, adaptive in nature, suggestive and decision-oriented, data-driven, context-awareness, and cross-platform operation [ 112 ]. As a result, mobile data science, which involves gathering a large amount of mobile data from various sources and analyzing it using machine learning techniques to discover useful insights or data-driven trends, can play an important role in the development of intelligent smartphone applications.
Multimedia data science: Over the last few years, a big data revolution in multimedia management systems has resulted from the rapid and widespread use of multimedia data, such as image, audio, video, and text, as well as the ease of access and availability of multimedia sources. Currently, multimedia sharing websites, such as Yahoo Flickr, iCloud, and YouTube, and social networks such as Facebook, Instagram, and Twitter, are considered as valuable sources of multimedia big data [ 89 ]. People, particularly younger generations, spend a lot of time on the Internet and social networks to connect with others, exchange information, and create multimedia data, thanks to the advent of new technology and the advanced capabilities of smartphones and tablets. Multimedia analytics deals with the problem of effectively and efficiently manipulating, handling, mining, interpreting, and visualizing various forms of data to solve real-world problems. Text analysis, image or video processing, computer vision, audio or speech processing, and database management are among the solutions available for a range of applications including healthcare, education, entertainment, and mobile devices.
Smart cities or urban data science: Today, more than half of the world’s population live in urban areas or cities [ 80 ] and considered as drivers or hubs of economic growth, wealth creation, well-being, and social activity [ 96 , 116 ]. In addition to cities, “Urban area” can refer to the surrounding areas such as towns, conurbations, or suburbs. Thus, a large amount of data documenting daily events, perceptions, thoughts, and emotions of citizens or people are recorded, that are loosely categorized into personal data, e.g., household, education, employment, health, immigration, crime, etc., proprietary data, e.g., banking, retail, online platforms data, etc., government data, e.g., citywide crime statistics, or government institutions, etc., Open and public data, e.g., data.gov, ordnance survey, and organic and crowdsourced data, e.g., user-generated web data, social media, Wikipedia, etc. [ 29 ]. The field of urban data science typically focuses on providing more effective solutions from a data-driven perspective, through extracting knowledge and actionable insights from such urban data. Advanced analytics of these data using machine learning techniques [ 105 ] can facilitate the efficient management of urban areas including real-time management, e.g., traffic flow management, evidence-based planning decisions which pertain to the longer-term strategic role of forecasting for urban planning, e.g., crime prevention, public safety, and security, or framing the future, e.g., political decision-making [ 29 ]. Overall, it can contribute to government and public planning, as well as relevant sectors including retail, financial services, mobility, health, policing, and utilities within a data-rich urban environment through data-driven smart decision-making and policies, which lead to smart cities and improve the quality of human life.
Smart villages or rural data science: Rural areas or countryside are the opposite of urban areas, that include villages, hamlets, or agricultural areas. The field of rural data science typically focuses on making better decisions and providing more effective solutions that include protecting public safety, providing critical health services, agriculture, and fostering economic development from a data-driven perspective, through extracting knowledge and actionable insights from the collected rural data. Advanced analytics of rural data including machine learning [ 105 ] modeling can facilitate providing new opportunities for them to build insights and capacity to meet current needs and prepare for their futures. For instance, machine learning modeling [ 105 ] can help farmers to enhance their decisions to adopt sustainable agriculture utilizing the increasing amount of data captured by emerging technologies, e.g., the internet of things (IoT), mobile technologies and devices, etc. [ 1 , 51 , 52 ]. Thus, rural data science can play a very important role in the economic and social development of rural areas, through agriculture, business, self-employment, construction, banking, healthcare, governance, or other services, etc. that lead to smarter villages.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in almost every sector in our real-world life, where the relevant data is available to analyze. To gather the right data and extract useful knowledge or actionable insights from the data for making smart decisions is the key to data science modeling in any application domain. Based on our discussion on the above ten potential real-world application domains by taking into account data-driven smart computing and decision making, we can say that the prospects of data science and the role of data scientists are huge for the future world. The “Data Scientists” typically analyze information from multiple sources to better understand the data and business problems, and develop machine learning-based analytical modeling or algorithms, or data-driven tools, or solutions, focused on advanced analytics, which can make today’s computing process smarter, automated, and intelligent.

Challenges and Research Directions

Our study on data science and analytics, particularly data science modeling in “ Understanding data science modeling ”, advanced analytics methods and smart computing in “ Advanced analytics methods and smart computing ”, and real-world application areas in “ Real-world application domains ” open several research issues in the area of data-driven business solutions and eventual data products. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions to build data-driven products.

Understanding the real-world business problems and associated data including nature, e.g., what forms, type, size, labels, etc., is the first challenge in the data science modeling, discussed briefly in “ Understanding data science modeling ”. This is actually to identify, specify, represent and quantify the domain-specific business problems and data according to the requirements. For a data-driven effective business solution, there must be a well-defined workflow before beginning the actual data analysis work. Furthermore, gathering business data is difficult because data sources can be numerous and dynamic. As a result, gathering different forms of real-world data, such as structured, or unstructured, related to a specific business issue with legal access, which varies from application to application, is challenging. Moreover, data annotation, which is typically the process of categorization, tagging, or labeling of raw data, for the purpose of building data-driven models, is another challenging issue. Thus, the primary task is to conduct a more in-depth analysis of data collection and dynamic annotation methods. Therefore, understanding the business problem, as well as integrating and managing the raw data gathered for efficient data analysis, may be one of the most challenging aspects of working in the field of data science and analytics.
The next challenge is the extraction of the relevant and accurate information from the collected data mentioned above. The main focus of data scientists is typically to disclose, describe, represent, and capture data-driven intelligence for actionable insights from data. However, the real-world data may contain many ambiguous values, missing values, outliers, and meaningless data [ 101 ]. The advanced analytics methods including machine and deep learning modeling, discussed in “ Advanced analytics methods and smart computing ”, highly impact the quality, and availability of the data. Thus understanding real-world business scenario and associated data, to whether, how, and why they are insufficient, missing, or problematic, then extend or redevelop the existing methods, such as large-scale hypothesis testing, learning inconsistency, and uncertainty, etc. to address the complexities in data and business problems is important. Therefore, developing new techniques to effectively pre-process the diverse data collected from multiple sources, according to their nature and characteristics could be another challenging task.
Understanding and selecting the appropriate analytical methods to extract the useful insights for smart decision-making for a particular business problem is the main issue in the area of data science. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to granular data analysis. Thus, understanding the advanced analytics methods, especially machine and deep learning-based modeling is the key. The traditional learning techniques mentioned in “ Advanced analytics methods and smart computing ” may not be directly applicable for the expected outcome in many cases. For instance, in a rule-based system, the traditional association rule learning technique [ 4 ] may produce redundant rules from the data that makes the decision-making process complex and ineffective [ 113 ]. Thus, a scientific understanding of the learning algorithms, mathematical properties, how the techniques are robust or fragile to input data, is needed to understand. Therefore, a deeper understanding of the strengths and drawbacks of the existing machine and deep learning methods [ 38 , 105 ] to solve a particular business problem is needed, consequently to improve or optimize the learning algorithms according to the data characteristics, or to propose the new algorithm/techniques with higher accuracy becomes a significant challenging issue for the future generation data scientists.
The traditional data-driven models or systems typically use a large amount of business data to generate data-driven decisions. In several application fields, however, the new trends are more likely to be interesting and useful for modeling and predicting the future than older ones. For example, smartphone user behavior modeling, IoT services, stock market forecasting, health or transport service, job market analysis, and other related areas where time-series and actual human interests or preferences are involved over time. Thus, rather than considering the traditional data analysis, the concept of RecencyMiner, i.e., recent pattern-based extracted insight or knowledge proposed in our earlier paper Sarker et al. [ 108 ] might be effective. Therefore, to propose the new techniques by taking into account the recent data patterns, and consequently to build a recency-based data-driven model for solving real-world problems, is another significant challenging issue in the area.
The most crucial task for a data-driven smart system is to create a framework that supports data science modeling discussed in “ Understanding data science modeling ”. As a result, advanced analytical methods based on machine learning or deep learning techniques can be considered in such a system to make the framework capable of resolving the issues. Besides, incorporating contextual information such as temporal context, spatial context, social context, environmental context, etc. [ 100 ] can be used for building an adaptive, context-aware, and dynamic model or framework, depending on the problem domain. As a result, a well-designed data-driven framework, as well as experimental evaluation, is a very important direction to effectively solve a business problem in a particular domain, as well as a big challenge for the data scientists.
In several important application areas such as autonomous cars, criminal justice, health care, recruitment, housing, management of the human resource, public safety, where decisions made by models, or AI agents, have a direct effect on human lives. As a result, there is growing concerned about whether these decisions can be trusted, to be right, reasonable, ethical, personalized, accurate, robust, and secure, particularly in the context of adversarial attacks [ 104 ]. If we can explain the result in a meaningful way, then the model can be better trusted by the end-user. For machine-learned models, new trust properties yield new trade-offs, such as privacy versus accuracy; robustness versus efficiency; fairness versus robustness. Therefore, incorporating trustworthy AI particularly, data-driven or machine learning modeling could be another challenging issue in the area.

In the above, we have summarized and discussed several challenges and the potential research opportunities and directions, within the scope of our study in the area of data science and advanced analytics. The data scientists in academia/industry and the researchers in the relevant area have the opportunity to contribute to each issue identified above and build effective data-driven models or systems, to make smart decisions in the corresponding business domains.

In this paper, we have presented a comprehensive view on data science including various types of advanced analytical methods that can be applied to enhance the intelligence and the capabilities of an application. We have also visualized the current popularity of data science and machine learning-based advanced analytical modeling and also differentiate these from the relevant terms used in the area, to make the position of this paper. A thorough study on the data science modeling with its various processing modules that are needed to extract the actionable insights from the data for a particular business problem and the eventual data product. Thus, according to our goal, we have briefly discussed how different data modules can play a significant role in a data-driven business solution through the data science process. For this, we have also summarized various types of advanced analytical methods and outcomes as well as machine learning modeling that are needed to solve the associated business problems. Thus, this study’s key contribution has been identified as the explanation of different advanced analytical methods and their applicability in various real-world data-driven applications areas including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making.

Finally, within the scope of our study, we have outlined and discussed the challenges we faced, as well as possible research opportunities and future directions. As a result, the challenges identified provide promising research opportunities in the field that can be explored with effective solutions to improve the data-driven model and systems. Overall, we conclude that our study of advanced analytical solutions based on data science and machine learning methods, leads in a positive direction and can be used as a reference guide for future research and applications in the field of data science and its real-world applications by both academia and industry professionals.

Declarations

The author declares no conflict of interest.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Part 1. Putting Data Driven Research In Perspective

1.1 What is Data Driven Research?

What is data driven research.

Data Driven Research is a subset of the field of Digital Humanities that deals particularly with data analytic and visualization methodologies. This field broadly consists of theories and methodologies from a range of humanities disciplines that inform how a researcher gathers, analyzes, and filters datasets to gain insight about a particular subject.

When data is processed, organized, structured or presented in a given context so as to make it useful, it is called information. Data are the facts or details from which information is derived. Individual pieces of data are rarely useful alone. For data to become information, data needs to be put into context.

With data driven research, scholars are able to harness the power of spreadsheets, text mining software, and public archives to produce visualizations that communicate findings and insight into fields of study. Producing visual stories for online discourse is just one of the exciting possibilities of data driven research. Therefore, we must develop a general working knowledge of data analytics in order to derive insights from datasets.

In this chapter, we will (1) provide a general overview of data driven research; (2) explain why spreadsheets are valuable tools for data analysis; (3) explain how humanists and social scientists can make key contributions to the field of data analysis; (4) use a data visualization to demonstrate how data driven research allows us to comprehend large scale analysis of topics.

Keywords: Data , Datasets , Data Curation , Data Visualization , Spreadsheet , Quantitative Data , Qualitative data .

Understanding Quantitative and Qualitative Data

This image is of a bookshelf and it explains the difference between quantitative and qualitative data by focusing on key attributes.

Data is organized into collections known as datasets, which are foundational for data-driven research. Datasets contain quantitative and qualitative data (sometimes both).

Quantitative data can be numbers and values. It can be used to ask the questions “how much?” or “how many?” Qualitative data is descriptive and conceptual. Also, it can be categorized based on traits and characteristics like serial codes and social security numbers since these categorical values are unique to one item or person. Collections of data can be used to perform different types of analyses, derive insight, and produce information.

The most common format for datasets is a spreadsheet, especially .csv formats. These documents are a single file organized as a table of rows and columns. These files can be opened on common spreadsheet applications like Microsoft Excel or Google Sheets. These types of files can also be stored in other formats ranging from a Microsoft Excel document to multiple datasets in a zip file.

Despite there being several advances in computational humanities, however, the role of data curation is often times overlooked, especially when it comes to the humanities and other scholarly disciplines.

Data curation is the work of organizing and managing a collection of datasets to meet the needs and interests of specific groups of people. The creation of datasets is a very labor-intensive and time- consuming endeavor. Curating datasets takes time and a particular type of expertise to create this valuable resource and make it available for researchers. This process also requires access to archival materials.

This is a screenshot from the Journal of Slavery and Data Preservation. It documents Marquis Taylor's dataset related to free persons in Savannah, GA from 1823 - 1842.

Marquis Taylor, a graduate student in the history department at Northwestern University published “ Contested Freedom: Free Persons of Color in Savannah, GA, 1823-1842 ” in the Journal of Slavery and Data Preservation. This dataset is compiled “entirely of information for free persons of color who resided in the city of Savannah, Georgia, registered between 1823 and 1842” which comprises “1,321 named individuals residing in Chatham County.”

Marquis’s dataset is small only in comparison to a massive dataset that identifies millions of people in the state over several decades, but this project is nonetheless large and significant in other contexts in the humanities. Archiving data is an important aspect of data driven research. If the field is to grow, we need to provide more access points for people to explore topics that traditionally do not make quantitative and qualitative data a central part of the analysis.

Data Driven Research and the Power of Spreadsheets

This image is taken from Michael J. Kramer’s essay “What Does Digital Humanities Bring to the Table?” It's an image that shows how students can use spreadsheets to create a student table for audio annotation

Spreadsheets are among the most valuable tools for pursuing data-driven research. These widely accessible electronic ledgers, including Microsoft Excel, Numbers, and Google Sheets, are used for storing, managing, and analyzing data. Sorting and filtering options, formulas, pivot tables, and other spreadsheet functions make it possible for researchers to manage thousands of categories and data points.

Humanists might find spreadsheets beneficial as well for collecting and organizing sources, performing statistical functions on numerical information, and visualizing data to name just a few tasks. Michael Kramer , an Assistant Professor of History at SUNY Brockport, believes that “lots of data organized in usefully systemized ways and analyzed using the tools of statistics and other more quantitative approaches are worth the time of humanities scholars.” In his “Digitizing Folk Music History” course, Kramer’s students produced detailed analyses, documented their research, and developed “an extended line of meta-metadata that was transitioning out from the evidence to analysis.”

In 2017, Kathleen Clarke, then a Ph.D. candidate in Higher Education at the University of Toronto, described using spreadsheets to organize information for her dissertation . She notes that spreadsheets enabled her to move information around easily and sort information from A-Z or Z-A within a second. Some of the headings she has in her spreadsheet include:

(1) ID number (2) Author(s) + Year (3) Title (4) Main Findings (5) APA/MLA Reference (6) Type of Resource (7) Abstract (8) Keywords (9) Location – Canada, United States, United Kingdom, Other (10) Notes -quotations I might want to use.

Journalists understand the significance of spreadsheets. The New York Times now conducts workshops on spreadsheets for its employees. The three-week training , based in Google Sheets, teaches participants how to search, sort, and filter information in spreadsheets, while also learning how to work with pivot tables, advanced data cleaning, and visualization techniques.

Lindsey Rogers Cook explained that “Even with some of the best data and graphics journalists in the business … data knowledge wasn’t spread widely among desks in our newsroom and wasn’t filtering into news desks’ daily reporting.” Cook produced a document called “ How 5 Data Dynamos Do Their Jobs ” to showcase how journalists have optimized the use of spreadsheets to produce groundbreaking work.

This visualization is a screenshot from the New York Time's article “Extensive Data Shows Punishing “Reach of Racism for Black Boys.” It depicts a funnel chart that shows what happens to men from childhood to adulthood based on racial and economic factors.

On March 19, 2019, the NYTime’s Upshot published “ Extensive Data Shows Punishing Reach of Racism for Black Boys ” by Emily Badger, Claire Cain Miller, Adam Pearce, and Kevin Quearly. This visual article examined how Black boys raised in America, even in the wealthiest families and living in some of the most well-to-do neighborhoods, still earn less in adulthood than white boys with similar backgrounds. Understanding how to analyze data in spreadsheets is a first crucial step to creating engaging visualizations that communicate findings.

After items are entered into spreadsheets, they can be re-presented in a variety of ways. The organized information can then serve as the building blocks for data visualizations.

Spreadsheet focusing on hundreds of jazz musicians would assist in highlighting their collaborations and diverse trajectories. Historians that want to map migration patterns and population shifts across regions in the US might organize their findings in a spreadsheet. A spreadsheet that contains census records for each state makes filtering through decades worth of data more efficient. Ultimately, spreadsheets facilitate the process of organizing bodies of information.

Comprehending the Scale of Slavery

As a practice that relies heavily on data analysis, data visualizations are invaluable across a range of fields and professions as people seek to understand the implications of large bodies of information. Visual compositions can help us to comprehend scale and interpret shifts in years, people, and other variables using our eyes. This is beneficial when communicating a story that contains several points.

“The Atlantic Slave Trade in Two Minutes” interactive by Andrew Kahn on Slate represents a powerful possibility of collecting and showcasing data. The visualization displays 315 years and 20,528 voyages of the Transatlantic Slave Trade. The two-minute interactive composition is at once compelling and troubling, for it is ultimately a distillation of the horrifying transportation of ten million enslaved humans presented in 120 seconds.

“The Atlantic Slave Trade in Two Minutes” is an intriguing data visualization with an extended backstory. The project is the product of decades of research that began in the late 1960s when a group of scholars began collecting information about slave trade voyages. Subsequent scholars contributed to the expansion of the data. Today, the Trans-Atlantic Slave Trade Database provides information on more than 36,000 voyages, 91,000 people who were enslaved and thousands of names of shipowners and ship captains. The display of so much information about the slave trade, presented in an interactive brief visualization, indicates the power of innovative data management.

Of course, even a skillfully designed visualization like “The Atlantic Slave Trade in Two Minutes” gives us reason to pause. The use of small dots to represent slave ships and the truncation of 300 years into two minutes trivializes the horrors of enslavement. Thus, responsible creators and viewers must be prepared to consider how data visualizations can simultaneously suppress people’s experiences while at the same time enhancing knowledge.

Visualizations keep the mind entertained and therefore able to process and receive several, interrelated pieces of information. Specifically, this visualization organizes hundreds of records related to the trans Atlantic slave trade, making it easier to remember points and even retrieve information in an efficient manner.

Key Takeaways – What is Data Driven Research?

Data allows us to quantify and consider the scale or rate at which events or recurring trends happen.
Data is hardly ever neutral; therefore, a singe dataset can be use multiple type of ways and for several uses.
Data can be transformed into visuals and be published on online mediums, thereby contributing to online discourses.
Spreadsheets are valuable beyond academic research for structuring, work data, and other uses. The tools can be used to organize sources, to-do lists, numerical and statistical data, etc
Pre-Assembled datasets allow for more time to be spent on data analysis and visualizations

By Kenton Rambsy

(See bibliography for sources)

Media Attributions

Private: Figure 1.1.1 – Quantitative vs Qualitative © Devin Pickell
Private: Figure 1.1.2 – Contested Freedom by Marquis Taylor © Marquis Taylor
Private: Figure 1.1.3 – Michael Kramer – Digitizing Folk Music © Michael Kramer
Figure 1.1.4 – The NYTimes Upshot © Emily Badger, Claire Cain Miller, Adam Pearce, and Kevin Quearly

Data are units of information, often numeric, that are collected through observation.[1] In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects.

Source: https://en.wikipedia.org/wiki/Data

A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the data set in question.

Source: https://en.wikipedia.org/wiki/Data_set

Data curation is the organization and integration of data collected from various sources. It involves annotation, publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation.

Source: https://en.wikipedia.org/wiki/Data_curation

Data visualization refers to the techniques used to communicate data or information by encoding it as visual objects (e.g., points, lines or bars) contained in graphics. The goal is to communicate information clearly and efficiently to users.

Source: https://en.wikipedia.org/wiki/Data_visualization

A spreadsheet is a computer application for organization, analysis, and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cells of a table.

Source: https://en.wikipedia.org/wiki/Spreadsheet

Quantitative Data is a classification that describes the nature of information within the values assigned to variables.

Source: https://en.wikipedia.org/wiki/Level_of_measurement

Qualitative data refers to generally nonnumerical data obtained by the researcher. The data is collected from first-hand observation, interviews, questionnaires, focus groups, participant-observation, recordings made in natural settings, documents, and artifacts.

Source: https://en.wikipedia.org/wiki/Qualitative_research

Share This Book

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 09 September 2022

Machine learning in project analytics: a data-driven framework and case study

Shahadat Uddin 1 ,
Stephen Ong 1 &
Haohui Lu 1

Scientific Reports volume 12 , Article number: 15252 ( 2022 ) Cite this article

10k Accesses

15 Citations

18 Altmetric

Metrics details

Applied mathematics
Computational science

The analytic procedures incorporated to facilitate the delivery of projects are often referred to as project analytics. Existing techniques focus on retrospective reporting and understanding the underlying relationships to make informed decisions. Although machine learning algorithms have been widely used in addressing problems within various contexts (e.g., streamlining the design of construction projects), limited studies have evaluated pre-existing machine learning methods within the delivery of construction projects. Due to this, the current research aims to contribute further to this convergence between artificial intelligence and the execution construction project through the evaluation of a specific set of machine learning algorithms. This study proposes a machine learning-based data-driven research framework for addressing problems related to project analytics. It then illustrates an example of the application of this framework. In this illustration, existing data from an open-source data repository on construction projects and cost overrun frequencies was studied in which several machine learning models (Python’s Scikit-learn package) were tested and evaluated. The data consisted of 44 independent variables (from materials to labour and contracting) and one dependent variable (project cost overrun frequency), which has been categorised for processing under several machine learning models. These models include support vector machine, logistic regression, k -nearest neighbour, random forest, stacking (ensemble) model and artificial neural network. Feature selection and evaluation methods, including the Univariate feature selection, Recursive feature elimination, SelectFromModel and confusion matrix, were applied to determine the most accurate prediction model. This study also discusses the generalisability of using the proposed research framework in other research contexts within the field of project management. The proposed framework, its illustration in the context of construction projects and its potential to be adopted in different contexts will significantly contribute to project practitioners, stakeholders and academics in addressing many project-related issues.

Long-term prediction modeling of shallow rockburst with small dataset based on machine learning

An ensemble-based machine learning solution for imbalanced multiclass dataset during lithology log generation

Prediction of jumbo drill penetration rate in underground mines using various machine learning approaches and traditional models

Introduction.

Successful projects require the presence of appropriate information and technology 1 . Project analytics provides an avenue for informed decisions to be made through the lifecycle of a project. Project analytics applies various statistics (e.g., earned value analysis or Monte Carlo simulation) among other models to make evidence-based decisions. They are used to manage risks as well as project execution 2 . There is a tendency for project analytics to be employed due to other additional benefits, including an ability to forecast and make predictions, benchmark with other projects, and determine trends such as those that are time-dependent 3 , 4 , 5 . There has been increasing interest in project analytics and how current technology applications can be incorporated and utilised 6 . Broadly, project analytics can be understood on five levels 4 . The first is descriptive analytics which incorporates retrospective reporting. The second is known as diagnostic analytics , which aims to understand the interrelationships and underlying causes and effects. The third is predictive analytics which seeks to make predictions. Subsequent to this is prescriptive analytics , which prescribes steps following predictions. Finally, cognitive analytics aims to predict future problems. The first three levels can be applied with ease with the help of technology. The fourth and fifth steps require data that is generally more difficult to obtain as they may be less accessible or unstructured. Further, although project key performance indicators can be challenging to define 2 , identifying common measurable features facilitates this 7 . It is anticipated that project analytics will continue to experience development due to its direct benefits to the major baseline measures focused on productivity, profitability, cost, and time 8 . The nature of project management itself is fluid and flexible, and project analytics allows an avenue for which machine learning algorithms can be applied 9 .

Machine learning within the field of project analytics falls into the category of cognitive analytics, which deals with problem prediction. Generally, machine learning explores the possibilities of computers to improve processes through training or experience 10 . It can also build on the pre-existing capabilities and techniques prevalent within management to accomplish complex tasks 11 . Due to its practical use and broad applicability, recent developments have led to the invention and introduction of newer and more innovative machine learning algorithms and techniques. Artificial intelligence, for instance, allows for software to develop computer vision, speech recognition, natural language processing, robot control, and other applications 10 . Specific to the construction industry, it is now used to monitor construction environments through a virtual reality and building information modelling replication 12 or risk prediction 13 . Within other industries, such as consumer services and transport, machine learning is being applied to improve consumer experiences and satisfaction 10 , 14 and reduce the human errors of traffic controllers 15 . Recent applications and development of machine learning broadly fall into the categories of classification, regression, ranking, clustering, dimensionality reduction and manifold learning 16 . Current learning models include linear predictors, boosting, stochastic gradient descent, kernel methods, and nearest neighbour, among others 11 . Newer and more applications and learning models are continuously being introduced to improve accessibility and effectiveness.

Specific to the management of construction projects, other studies have also been made to understand how copious amounts of project data can be used 17 , the importance of ontology and semantics throughout the nexus between artificial intelligence and construction projects 18 , 19 as well as novel approaches to the challenges within this integration of fields 20 , 21 , 22 . There have been limited applications of pre-existing machine learning models on construction cost overruns. They have predominantly focussed on applications to streamline the design processes within construction 23 , 24 , 25 , 26 , and those which have investigated project profitability have not incorporated the types and combinations of algorithms used within this study 6 , 27 . Furthermore, existing applications have largely been skewed towards one type or another 28 , 29 .

In addition to the frequently used earned value method (EVM), researchers have been applying many other powerful quantitative methods to address a diverse range of project analytics research problems over time. Examples of those methods include time series analysis, fuzzy logic, simulation, network analytics, and network correlation and regression. Time series analysis uses longitudinal data to forecast an underlying project's future needs, such as the time and cost 30 , 31 , 32 . Few other methods are combined with EVM to find a better solution for the underlying research problems. For example, Narbaev and De Marco 33 integrated growth models and EVM for forecasting project cost at completion using data from construction projects. For analysing the ongoing progress of projects having ambiguous or linguistic outcomes, fuzzy logic is often combined with EVM 34 , 35 , 36 . Yu et al. 36 applied fuzzy theory and EVM for schedule management. Ponz-Tienda et al. 35 found that using fuzzy arithmetic on EVM provided more objective results in uncertain environments than the traditional methodology. Bonato et al. 37 integrated EVM with Monte Carlo simulation to predict the final cost of three engineering projects. Batselier and Vanhoucke 38 compared the accuracy of the project time and cost forecasting using EVM and simulation. They found that the simulation results supported findings from the EVM. Network methods are primarily used to analyse project stakeholder networks. Yang and Zou 39 developed a social network theory-based model to explore stakeholder-associated risks and their interactions in complex green building projects. Uddin 40 proposed a social network analytics-based framework for analysing stakeholder networks. Ong and Uddin 41 further applied network correlation and regression to examine the co-evolution of stakeholder networks in collaborative healthcare projects. Although many other methods have already been used, as evident in the current literature, machine learning methods or models are yet to be adopted for addressing research problems related to project analytics. The current investigation is derived from the cognitive analytics component of project analytics. It proposes an approach for determining hidden information and patterns to assist with project delivery. Figure 1 illustrates a tree diagram showing different levels of project analytics and their associated methods from the literature. It also illustrates existing methods within the cognitive component of project analytics to where the application of machine learning is situated contextually.

A tree diagram of different project analytics methods. It also shows where the current study belongs to. Although earned value analysis is commonly used in project analytics, we do not include it in this figure since it is used in the first three levels of project analytics.

Machine learning models have several notable advantages over traditional statistical methods that play a significant role in project analytics 42 . First, machine learning algorithms can quickly identify trends and patterns by simultaneously analysing a large volume of data. Second, they are more capable of continuous improvement. Machine learning algorithms can improve their accuracy and efficiency for decision-making through subsequent training from potential new data. Third, machine learning algorithms efficiently handle multi-dimensional and multi-variety data in dynamic or uncertain environments. Fourth, they are compelling to automate various decision-making tasks. For example, machine learning-based sentiment analysis can easily a negative tweet and can automatically take further necessary steps. Last but not least, machine learning has been helpful across various industries, for example, defence to education 43 . Current research has seen the development of several different branches of artificial intelligence (including robotics, automated planning and scheduling and optimisation) within safety monitoring, risk prediction, cost estimation and so on 44 . This has progressed from the applications of regression on project cost overruns 45 to the current deep-learning implementations within the construction industry 46 . Despite this, the uses remain largely limited and are still in a developmental state. The benefits of applications are noted, such as optimising and streamlining existing processes; however, high initial costs form a barrier to accessibility 44 .

The primary goal of this study is to demonstrate the applicability of different machine learning algorithms in addressing problems related to project analytics. Limitations in applying machine learning algorithms within the context of construction projects have been explored previously. However, preceding research has mainly been conducted to improve the design processes specific to construction 23 , 24 , and those investigating project profitabilities have not incorporated the types and combinations of algorithms used within this study 6 , 27 . For instance, preceding research has incorporated a different combination of machine-learning algorithms in research of predicting construction delays 47 . This study first proposed a machine learning-based data-driven research framework for project analytics to contribute to the proposed study direction. It then applied this framework to a case study of construction projects. Although there are three different machine learning algorithms (supervised, unsupervised and semi-supervised), the supervised machine learning models are most commonly used due to their efficiency and effectiveness in addressing many real-world problems 48 . Therefore, we will use machine learning to represent supervised machine learning throughout the rest of this article. The contribution of this study is significant in that it considers the applications of machine learning within project management. Project management is often thought of as being very fluid in nature, and because of this, applications of machine learning are often more difficult 9 , 49 . Further to this, existing implementations have largely been limited to safety monitoring, risk prediction, cost estimation and so on 44 . Through the evaluation of machine-learning applications, this study further demonstrates a case study for which algorithms can be used to consider and model the relationship between project attributes and a project performance measure (i.e., cost overrun frequency).

Machine learning-based framework for project analytics

When and why machine learning for project analytics.

Machine learning models are typically used for research problems that involve predicting the classification outcome of a categorical dependent variable. Therefore, they can be applied in the context of project analytics if the underlying objective variable is a categorical one. If that objective variable is non-categorical, it must first be converted into a categorical variable. For example, if the objective or target variable is the project cost, we can convert this variable into a categorical variable by taking only two possible values. The first value would be 0 to indicate a low-cost project, and the second could be 1 for showing a high-cost project. The average or median cost value for all projects under consideration can be considered for splitting project costs into low-cost and high-cost categories.

For data-driven decision-making, machine learning models are advantageous. This is because traditional statistical methods (e.g., ordinary least square (OLS) regression) make assumptions about the underlying research data to produce explicit formulae for the objective target measures. Unlike these statistical methods, machine learning algorithms figure out patterns on their own directly from the data. For instance, for a non-linear but separable dataset, an OLS regression model will not be the right choice due to its assumption that the underlying data must be linear. However, a machine learning model can easily separate the dataset into the underlying classes. Figure 2 (a) presents a situation where machine learning models perform better than traditional statistical methods.

( a ) An illustration showing the superior performance of machine learning models compared with the traditional statistical models using an abstract dataset with two attributes (X 1 and X 2 ). The data points within this abstract dataset consist of two classes: one represented with a transparent circle and the second class illustrated with a black-filled circle. These data points are non-linear but separable. Traditional statistical models (e.g., ordinary least square regression) will not accurately separate these data points. However, any machine learning model can easily separate them without making errors; and ( b ) Traditional programming versus machine learning.

Similarly, machine learning models are compelling if the underlying research dataset has many attributes or independent measures. Such models can identify features that significantly contribute to the corresponding classification performance regardless of their distributions or collinearity. Traditional statistical methods have become prone to biased results when there exists a correlation between independent variables. Machine learning-based current studies specific to project analytics have been largely limited. Despite this, there have been tangential studies on the use of artificial intelligence to improve cost estimations as well as risk prediction 44 . Additionally, models have been implemented in the optimisation of existing processes 50 .

Machine learning versus traditional programming

Machine learning can be thought of as a process of teaching a machine (i.e., computers) to learn from data and adjust or apply its present knowledge when exposed to new data 42 . It is a type of artificial intelligence that enables computers to learn from examples or experiences. Traditional programming requires some input data and some logic in the form of code (program) to generate the output. Unlike traditional programming, the input data and their corresponding output are fed to an algorithm to create a program in machine learning. This resultant program can capture powerful insights into the data pattern and can be used to predict future outcomes. Figure 2 (b) shows the difference between machine learning and traditional programming.

Proposed machine learning-based framework

Figure 3 illustrates the proposed machine learning-based research framework of this study. The framework starts with breaking the project research dataset into the training and test components. As mentioned in the previous section, the research dataset may have many categorical and/or nominal independent variables, but its single dependent variable must be categorical. Although there is no strict rule for this split, the training data size is generally more than or equal to 50% of the original dataset 48 .

The proposed machine learning-based data-driven framework.

Machine learning algorithms can handle variables that have only numerical outcomes. So, when one or more of the underlying categorical variables have a textual or string outcome, we must first convert them into the corresponding numerical values. Suppose a variable can take only three textual outcomes (low, medium and high). In that case, we could consider, for example, 1 to represent low , 2 to represent medium , and 3 to represent high . Other statistical techniques, such as the RIDIT (relative to an identified distribution) scoring 51 , can also be used to convert ordered categorical measurements into quantitative ones. RIDIT is a parametric approach that uses probabilistic comparison to determine the statistical differences between ordered categorical groups. The remaining components of the proposed framework have been briefly described in the following subsections.

Model-building procedure

The next step of the framework is to follow the model-building procedure to develop the desired machine learning models using the training data. The first step of this procedure is to select suitable machine learning algorithms or models. Among the available machine learning algorithms, the commonly used ones are support vector machine, logistic regression, k -nearest neighbours, artificial neural network, decision tree and random forest 52 . One can also select an ensemble machine learning model as the desired algorithm. An ensemble machine learning method uses multiple algorithms or the same algorithm multiple times to achieve better predictive performance than could be obtained from any of the constituent learning models alone 52 . Three widely used ensemble approaches are bagging, boosting and stacking. In bagging, the research dataset is divided into different equal-sized subsets. The underlying machine learning algorithm is then applied to these subsets for classification. In boosting, a random sample of the dataset is selected and then fitted and trained sequentially with different models to compensate for the weakness observed in the immediately used model. Stacking combined different weak machine learning models in a heterogeneous way to improve the predictive performance. For example, the random forest algorithm is an ensemble of different decision tree models 42 .

Second, each selected machine learning model will be processed through the k -fold cross-validation approach to improve predictive efficiency. In k -fold cross-validation, the training data is divided into k folds. In an iteration, the (k-1) folds are used to train the selected machine models, and the remaining last fold isF used for validation purposes. This iteration process continues until each k folds will get a turn to be used for validation purposes. The final predictive efficiency of the trained models is based on the average values from the outcomes of these iterations. In addition to this average value, researchers use the standard deviation of the results from different iterations as the predictive training efficiency. Supplementary Fig 1 shows an illustration of the k -fold cross-validation.

Third, most machine learning algorithms require a pre-defined value for their different parameters, known as hyperparameter tuning. The settings of these parameters play a vital role in the achieved performance of the underlying algorithm. For a given machine learning algorithm, the optimal value for these parameters can be different from one dataset to another. The same algorithm needs to run multiple times with different parameter values to find its optimal parameter value for a given dataset. Many algorithms are available in the literature, such as the Grid search 53 , to find the optimal parameter value. In the Grid search, hyperparameters are divided into discrete grids. Each grid point represents a specific combination of the underlying model parameters. The parameter values of the point that results in the best performance are the optimal parameter values 53 .

Testing of the developed models and reporting results

Once the desired machine learning models have been developed using the training data, they need to be tested using the test data. The underlying trained model is then applied to predict its dependent variable for each data instance. Therefore, for each data instance, two categorical outcomes will be available for its dependent variable: one predicted using the underlying trained model, and the other is the actual category. These predicted and actual categorical outcome values are used to report the results of the underlying machine learning model.

The fundamental tool to report results from machine learning models is the confusion matrix, which consists of four integer values 48 . The first value represents the number of positive cases correctly identified as positive by the underlying trained model (true-positive). The second value indicates the number of positive instances incorrectly identified as negative (false-negative). The third value represents the number of negative cases incorrectly identified as positive (false-positive). Finally, the fourth value indicates the number of negative instances correctly identified as negative (true-negative). Researchers also use a few performance measures based on the four values of the confusion matrix to report machine learning results. The most used measure is accuracy which is the ratio of the number of correct predictions (true-positive + true-negative) and the total number of data instances (sum of all four values of the confusion matrix). Other measures commonly used to report machine learning results are precision, recall and F1-score. Precision refers to the ratio between true-positives and the total number of positive predictions (i.e., true-positive + false-positive), often used to indicate the quality of a positive prediction made by a model 48 . Recall, also known as the true-positive rate, is calculated by dividing true-positive by the number of data instances that should have been predicted as positive (i.e., true-positive + false-negative). F1-score is the harmonic mean of the last two measures, i.e., [(2 × Precision × Recall)/(Precision + Recall)] and the error-rate equals to (1-Accuracy).

Another essential tool for reporting machine learning results is variable or feature importance, which identifies a list of independent variables (features) contributing most to the classification performance. The importance of a variable refers to how much a given machine learning algorithm uses that variable in making accurate predictions 54 . The widely used technique for identifying variable importance is the principal component analysis. It reduces the dimensionality of the data while minimising information loss, which eventually increases the interpretability of the underlying machine learning outcome. It further helps in finding the important features in a dataset as well as plotting them in 2D and 3D 54 .

Ethical approval

Ethical approval is not required for this study since this study used publicly available data for research investigation purposes. All research was performed in accordance with relevant guidelines/regulations.

Informed consent

Due to the nature of the data sources, informed consent was not required for this study.

Case study: an application of the proposed framework

This section illustrates an application of this study’s proposed framework (Fig. 2 ) in a construction project context. We will apply this framework in classifying projects into two classes based on their cost overrun experience. Projects rarely experience a delay belonging to the first class (Rare class). The second class indicates those projects that often experience a delay (Often class). In doing so, we consider a list of independent variables or features.

Data source

The research dataset is taken from an open-source data repository, Kaggle 55 . This survey-based research dataset was collected to explore the causes of the project cost overrun in Indian construction projects 45 , consisting of 44 independent variables or features and one dependent variable. The independent variables cover a wide range of cost overrun factors, from materials and labour to contractual issues and the scope of the work. The dependent variable is the frequency of experiencing project cost overrun (rare or often). The dataset size is 139; 65 belong to the rare class, and the remaining 74 are from the often class. We converted each categorical variable with a textual or string outcome into an appropriate numerical value range to prepare the dataset for machine learning analysis. For example, we used 1 and 2 to represent rare and often class, respectively. The correlation matrix among the 44 features is presented in Supplementary Fig 2 .

Machine learning algorithms

This study considered four machine learning algorithms to explore the causes of project cost overrun using the research dataset mentioned above. They are support vector machine, logistic regression, k- nearest neighbours and random forest.

Support vector machine (SVM) is a process applied to understand data. For instance, if one wants to determine and interpret which projects are classified as programmatically successful through the processing of precedent data information, SVM would provide a practical approach for prediction. SVM functions by assigning labels to objects 56 . The comparison attributes are used to cluster these objects into different groups or classes by maximising their marginal distances and minimising the classification errors. The attributes are plotted multi-dimensionally, allowing a separation line, known as a hyperplane , see supplementary Fig 3 (a), to distinguish between underlying classes or groups 52 . Support vectors are the data points that lie closest to the decision boundary on both sides. In Supplementary Fig 3 (a), they are the circles (both transparent and shaded ones) close to the hyperplane. Support vectors play an essential role in deciding the position and orientation of the hyperplane. Various computational methods, including a kernel function to create more derived attributes, are applied to accommodate this process 56 . Support vector machines are not only limited to binary classes but can also be generalised to a larger variety of classifications. This is accomplished through the training of separate SVMs 56 .

Logistic regression (LR) builds on the linear regression model and predicts the outcome of a dichotomous variable 57 ; for example, the presence or absence of an event. It uses a scatterplot to understand the connection between an independent variable and one or more dependent variables (see Supplementary Fig 3 (b)). LR model fits the data to a sigmoidal curve instead of fitting it to a straight line. The natural logarithm is considered when developing the model. It provides a value between 0 and 1 that is interpreted as the probability of class membership. Best estimates are determined by developing from approximate estimates until a level of stability is reached 58 . Generally, LR offers a straightforward approach for determining and observing interrelationships. It is more efficient compared to ordinary regressions 59 .

k -nearest neighbours (KNN) algorithm uses a process that plots prior information and applies a specific sample size ( k ) to the plot to determine the most likely scenario 52 . This method finds the nearest training examples using a distance measure. The final classification is made by counting the most common scenario or votes present within the specified sample. As illustrated in Supplementary Fig 3 (c), the closest four nearest neighbours in the small circle are three grey squares and one white square. The majority class is grey. Hence, KNN will predict the instance (i.e., Χ ) as grey. On the other hand, if we look at the larger circle of the same figure, the nearest neighbours consist of ten white squares and four grey squares. The majority class is white. Thus, KNN will classify the instance as white. KNN’s advantage lies in its ability to produce a simplified result and handle missing data 60 . In summary, KNN utilises similarities (as well as differences) and distances in the process of developing models.

Random forest (RF) is a machine learning process that consists of many decision trees. A decision tree is a tree-like structure where each internal node represents a test on the input attribute. It may have multiple internal nodes at different levels, and the leaf or terminal nodes represent the decision outcomes. It produces a classification outcome for a distinctive and separate part to the input vector. For non-numerical processes, it considers the average value, and for discrete processes, it considers the number of votes 52 . Supplementary Fig 3 (d) shows three decision trees to illustrate the function of a random forest. The outcomes from trees 1, 2 and 3 are class B, class A and class A, respectively. According to the majority vote, the final prediction will be class A. Because it considers specific attributes, it can have a tendency to emphasise specific attributes over others, which may result in some attributes being unevenly weighted 52 . Advantages of the random forest include its ability to handle multidimensionality and multicollinearity in data despite its sensitivity to sampling design.

Artificial neural network (ANN) simulates the way in which human brains work. This is accomplished by modelling logical propositions and incorporating weighted inputs, a transfer and one output 61 (Supplementary Fig 3 (e)). It is advantageous because it can be used to model non-linear relationships and handle multivariate data 62 . ANN learns through three major avenues. These include error-back propagation (supervised), the Kohonen (unsupervised) and the counter-propagation ANN (supervised) 62 . There are two types of ANN—supervised and unsupervised. ANN has been used in a myriad of applications ranging from pharmaceuticals 61 to electronic devices 63 . It also possesses great levels of fault tolerance 64 and learns by example and through self-organisation 65 .

Ensemble techniques are a type of machine learning methodology in which numerous basic classifiers are combined to generate an optimal model 66 . An ensemble technique considers many models and combines them to form a single model, and the final model will eliminate the weaknesses of each individual learner, resulting in a powerful model that will improve model performance. The stacking model is a general architecture comprised of two classifier levels: base classifier and meta-learner 67 . The base classifiers are trained with the training dataset, and a new dataset is constructed for the meta-learner. Afterwards, this new dataset is used to train the meta-classifier. This study uses four models (SVM, LR, KNN and RF) as base classifiers and LR as a meta learner, as illustrated in Supplementary Fig 3 (f).

Feature selection

The process of selecting the optimal feature subset that significantly influences the predicted outcomes, which may be efficient to increase model performance and save running time, is known as feature selection. This study considers three different feature selection approaches. They are the Univariate feature selection (UFS), Recursive feature elimination (RFE) and SelectFromModel (SFM) approach. UFS examines each feature separately to determine the strength of its relationship with the response variable 68 . This method is straightforward to use and comprehend and helps acquire a deeper understanding of data. In this study, we calculate the chi-square values between features. RFE is a type of backwards feature elimination in which the model is fit first using all features in the given dataset and then removing the least important features one by one 69 . After that, the model is refit until the desired number of features is left over, which is determined by the parameter. SFM is used to choose effective features based on the feature importance of the best-performing model 70 . This approach selects features by establishing a threshold based on feature significance as indicated by the model on the training set. Those characteristics whose feature importance is more than the threshold are chosen, while those whose feature importance is less than the threshold are deleted. In this study, we apply SFM after we compare the performance of four machine learning methods. Afterwards, we train the best-performing model again using the features from the SFM approach.

Findings from the case study

We split the dataset into 70:30 for training and test purposes of the four selected machine learning algorithms. We used Python’s Scikit-learn package for implementing these algorithms 70 . Using the training data, we first developed six models based on these six algorithms. We used fivefold validation and target to improve the accuracy value. Then, we applied these models to the test data. We also executed all required hyperparameter tunings for each algorithm for the possible best classification outcome. Table 1 shows the performance outcomes for each algorithm during the training and test phase. The hyperparameter settings for each algorithm have been listed in Supplementary Table 1 .

As revealed in Table 1 , random forest outperformed the other three algorithms in terms of accuracy for both the training and test phases. It showed an accuracy of 78.14% and 77.50% for the training and test phases, respectively. The second-best performer in the training phase is k- nearest neighbours (76.98%), and for the test phase, it is the support vector machine, k- nearest neighbours and artificial neural network (72.50%).

Since random forest showed the best performance, we explored further based on this algorithm. We applied the three approaches (UFS, RFE and SFM) for feature optimisation on the random forest. The result is presented in Table 2 . SFM shows the best outcome among these three approaches. Its accuracy is 85.00%, whereas the accuracies of USF and RFE are 77.50% and 72.50%, respectively. As can be seen in Table 2 , the accuracy for the testing phase increases from 77.50% in Table 1 (b) to 85.00% with the SFM feature optimisation. Table 3 shows the 19 selected features from the SFM output. Out of 44 features, SFM found that 19 of them play a significant role in predicting the outcomes.

Further, Fig. 4 illustrates the confusion matrix when the random forest model with the SFM feature optimiser was applied to the test data. There are 18 true-positive, five false-negative, one false-positive and 16 true-negative cases. Therefore, the accuracy for the test phase is (18 + 16)/(18 + 5 + 1 + 16) = 85.00%.

Confusion matrix results based on the random forest model with the SFM feature optimiser (1 for the rare class and 2 for the often class).

Figure 5 illustrates the top-10 most important features or variables based on the random forest algorithm with the SFM optimiser. We used feature importance based on the mean decrease in impurity in identifying this list of important variables. Mean decrease in impurity computes each feature’s importance as the sum over the number of splits that include the feature in proportion to the number of samples it splits 71 . According to this figure, the delays in decision marking attribute contributed most to the classification performance of the random forest algorithm, followed by cash flow problem and construction cost underestimation attributes. The current construction project literature also highlighted these top-10 factors as significant contributors to project cost overrun. For example, using construction project data from Jordan, Al-Hazim et al. 72 ranked 20 causes for cost overrun, including causes similar to these causes.

Feature importance (top-10 out of 19) based on the random forest model with the SFM feature optimiser.

Further, we conduct a sensitivity analysis of the model’s ten most important features (from Fig. 5 ) to explore how a change in each feature affects the cost overrun. We utilise the partial dependence plot (PDP), which is a typical visualisation tool for non-parametric models 73 , to display this analysis’s outcomes. A PDP can demonstrate whether the relation between the target and a feature is linear, monotonic, or more complicated. The result of the sensitivity analysis is presented in Fig. 6 . For the ‘delays in decisions making’ attribute, the PDP shows that the probability is below 0.4 until the rating value is three and increases after. A higher value for this attribute indicates a higher risk of cost overrun. On the other hand, there are no significant differences can be seen in the remaining nine features if the value changes.

The result of the sensitivity analysis from the partial dependency plot tool for the ten most important features.

Summary of the case study

We illustrated an application of the proposed machine learning-based research framework in classifying construction projects. RF showed the highest accuracy in predicting the test dataset. For a new data instance with information for its 19 features but has not had any information on its classification, RF can identify its class ( rare or often ) correctly with a probability of 85.00%. If more data is provided, in addition to the 139 instances of the case study, to the machine learning algorithms, then their accuracy and efficiency in making project classification will improve with subsequent training. For example, if we provide 100 more data instances, these algorithms will have an additional 50 instances for training with a 70:30 split. This continuous improvement facility put the machine learning algorithms in a superior position over other traditional methods. In the current literature, some studies explore the factors contributing to project delay or cost overrun. In most cases, they applied factor analysis or other related statistical methods for research data analysis 72 , 74 , 75 . In addition to identifying important attributes, the proposed machine learning-based framework identified the ranking of factors and how eliminating less important factors affects the prediction accuracy when applied to this case study.

We shared the Python software developed to implement the four machine learning algorithms considered in this case study using GitHub 76 , a software hosting internet site. user-friendly version of this software can be accessed at https://share.streamlit.io/haohuilu/pa/main/app.py . The accuracy findings from this link could be slightly different from one run to another due to the hyperparameter settings of the corresponding machine learning algorithms.

Due to their robust prediction ability, machine learning methods have already gained wide acceptability across a wide range of research domains. On the other side, EVM is the most commonly used method in project analytics due to its simplicity and ease of interpretability 77 . Essential research efforts have been made to improve its generalisability over time. For example, Naeni et al. 34 developed a fuzzy approach for earned value analysis to make it suitable to analyse project scenarios with ambiguous or linguistic outcomes. Acebes 78 integrated Monte Carlo simulation with EVM for project monitoring and control for a similar purpose. Another prominent method frequently used in project analytics is the time series analysis, which is compelling for the longitudinal prediction of project time and cost 30 . Apparently, as evident in the present current literature, not much effort has been made to bring machine learning into project analytics for addressing project management research problems. This research made a significant attempt to contribute to filling up this gap.

Our proposed data-driven framework only includes the fundamental model development and application process components for machine learning algorithms. It does not have a few advanced-level machine learning methods. This study intentionally did not consider them for the proposed model since they are required only in particular designs of machine learning analysis. For example, the framework does not contain any methods or tools to handle the data imbalance issue. Data imbalance refers to a situation when the research dataset has an uneven distribution of the target class 79 . For example, a binary target variable will cause a data imbalance issue if one of its class labels has a very high number of observations compared with the other class. Commonly used techniques to address this issue are undersampling and oversampling. The undersampling technique decreases the size of the majority class. On the other hand, the oversampling technique randomly duplicates the minority class until the class distribution becomes balanced 79 . The class distribution of the case study did not produce any data imbalance issues.

This study considered only six fundamental machine learning algorithms for the case study, although many other such algorithms are available in the literature. For example, it did not consider the extreme gradient boosting (XGBoost) algorithm. XGBoost is based on the decision tree algorithm, similar to the random forest algorithm 80 . It has become dominant in applied machine learning due to its performance and speed. Naïve Bayes and convolutional neural networks are other popular machine learning algorithms that were not considered when applying the proposed framework to the case study. In addition to the three feature selection methods, multi-view can be adopted when applying the proposed framework to the case study. Multi-view learning is another direction in machine learning that considers learning with multiple views of the existing data with the aim to improve predictive performance 81 , 82 . Similarly, although we considered five performance measures, there are other potential candidates. One such example is the area under the receiver operating curve, which is the ability of the underlying classifier to distinguish between classes 48 . We leave them as a potential application scope while applying our proposed framework in any other project contexts in future studies.

Although this study only used one case study for illustration, our proposed research framework can be used in other project analytics contexts. In such an application context, the underlying research goal should be to predict the outcome classes and find attributes playing a significant role in making correct predictions. For example, by considering two types of projects based on the time required to accomplish (e.g., on-time and delayed ), the proposed framework can develop machine learning models that can predict the class of a new data instance and find out attributes contributing mainly to this prediction performance. This framework can also be used at any stage of the project. For example, the framework’s results allow project stakeholders to screen projects for excessive cost overruns and forecast budget loss at bidding and before contracts are signed. In addition, various factors that contribute to project cost overruns can be figured out at an earlier stage. These elements emerge at each stage of a project’s life cycle. The framework’s feature importance helps project managers locate the critical contributor to cost overrun.

This study has made an important contribution to the current project analytics literature by considering the applications of machine learning within project management. Project management is often thought of as being very fluid in nature, and because of this, applications of machine learning are often more difficult. Further, existing implementations have largely been limited to safety monitoring, risk prediction and cost estimation. Through the evaluation of machine learning applications, this study further demonstrates the uses for which algorithms can be used to consider and model the relationship between project attributes and cost overrun frequency.

The applications of machine learning in project analytics are still undergoing constant development. Within construction projects, its applications have been largely limited and focused on profitability or the design of structures themselves. In this regard, our study made a substantial effort by proposing a machine learning-based framework to address research problems related to project analytics. We also illustrated an example of this framework’s application in the context of construction project management.

Like any other research, this study also has a few limitations that could provide scopes for future research. First, the framework does not include a few advanced machine learning techniques, such as data imbalance issues and kernel density estimation. Second, we considered only one case study to illustrate the application of the proposed framework. Illustrations of this framework using case studies from different project contexts would confirm its robust application. Finally, this study did not consider all machine learning models and performance measures available in the literature for the case study. For example, we did not consider the Naïve Bayes model and precision measure in applying the proposed research framework for the case study.

Data availability

This study obtained research data from publicly available online repositories. We mentioned their sources using proper citations. Here is the link to the data https://www.kaggle.com/datasets/amansaxena/survey-on-road-construction-delay .

Venkrbec, V. & Klanšek, U. In: Advances and Trends in Engineering Sciences and Technologies II 685–690 (CRC Press, 2016).

Google Scholar

Damnjanovic, I. & Reinschmidt, K. Data Analytics for Engineering and Construction Project Risk Management (Springer, 2020).

Book Google Scholar

Singh, H. Project Management Analytics: A Data-driven Approach to Making Rational and Effective Project Decisions (FT Press, 2015).

Frame, J. D. & Chen, Y. Why Data Analytics in Project Management? (Auerbach Publications, 2018).

Ong, S. & Uddin, S. Data Science and Artificial Intelligence in Project Management: The Past, Present and Future. J. Mod. Proj. Manag. 7 , 26–33 (2020).

Bilal, M. et al. Investigating profitability performance of construction projects using big data: A project analytics approach. J. Build. Eng. 26 , 100850 (2019).

Article Google Scholar

Radziszewska-Zielina, E. & Sroka, B. Planning repetitive construction projects considering technological constraints. Open Eng. 8 , 500–505 (2018).

Neely, A. D., Adams, C. & Kennerley, M. The Performance Prism: The Scorecard for Measuring and Managing Business Success (Prentice Hall Financial Times, 2002).

Kanakaris, N., Karacapilidis, N., Kournetas, G. & Lazanas, A. In: International Conference on Operations Research and Enterprise Systems. 135–155 Springer.

Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349 , 255–260 (2015).

Article ADS MathSciNet CAS PubMed MATH Google Scholar

Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, 2014).

Book MATH Google Scholar

Rahimian, F. P., Seyedzadeh, S., Oliver, S., Rodriguez, S. & Dawood, N. On-demand monitoring of construction projects through a game-like hybrid application of BIM and machine learning. Autom. Constr. 110 , 103012 (2020).

Sanni-Anibire, M. O., Zin, R. M. & Olatunji, S. O. Machine learning model for delay risk assessment in tall building projects. Int. J. Constr. Manag. 22 , 1–10 (2020).

Cong, J. et al. A machine learning-based iterative design approach to automate user satisfaction degree prediction in smart product-service system. Comput. Ind. Eng. 165 , 107939 (2022).

Li, F., Chen, C.-H., Lee, C.-H. & Feng, S. Artificial intelligence-enabled non-intrusive vigilance assessment approach to reducing traffic controller’s human errors. Knowl. Based Syst. 239 , 108047 (2021).

Mohri, M., Rostamizadeh, A. & Talwalkar, A. Foundations of Machine Learning (MIT press, 2018).

MATH Google Scholar

Whyte, J., Stasis, A. & Lindkvist, C. Managing change in the delivery of complex projects: Configuration management, asset information and ‘big data’. Int. J. Proj. Manag. 34 , 339–351 (2016).

Zangeneh, P. & McCabe, B. Ontology-based knowledge representation for industrial megaprojects analytics using linked data and the semantic web. Adv. Eng. Inform. 46 , 101164 (2020).

Akinosho, T. D. et al. Deep learning in the construction industry: A review of present status and future innovations. J. Build. Eng. 32 , 101827 (2020).

Soman, R. K., Molina-Solana, M. & Whyte, J. K. Linked-Data based constraint-checking (LDCC) to support look-ahead planning in construction. Autom. Constr. 120 , 103369 (2020).

Soman, R. K. & Whyte, J. K. Codification challenges for data science in construction. J. Constr. Eng. Manag. 146 , 04020072 (2020).

Soman, R. K. & Molina-Solana, M. Automating look-ahead schedule generation for construction using linked-data based constraint checking and reinforcement learning. Autom. Constr. 134 , 104069 (2022).

Shi, F., Soman, R. K., Han, J. & Whyte, J. K. Addressing adjacency constraints in rectangular floor plans using Monte-Carlo tree search. Autom. Constr. 115 , 103187 (2020).

Chen, L. & Whyte, J. Understanding design change propagation in complex engineering systems using a digital twin and design structure matrix. Eng. Constr. Archit. Manag. (2021).

Allison, J. T. et al. Artificial intelligence and engineering design. J. Mech. Des. 144 , 020301 (2022).

Dutta, D. & Bose, I. Managing a big data project: The case of ramco cements limited. Int. J. Prod. Econ. 165 , 293–306 (2015).

Bilal, M. & Oyedele, L. O. Guidelines for applied machine learning in construction industry—A case of profit margins estimation. Adv. Eng. Inform. 43 , 101013 (2020).

Tayefeh Hashemi, S., Ebadati, O. M. & Kaur, H. Cost estimation and prediction in construction projects: A systematic review on machine learning techniques. SN Appl. Sci. 2 , 1–27 (2020).

Arage, S. S. & Dharwadkar, N. V. In: International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC). 594–599 (IEEE, 2017).

Cheng, C.-H., Chang, J.-R. & Yeh, C.-A. Entropy-based and trapezoid fuzzification-based fuzzy time series approaches for forecasting IT project cost. Technol. Forecast. Soc. Chang. 73 , 524–542 (2006).

Joukar, A. & Nahmens, I. Volatility forecast of construction cost index using general autoregressive conditional heteroskedastic method. J. Constr. Eng. Manag. 142 , 04015051 (2016).

Xu, J.-W. & Moon, S. Stochastic forecast of construction cost index using a cointegrated vector autoregression model. J. Manag. Eng. 29 , 10–18 (2013).

Narbaev, T. & De Marco, A. Combination of growth model and earned schedule to forecast project cost at completion. J. Constr. Eng. Manag. 140 , 04013038 (2014).

Naeni, L. M., Shadrokh, S. & Salehipour, A. A fuzzy approach for the earned value management. Int. J. Proj. Manag. 29 , 764–772 (2011).

Ponz-Tienda, J. L., Pellicer, E. & Yepes, V. Complete fuzzy scheduling and fuzzy earned value management in construction projects. J. Zhejiang Univ. Sci. A 13 , 56–68 (2012).

Yu, F., Chen, X., Cory, C. A., Yang, Z. & Hu, Y. An active construction dynamic schedule management model: Using the fuzzy earned value management and BP neural network. KSCE J. Civ. Eng. 25 , 2335–2349 (2021).

Bonato, F. K., Albuquerque, A. A. & Paixão, M. A. S. An application of earned value management (EVM) with Monte Carlo simulation in engineering project management. Gest. Produção 26 , e4641 (2019).

Batselier, J. & Vanhoucke, M. Empirical evaluation of earned value management forecasting accuracy for time and cost. J. Constr. Eng. Manag. 141 , 05015010 (2015).

Yang, R. J. & Zou, P. X. Stakeholder-associated risks and their interactions in complex green building projects: A social network model. Build. Environ. 73 , 208–222 (2014).

Uddin, S. Social network analysis in project management–A case study of analysing stakeholder networks. J. Mod. Proj. Manag. 5 , 106–113 (2017).

Ong, S. & Uddin, S. Co-evolution of project stakeholder networks. J. Mod. Proj. Manag. 8 , 96–115 (2020).

Khanzode, K. C. A. & Sarode, R. D. Advantages and disadvantages of artificial intelligence and machine learning: A literature review. Int. J. Libr. Inf. Sci. (IJLIS) 9 , 30–36 (2020).

Loyola-Gonzalez, O. Black-box vs. white-box: Understanding their advantages and weaknesses from a practical point of view. IEEE Access 7 , 154096–154113 (2019).

Abioye, S. O. et al. Artificial intelligence in the construction industry: A review of present status, opportunities and future challenges. J. Build. Eng. 44 , 103299 (2021).

Doloi, H., Sawhney, A., Iyer, K. & Rentala, S. Analysing factors affecting delays in Indian construction projects. Int. J. Proj. Manag. 30 , 479–489 (2012).

Alkhaddar, R., Wooder, T., Sertyesilisik, B. & Tunstall, A. Deep learning approach’s effectiveness on sustainability improvement in the UK construction industry. Manag. Environ. Qual. Int. J. 23 , 126–139 (2012).

Gondia, A., Siam, A., El-Dakhakhni, W. & Nassar, A. H. Machine learning algorithms for construction projects delay risk prediction. J. Constr. Eng. Manag. 146 , 04019085 (2020).

Witten, I. H. & Frank, E. Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann, 2005).

Kanakaris, N., Karacapilidis, N. I. & Lazanas, A. In: ICORES. 362–369.

Heo, S., Han, S., Shin, Y. & Na, S. Challenges of data refining process during the artificial intelligence development projects in the architecture engineering and construction industry. Appl. Sci. 11 , 10919 (2021).

Article CAS Google Scholar

Bross, I. D. How to use ridit analysis. Biometrics 14 , 18–38 (1958).

Uddin, S., Khan, A., Hossain, M. E. & Moni, M. A. Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 19 , 1–16 (2019).

LaValle, S. M., Branicky, M. S. & Lindemann, S. R. On the relationship between classical grid search and probabilistic roadmaps. Int. J. Robot. Res. 23 , 673–692 (2004).

Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2 , 433–459 (2010).

Saxena, A. Survey on Road Construction Delay , https://www.kaggle.com/amansaxena/survey-on-road-construction-delay (2021).

Noble, W. S. What is a support vector machine?. Nat. Biotechnol. 24 , 1565–1567 (2006).

Article CAS PubMed Google Scholar

Hosmer, D. W. Jr., Lemeshow, S. & Sturdivant, R. X. Applied Logistic Regression Vol. 398 (John Wiley & Sons, 2013).

LaValley, M. P. Logistic regression. Circulation 117 , 2395–2399 (2008).

Article PubMed Google Scholar

Menard, S. Applied Logistic Regression Analysis Vol. 106 (Sage, 2002).

Batista, G. E. & Monard, M. C. A study of K-nearest neighbour as an imputation method. His 87 , 48 (2002).

Agatonovic-Kustrin, S. & Beresford, R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J. Pharm. Biomed. Anal. 22 , 717–727 (2000).

Zupan, J. Introduction to artificial neural network (ANN) methods: What they are and how to use them. Acta Chim. Slov. 41 , 327–327 (1994).

CAS Google Scholar

Hopfield, J. J. Artificial neural networks. IEEE Circuits Devices Mag. 4 , 3–10 (1988).

Zou, J., Han, Y. & So, S.-S. Overview of artificial neural networks. Artificial Neural Networks . 14–22 (2008).

Maind, S. B. & Wankar, P. Research paper on basic of artificial neural network. Int. J. Recent Innov. Trends Comput. Commun. 2 , 96–100 (2014).

Wolpert, D. H. Stacked generalization. Neural Netw. 5 , 241–259 (1992).

Pavlyshenko, B. In: IEEE Second International Conference on Data Stream Mining & Processing (DSMP). 255–258 (IEEE).

Jović, A., Brkić, K. & Bogunović, N. In: 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). 1200–1205 (Ieee, 2015).

Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46 , 389–422 (2002).

Article MATH Google Scholar

Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

MathSciNet MATH Google Scholar

Louppe, G., Wehenkel, L., Sutera, A. & Geurts, P. Understanding variable importances in forests of randomized trees. Adv. Neural. Inf. Process. Syst. 26 , 431–439 (2013).

Al-Hazim, N., Salem, Z. A. & Ahmad, H. Delay and cost overrun in infrastructure projects in Jordan. Procedia Eng. 182 , 18–24 (2017).

Breiman, L. Random forests. Mach. Learn. 45 , 5–32. https://doi.org/10.1023/A:1010933404324 (2001).

Shehu, Z., Endut, I. R. & Akintoye, A. Factors contributing to project time and hence cost overrun in the Malaysian construction industry. J. Financ. Manag. Prop. Constr. 19 , 55–75 (2014).

Akomah, B. B. & Jackson, E. N. Contractors’ perception of factors contributing to road project delay. Int. J. Constr. Eng. Manag. 5 , 79–85 (2016).

GitHub: Where the world builds software , https://github.com/ .

Anbari, F. T. Earned value project management method and extensions. Proj. Manag. J. 34 , 12–23 (2003).

Acebes, F., Pereda, M., Poza, D., Pajares, J. & Galán, J. M. Stochastic earned value analysis using Monte Carlo simulation and statistical learning techniques. Int. J. Proj. Manag. 33 , 1597–1609 (2015).

Japkowicz, N. & Stephen, S. The class imbalance problem: A systematic study. Intell. data anal. 6 , 429–449 (2002).

Chen, T. et al. Xgboost: extreme gradient boosting. R Packag. Version 0.4–2.1 1 , 1–4 (2015).

Guarino, A., Lettieri, N., Malandrino, D., Zaccagnino, R. & Capo, C. Adam or Eve? Automatic users’ gender classification via gestures analysis on touch devices. Neural Comput. Appl. 1–23 (2022).

Zaccagnino, R., Capo, C., Guarino, A., Lettieri, N. & Malandrino, D. Techno-regulation and intelligent safeguards. Multimed. Tools Appl. 80 , 15803–15824 (2021).

Download references

Acknowledgements

The authors acknowledge the insightful comments from Prof Jennifer Whyte on an earlier version of this article.

Author information

Authors and affiliations.

School of Project Management, The University of Sydney, Level 2, 21 Ross St, Forest Lodge, NSW, 2037, Australia

Shahadat Uddin, Stephen Ong & Haohui Lu

You can also search for this author in PubMed Google Scholar

Contributions

S.U.: Conceptualisation; Data curation; Formal analysis; Methodology; Supervision; and Writing (original draft, review and editing) S.O.: Data curation; and Writing (original draft, review and editing) H.L.: Methodology; and Writing (original draft, review and editing) All authors reviewed the manuscript).

Corresponding author

Correspondence to Shahadat Uddin .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Uddin, S., Ong, S. & Lu, H. Machine learning in project analytics: a data-driven framework and case study. Sci Rep 12 , 15252 (2022). https://doi.org/10.1038/s41598-022-19728-x

Download citation

Received : 13 April 2022

Accepted : 02 September 2022

Published : 09 September 2022

DOI : https://doi.org/10.1038/s41598-022-19728-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Evaluation and prediction of time overruns in jordanian construction projects using coral reefs optimization and deep learning methods.

Jumana Shihadeh
Ghyda Al-Shaibie
Hamza Al-Bdour

Asian Journal of Civil Engineering (2024)

A robust, resilience machine learning with risk approach: a case study of gas consumption

Mehdi Changizi
Sadia Samar Ali

Annals of Operations Research (2024)

Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets

Shahadat Uddin

Health and Technology (2024)

Prediction of SMEs’ R&D performances by machine learning for project selection

Hyoung Sun Yoo
Ye Lim Jung
Seung-Pyo Jun

Scientific Reports (2023)

A robust and resilience machine learning for forecasting agri-food production

Amin Gholamrezaei
Kiana Kheiri

Scientific Reports (2022)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

Explore articles by subject
Guide to authors
Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Open access
Published: 14 July 2023

Data-driven research and healthcare: public trust, data governance and the NHS

Angeliki Kerasidou 1 &
Charalampia (Xaroula) Kerasidou 2

BMC Medical Ethics volume 24 , Article number: 51 ( 2023 ) Cite this article

4417 Accesses

8 Citations

11 Altmetric

Metrics details

A Correction to this article was published on 04 October 2023

This article has been updated

It is widely acknowledged that trust plays an important role for the acceptability of data sharing practices in research and healthcare, and for the adoption of new health technologies such as AI. Yet there is reported distrust in this domain. Although in the UK, the NHS is one of the most trusted public institutions, public trust does not appear to accompany its data sharing practices for research and innovation, specifically with the private sector, that have been introduced in recent years. In this paper, we examine the question of, what is it about sharing NHS data for research and innovation with for-profit companies that challenges public trust? To address this question, we draw from political theory to provide an account of public trust that helps better understand the relationship between the public and the NHS within a democratic context, as well as, the kind of obligations and expectations that govern this relationship. Then we examine whether the way in which the NHS is managing patient data and its collaboration with the private sector fit under this trust-based relationship. We argue that the datafication of healthcare and the broader ‘health and wealth’ agenda adopted by consecutive UK governments represent a major shift in the institutional character of the NHS, which brings into question the meaning of public good the NHS is expected to provide, challenging public trust. We conclude by suggesting that to address the problem of public trust, a theoretical and empirical examination of the benefits but also the costs associated with this shift needs to take place, as well as an open conversation at public level to determine what values should be promoted by a public institution like the NHS.

Peer Review reports

There seems to be wide acknowledgement of the importance of trust in the context of data driven research and healthcare. It has been argued that building trust in this area can facilitate public acceptability of data sharing and the adoption of new technologies such as AI [ 1 , 2 ]. And yet, what it means, and how to promote trust in this context remains vague [ 3 , 4 ]. In the UK, the NHS is one of the most trusted public institutions, and also the single biggest holder of health data worldwide [ 5 ]. This could create favourable conditions for the establishment and promotion of a robust data driven health research and innovation sector and of data driven healthcare, something that successive governments in recent years have been trying to achieve [ 6 , 7 , 8 ]. However, in recent years, data sharing initiatives as developed by successive governments and implemented through the NHS have been met with public distrust [ 9 , 10 ]. One of the concerns cited is the involvement of private companies in the healthcare space, and worries of a potential “sell-off” of NHS patient data to for-profit companies for research and product development [ 11 , 12 , 13 , 14 ]. What it is about the sharing of NHS data with private for-profit companies in research and product development that is problematic and might be negatively affecting public trust is less clear. To answer this question, we first provide an account of public trust to help elucidate the kind of relationship it signifies between the public and health institutions, such as the NHS, and the kind of obligations and expectations that can be reasonably extrapolated from this relationship. Then we examine whether the way in which the NHS is managing patient data, and the particular collaborations and data sharing practices developed with the private sector fit under this trust-based relationship between the public and the NHS. We argue that the datafication of healthcare and the broader ‘health and wealth’ agenda adopted by consecutive UK governments represent a major shift in the institutional character of the NHS. This shift brings into question the meaning of public good that the NHS is supposed to provide, and thus challenges public trust. We suggest that in order to address the problem of public trust, a substantive examination – theoretical and empirical – of the benefits but also, and importantly, the costs associated with this shift needs to take place, as well as an open public conversation to determine what values should be endorsed and promoted by a public institution like the NHS.

Public trust

A number of scholars have been discussing the issue of trust and data governance particularly in the context of healthcare [ 3 , 4 , 15 , 16 ]. These discussions have engaged with the philosophical and sociological literature to understand and explain the role and importance of trust in this context, and draw meaningful distinctions between related concepts such as trust, trustworthiness, reliance, reliability and confidence [ 3 , 15 , 16 ]. A proper analysis of these terms fall outside the scope of this paper, yet a short clarification of these concepts might be of use here. Trust denotes a relationship where one has a reasonable belief in the other person’s ability and good will or expressed commitment to perform a certain action [ 17 , 18 , 19 , 20 ]. Reliance, and confidence – the latter, a term most often found in the sociological literature [ 21 , 22 ] - confer the same reasonable expectation that an action will be performed, but do not entail affective elements, in the sense that there is no expectation of good will or expressed commitment [ 17 , 20 , 23 ]. Trustworthiness and reliability, on the other hand, refer to characteristics of the agent that is (could be) trusted or relied upon [ 17 , 20 , 24 , 25 ]. As O’Neil notes, when it comes to trust, the crucial point is to be able to discern who is trustworthy and who is untrustworthy in order to be sure to place our trust appropriately [ 19 , 25 ]. However, whilst using accounts of interpersonal trust to make sense of trust between collectives is useful [ 26 ], they do not capture the distinct and morally relevant context that frames the relationship between the public and public institutions, namely, the democratic structure that underpins this trust relationship. It is this political context, we maintain, that creates specific conditions both for the public and public institutions. When technological, policy or economic developments impact on the ways in which these rights and obligations are understood and practiced, the question arises of whether a rethinking of the relationship between the public and the public institution is needed [ 27 , 28 ]. For these reasons, and in order to tackle the question we set out in this paper, we turn to political theory for accounts of public trust and specifically to the work of Mark Warren.

According to Warren, the term public trust denotes the trust warranted to public institutions tasked with providing a public good or service [ 29 ]. The basic conditions for public trust are convergent interests between the public and the institution, and commitment to serving and promoting the public good [ 29 ]. Democratic societies ought to be organised in a way that facilitates a clear separation between impartial bodies that provide public goods, and partial bodies, such as political branches of the government and political institutions (e.g. elected executives), which do not take on such commitments. According to Warren, partial bodies warrant public distrust based on the fact that their interests, motivations and intentions do not correspond with those of the general public, but only with certain sub-groups of the general population (e.g. party members) [ 29 ]. In other words, an institution’s raison d’etre to provide public good is what warrants public trust (even if it does not guarantee it). A commitment to public good can be seen as an indication of the good will such institutions hold towards the trustor, i.e. the public, and their moral motivation to validate the public’s dependency on them. On the other hand, public distrust towards partial institutions is based on the recognition that these institutions do not necessarily have good will towards the public or accept a commitment to provide a benefit for all (even if it could be argued that they might have good will towards certain subgroups that comprise the public, e.g. party members).

Warren maintains that a democratic society should welcome distrust towards partial institutions and put measures in place to facilitate and support it, namely, to institutionalise it [ 29 ]. This can happen by providing citizens the means with which to check and control the behaviour of partial bodies, to ensure that they do not harm public interest, and as much as possible align their activities with these that promote the public good. Institutionalising distrust can take the form of independent governance structures that provide checks and balances, monitor the actions of partial bodies, and introduce enforceable sanctions that they are meaningful enough to deter wrongdoing [ 29 ]. Furthermore, providing functional and appropriate forms of monitoring can prevent distrust from generalising and spilling over to impartial institutions [ 29 ]. One way of understanding the institutionalisation of distrust is as a guarantor of trust to the democratic system as a whole. It reinforces the idea that democracies are organised in such a way as to benefit the ‘demos’, rather than to benefit certain individuals or subgroups within society.

The trust relationship between the public and the NHS

The NHS is widely regarded as a public institution that warrants and holds public trust. Its main function is to provide a public good, specifically to support and improve people’s health and wellbeing. From its inception in 1948 by Nye Bevan, the NHS’ key characteristic is its moral motivation as expressed in its solidaristic character [ 28 , 30 ]. It is founded on a set of principles and values, which enjoy wide public support [ 31 ], that bind together the communities and people that it serves and the staff that work for it. As its Constitution states, ‘The NHS belongs to the people’ [ 32 ]. The NHS was founded on the principle that care should be delivered on the basis of need rather than on ability to pay. According to Veitch, this forms the basis of the relationship between the British public and the NHS; a public service user and public service provider with a shared understanding of the public good at stake [ 30 ]. In this relationship, the public has a moral and civic duty to fund the health service through taxation and use its resources prudently [ 32 ], while the State is obliged, through its appropriate financing and oversight of the healthcare service, to deliver appropriate care at the point of need. It is this stated commitment to serve the public good of healthcare that suggests an alignment of values and convergence of interests between the involved partners.

Although the latest British Social Attitudes survey demonstrates the public’s commitment and faith in the core principles of the NHS [ 31 ], public trust in its data sharing practices appears somewhat problematic. Various studies conducted across the years to gauge public attitudes to data sharing in healthcare and beyond have reported disquiet at the involvement of private companies in the healthcare space, with some voicing concerns about a potential “sell-off” of NHS patient data to private for-profit companies for research and product development [ 12 , 13 , 14 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 ].

As it is often pointed out, the involvement of the private sector in the NHS is not a new phenomenon [ 41 ]. It can be traced back to the National Health Service and Community Care Act 1990, which separated the healthcare provision function of the NHS from the healthcare purchase function. Later, the Health and Social Care Act 2012 further fragmented the service, inviting more private companies into the healthcare space by introducing a tendering process for the commissioning of health care services under the principle of competition [ 30 , 42 ]. More recently, the Health and Care Act 2022 represented a significant shift away from competition and towards collaboration between health providers [ 43 , 44 ]. As some argue, whether these reforms will manage to halt the erosion of public services and the further involvement of private interests in the NHS depends on how and the context within which they will be implemented [ 43 ]. In any case, such changes can have a significant impact on the relationship between the public and the healthcare institution [ 30 ].

Similarly, since the introduction of IT medical systems and Electronic Health Records (EHR), questions about private involvement when it comes to health data access and sharing have increasingly come to the foreground. Existing surveys and research studies reveal snapshots of a complex picture about public attitudes towards data sharing; namely, while participants are generally open to the sharing of health data for the benefit of patients, the NHS and for the broader public benefit, there is consistent uneasiness associated with the sharing of health data with private organisations and the commercial exploitation of NHS data [ 12 , 13 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 45 ]. As we demonstrate in the next section, the digitisation of the NHS and the broader datafication of healthcare along with the adopted ‘health and wealth’ agenda introduce new challenges in the context of healthcare, further impacting on the relationship between the public and the NHS.

Digitisation and datafication of the NHS – the ‘health and wealth’ agenda

The digitisation of the NHS started in 2002 aiming (and missing) to achieve a ‘paperless’ NHS by 2018 and a ‘core level of digitisation’ by 2024 [ 46 ]. This process coincided with what Faulkner et al. call ‘the research turn’ in the NHS [ 47 ]. In 2003, the vision of UK’s Department of Trade and Industry’s Biotechnology Innovation and Growth Team [ 48 ], and later in 2006, the Department of Health’s publication Best Research for Best Health [ 8 ] made research a key element of the NHS, placing it at the centre of a national ‘health and wealth’ agenda. As the foreword of the latter report stated:

The vision that this strategy describes is underpinned by our determination to ensure that the NHS contribution to health research is a centrepiece of the Government’s ambition to raise the level of research and development (R&D) to 2.5% of GDP by 2014 ( 8 ).

The digitisation of the NHS and more specifically the introduction of Electronic Health Records (EHRs) were central to this strategic vision. Moving from paper records to highly mobile recorded data heralded an infrastructure and with it a new set of private partners that facilitated a ‘more entrepreneurial approach to data’ [ 47 ].

This entrepreneurial approach was further crystallised in 2012. The newly introduced pledge in the NHS Constitution in 2013 for England ‘to anonymise the information collected during the course of your treatment and use it to support research and improve care for others’ [ 49 ] was accompanied by the introduction of the Health and Social Care Act 2012 which established a new legal framework for the flow of patient information within the NHS [ 50 ]. While even before its introduction, there were special exemptions Footnote 1 that facilitated (and continue to facilitate) the legal disclosure of patient-identifiable data to a third-party without consent, it was the 2012 Act that enabled the obligatory release-upon-request of patient information from every health and care provider in England directed by the then newly formed body, NHS England, which is an executive non-departmental public body sponsored by the Department of Health and Social Care [ 51 , 52 ]. Following the intervention of the National Data Guardian, this was later accompanied by a (conditional) Opt Out option – which does not apply when data is needed for the purposes of individual care, and may only apply for any other purposes beyond individual care [ 53 ]. Footnote 2 Specifically, in relation to the use of health data beyond a patient’s individual care, the option to Opt Out only applies to identifiable data, and does not apply to anonymised data. As the NHS website states: Your choice [to opt out] does not apply, ‘[w]hen information that can identify you is removed’ [ 54 ]. Footnote 3 However, as it is becoming increasingly understood, removing identifiable information does not necessarily render data anonymous [ 55 ]. Footnote 4

These developments meant that information shared between doctors and patients in confidence was being transformed into usable streams of data that could be accessed by a range of actors for both primary and secondary uses, such as health services management and research, in an increasingly datafied landscape. Traditionally, when patients were enrolled in healthcare research, this would happen as part of specific and contained research projects, and with specific and individual consent. Similarly, anonymised data have long been offered for research and secondary purposes without the option of an Opt Out [ 57 ]. Both these practices continue to our day. However, the advent of the Big Data era with its major advancements in data science [ 58 ] and the subsequent datafication of healthcare [ 59 , 60 ], along with the regulatory legal and policy changes, as briefly described above, have given rise to a more complex situation. As the next section demonstrates, the financialisation of healthcare complicates matters further [ 61 , 62 ].

In 2013, the then UK Prime Minister David Cameron promised that everyone will be a research subject, by default [ 62 ]. Cameron’s promise was part of a broader vision - still persisting today - to facilitate research and innovation by making Britain a world leader in the life sciences sector and health-tech industry [ 6 ]. Footnote 5 The NHS with its longitudinal generational health data has become instrumental in the realisation of this vision, as consecutive UK governments have been strategising for ways not to ‘waste’ what is now seen as a valuable asset but exploit it as a way of competing in the global health knowledge market [ 63 ]. The assetisation of NHS data, through their transformation into a financial resource that can participate in a speculative economy, enables not only the financialisation of healthcare research and innovation [ 64 ] but also the further collapse of boundaries between research, innovation and the provision of healthcare. Nowadays, NHS patients, staff and the public as a whole are expected to support the NHS as the main healthcare providing institution, but also participate in ‘power[ing] the UK economy’ [ 65 ] by supporting the NHS’ role as ‘a major investor and wealth creator in the UK’ [ 7 ] as the boundaries between research and direct clinical care become increasingly blurred [ 61 ].

The aforementioned digitisation of the NHS and assetisation of NHS data coincide with the entry of new private commercial interests and companies in the healthcare space. Global consumer tech companies such as Google, Microsoft and Amazon, along with numerous other intermediaries and start-ups that were, up until very recently, alien to the healthcare space, can now buy, have access to or be handed over NHS patient data for the development of for-profit tools and products [ 66 , 67 , 68 ] and for the furthering of their own commercial strategies [ 69 ]. Furthermore, such private companies can secure strategic infrastructural positions acting as health ‘data intermediaries’ [ 69 ] or data ‘prospectors’ [ 70 ] as they provide the proprietary bricks and mortar of ‘essential infrastructures’ [ 71 , 72 ]. As Prainsack notes, the increased digitisation of health and health-related activities allows these companies to perform a number of roles simultaneously, from producing devises used for patient monitoring to developing the software that collects and processes the data, which affords them not only a much greater role but also increasing power in this space [ 72 , 73 ].

These developments have raised concerns about the ways in which healthcare data are utilised and the types of data sharing strategies employed resulting, in recent years, in some costly - both in monetary but also reputational terms - policy decisions [ 74 , 14 , 39 , 45 ]. So, in 2016, and with more than a million of people opting out (when given the chance) [ 75 ], the care.data plan was abandoned. After its failure, some argued that what was needed was greater transparency and a better information campaign about the benefits of sharing/using NHS patient data [ 76 ]. However, as others have pointed out, attempts to address this perceived public trust deficit by trying to inform and educate people on the benefits of data sharing and new technologies fail to take into account the underlying reasons that lead to public distrust [ 3 , 77 ]. So it is no surprise that when in May 2021 the government attempted to share NHS patient data with private companies for research and development under the General Practice Data for Planning and Research (GPDPR) scheme, concerns were raised once again. After more than a million people opted out of the scheme within a month of its announcement, again the government had to pause [ 10 ]. Although there are still plans for the GPDPR scheme to go ahead, it is an open question whether public trust will follow [ 14 , 39 , 45 ]. Recent concerns raised about the involvement of the controversial private company Palantir in key NHS data operations, during and beyond the pandemic, indicate that this issue is not going away any time soon [ 78 , 79 ].

Changing roles

Nye Bevan’s original vision for the NHS was of a healthcare system based on solidarity; namely, a principled relationship between the public, the health service and the State, within which certain rights but also duties and obligations for all parties involved emerge as a type of ‘communal responsibility of and for each and all’ for the provision of the public good that is healthcare [ 30 ]. This solidaristic character is still reflected in the NHS Constitution which outlines the rights and responsibilities of ‘how patients, the public and staff can help the NHS work effectively and ensure that finite resources are used fairly’ [ 49 ].

In recent decades, arguments have been made to defend a new way the public ought to discharge its solidarity-based obligation in the healthcare context and promote this public good, that is through (voluntary) participation in biomedical research, especially when participation is minimally risky and minimally invasive, such as submitting samples and data to a biobank [ 80 , 81 , 82 , 83 ]. Others, though, have rejected these claims suggesting that research participation can only be understood as an imperfect moral duty, rather than a strong moral obligation [ 84 ] raising justice-based concerns particularly in relation to who stands to benefit and whether research participation is the best way to promote the public good of healthcare or demonstrate solidarity [ 85 , 86 , 87 ]. Interestingly, even those defending research participation as a moral obligation do not go as far as to suggest that such an obligation should be mandatory or legally enforced, maintaining the importance of autonomy in this context [ 80 , 82 ]. Yet, in the case of anonymised NHS data, the policy decision to make everyone a default research subject, in the sense that all patients’ anonymised data can potentially be shared, bypasses these ethical debates and forecloses the normative conclusion. In the monopolistic healthcare system that is the NHS, participation in the particular version of research and innovation, and in the broader ‘health and wealth’ agenda, becomes not just a moral duty, but an unavoidable civic obligation inextricably tied with the public’s ability to access healthcare, and part of the solidarity-based relationship between the public and the healthcare service.

The adoption and operationalisation of the ‘health and wealth’ agenda in England stealthily, yet fundamentally, changes the role of the NHS; from a public institution tasked with the provision of healthcare to one tasked with (also) using its position as the main healthcare provider to generate a resource to promote research and innovation, including in the private sector. This change inevitably impacts on the relationship between the public and the NHS. Nowadays, members of the public are not just citizens of a welfare state who support the system through taxation and prudent use of resources, but also data subjects enlisted in supporting a particular approach to research and innovation merely by virtue of seeking healthcare [ 63 ]. In this sense, the public is required to “pay twice” for their healthcare, once through taxation but also through their data. In this new relationship, patients are both citizens of the welfare state (whose taxes are still funding the healthcare service) and also datafied entities whose data are the asset to be traded in this new economy, as the roles of citizen, patient and data subject merge together.

And it is not just the public that seem to have acquired a multiplicity of roles in this new healthcare landscape. The healthcare service itself is also taking on new aims and objectives, including stimulating economic growth and supporting private sector collaborations [ 88 ]. The expansion of the role the NHS plays as a public institution in society is neither straightforward nor uncontroversial. The fact that different governments even within the UK have chosen to adopt different strategies when it comes to managing health data demonstrates that co-opting a solidarity-based public healthcare system to support a wealth-generating agenda is a political choice rather than a socio-economic inevitability [ 89 ]. This change unites the pursuit of universal healthcare as a public good with the aims of successive UK governments to derive wealth based on a particular neoliberal model of bioeconomy [ 43 , 90 ]. As such, and to return to Warren’s analysis of public trust, it mixes an impartial public and trusted institution with a partial one that seeks to secure its political goals using the former as its means [ 29 ]. Furthermore, it brings into question the NHS’ commitment to serving and promoting the public good of healthcare, as it forces it to endorse multiple aims, including the generation of wealth for the private sector. The cost, as cases such as the implementation of care.data demonstrate, is the growth of distrust resulting in the seemingly contradictory situation where the public declares its trust for the NHS to handle its data [ 31 ], while, given the opportunity, it votes with its feet when such initiatives are introduced.

So far, we have seen that the public is committed to healthcare as a public good and continues to support a national healthcare service that is dedicated at providing care to all according to need [ 31 ]. They are happy to support this service both with their taxes as well as with their data, as long as it is for the “greater public good” [ 12 , 91 ]. At the same time, they are repeatedly objecting to a version of the NHS that uses patient data to promote economic and industrial growth including in the private sector. For example, a systematic review of public opinions on the use of patient data for research in the UK and Republic of Ireland showed that the public’s support is widespread yet conditional upon competence in keeping data secure, and free from interference of “private interests” [ 91 ]. In their workshops on patients’ views of the possible benefits from reuse of personal health data, Aitken et al. note that none of their participants ‘spoke of societal benefits in terms of economic benefit’ [ 92 ]. Furthermore, earlier work has shown that regardless of their success in attaining targets and improving cost-efficiency, NHS reform programmes that threaten the legitimacy of the public service are met with public distrust and unease [ 93 ].

Following our analysis, the problem that emerges is one of misalignment of aims and values between the public and the healthcare services that impacts on the trust relationship. By adding further economic aims into the function of the NHS, the commitment to health as the main and sole public good served by this public institution is questioned. Under the ‘health and wealth’ agenda, the health service is required to promote multiple goals, and it is this that introduces distrust into the relationship. Following Warren’s theory that partial institutions do not warrant public trust suggests that it is not possible for the NHS to behave like a partial institution, one that serves aims beyond that of promoting the public good, and still warrant public trust [ 29 ].

Healthcare is widely accepted as a public good and public healthcare institutions, such as the NHS, are founded on that understanding. The ‘health and wealth’ policy agenda adopted by successive governments in the past couple of decades disrupts this understanding as it collapses one term onto the other. The datafication of healthcare, as it is currently pursued in England, further reinforces this relationship as illustrated by the recent Data Saves Lives policy report which proclaims: ‘So that we can continue to provide the best care for the citizens we serve, we must safely grasp the opportunities for data-driven innovation […] and power the UK economy’ [ 65 ].

However, recent reports in the complex and fragmented landscape of NHS data reveal that the ways in which NHS data are shared with external and private actors, and the extent to which any benefits that derive from this data then return to the NHS and the general public, if at all, is far from straightforward [ 67 , 68 ]. This demonstrates that, despite its rhetorical neatness, the formulation data = wealth = health that many policy reports, such as the aforementioned Data Saves Lives [ 65 ], appear to assume needs to be carefully examined and robustly demonstrated, rather than just wishfully proclaimed, especially if it is to convince a sceptical public. As such an important theoretical question that needs to be investigated is the following; under what conditions could the solidarity-based obligation to promote health between the public and the State be reasonably expected to also include the promotion of research and innovation? This is a question which cannot be settled merely by arguing that the public has a moral obligation to participate in research. It needs to be demonstrated that this is not just an imperfect or weak obligation, but one that should become a mandatory civic duty, part of the solidarity-based relationship between the national health institution and the public. This would require both philosophical and empirical work. For example, in order for such an expansion of aims to be acceptable within the existing relationship between the public and the health service, one should examine whether these multiple goals are compatible with each other or whether they might lead to the corruption or corrosion of existing and accepted values and priorities [ 94 ]. What are the consequences of taking on multiple aims for the provision of care on the ground and how should conflicts between achieving these different aims be resolved?

Furthermore, one would need to explore whether and to what extent these activities, as they are currently pursued, serve, directly and primarily, the public good of healthcare. Are these activities the most effective and efficient ways to promote health, as opposed to other types of social interventions? Is research and innovation, as currently practiced, able to address existing and widespread social, economic and political inequalities, thus making promoting research and innovation the preferred expression of civic or even global solidarity? The latter is particular pertinent as there are many who argue that tackling inequalities in the distribution of power, money and resources, and improving the conditions in which people are born, grow, live, work and age can do more to promote the health of the public rather than investment in research and in data-intensive health technologies like genomics and AI [ 85 , 86 , 95 ].

Finally, and once we have a clearer theoretical and empirical understanding of these issues, it would be necessary for an honest and informed public debate to take place to ascertain whether this expansion of the aims served by the NHS should become part of a new social contact. More fundamentally, it needs to be determined through public dialogue, what values should be endorsed and promoted by a public institution like the NHS. Would the public be ready and willing to support the changing character of the NHS and adopt its new role in this relationship? And, most importantly, would it still be able to trust it?

This paper set out to address the question, what is it about sharing NHS data for research and innovation that challenges public trust? In order to do so, it drew from political theory to provide an account of public trust that helps better understand the relationship between the public and the NHS within a democratic context, as well as, the kind of obligations and expectations that govern this relationship. After examining whether the ways in which the NHS is managing patient data along with its collaboration with the private sector fit under this trust-based relationship, it argued that the digitisation of the NHS and the broader ‘health and wealth’ agenda adopted by consecutive UK governments represent a major shift in the institutional character of the NHS. We demonstrate that this shift brings into question the meaning of public good the NHS is expected to provide, hence challenging public trust. In conclusion, the paper argues that in order to address the problem of public trust the following are needed: (a) a theoretical and empirical examination of the benefits but also the costs associated with this shift, (b) an open conversation at a public level to determine what values a public institution like the NHS should promote.

Data Availability

This manuscript is based on the review and analysis of relevant literature. It does not contain any original data.

Change history

04 october 2023.

A Correction to this paper has been published: https://doi.org/10.1186/s12910-023-00955-4

As Cheung explains, ‘Access to [identifiable] health data without consent is possible in England and Wales through Sect. 251 of the NHS Act 2006 and its current regulations (usually referred to as ‘Sect. 251’). Through this process, the Common Law of Confidentiality is temporarily set aside for the specific purpose applied for, although responsibilities resulting from the Data Protection Act are still applicable (e.g. the obligation to be ‘lawful, fair and transparent’).’ [ 28 ].

While the option of an Opt-Out was nominally offered, Vezyridis gives an account of the problems, complications and emissions that rendered this option problematic when the care.data scheme was introduced. These include lack of public information on its availability and the inability to process the high number of subsequent Opt Outs. Although a number of steps were taken in an attempt to address these issues, the care.data system was finally withdrawn [ 28 ].

The National Data Opt-out Operational Policy Guidance offers a more comprehensive explanation: ‘The opt-out does not apply when the individual has consented to the sharing of their data or where the data is anonymised in line with the Information Commissioner’s Office (ICO) Code of Practice on Anonymisation.’ [ 53 ].

Also see Meszaros for a detailed account of how the data protection terminology in key policy documents and guidelines in the UK contains inconsistencies leading to confusion between the terms anonymization/de-identification/pseudonymisation, challenging the soundness of this regulatory framework [56].

As the Health and Social Care Act 2012 has been superseded by the Health and Social Care Act 2022, the political appetite to facilitate research and innovation through the use of NHS data remains. The new Act merges the NHSX and NHS Digital into NHSE/I. This move has been described as ‘a significant retrograde step in defending the rights of citizens with respect to the collection and use of their health data. And has the potential for undermining the relationship between clinicians and their patients’ [ 63 ].

WHO. Big data and artificial intelligence for achieving universal health coverage: an international consultation on ethics. Geneva: World Health Organisation; 2018.

Google Scholar

Topol E. The Topol Review: preparing the healthcare workforce to deliver the digital future. National Health Service; 2019.

Kerasidou CX, Kerasidou A, Buscher M, Wilkinson S. Before and beyond trust: reliance in medical AI. J Med Ethics. 2022;48(11):852–6.

Article Google Scholar

Knowles B, Richards JT, editors. The sanction of authority: Promoting public trust in ai. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency; 2021.

Ghafur S, Fontana G, Halligan J, O’Shaughnessy J, Darzi A. NHS data: maximising its impact on the health and wealth of the United Kingdom. Imperial College London; 2020.

Department of Health and Social Care. New data strategy to drive innovation and improve efficiency [press release]. 2022 13 June. https://www.gov.uk/government/news/new-data-strategy-to-drive-innovation-and-improve-efficiency . Accessed 01 June 2023.

Department of Health. Innovation Health and Wealth: Accelerating Adoption and Diffusion in the NHS. London, UK. 2011.

Department of Health. Best research for Best Health: A new national health research strategy. London, UK. 2006.

Temperton J. NHS care.data scheme closed after years of controversy. WIRED. 2016 6 July. https://www.wired.co.uk/article/care-data-nhs-england-closed . Accessed 01 June 2023.

Jayanetti C. NHS data grab on hold as millions opt out. The Observer. 2021 22 August. https://www.theguardian.com/society/2021/aug/22/nhs-data-grab-on-hold-as-millions-opt-out . Accessed 01 June 2023

Moberly T. Should we be worried about the NHS selling patient data? BMJ. 2020;368:m113.

Centre for Data Ethics and Innovation. Public attitudes to data and AI: Tracker survey (Wave 2). 2022 2 November. https://www.gov.uk/government/publications/public-attitudes-to-data-and-ai-tracker-survey-wave-2/public-attitudes-to-data-and-ai-tracker-survey-wave-2 . Accessed 01 June 2023.

Papoutsi C, Reed JE, Marston C, Lewis R, Majeed A, Bell D. Patient and public views about the security and privacy of Electronic Health Records (EHRs) in the UK: results from a mixed methods study. BMC Med Inf Decis Mak. 2015;15(1):86.

Ipsos Mori. The one-way mirror: public attitudes to commercial access to health data. London: Wellcome Trust. 2016.

Sheehan M, Friesen P, Balmer A, Cheeks C, Davidson S, Devereux J, et al. Trust, trustworthiness and sharing patient data for research. J Med Ethics. 2021;47(12):e26–e.

Graham M. Data for sale: trust, confidence and sharing health data with commercial companies. J Med Ethics. 2021.

Baier A. Trust and Antitrust. Ethics. 1986;96(2):231–60.

Hawley K. Trust, distrust and commitment. Noûs. 2014;48(1):1–20.

O’Neill O. Trust, trustworthiness, and accountability. In: Morris N, Vines D, editors. Capital failure: rebuilding Trust in Financial Services. Oxford University Press; 2014. p. 172–190.

Jones K. Trust as an affective attitude. Ethics. 1996;107(1):4–25.

Möllering G. The nature of Trust: from Georg Simmel to a theory of expectation, interpretation and suspension. Sociology. 2001;35(2):403–20.

Luhmann N. Familiarity, Confidence, Trust: Problems and Alternatives. In: Gambetta D, editor. Trust: Making and Breaking Cooperative Relations. Blackwell; 1988.

O’Neill O. Autonomy and trust in Bioethics. Cambridge: Cambridge Univ Pr; 2002.

Book Google Scholar

Wright S. Trust and trustworthiness. Philosophia. 2010;38(3):615–27.

O’Neill O. Linking Trust to Trustworthiness. Int J Philosophical Stud. 2018;26(2):293–300.

Kerasidou A. Trustworthy institutions in Global Health Research Collaborations. In: Ganguli-Mitra A, Sorbie A, McMillan C, Dove E, Postan E, Laurie G, et al. editors. The Cambridge Handbook of Health Research Regulation. Cambridge Law Handbooks. Cambridge: Cambridge University Press; 2021. pp. 81–9.

Chapter Google Scholar

Lucassen A, Montgomery J, Parker M. Ethics and the social contract for genomics in the NHS. In: DaviesSC, Annual Report of the Chief Medical Officer 2016, Generation Genome. Department of Health, London.

Horn R, Kerasidou A. Sharing whilst caring: solidarity and public trust in a data-driven healthcare system. BMC Med Ethics. 2020;21(1):110.

Warren M. Trust and democracy. In: Uslaner EM, editor. The Oxford handbook of social and political trust. Oxford: Oxford University Press; 2018. pp. 75–94.

Veitch K. Obligation and the changing nature of publicly funded Healthcare. Med Law Rev. 2018;27(2):267–94.

Wellings DJD, Maguire D, Appleby J, Hemmings N, Morris J, Schlepper L. Public satisfaction with the NHS and social care in 2021: results from the British Social Attitudes survey. The King’s Fund; 2022.

NHS. The NHS Constitution for England. 2021 https://www.gov.uk/government/publications/the-nhs-constitution-for-england/the-nhs-constitution-for-england . Accessed 01 June 2023.

Hill EM, Turner EL, Martin RM, Donovan JL. Let’s get the best quality research we can”: public awareness and acceptance of consent to use existing data in health research: a systematic review and qualitative study. BMC Med Res Methodol. 2013;13:1–10.

Wellcome Trust. Summary report of qualitative research into public attitudes to personal data and linking personal data. Wellcome Trust London; 2013.

Clemence M, Gilby N, Shah J, Swiecicka J, Warren D, Smith P et al. Wellcome Trust Monitor Wave 2: Tracking Public Views on Science. Biomedical Research and Science Education, London: Research, Ipsos Mori. 2013:148.

Ritchie O, Reid S, Smith L. Review of public and professional attitudes towards confidentiality of healthcare data: final report. General Medical Council; 2015.

Singleton P, Lea N, Tapuria A, Kalra D. Public and Professional attitudes to privacy of healthcare data: a survey of the literature. General Medical Council. 2007.

Ghafur S, Van Dael J, Leis M, Darzi A, Sheikh A. Public perceptions on data sharing: key insights from the UK and the USA. Lancet Digit Health. 2020;2(9):e444–e6.

Jones LA, Nelder JR, Fryer JM, Alsop PH, Geary MR, Prince M, et al. Public opinion on sharing data from health services for clinical and research purposes without explicit consent: an anonymous online survey in the UK. BMJ Open. 2022;12(4):e057579.

Understanding Patient Data. Public attitudes to the use of patient data 2010–2018 and 2018–2021. https://understandingpatientdata.org.uk/how-do-people-feel-about-use-data . Accessed 01 June 2023.

Buckingham H, Dayan M. Privatisation in the English NHS: fact of fiction? Nuffield Trust [Internet]. 2019 15 November. https://www.nuffieldtrust.org.uk/news-item/privatisation-in-the-english-nhs-fact-or-fiction . Accessed 01 June.

Squires M. Is the NHS a business? British Journal of Medical Practice. 2014;64(622):257–8.

Bayliss K. Can England's National Health System reforms overcome the neoliberal legacy?. Int J Health Serv. 2022;52(4):480–91.

Moberly T. Ten things you need to know about the Health and Care Bill. BMJ. 2022;376:o361.

Chico V, Hunn A, Taylor M. Public views on sharing anonymised patient-level data where there is a mixed public and private benefit. NHS Health Research Authority, University of Sheffield School of Law. 2019.

Department of Health and Social Care. Digital transformation in the NHS. National Audit Office. 2020 May 15.

Faulkner-Gurstein R, Wyatt D. Platform NHS: reconfiguring a public service in the age of digital capitalism. Science, Technology & Human Values. 2021. p.1–21.

Department of Trade and Industry. Bioscience 2015: Improving National Health, Increasing National Wealth. Biotechnology Innovation and Growth Team. London, UK. 2003.

NHS. The NHS Constitution for England 2013. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/170656/NHS_Constitution.pdf . Accessed 01 June 2023

Grace J, Taylor MJ. Disclosure of confidential patient information and the duty to consult: the role of the Health and Social Care Information Centre. Med Law Rev. 2013;21(3):415–47.

Vezyridis P, Timmons S. Dissenting from care.data: an analysis of opt-out forms. J Med Ethics. 2016;42(12):792–6.

Taylor M. Information governance as a force for good - Lessons to be Learnt from Care.Data. SCRIPTed. 2014;11:1.

NHS Digital. National Data Opt-out. 2020. https://digital.nhs.uk/services/national-data-opt-out/operational-policy-guidance-document . Accessed 01 June 2023.

NHS. Your NHS Data Matters: When your choice about sharing data from your health records does not apply. https://www.nhs.uk/your-nhs-data-matters/where-your-choice-does-not-apply/ . Accessed 01 June 2023.

Rocher L, Hendrickx JM, de Montjoye Y-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nat Commun. 2019;10(1):3069.

Meszaros J, Ho CH. Building trust and transparency? Challenges of the opt-out system and the secondary use of health data in England. Med Law Int. 2019;19(2?3):159–81.

Department of Health. Confidentiality: NHS Code of Practice. 2003.

Kudyba S, Kwatinetz M. Introduction to the Big Data Era. In: Kudyba S, editor. Big Data, Mining, and Analytics. CRC Press. 2014. p. 1–17.

Ruckenstein M, Schüll ND. The datafication of Health. Annu Rev Anthropol. 2017;46(1):261–78.

Sharon T, Lucivero F. Introduction to the Special Theme: The expansion of the health data ecosystem?Rethinking data ethics and governance. 2019;6(2): 1–5.

Hallowell N. Research or clinical care: what’s the difference? J Med Ethics. 2018;44:359–60.

BBC News. Everyone ‘to be research patient’, says David Cameron. 2011 5 December. https://www.bbc.co.uk/news/uk-16026827 . Accessed 1 June 2023.

Wienroth M, Pearce C, McKevitt C. Research campaigns in the UK National Health Service: patient recruitment and questions of valuation. Sociol Health Illn. 2019;41(7):1444–61.

Birch K. Rethinking value in the bio-economy: Finance, assetization, and the management of value. Sci Technol Hum Values. 2017;42(3):460–90.

Department of Health and Social Care. Data Saves Lives: reshaping health and social care with data. London, UK. 2022.

Horn R, Kerasidou A. Sharing whilst caring: solidarity and public trust in a data-driven healthcare system. BMC Med Ethics. 2020;21:110.

Murgia M, Harlow M. NHS shares English hospital data with dozens of companies. Financial Times. 2021 27 July. https://www.ft.com/content/6f9f6f1f-e2d1-4646-b5ec-7d704e45149e . Accessed 1 June 2023.

Goldacre B, MacKenna B. The NHS deserves better use of hospital medicines data. BMJ. 2020;370.

Sadowski J. Rethinking Data- blog series. 2022 1 June. https://www.adalovelaceinstitute.org/blog/political-economy-data-intermediaries/ . Accessed 1 June 2023.

Powles J, Hodson H. Google DeepMind and healthcare in an age of algorithms. Health and technology. 2017;7(4):351–67.

Srnicek N. Platform Capitalism. John Wiley & Sons. 2017.

Sharon T. The googlization of health research: from disruptive innovation to disruptive ethics. Per Med. 2016;13(6):563–74.

Prainsack B. The political economy of digital data: introduction to the special issue. Policy Stud. 2020;41(5):439–46.

van Staa T-P, Goldacre B, Buchan I, Smeeth L. Big health data: the need to earn public trust. BMJ. 2016;354:i3636.

NHS Digital. Care Information Choices - April 2016. HYPERLINK https://digital.nhs.uk/data-and-information/publications/statistical/care-information-choices/care-information-choices-april-2016 . Accessed 1 June 2023.

Godlee F. What can we salvage from care.data? BMJ. 2016;354:i3907.

Sterckx S, Rakic V, Cockbain J, Borry P. “You hoped we would sleep walk into accepting the collection of our data”: controversies surrounding the UK care. data scheme and their wider relevance for biomedical research. Med Health Care Philos. 2016;19:177–90.

BBC News. Palantir: NHS faces legal action over data firm contract. 2021 24 February. https://www.bbc.co.uk/news/technoogy-56183785 . Accessed 1 June 2023.

Ungoed-Thomas J. Controversial ?360m NHS England data platform ‘lined up’ for Trump backer’s firm. 2022 13 November. https://www.theguardian.com/society/2022/nov/13/controversial-360m-nhs-england-data-platform-lined-up-for-trump-backers-firm . Accessed 1 June 2023.

Harris J. Scientific research is a moral duty. J Med Ethics. 2005;31(4):242–8.

Rhodes R. In defense of the duty to participate in Biomedical Research. Am J Bioethics. 2008;8:37.

Schaefer GO, Emanuel EJ, Wertheimer A. The obligation to participate in biomedical research. JAMA. 2009;302(1):67–72.

Prainsack B, Buyx A. Solidarity in biomedicine and beyond. Cambridge University Press; 2017.

Shapshay S, Pimple KD. Participation in biomedical research is an imperfect moral duty: a response to John Harris. J Med Ethics. 2007;33(7):414–7.

Rennie S. Viewing research participation as a moral obligation: in whose interests? Hastings Cent Rep. 2011;41(2):40–7.

Neuhaus CP. Does Solidarity require “All of Us” to participate in Genomics Research? Hastings Cent Rep. 2020;50(S1):62–69.

de Melo-Martín I. A duty to Participate in Research: does Social Context Matter? Am J Bioeth. 2008;8(10):28–36.

NHS. The NHS Long Term Plan: Research and innovation to drive future outcomes improvement. 2019.

McCartney M. Care.data: why are Scotland and Wales doing it differently? BMJ. 2014;348.

Viens AM. Neo-liberalism, austerity and the political determinants of health. Health Care Anal. 2019;27(3):147–52.

Stockdale J, Cassell J, Ford E. “Giving something back”: a systematic review and ethical enquiry into public views on the use of patient data for research in the United Kingdom and the Republic of Ireland. Wellcome Open Res. 2018;3:6.

Aitken M, Porteous C, Creamer E, Cunningham-Burley S. Who benefits and how? Public expectations of public benefits from data-intensive health research. Big Data & Society. 2018;July–December:1–12.

Taylor-Gooby P, Wallace A. Public values and public trust: responses to welfare state reform in the UK. J social policy. 2009;38(3):401–19.

Walsh A. Commercialisation and the corrosion of the ideals of medical professionals. In: Therese Feiler JH, Andrew, Papanikitas, editors. Marketisation, Ethics and Healthcare: policy, practice and moral formation. Routledge; 2018. pp. 133–46.

Marmot M, Bell R. Social inequalities in health: a proper concern of epidemiology. Ann Epidemiol. 2016;26(4):238–40.

Download references

Acknowledgements

The authors would like to acknowledge the support of NDPH Senior Fellowship for this work. The authors would also like to thank the anonymous reviewers for their insightful comments.

This work is supported by an NDPH Senior Fellowship.

Author information

Authors and affiliations.

Ethox Centre, Oxford Population Health (Nuffield Department of Population Health, Big Data Institute, Li Ka Shing Centre for Health Information and Discovery Oxford, University of Oxford, Oxford, UK

Angeliki Kerasidou

Charalampia (Xaroula) Kerasidou

You can also search for this author in PubMed Google Scholar

Contributions

AK conceived and prepared the first draft. XK further analysed the argument. AK and XK both revised, edited and approved the final manuscript.

Corresponding author

Correspondence to Angeliki Kerasidou .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Kerasidou, A., Kerasidou, C. Data-driven research and healthcare: public trust, data governance and the NHS. BMC Med Ethics 24 , 51 (2023). https://doi.org/10.1186/s12910-023-00922-z

Download citation

Received : 24 March 2023

Accepted : 13 June 2023

Published : 14 July 2023

DOI : https://doi.org/10.1186/s12910-023-00922-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Data sharing
Data governance

BMC Medical Ethics

ISSN: 1472-6939

General enquiries: [email protected]

Cognitive Science
Decision Making

Leveraging HR Analytics for Data-Driven Decision Making: A Comprehensive Review

Indian Maritime University

Vel Tech - Technical University
This person is not on ResearchGate, or hasn't claimed this research yet.

Discover the world's research

25+ million members
160+ million publication pages
2.3+ billion citations

John Boudreau

P Rasmussen
Hum Resource Manag Rev
INT J HUM RESOUR MAN
S Van Den Heuvel
T Bondarouk
D R Thirupurasundari
Recruit researchers
Join for free
Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Information

Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

Active Journals
Find a Journal
Proceedings Series
For Authors
For Reviewers
For Editors
For Librarians
For Publishers
For Societies
For Conference Organizers
Open Access Policy
Institutional Open Access Program
Special Issues Guidelines
Editorial Process
Research and Publication Ethics
Article Processing Charges
Testimonials
Preprints.org
SciProfiles
Encyclopedia

Article Menu

Subscribe SciFeed
Recommended Articles
Google Scholar
on Google Scholar
Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Data-driven business innovation processes: evidence from authorized data flows in china.

1. Introduction

2. literature review and theoretical framework, 2.1. literature review, 2.1.1. the economic value of data, 2.1.2. data application in the public and private sectors, 2.2. theoretical framework, 3. empirical design, 3.1. data sources, 3.2. empirical strategy, 4. product innovation effects of government data authorization, 4.1. baseline regression, 4.2. robustness checks, 4.2.1. changing data contract identification rules, 4.2.2. consideration of negative weights, 4.2.3. excluding the effect of financial subsidies, 5. heterogeneity analysis, 5.1. regional data quality, 5.2. scarcity of data, 6. dynamics of knowledge accumulation, 6.1. learning from data, 6.2. quality upgrading effect, 6.3. data diffusion effects, 7. discussion and conclusions, author contributions, data availability statement, acknowledgments, conflicts of interest.

Chen, M.; Sinha, A.; Hu, K.; Shah, M.I. Impact of Technological Innovation on Energy Efficiency in Industry 4.0 Era: Moderation of Shadow Economy in Sustainable Development. Technol. Forecast. Soc. Chang. 2021 , 164 , 120521. [ Google Scholar ] [ CrossRef ]
Veile, J.W.; Schmidt, M.C.; Voigt, K.I. Toward a New Era of Cooperation: How Industrial Digital Platforms Transform Business Models in Industry 4.0. J. Bus. Res. 2022 , 143 , 387–405. [ Google Scholar ] [ CrossRef ]
Wang, S.; Wan, J.; Zhang, D.; Li, D.; Zhang, C. Towards Smart Factory for Industry 4.0: A Self-Organized Multi-Agent System with Big Data Based Feedback and Coordination. Comput. Netw. 2016 , 101 , 158–168. [ Google Scholar ] [ CrossRef ]
Baden-Fuller, C.; Haefliger, S. Business Models and Technological Innovation. Long Range Plann. 2013 , 46 , 419–426. [ Google Scholar ] [ CrossRef ]
Schaefer, D.; Walker, J.; Flynn, J. A Data-Driven Business Model Framework for Value Capture in Industry 4.0. In Advances in Manufacturing Technology ; IOS Press: Amsterdam, The Netherlands, 2017; Volume XXXI, pp. 245–250. [ Google Scholar ]
Foss, N.J.; Saebi, T. Fifteen Years of Research on Business Model Innovation: How Far Have We Come, and Where Should We Go? J. Manag. 2017 , 43 , 200–227. [ Google Scholar ] [ CrossRef ]
Shet, S.V.; Pereira, V. Proposed Managerial Competencies for Industry 4.0—Implications for Social Sustainability. Technol. Forecast. Soc. Chang. 2021 , 173 , 121080. [ Google Scholar ] [ CrossRef ]
Agostini, L.; Nosella, A. Industry 4.0 and Business Models: A Bibliometric Literature Review. Bus. Process Manag. J. 2021 , 27 , 1633–1655. [ Google Scholar ] [ CrossRef ]
Farboodi, M.; Veldkamp, L. Long-run growth of financial data technology. Am. Econ. Rev. 2020 , 110 , 2485–2523. [ Google Scholar ] [ CrossRef ]
Kraus, S.; Durst, S.; Ferreira, J.J.; Veiga, P.; Kailer, N.; Weinmann, A. Digital Transformation in Business and Management Research: An Overview of the Current Status Quo. Int. J. Inf. Manag. 2022 , 63 , 102466. [ Google Scholar ] [ CrossRef ]
Arrieta-Ibarra, I.; Goff, L.; Jiménez-Hernández, D.; Lanier, J.; Weyl, E.G. Should we treat data as labor? Moving beyond “free”. In aea Papers and Proceedings. In AEA Papers and Proceedings ; American Economic Association: Nashville, TN, USA, 2018; Volume 108, pp. 38–42. [ Google Scholar ]
Cong, L.; Li, B.W.; Zhang, Q.T. Alternative data in fintech and business intelligence. In The Palgrave Handbook of FinTech and Blockchain ; Palgrave Macmillan: Cham, Switzerland, 2021; pp. 217–242. [ Google Scholar ]
Acquisti, A.; Taylor, C.; Wagman, L. The economics of privacy. J. Econ. Lit. 2016 , 54 , 442–492. [ Google Scholar ] [ CrossRef ]
Veldkamp, L.; Chung, C. Data and the aggregate economy. J. Econ. Lit. 2024 , 62 , 458–484. [ Google Scholar ] [ CrossRef ]
Schaefer, M.; Sapi, G. Learning from data and network effects: The example of internet search. SSRN Electron. J. 2020 . DIW Berlin Discussion Paper No. 1894. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3688819 (accessed on 3 May 2022).
Jones, C.I.; Tonetti, C. Nonrivalry and the Economics of Data. Am. Econ. Rev. 2020 , 110 , 2819–2858. [ Google Scholar ] [ CrossRef ]
Agrawal, A.; McHale, J.; Oettl, A. Finding needles in haystacks: Artificial intelligence and recombinant growth. In The Economics of Artificial Intelligence: An Agenda ; University of Chicago Press: Chicago, IL, USA, 2018; pp. 149–174. [ Google Scholar ]
Acciarini, C.; Cappa, F.; Boccardelli, P.; Oriani, R. How can organizations leverage big data to innovate their business models? A systematic literature review. Technovation 2023 , 123 , 102713. [ Google Scholar ] [ CrossRef ]
Farboodi, M.; Veldkamp, L. A Model of the Data Economy (No. w28427) ; National Bureau of Economic Research: Cambridge, MA, USA, 2021. [ Google Scholar ]
Garicano, L.; Rossi-Hansberg, E. Organizing Growth. J. Econ. Theory 2012 , 147 , 623–656. [ Google Scholar ] [ CrossRef ]
Kitchens, B.; Dobolyi, D.; Li, J.; Abbasi, A. Advanced Customer Analytics: Strategic Value through Integration of Relationship-Oriented Big Data. J. Manag. Inf. Syst. 2018 , 35 , 540–574. [ Google Scholar ] [ CrossRef ]
Veldkamp, L.L. Slow boom, sudden crash. J. Econ. Theory 2005 , 124 , 230–257. [ Google Scholar ] [ CrossRef ]
Ianni, M.; Masciari, E.; Sperlí, G. A Survey of Big Data Dimensions vs Social Networks Analysis. J. Intell. Inf. Syst. 2021 , 57 , 73–100. [ Google Scholar ] [ CrossRef ]
Etzion, D.; Aragon-Correa, J.A. Big Data, Management, and Sustainability: Strategic Opportunities Ahead. Organ. Environ. 2016 , 29 , 147–155. [ Google Scholar ] [ CrossRef ]
Lamba, K.; Singh, S.P. Dynamic Supplier Selection and Lot-Sizing Problem Considering Carbon Emissions in a Big Data Environment. Technol. Forecast. Soc. Chang. 2019 , 144 , 573–584. [ Google Scholar ] [ CrossRef ]
Begenau, J.; Farboodi, M.; Veldkamp, L. Big Data in Finance and the Growth of Large Firms. J. Monet. Econ. 2018 , 97 , 71–87. [ Google Scholar ] [ CrossRef ]
Sorescu, A. Data-Driven Business Model Innovation: Business Model Innovation. J. Prod. Innov. Manag. 2017 , 34 , 691–696. [ Google Scholar ] [ CrossRef ]
Xie, D.; Zhang, L. A Generalized Model of Growth in the Data Economy. SSRN 2022 . Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4033576 (accessed on 13 February 2022).
Hou, Y.; Huang, J.; Xie, D.; Zhou, W. The Limits to Growth in the Data Economy: Storage Constraint and “Data Productivity Paradox”. SSRN 2022 . Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4099544 (accessed on 3 May 2022).
Jovanovic, B.; Nyarko, Y. Learning by Doing and the Choice of Technology ; National Bureau of Economic Research: Cambridge, MA, USA, 1994. [ Google Scholar ]
Atkeson, A.; Kehoe, P.J. Modeling and measuring organization capital. J. Polit. Econ. 2005 , 113 , 1026–1053. [ Google Scholar ] [ CrossRef ]
Oberfield, E.; Venkateswaran, V. Expertise and firm dynamics. In 2018 Meeting Papers ; No. 1132; Society for Economic Dynamics: New York, NY, USA, 2018. [ Google Scholar ]
Cong, L.W.; Xie, D.; Zhang, L. Knowledge accumulation, privacy, and growth in a data economy. Manag. Sci. 2021 , 67 , 6480–6492. [ Google Scholar ] [ CrossRef ]
Schaefer, M.; Sapi, G. Complementarities in learning from data: Insights from general search. Inf. Econ. Policy 2023 , 65 , 101063. [ Google Scholar ] [ CrossRef ]
Agrawal, A.; Gans, J.; Goldfarb, A. Prediction, judgment, and complexity: A theory of decision-making and artificial intelligence. In The Economics of Artificial Intelligence: An Agenda ; University of Chicago Press: Chicago, IL, USA, 2018; pp. 89–110. [ Google Scholar ]
Bajari, P.; Chernozhukov, V.; Hortaçsu, A.; Suzuki, J. The impact of big data on firm performance: An empirical investigation. In AEA Papers and Proceedings ; American Economic Association: Nashville, TN, USA, 2019; Volume 109, pp. 33–37. [ Google Scholar ]
Ghasemaghaei, M.; Turel, O. Possible Negative Effects of Big Data on Decision Quality in Firms: The Role of Knowledge Hiding Behaviours. Inf. Syst. J. 2020 , 31 , 268–293. [ Google Scholar ] [ CrossRef ]
Arthur, K.N.A.; Owen, R. A Micro-Ethnographic Study of Big Data-Based Innovation in the Financial Services Sector: Governance, Ethics and Organizational Practices. J. Bus. Ethics 2019 , 160 , 363–375. [ Google Scholar ] [ CrossRef ]
Nguyen Dang Tuan, M.; Nguyen Thanh, N.; Le Tuan, L. Applying a Mindfulness-Based Reliability Strategy to the Internet of Things in Healthcare—A Business Model in the Vietnamese Market. Technol. Forecast. Soc. Chang. 2019 , 140 , 54–68. [ Google Scholar ] [ CrossRef ]
Mariani, M.M.; Fosso Wamba, S. Exploring How Consumer Goods Companies Innovate in the Digital Age: The Role of Big Data Analytics Companies. J. Bus. Res. 2020 , 121 , 338–352. [ Google Scholar ] [ CrossRef ]
Dubey, R.; Gunasekaran, A.; Childe, S.J.; Bryde, D.J.; Giannakis, M.; Foropon, C.; Roubaud, D.; Hazen, B.T. Big Data Analytics and Artificial Intelligence Pathway to Operational Performance under the Effects of Entrepreneurial Orientation and Environmental Dynamism: A Study of Manufacturing Organisations. Int. J. Prod. Econ. 2020 , 226 , 107599. [ Google Scholar ] [ CrossRef ]
Mikalef, P.; Boura, M.; Lekakos, G.; Krogstie, J. The Role of Information Governance in Big Data Analytics Driven Innovation. Inf. Manag. 2020 , 57 , 103361. [ Google Scholar ] [ CrossRef ]
Janssen, M.; Charalabidis, Y.; Zuiderwijk, A. Benefits, adoption barriers and myths of open data and open government. Inf. Syst. Manag. 2012 , 29 , 258–268. [ Google Scholar ] [ CrossRef ]
Moon, M.J. Shifting from old open government to new open government: Four critical dimensions and case illustrations. Public Perform. Manag. Rev. 2020 , 43 , 535–559. [ Google Scholar ] [ CrossRef ]
Zuiderwijk, A.; Janssen, M.; Choenni, S.; Meijer, R.; Alibaks, R.S. Socio-technical impediments of open data. Electr. J. e-Gov. 2012 , 10 , 156–172. [ Google Scholar ]
Zhao, Y.; Fan, B. Effect of an agency’s resources on the implementation of open government data. Inf. Manag. 2021 , 58 , 103465. [ Google Scholar ] [ CrossRef ]
Hopp, W.J.; Li, J.; Wang, G. Big Data and the Precision Medicine Revolution. Prod. Oper. Manag. 2018 , 27 , 1647–1664. [ Google Scholar ] [ CrossRef ]
Chen, P.-T. Medical Big Data Applications: Intertwined Effects and Effective Resource Allocation Strategies Identified through IRA-NRM Analysis. Technol. Forecast. Soc. Chang. 2018 , 130 , 150–164. [ Google Scholar ] [ CrossRef ]
Heimstädt, M.; Saunderson, F.; Heath, T. From toddler to teen: Growth of an open data ecosystem. JeDEM-eJournal eDemocracy Open Gov. 2014 , 6 , 123–135. [ Google Scholar ] [ CrossRef ]
Jetzek, T.; Avital, M.; Bjørn-Andersen, N. The sustainable value of open government data. J. Assoc. Inf. Syst. 2019 , 20 , 702–734. [ Google Scholar ] [ CrossRef ]
Wang, H.J.; Lo, J. Adoption of open government data among government agencies. Gov. Inf. Q. 2016 , 33 , 80–88. [ Google Scholar ] [ CrossRef ]
Fang, J.; Zhao, L.; Li, S. Exploring open government data ecosystems across data, information, and business. Gov. Inf. Q. 2024 , 41 , 101934. [ Google Scholar ] [ CrossRef ]
Yang, T.M.; Lo, J.; Shiang, J. To open or not to open? Determinants of open government data. J. Inf. Sci. 2015 , 41 , 596–612. [ Google Scholar ] [ CrossRef ]
Liu, Z.G.; Li, X.Y.; Zhu, X.H. Scenario Modeling for Government Big Data Governance Decision-Making: Chinese Experience with Public Safety Services. Inf. Manag. 2022 , 59 , 103622. [ Google Scholar ] [ CrossRef ]
Safarov, I.; Meijer, A.; Grimmelikhuijsen, S. Utilization of open government data: A systematic literature review of types, conditions, effects and users. Inf. Polity 2017 , 22 , 1–24. [ Google Scholar ] [ CrossRef ]
Jetzek, T.; Avital, M.; Bjorn-Andersen, N. Data-driven innovation through open government data. J. Theor. Appl. Electron. Commer. Res. 2014 , 9 , 100–120. [ Google Scholar ] [ CrossRef ]
Magalhaes, G.; Roseira, C. Open Government Data and the Private Sector: An Empirical View on Business Models and Value Creation. Gov. Inf. Q. 2020 , 37 , 101248. [ Google Scholar ] [ CrossRef ]
Trabucchi, D.; Buganza, T.; Pellizzoni, E. Give Away Your Digital Services: Leveraging Big Data to Capture Value New models that capture the value embedded in the data generated by digital services may make it viable for companies to offer those services for free. Res. Technol. Manag. 2017 , 60 , 43–52. [ Google Scholar ] [ CrossRef ]
Minatogawa, V.L.F.; Franco, M.M.V.; Rampasso, I.S.; Anholon, R.; Quadros, R.; Durán, O.; Batocchio, A. Operationalizing Business Model Innovation through Big Data Analytics for Sustainable Organizations. Sustainability 2019 , 12 , 277. [ Google Scholar ] [ CrossRef ]
Akter, S.; Wamba, S.F.; Gunasekaran, A.; Dubey, R.; Childe, S.J. How to improve firm performance using big data analytics capability and business strategy alignment? Int. J. Prod. Econ. 2016 , 182 , 113–131. [ Google Scholar ] [ CrossRef ]
Erevelles, S.; Fukawa, N.; Swayne, L. Big Data consumer analytics and the transformation of marketing. J. Bus. Res. 2016 , 69 , 897–904. [ Google Scholar ] [ CrossRef ]
Wu, L.; Hitt, L.; Lou, B. Data analytics, innovation, and firm productivity. Manag. Sci. 2020 , 66 , 2017–2039. [ Google Scholar ] [ CrossRef ]
Story, V.; O’Malley, L.; Hart, S. Roles, role performance, and radical innovation competencies. Ind. Mark. Manag. 2011 , 40 , 952–966. [ Google Scholar ] [ CrossRef ]
Callaway, B.; Sant’Anna, P.H. Difference-in-differences with multiple time periods. J. Econom. 2021 , 225 , 200–230. [ Google Scholar ] [ CrossRef ]
Beraja, M.; Yang, D.Y.; Yuchtman, N. Data-intensive innovation and the state: Evidence from AI firms in China. Rev. Econ. Stud. 2023 , 90 , 1701–1723. [ Google Scholar ] [ CrossRef ]
Sun, L.; Abraham, S. Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. J. Econom. 2021 , 225 , 175–199. [ Google Scholar ] [ CrossRef ]
Frank, A.G.; Dalenogare, L.S.; Ayala, N.F. Industry 4.0 technologies: Implementation patterns in manufacturing companies. Int. J. Prod. Econ. 2019 , 210 , 15–26. [ Google Scholar ] [ CrossRef ]
Jha, A.K.; Agi, M.A.N.; Ngai, E.W.T. A note on big data analytics capability development in supply chain. Decis. Support Syst. 2020 , 138 , 113382. [ Google Scholar ] [ CrossRef ]
Jetzek, T.; Avital, M.; Bjørn-Andersen, N. The Value of Open Government Data: A Strategic Analysis Framework. In Proceedings of the 2012 Pre-ICIS Workshop: Open Data and Open Innovation in eGovernment, Orlando, FL, USA, 15 December 2012. [ Google Scholar ]

Click here to enlarge figure

	Substantial Data	Little Data
Variables	(1)	(2)
Total product innovation	15.4	10.6
Total product innovation	(37.2)	(44.8)
Minor version	5.0	2.9
Minor version	(16.2)	(21.0)
	371.5	148.7
	(1676.5)	(931.0)
	56.7	9.5
	(869.8)	(58.0)
Duration of the company	38.4	31.4
Duration of the company	(13.1)	(13.0)
Operation capital	55.6	8.5
Operation capital	(893.5)	(57.2)
N	14,884	8316

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Gao, X.; Feng, H. Data-Driven Business Innovation Processes: Evidence from Authorized Data Flows in China. Systems 2024 , 12 , 280. https://doi.org/10.3390/systems12080280

Gao X, Feng H. Data-Driven Business Innovation Processes: Evidence from Authorized Data Flows in China. Systems . 2024; 12(8):280. https://doi.org/10.3390/systems12080280

Gao, Xueyuan, and Hua Feng. 2024. "Data-Driven Business Innovation Processes: Evidence from Authorized Data Flows in China" Systems 12, no. 8: 280. https://doi.org/10.3390/systems12080280

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

Subscribe to receive issue release notifications and newsletters from MDPI journals

Towards data-driven decision making: the role of analytical culture and centralization efforts

Original Paper
Open access
Published: 16 September 2023

Cite this article

You have full access to this open access article

Ágnes Szukits ORCID: orcid.org/0000-0001-5719-7543 1 &
Péter Móricz ORCID: orcid.org/0000-0002-2534-0162 1

7682 Accesses

5 Citations

Explore all metrics

The surge in data-related investments has drawn the attention of both managers and academia to the question of whether and how this (re)shapes decision making routines. Drawing on the information processing theory of the organization and the agency theory, this paper addresses how putting a strategic emphasis on business analytics supports an analytical decision making culture that makes enhanced use of data in each phase of the decision making process, along with a potential change in authorities resulting from shifts in information asymmetry. Based on a survey of 305 medium-sized and large companies, we propose a multiple-mediator model. We provide support for our hypothesis that top management support for business analytics and perceived data quality are good predictors of an analytical culture. Furthermore, we argue that the analytical culture increases the centralization of data use, but interestingly, we found that this centralization is not associated with data-driven decision making. Our paper positions a long-running debate about information technology-related centralization of authorities in the new context of business analytics.

The illusion of data-driven decision making – The mediating effect of digital orientation and controllers’ added value in explaining organizational implications of advanced analytics

Big data and information processing in organizational decision processes.

Management control packages in family businesses: a configurational approach

Avoid common mistakes on your manuscript.

1 Introduction

There is a substantial body of research demonstrating the effects of information technology (IT) on organizations, which supports the idea that technological advancements are related to organizational change, although the nature of this relationship has been much debated (Orlikowski 1992 ; Robey et al. 2013 ). The seemingly endless stream of new IT solutions and the ever-expanding range of human activities affected by IT mean that research into the effects of IT must continue (Robey et al. 2013 ). Recently, digital technologies have been rightly at the center of research endeavors. The proliferation of newer and newer technologies enables data storage and processing and increases the accessibility of data. Due to this increased accessibility, data is no longer just a valuable asset used primarily to support management, but becomes the core of all organizational processes (Alaimo and Kallinikos 2022 ).

These new digital technologies have a distinctive nature in several aspects (Yoo 2012 ). The distinctive nature of recent business analytics (BA) techniques and related digital technologies challenges the reliability of previous results on IT-induced organizational change and calls for a rethinking of what we already know in this field. The emerging nature of these technologies does not only refer to the continuous development they undergo, but also signals that their use and organizational consequences remain diverse and unexplored (Bailey et al. 2022 ). Therefore, this paper is motivated by this serious interest in understanding the nature of organizational consequences related to the new emerging field of BA.

In the last 15 years, we have learned a lot about some aspects and organizational implications of BA use. With the growing adoption of BA, the attention of both practitioners and academics was first directed toward how it produces higher profits and growth (Davenport and Harris 2007 ), can create business value (Seddon et al. 2017 ; Krishnamoorthi and Mathew 2018 ; Grover et al. 2018 ), and improve overall firm performance (Ferraris et al. 2018 ; Aydiner et al. 2019 ; Karaboga et al. 2022 ). Despite repeated reinforcement of the relationship between BA and firm performance, the benefits of BA investments are far from evident (Ross et al. 2013 ). To reap the benefits of BA, top-level management must be open to putting data ahead of their instincts and prior knowledge (McAfee and Brynjolfsson 2012 ; Gupta and George 2016 ). Thus, if data-driven decision making is a prerequisite for generating value from BA, understanding organizational factors that support the use of data in the decision making process is vital.

Previous research had made significant efforts to identify these organizational factors, mainly in the context of business intelligence (BI) use. While some companies tend to have a longer, others shorter, or no tradition in data-driven decision making and fact-supporting culture, this is something that can be built and strengthened starting at the top (Ross et al. 2013 ). This facilitating role of top management has been repeatedly proven for various types of IT initiative (Young and Jordan 2008 ; Wirges and Neyer 2022 ). Due to increased organizational dependence on information systems and data, managerial attention is directed toward maintaining data quality (Gorla et al. 2010 ). The good quality of data has long been argued as an antecedent to benefiting from BI systems (Işık et al. 2013 ; Wieder and Ossimitz 2015 ) and more recently from BA (Seddon et al. 2017 ; Torres and Sidorova 2019 ), usually justified on the ground that improved data quality will likely support analytic culture (Popovič et al. 2012 ; Kulkarni et al. 2017 ) and improve decision making processes (Gorla et al. 2010 ; Côrte-Real et al. 2020 ).

Prior research endeavored to gather factors that support data-driven decision making, resulting in a growing but scattered body of evidence. However, the joint effects of these factors in the context of BA remain poorly understood. To address this gap, we present a top management perspective that explains the managerial choice to meet information processing needs. The proposed model, rooted in organizational information processing theory (OIPT), posits that BA enhances organizational information processing capabilities, allowing informed decisions by top management (Cao et al. 2015 ). Accentuating the choice of top management over technological options, top management support for BA initiatives is an exogenous construct that influences data quality perception factors and the development of an analytical decision making culture. Accordingly, we hypothesize that top management support, along with improved data quality perceptions, fosters an analytical decision making culture, facilitating the data-driven decision making process .

Moreover, we call attention to another, somewhat neglected means of managing the information processing capacity of organizations: the adequate organizational design (Tushman and Nadler 1978 ), particularly the decisions surrounding centralization. We propose that due to changes in information distribution, analytical decision making culture acts as a centralizing force, and centralization further motivates top management to rely on data in decision making . Agency theory explains the degree of centralization as a consequence of information asymmetry, where top management’s information disadvantage necessitates decentralization of decision making (Qiao 2022 ). Information systems can alter the level of information asymmetry (Jensen and Meckling 1976 ), although the impact of technologies remains ambiguous (Bloom et al. 2014 ). Furthermore, due to the unique nature of BA, previous research findings on the effects of technology on centralization cannot be directly extrapolated. Limited evidence suggests that BA may improve autonomy at lower levels due to cost reduction in data acquisition, but inexpensive communication may counteract this effect by allowing swift alerting of top management (Bloom et al. 2014 ). To fill this empirical gap, this paper addresses centralization in data use and introduces it as a factor that influences data-driven decision making in BA context.

To find support for our hypotheses, we analyzed the survey data of 305 mid-size and large companies registered in Hungary and suggested a multiple mediator model estimated by the WPLS algorithm with weights applied to ensure the representativeness of the results. Based on the opinion of the highest paid person (McAfee and Brynjolfsson 2012 ), we conclude that top management support of BA plays a crucial role in building an analytical decision making culture, an effect that is partially mediated by perceived data quality. Both the improvements in data quality and the analytical culture support the main endogenous construct, data-driven decision making. Interestingly, centralization in data use proved to be dual in nature: analytical culture strengthens centralization in data use without any consequences on the extent to which decision making will be data-driven.

Our research contributes to the literature in several ways. First, the paper verifies that the main drivers of an analytical culture emphasized in prior research are held upright in BA context. Second, we suggest a measurement scale of data-driven decision making, an organizational phenomenon investigated but measured in a somewhat fragmented way in the earlier literature. We rely on Herbert Simon, the ancestor of information processing in relation to decision making (Joseph and Gaba 2020 ). An 8-item scale has been proposed based on Simon’s ( 2013 ) decision making phases, a model that is often referred to and applied conceptually, but to our knowledge, not yet used for operationalization. Third, drawing on OIPT and agency theory, we identify centralization in data use as a consequence of analytical culture. Previous research has debated for decades whether the application of novel technologies causes centralization or decentralization. However, we hardly know the effect of current BA initiatives and the supporting organizational environment in this regard. This paper adds to this so far underinvestigated research topic by revealing the dual nature of centralization in data use. This duality can serve as an explanation for the contradictions reported in previous investigations.

In the remainder of the paper, we first introduce the concepts of information processing and information asymmetry, the interpreting frameworks for the key constructs of data-driven decision making and analytical decision making culture. We describe the phenomenon of centralization in data use drawing on agency theory and review the information system literature to show the controversies. On the basis of our theoretical foundations and findings of prior research, we develop several hypotheses resulting in a multiple-mediator model. The third chapter presents the data collection method, measurement properties, and the results of hypothesis testing. While most hypotheses have been supported, we pay more attention to the non-verified relationships in the discussion section. Finally, we draw implications for both theory and practice and outline further research possibilities that can overcome the limitations of current research.

2 Theoretical background and hypotheses development

2.1 organizational information processing, information asymmetry and information systems.

The OIPT conceptualizes organizations as open systems that seek to adapt to contextual factors by reducing uncertainty in decision making processes (Zhu et al. 2018 ). The open system approach does not simply mean that the system adapts to its environment, but that a failure of adaptation, mismatch, undermines the viability of the organization (Scott 1981 ; Scott and Davis 2015 ). OIPT conceptualizes the mismatch “between contextual factors and management practices as a gap between information-processing requirements and information-processing capacity” (Zelt et al. 2018 , p. 70). To improve information processing capacity, organizations can invest in vertical information systems that process information gathered during task execution without overwhelming communication tiers. Alternatively, they establish lateral relationships that extend across the hierarchy and push decision making down within the organization (Galbraith 2014 ). Vertical information systems, such as an enterprise resource planning (ERP) system, allow organizations to process data efficiently and intelligently, and this increase in information processing capacity enables organizations to respond rapidly to the growing information processing needs (Srinivasan and Swink 2018 ). Information systems that improve information availability (visibility) support lateral relations. Corporate intranets, BI platforms, or systems that acquire current and valuable information from customers or suppliers enhance information processing capacity, result in more effective decision making, and contribute to the overall responsiveness of the organization (Bloom et al. 2014 ; Srinivasan and Swink 2018 ).

The term BI denotes the cluster of decision support technologies that aim to systematically gather and transform information from internal and external sources (systems) into actionable insights to support decision making (Rouibah and Ould-ali 2002 ; Chaudhuri et al. 2011 ). More recently, BA has been interpreted as evidence-based problem recognition and resolution by applying descriptive, predictive, or prescriptive analytical methods (Holsapple et al. 2014 ; Appelbaum et al. 2017 ), as well as a business application of data analytics in a broad sense (Duan and Xiong 2015 ). It is a subset of BI (Gudfinnsson et al. 2015 ) or an important extension of BI that goes beyond elementary reporting solutions (Laursen and Thorlund 2016 ), often referred to jointly as business intelligence and analytics (BI&A) (Kowalczyk and Gerlach 2015 ; Arnott and Pervan 2016 ). Informed by OIPT, researchers extensively investigated BI&A and confirmed that it strengthens information processing capacity with wide-ranging organizational impacts on decision making effectiveness (Cao et al. 2015 ), supply chain performance (Yu et al. 2021a , b ), and resilience (Dubey et al. 2021 ).

OIPT further posits that organizations use information processing activities that are most suited for the type and quantity of information asymmetry with which they must deal (Aben et al. 2021 ). In this view, information asymmetry is understood as uncertainty and equivocality, where uncertainty refers to the lack of information and equivocality refers to its ambiguity (Weick 1979 ; Daft and Lengel 1986 ). This interpretation of information asymmetry replicates the information processing capacity requirement discussed above. Over and above, information asymmetry refers to the differences in information available to different decision makers within organizations (Saam 2007 ). This fundamental claim of another information theory, namely agency theory, draws attention to the unequal distribution of information and assumes that lower-level managers have an information advantage (Bergh et al. 2019 ). Since lower levels have the edge in terms of information, management is compelled to transfer decision making authority (Jensen and Meckling 1976 ), while also seeking to reduce information asymmetries.

Utilizing the possibilities provided by BA is capable of mitigating problems caused by equivocality and ambiguity in decision contexts (Kowalczyk and Buxmann 2014 ) by generating more data and making sense of it (Aben et al. 2021 ). Furthermore, BA moves information advantage from local to top management by establishing increased central control over data (Labro et al. 2022 ). Underpinned by these prior research results, we argue that BA is able to cope with information asymmetries both in terms of reducing uncertainty and equivocality and in terms of diminishing information disadvantage at the top. This potential of BA moves us forward to explain how BA shapes data utilization in a decision context.

2.2 Link to decision making culture and data-driven decision making process

One of the main objectives of organizational information processing is to make decisions (Choo 1996 ). Recognizing the differences in the gathering and evaluation of managerial information, the literature proposed various decision strategies (Evans 2010 ). The nonintuitive analytic processing type focuses on collecting, collating, analyzing, and interpreting the available information and ends with consciously derived rational choices (Hammond 1996 ). Rationality in decision making manifests itself in the comprehensive search for information, inventorizing, and evaluation of alternatives (Thunholm 2004 ), suggesting the presence of rationality in the whole decision making process instead of limiting it to the final choice. This long-established process approach to decisions (Svenson 1979 ) requires the identification of a sequence of successive phases such as 3 phases of identification, development, and selection (Mintzberg et al. 1976 ) or 4 phases of problem identification, problem definition, prioritization, and test for cause-effect relationships (Kepner and Tregoe 2005 ). Process descriptions of decision making are rather rational approaches, whereby the most widely applied and referred model has been developed by Simon ( 2013 ), differentiating between the intelligence, design, choice, implementation, and monitoring phases. In line with these process approaches, we conceptualize data-driven decision making as collecting, analyzing, and using verifiable data (Maxwell et al. 2016 ) through each step of the decision making process. The existence of decision rationality and the use of available data can be interpreted and investigated not only at an individual level but also at the organizational level, referred to as analytical decision making culture (Popovič et al. 2012 ), analytical decision making orientation (Kulkarni et al. 2017 ), rational decision making praxis (Cabantous and Gond 2011 ), or data-driven culture (Karaboga et al. 2022 ). The literature argues that this organizational-level social practice promotes rationality and reliance on data in individual decisions.

Previous discussions of data-driven decision making are consistent in proving that characteristics of data strongly shape the use of data in a decision context (Popovič et al. 2012 ; Provost and Fawcett 2013 ; Puklavec et al. 2018 ), although data quality literature captured the categorization of requirements regarding the data to be used and its adequate measurement very differently (Wang and Strong 1996 ; Pipino et al. 2002 ; Batini et al. 2009 ; Knauer et al. 2020 ). While some scholars distinguish data and information (Davenport and Prusak 1998 ; Zack 2007 ), in this paper we do not interpret the differences, and we use data interchangeably with information in line with prominent studies on data and information quality (Wang 1998 ; Pipino et al. 2002 ). The diffuse set of quality attributes was investigated in the BI context to find evidence of how data quality is influenced by the analytical decision making orientation of the company (Kulkarni et al. 2017 ), how data quality impacts its use moderated by the decision making culture (Popovič et al. 2012 ), and how data quality and data use support higher perceived decision quality (Visinescu et al. 2017 ).

The access to the same set of high-quality data diminishes uneven distribution of information, and this decrease in information asymmetry directly influences decision making as it was repeatedly evidenced in inter-organizational settings (Afzal et al. 2009 ; Mandal and Jain 2021 ; Ahearne et al. 2022 ). Furthermore, information asymmetry might also arise from a lack of means through which organizational members process information and communicates it (Bergh et al. 2019 ), a problem that calls for enhancing the information processing capacity at the firm level. Consequently, mechanisms suggested for the solution of information asymmetry and related support of managerial decision making often rely on the implementation of various information systems (Saam 2007 ).

2.3 Centralization and information systems

Organizations centralize or decentralize decision making to varying degrees, and this is not independent of the technology that processes the data and supports the decisions (Robey and Boudreau 1999 ). Theorizing about whether and how these data-processing technologies induce organizational changes started with the deterministic approach of the technological imperative . It assumes that IT investments change the information processing capabilities of the firm, thereby determining the optimally feasible formal decision making structure (Jasperson et al. 2002 ). The first stream of research, studies made divergent observations about computerization of data and information processing. Arguments for recentralizing previously delegated decision making power in top management were published (Leavitt and Whisler 1958 ) along with surveys reporting that department-wide computer use is linked to decentralization (George and King 1991 ). With the proliferation of ERP systems, researchers demonstrated how these packaged software required standardization of business processes coupled with decentralization. Lower-level managers got the ability to make decisions, which was something they were unable to accomplish before the implementation of ERP (Järvenpää 2007 ; Doherty et al. 2010 ). Meanwhile, ERP systems also reduced information asymmetry and shifted the information advantage to higher management, allowing them to closely monitor inputs and outputs (Bloom et al. 2014 ). The effect of this was twofold, as it allowed high-level managers to take decisions themselves as their information disadvantage was diminished, but also reduced the risk of delegation, which gave managers a chance to relieve their overload.

Overcoming the above discussion on how IT constraints organizational changes, the organizational imperative (social determinism) approach proposed a reversed direction, arguing that companies are likely to choose technology that reinforces the existing power structures of the organization (George and King 1991 ; Robey and Boudreau 1999 ). Research has shown that top managers are prone to use IT to reduce the size of middle management when both IT and organizational decisions are centralized. Where these decisions are decentralized, the number of middle managers increases with the implementation of IT (Pinsonneault and Kraemer 1993 ). Information systems will be easily implemented to the extent that their implications for power distribution are consistent with other sources of power (Markus and Pfeffer 1983 ).

The reasoning of OIPT underpins the argumentation of the organizational imperative research stream by emphasizing that IT is not an independent factor, but the result of an organization’s information processing needs and its managers’ decisions on how to satisfy those needs (Markus and Robey 1988 ). When organizational actors make choices about IT, to meet information processing needs, they are guided by the characteristics of the technologies available. These perceived and actual properties of a technology, also referred to as materiality (Leonardi et al. 2012 ) or affordance (Davis and Chouinard 2016 ) in the literature, determine how it can be used and how it will shape the organization. In the perspective of the technological imperative, these properties project the trajectory of organizational structural change. However, in the perspective of the organizational imperative, these attributes would be the ones that managers or the particular group of decision makers consider when trying to choose the technology that best serves their interests.

With the advent of digital technologies, new properties have come to the foreground compared to previous organizational IT, such as the homogenization of data, reprogrammability, replicability, and its self-referential nature (Yoo 2012 ). In contrast to transactional systems that have been prevalent since the 1990s, today’s ERP systems and other core operations technologies tend to be characterized by strong standardization and reliability, less customization, or even the adoption of software as a service (SaaS) (Sebastian et al. 2017 ). The SaaS deployment mode results in a greater degree of standardization, as further adjustments of these off-the-shelf cloud solutions are expensive and difficult. This not only reduces the degree of freedom in system design, but also shifts the focus from local and tailor-made information production to organization-wide standard information consumption. This prefabricated information can benefit the central management by increasing their power (Carlsson-Wall et al. 2022 ). Digital technologies, on the other hand, are increasingly plug-and-play, which means that they allow for experimentation, rapid deployment, and quick replacement (Sebastian et al. 2017 ). This means that organizational decision makers of IT deployments, be they senior management or local leaders of bottom-up digital initiatives, have greater flexibility in choosing the right technology.

The variety of BA software solutions available and the constant renewal of the offerings suggest that the properties of emerging BA technologies are not predetermined, but depend on the choices of the organizational actors deciding on their adoption: the organization’s information processing needs and managers’ decisions about how to meet them are what drive BA. The proliferation of it not only has interesting behavioral aspects (e.g., changing managerial decision making habits, processes), but the organizational consequences are considerable (Lismont et al. 2017 ). As lower-level managers can more easily translate the data into actionable insights than the top management, the use of analytics might reproduce information asymmetry, a force in the direction of decentralization (Wohlstetter et al. 2008 ). At the same time, BA allows higher levels of management access to more comprehensive data, that is, the information asymmetry is reduced. In line with this, predictive analytics has been observed to be associated with increased top management control of data gathering and less delegation of decision making power to local managers (Labro et al. 2022 ).

2.4 Hypotheses development

Similarly to all other types of organizational phenomena, the factors influencing analytical culture and decision making are numerous. Grounding our research on lenses of information processing and information asymmetry as discussed by OIPT and agency theory, we postulate successive effects of selected key constructs resulting in a multiple mediator model (Fig. 1 ).

Research model

Establishing and maintaining adequate information structures and information processing capability is argued to be among the major tasks of organizations (Daft and Lengel 1986 ). In a recent research, BA proved to be an effective tool for strengthening the information processing capability of organizations (Cao et al. 2015 ). At the same time, the literature in the context of BA argued that top management strongly influences changes in organizational information processing (Cruz-Jesus et al. 2018 ; Maroufkhani et al. 2020 ). More broadly, the IS literature considers top management support as highly important in implementation (Sharma and Yetton 2003 ) and adoption of new tools and emerging technologies (Khayer et al. 2020 ). Despite its essential role, the interpretation and measurement of top management support lack clarity and consistency. In their discussion, Kulkarni et al. ( 2017 ) differentiate the concepts of top management involvement (attitudinal intervention) and participation (behavioral intervention), both of them covered by the umbrella term top management support. In line with this view, where attitudes and supportive behaviors jointly shape the construct of top management support (Maroufkhani et al. 2020 ), in this paper we interpret top management support of BA as the extent to which a firm’s top management considers building and maintaining BA capability as strategically important (Kulkarni et al. 2017 ).

Although top management support is a crucial driving force behind the use of analytics (Chen et al. 2015 ), we cannot expect that the mere use of analytical tools alters managerial information processing. BA’s positive effect on information processing capabilities is expressed through the mediation of a data-driven environment that is understood as a set of organizational practices for the development of strategy, policies, and processes that ensure the embeddedness of BA (Cao et al. 2015 ). This supportive environment was also denoted as analytical decision making culture to emphasize the focus on decision making support (Popovič et al. 2012 ). The supporting culture is strongly shaped by top management, as it serves as a driver for altering corporate values, norms, and cultures, making it possible for other organization members to adopt the new analytics technologies (Chen et al. 2015 ). These findings suggest that the support of BA by top management strengthens the analytical decision making culture. On this basis, we formulate the following hypothesis.

Top management support of business analytics is positively associated with an analytical decision making culture.

Various technologies applied by the organization, each of them implemented with different purposes, shape the information structure of the organization. Organizations design BA to improve further aspects of data quality defined here as the accuracy, relevancy, representation, and accessibility of the data (in line with Wang and Strong 1996 ). By mitigating the problem of data reach, BA is expected to reduce uncertainty and by reducing the lack of clarity in data it is able to cope with equivocality (Kowalczyk and Buxmann 2014 ). As a result of its supportive attitude and behavior, top management can rightfully expect these improvements in data quality, which, in fact, was supported by previous findings about the link between top management support and detail, relevance, reliability, and timeliness of information (Kulkarni et al. 2017 ). Thus, we argue that the positive attitude of top management and its supportive behavior toward BA is a precursor of improved perceptions of data quality.

Top management support of business analytics is positively associated with data quality as perceived by the top management.

However, investing in data and analytical tools is not sufficient to cultivate a data-driven culture (Rikhardsson and Yigitbasioglu 2018 ), the whole set of applied assets, tools, and techniques powerfully shape social practice on the organizational level, i.e., the decision making culture (Cabantous and Gond 2011 ). As enterprise data is inimitable and, therefore, the most valuable strategic asset of companies (Djerdjouri and Mehailia 2017 ), improvements in the quality of this strategic asset will strengthen the perspective on using information to make decisions (Popovič et al. 2012 ). Thus, we assume that data quality is positively associated with analytical culture, and the above-mentioned catalyst effect of top management support on analytical culture is explained mainly by improved data quality. The proposed mediator effect of perceived data quality calls for two related hypotheses.

Perceived data quality is positively associated with analytical decision making culture.

Perceived data quality mediates between top management support and analytical decision making culture.

Organizational culture significantly impacts information processing (Stoica et al. 2004 ) by constraining or enabling individual behavior. Similarly, decision making culture shapes individual decision making processes, whereas an analytical culture helps individual actors to make rational decisions (Cabantous and Gond 2011 ), as decision makers are encouraged to systematically use and analyze data for their decision making tasks (Kulkarni et al. 2017 ). The positive and direct association between a data-driven environment and data-driven decision making demonstrated earlier (Cao et al. 2015 ) has been reasoned by Davenport et al. ( 2001 , p. 127) emphasizing that "without solid values underlying analytic efforts, the already difficult-to-sustain behaviors … are easily neglected”. Based on these arguments, we postulate that organizational level attitudes to the use of data will support its individual level use in a decision context.

Analytical decision making culture is positively associated with data-driven decision making.

Anecdotal evidence in the scholarly and managerial literature suggests that BA eliminates the informational advantage of local or functional managers, and there is a tendency toward centralization of data use supported by central units dedicated to business analysis (Sharma et al. 2014 ). As the results generated by BA are typically more objective, measurable, and transmittable to headquarters (Labro et al. 2022 ), BA enhances the information processing capability: top management is no longer compelled to delegate decisions to avoid information overload. This suggests a possible shift of decision making power from local to top management. Drawing on agency theory and developing our arguments based on top management’s desire to reduce information asymmetry, we also posit that this shift might be associated with limiting the availability of information produced by BA. This corresponds to the arguments that a shift in the distribution of decision rights is possible if top management controls and accesses the output of analytics (Labro et al. 2022 ). Consequently, we introduce the term centralization in data use to denote the reduced availability of analytical results for lower-level managers accompanied by the reduced delegation of decision making and suggest that the analytical culture supports the centralization of data use.

Analytical decision making culture is positively associated with centralization in data use.

Furthermore, increased information availability at higher organizational levels has a comforting effect on managers who acquire a sense of control and manageability (Kuvaas 2002 ). As access to information makes them more confident, it can be assumed that top managers will rely more on data to make their decisions and also convincingly serve as a role model across the organization (Maroufkhani et al. 2020 ). Moreover, given sufficient information, managers can make decisions without the involvement of lower levels, thus becoming the driving force behind data-driven decisions (George and King 1991 ). Thus, we assume that the centralization of data use is positively associated with data-driven decision making, and that the increased centralization of data use partly explains the catalyst effect of analytical culture discussed above. The proposed mediator effect of centralization calls for two related hypotheses.

Centralization of data use is positively associated with data-driven decision making.

Centralization in data use mediates between analytical decision making culture and data-driven decision making.

Prior research suggests that information processing and data quality are closely related. Data quality aspects importance and usability are associated with the two aspects of information processing, perception, and judgment, highlighting that different modes of information processing prefer different sets of information (Blaylock and Rees 1984 ). Data accuracy and the amount of data have an impact on the decision making performance, while the consistency of the data was argued to affect the judgment time (Samitsch 2014 ). The link between data quality aspects and decision making quality aspects appears to be straightforward as management decisions based on these data analytics methodologies are only as good as the underlying data (Hazen et al. 2014 ). At the same time, we argue that this is a second-order link instead of a direct effect. This paper suggests a more direct link from data quality to data use in the decision context without judging time, effectiveness, or other quality criteria for decision making. We propose this first-order link based on the idea that first data needs to be utilized in the decision process before any quality assessment of the process can be made. As suspicion regarding data quality often prevents managers from making any decisions (Redman 1998 ), in the following hypothesis, we posit that good perceptions about data quality are required for utilizing the data during decision making.

Perceived data quality is positively related to data-driven decision making.

The possible adverse effect of poor data quality on its use in a decision context is reinforced by its negative consequences on the decision making culture. Data quality problems perceived by management will increase mistrust (Redman 1998 ) and undermine the positive attitude toward data use at the organizational level. Such damage to the supportive culture carries the risk of not being able to influence managerial behavior toward being data-driven in their decision making process. Therefore, we argue that the effect of data quality on data-driven decision making is mainly explained by the underlying decision making culture.

Analytical decision making culture mediates between perceived data quality and data-driven decision making.

In summary, the above hypotheses first examine the factors that influence data-driven decision making. Based on OIPT, we assume that data-driven decision making relies on the increase in information processing capacity enabled by BA technologies. As custodian of information processing needs, top management supports the establishment of an analytical decision making culture (H1a) and strives to improve data quality (H1b) that serves as mediator and it partially explains how top management supports results in an analytical culture (H1d). We hypothesize that these factors, namely analytical decision making culture (H2a) and perceived data quality (H3a) are the prerequisites for using data in the decision process. Properties of BA suggest us that it is more likely to channel a wide range of data towards top management, an opportunity for top management to reduce information asymmetries. Arguing for the organizational imperative, it will be the top management that will exploit the analytical decision making culture to shape BA practices to centralize data use (H2b), on which data-driven decisions can be built (H2c, H2d). Figure 1 provides a visual summary of the hypotheses concerning the relationships between the constructs.

3 Research method

3.1 data collection and sample characteristics.

The survey was conducted between March and April 2022, targeting medium-size and large companies registered in Hungary across all industries (covering all NACE Footnote 1 codes excluding 97–99). The size of the companies was primarily defined by their number of full-time employees (50–99 employees for smaller medium-sized companies, 100–249 employees for medium large, and 250 + employees for large companies). When implementing a stratified random sampling, the definitions of the strata were based on company size, industry, and region quotas provided by the Central Statistics Office of Hungary. Out of the total population of almost 6000 companies, 1369 were contacted by phone after a random selection by quota cell. In total, 306 companies responded to our survey questions, representing a response rate of 22.35%. The variables of the perception questionnaire were supplemented with the data of the balance sheet and income statement for the years 2017–2020, downloaded from the Electronic Reporting Portal database of the Ministry of Justice. Based on its incomplete financial data, one sample company was removed from the database, resulting in a final sample of 305 case companies. 53.5% of the companies in the sample are smaller medium-sized (50–99 employees), 29.7% medium-large (100–249 employees), and 16.8% large (250 + employees), reflecting the rates of the sampling frame. Table 1 contains the industry characteristics of the final sample.

The research was designed to investigate the perceptions of executives. Therefore, 125 top-level executives (CEO, Managing Director, Chief Executive Officer, President), 65 CXO-level executives (executives reporting to the top-level executive), and 115 strategic decision makers in other positions (e.g., board members) responded during the telephone inquiry in a total of 30–40 min. They typically have an economic qualification and 24.9 years of total work experience, as reported in Table 2 .

Taking top executives’ perceptions as a point of reference is justified not only by their most extensive decision making authority and practice, but by their crucial influence on organizational data flows. We recognize that different levels of senior managers may have divergent perceptions of the investigated organizational phenomena, but based on the results of the non-parametric Kruskal–Wallis test we could exclude that the three executive positions perceive the two main dependent constructs differently: data-driven decision making (H = 0.344 and p = 0.844) and analytical culture (H = 5.705 and p = 0.058). Consequently, we conclude that positions do not significantly affect the results at a significance level of p = 0.05. However, studying organizational-level phenomena with a single respondent involves the risk of accelerated natural correlations between causes and outcomes (Van der Stede et al. 2005 ). To assess this common method bias, we executed the Harman’s Single Factor Test. The total variance extracted by one factor (41.5%) is below the recommended threshold of 50%, suggesting no issues with common method bias that may distort the data when the same measurement instrument is used for independent and dependent variables.

Using empirical data exclusively from a single country, namely Hungary, requires that country conditions allow the collection of relevant empirical data. Relevance is established by a satisfactory level of development of companies in order to be able to respond adequately. In 2022, Hungary ranked 22nd out of the 27 EU member states in the Digital Economy and Society Index, but it progressed in line with the EU over the last five years (European Commission 2022 ). Similarly to the other components measured by the index, the country scores below the EU average in integrating digital technology into the activities of enterprises, although significant improvements have been achieved. This allows us to conclude that the Hungarian sample will be sufficiently heterogeneous with regard to the phenomena under consideration: there will be companies that are more developed, as well as those where managerial attention is lacking, and therefore they do not exploit the opportunities offered by digital technologies, such as advancements on BA.

3.2 Measurement properties

For operationalization of most of the constructs presented in Fig. 1 , multi-item scales were re-adopted from the BI literature, measured on a Likert-type scale from 1 to 5, as shown in Appendix I. Top management support is understood as the commitment of top-level management to BA measured by the 4-item of Kulkarni et al. ( 2017 ). The scale initially measuring top management support of BI initiatives could be readopted into the BA context without any additional change in the wording. Accordingly, the construct of top management support is operationalized here by the extent to which top-level managers consider BA as strategically important, the extent to which they sponsor related initiatives, their commitment to BA policy/guidelines, and their supportive behaviors expressed in hiring and retaining people with analytical skills.

Although researchers suggest extensive lists of items measuring different facets of perceived data quality (Batini et al. 2009 ), in our research setup, we strongly restricted the number of items to keep the survey length limited and balanced. Wang and Strong ( 1996 ) not only generated an extensive list of data quality attributes but also assessed the importance of these attributes as perceived by data consumers. Their hierarchical data quality framework resulted in four target categories of data accuracy, relevancy, representation, and accessibility. Based on Wang’s and Strong’s interpretation of these categories, we applied a four-item scale to assess data quality as perceived by top managers. Data accessibility is emphasized as a data quality attribute in this framework: data is available in the company’s information system or obtainable. Here, we suggest introducing a further aspect of data accessibility, namely the extent of managers with access rights, which is no longer a data quality attribute, but a system attribute, a question of top-level decision. This centralization of accessibility and a managerial desire toward concentrating decision making power in case sufficient data are available is denoted here as centralization in data use. We suggest this interpretation based on the idea of George and King ( 1991 ) about managerial action imperative. Consequently, this two-item scale merges two facets of supportive managerial behavior: a managerial action (decision about limited accessibility) and a managerial intent of possible centralization in the decision making process.

Analytical decision making culture is understood here as an attitude to use numerical information in decision making processes and is measured by the scale suggested and previously applied by Popovič et al. ( 2012 ). This construct covers the existence and awareness of the decision making process and the presence of policies about incorporating information in the process and general consideration of information during decision making. The latter item points in the direction of information use, but still measures an attitude instead of the actual extent of information use. Therefore, we propose to examine a data-driven decision making construct separately from the analytical culture and measure it by the degree to which top management relies on available data in the decision context. As decision making is a process rather than a single action (Rajagopalan et al. 1993 ), we need to delimit and operationalize the extent of data use in each step of the decision making process. Simon’s ( 2013 ) decision phases not only have a long tradition in management research, but are argued to fit the research context of recent data and decision-related studies (Chiheb et al. 2019 ; Liberatore and Wagner 2022 ). Thus, in our research setup, we distinguish between the four phases of intelligence, design, choice, and implementation and measure the extent of data use with two items in each of phases according to its content as described in a computerized decision support context by Turban et al. ( 2011 ).

The size of the company (measured by the number of employees) as a continuous variable and the industry as a binary variable were involved as control variables in the model . Although the sample companies cover all sectors, 37% are manufacturers, as reported in Table 1 . Therefore, we control for the effect of the manufacturing industry by including the industry as a dummy variable, whether the firm is active in the manufacturing industry.

The method of partial least squares structural equation modeling (PLS-SEM) is used in the study that is widely applied in European management research (Richter et al. 2016 ). Preliminary data analysis was executed in SPSS 27, and the PLS model was calculated using SmartPLS4. Descriptive statistics reported in Appendix II show a low number of missing values per variable, ranging from 0 to 2%. Missing values were treated by mean value replacement, as suggested by Hair et al. ( 2021 ). Although the distribution of sample firms along different characteristics closely approximates the distributions found in the total population (see Table 1 ), to further improve the sample’s fit to the quotas (by company size, industry, and region), sampling weights have been calculated, and the weighted PLS-SEM algorithm was run. The weight variable is set close to one for all respondents (0.88–1.21), hence, non-weighted and weighted PLS results do not show much variation in the results. The estimation of the PLS model presented in the following incorporates sample weighting to obtain unbiased estimates of population effects (Sarstedt et al. 2018 ).

The survey also meets the minimum sample size requirements of conservative methods. At a significance level of 5% and with the lowest statistically significant path coefficient value of 0.2, the inverse square root method (Kock and Hadaya 2018 ) requires a minimum sample size of 155, which is considerably exceeded by the final sample size of 305.

4.1 Measurement model assessment

The reflective measurement model was assessed using reliability and validity measures, as shown in Table 3 . We experienced high outer loadings, implying that indicators measure the same phenomenon, excluding one item of construct centralization in data use (CE_1) where an outer loading (0.694) just below the established threshold (0.7) was obtained. In the case of newly developed scales, Hulland ( 1999 ) suggested not automatically eliminating indicators with somewhat weaker outer loading but carefully investigating them. Having a solid theoretical rationale for including this variable and acceptable values for construct level reliability, we suggest keeping it in the model. The value of Cronbach’s alfa (0.604) is above the acceptable threshold recommended for exploratory research (Hair et al. 2021 ). Moreover, Cronbach’s alfa is sensitive to the number of items and assumes equal outer loadings on the construct, which has substantial limitations to measure internal consistency reliability in this case. Prioritized by the PLS-SEM algorithm, the composite reliability values of rho_a and rho_c reported over 0.8 show adequate reliability at the construct level for each construct. Convergent validity assessed by the average variance extracted (AVE) is clearly above the threshold of 0.5, which indicates that the items constituting each construct share a high proportion of variance.

As the square root of each construct’s AVE is higher than its highest correlation with any other construct, the Fornell-Larcker criterion establishes the discriminant validity. The heterotrait-monotrait ratio (HTMT), a newer criterion to assess discriminant validity, also supports it (Henseler et al. 2015 ) as correlation ratios are far below the conservative threshold of 0.85 (see Table 4 ). The values of the variation inflation factor (VIF) clearly below 3 (below 1.303 for the predictors of (3) AC and below 1.514 for the predictors of (5) DM) ensure that collinearity among the predictor constructs will have no substantial effect on the model estimation. Overall, the reflective measurement model assessment results point to an appropriate measurement model.

4.2 Structural model assessment

For the estimation of the structural model, we selected path weighting scheme, and bootstrapping was executed with 10,000 samples based on bias-corrected and accelerated (BCa) bootstrap confidence intervals, a bootstrapping procedure suggested by Henseler et al. ( 2009 ) to overcome shortcomings of other methods. Table 5 summarizes the results for the significance tests of the individual path coefficients, interpreted as standardized coefficients of ordinary least squares regressions. All direct effects are significant, with one exception: centralization in data use does not impact data-driven decision making.

The values of the significant path coefficients (b) indicate a varying degree of association. Based on Cohen’s ( 2013 ) suggestion that b-values of 0.5, 0.3, and 0.15 signify strong, medium, and weak effects, we can conclude that top management support is strongly related to perceived data quality, which in turn impacts analytical decision making culture, supporting H1b and H1c. Additionally, top management support is associated with analytical culture at the medium level, supporting H1a. Analytical culture moderately affects centralization in data use (H2b) and strongly influences data-driven decision making (H2a). Although perceived data quality is weakly but significantly associated with data-driven decision making (H3a), centralization in data use does contribute to the use of data in decision making. Therefore, the model does not support H2c.

Individual mediating effects of data quality, analytical culture, and centralization in data use were hypothesized and tested. Specific indirect effects through mediators are quantified by the multiplication of direct effects reported in Table 5 . The significance test for specific indirect effects (see Table 6 ) supports the mediating role of perceived data quality. Thus, we can accept H1d. Similarly, analytical decision making culture underlies the relationship between perceived data quality and data-driven decision making. Thus, the model also confirms H3b. As both the indirect and direct effects are significant and point in the same direction, both mediations are complementary mediating relationships. H2d suggesting the mediator role of centralization in data use is not supported by the data. Although the direct effect of analytical culture on data-driven decision making is significant, the indirect effect is not, indicating a situation of direct-only non-mediation. Total effects calculated as the sum of direct and indirect effects (see Table 7 ) suggest strong relationships between the key target constructs (3) AC, (5) DM, and the predictor constructs, apart from the predictor role centralization in data use (see Table 6 ).

To rule out the confounding effect of company size and industry, we added two control variables to the model. The company size, measured by the number of employees, and the binary variable indicating manufacturing activity were involved and linked to each endogenous construct (model2). As the path coefficients of the hypothesized relationships in model 2 are very close to those in the original model not involving controls, we can rule out the confounding effect of company size and industry (see Table 7 ).

The model’s in-sample predictive power measured by the coefficient of determination (R 2 ) is rated moderate concerning the key target constructs analytical decision making culture (0.373) and data-driven decision making (0.35), assessed based on prior classification of magnitudes (Chin 1998 ). This suggests that the model is able to fit the data at hand. The out-of-sample predictive power of the model was evaluated with Q 2 statistics followed by a PLS predict procedure (Shmueli et al. 2016 ) where the tenfold cross-validation was repeated r = 10 times. The Q 2 predict values of each indicator measuring the key target constructs analytical decision making culture and data-driven decision making are above zero (see Table 8 ), indicating that the model meets the minimum criteria. Then we calculated the differences between the predicted and observed values for the PLS-SEM model and for the linear regression model (LM) using the root mean square error (RMSE) as a prediction statistic. Drawing on the idea that PLS-SEM based predictions should outperform LM (Shmueli et al. 2016 ), RMSE values of LM and PLS-SEM are compared in Table 8 . As not all but most of the indicators in the PLS model yield more minor prediction errors than the LM, the model has medium predictive power.

As shown in Table 8 , the PLS-SEM RMSE value exceeds that of the linear model for one indicator (DM_6). Hair et al. ( 2021 ) suggest exploring the potential explanations for the low predictive power of indicators. As this indicator has the highest outer loading (0.826) among all indicators associated with the construct data-driven decision making, we can exclude problems arising from the measurement model. Data issues can be excluded as well as the indicator is characterized by a non-outstanding standard deviation (0.8092) and a non-extreme distribution (skewness −0.973, kurtosis 1.055). Overall, the structural model estimates suggest that the model is not confounded by a third variable, and it has a moderate in-sample and out-of-sample predictive power. Figure 2 provides a visual summary of the results concerning the relationships between the constructs.

Effect sizes of direct and specific indirect effects

5 Discussion

This study aimed to reveal insights on building an analytical decision making culture, as well as the driving forces of data-driven decision making, arguing that top management support for BA is a crucial foundation in related changes. The findings of the structural model confirm the claim in the literature that top management support is positively associated with analytical decision making culture (Popovič et al. 2012 ; Chen et al. 2015 ) in BA context. Prior evidence on that analytical culture is the greatest obstacle, thus the greatest challenge in benefiting from BA (LaValle et al. 2011 ; McAfee and Brynjolfsson 2012 ), draws attention to the importance of the finding that top management’s efforts are not in vain, but they can promote analytical decision making culture. Furthermore, the results confirm the importance of perceived data quality, as evidenced earlier in the BI context (Kulkarni et al. 2017 ). However, improved perceptions of data quality have been considered both as an antecedent and consequence of culture in the prior literature. While Kulkarni et al. ( 2017 ) proposed that analytical culture improves data quality perceptions, our research aimed to argue and reveal a positive effect in the opposite direction. The mismatch can be attributed to the fact that previous research has constructed a model to explain BI capabilities rather than decision making. Quality aspects, denoted by Kulkarni et al. ( 2017 ) as information capabilities, were considered part of the BI capability, understood as a firm’s ability to provide high-quality information and systems to support decision makers. In our model focusing on the driving forces of data-driven decision making, managerial perceptions about data quality represent a precursor to analytic decision making culture. The strong effect reveals that top management supports building an analytical culture, and the related improvements in managerial perceptions about data quality partially explain this effect. This partial mediation effect is explained by the fact that attitudes towards data use are expected to improve if the quality of data content is considered good (Popovič et al. 2012 ).

Reporting a significant effect, we found support for data quality that directly influences the nature of decision making. Although this effect is weak, the medium-strong total effect indicates the importance of data quality in decision making. Here, we measure and evaluate management perceptions about data quality. Data quality information, i.e., metadata that objectively describe data quality, is rarely found in organizational information systems, although it clearly impacts decision making (Chengalur-Smith et al. 1999 ). If data quality information is not provided, decision makers “develop a feel for the nuances and eccentricities of the data used” (Fisher et al. 2003 , p. 170) and decide about the use of data based on their subjective judgment, making managerial perceptions of data quality crucial.

Reporting high values for both direct and indirect effects, we found support for analytical decision making culture positively influencing data-driven decision making. This highlights the importance of a supportive social practice on the organizational level in guiding individual behavior (Stoica et al. 2004 ) and supports the claim that utilizing analytical capabilities requires a change in corporate mentality to view data and information as vital organizational assets (Galbraith 2014 ). The literature is consistent in that analytical culture is difficult to create (Davenport and Bean 2018 ), but once established, it is a competitive force to support organizational performance (Karaboga et al. 2022 ). At the same time, the most common objectives of firms are not directly related to performance but to better decisions through advanced analytics (Davenport and Bean 2018 ), suggesting a direct link to data-driven decision making. The results supported this claim and verified our arguments leveraged by OIPT that analytical culture helps managers to make information processing consistently fact-based.

We argued that this effect of analytical culture on decision making can be partially explained by a shift in power balance, described as the centralization of data use. Although the results did not verify the existence of this mediation, we found that the analytical culture is associated with the centralization of data use, interpreted as the limited availability of analytical results accompanied by reduced delegation of decision making. BA tools and techniques frame the culture of evidence-based decision making that also does not leave existing structures, roles, and processes untouched (Ross et al. 2013 ). Although the broad organizational impact of analytical culture is undeniable, the direction of the changes is not apparent at all. Ross et al. ( 2013 ) argue that data analytics equips lower-level managers with the data that helps them make decisions locally, suggesting a wide availability of data along with a potential for enhanced delegation of operative decision making. If so, this would keep the information advantage at the local level, maintaining the information asymmetries. At the same time, predictive analytics is able to decrease the local information advantage because, unlike traditional local information, the results of BA are less subjective and require less local expertise to interpret (Labro et al. 2022 ). By reducing information asymmetry, BA can alter the existing power balance of the organization. A similar shift in information distribution has also been shown in the context of the use of predictive analytics in planning. The benefits of participative budgeting in case of high information asymmetry has long been argued in the accounting literature (Heinle et al. 2014 ). When employing predictive analytics in driver-based planning, which involves the systematic utilization of company data to investigate and verify causal relationships, data are capable of partially replacing local expertise (Valjanow et al. 2019 ). This not only counterbalances local information advantages by drawing on more objective methods in the planning process but also limits information availability by involving a relatively narrow group of analysts and executives in centralized, driver-based planning.

Our results show that analytical culture has the potential to alter the centralization of data use: it eliminates the information disadvantage at the top level and allows decisions not to be delegated. As an unexpected result, this is not associated with a more robust integration of data use into the decision making process. The availability of a large amount of information has been argued to be ambiguous. On the one hand, much information available to managers was reported to increase feelings of satisfaction (O’Reilly 1980 ), to improve perceptions about the level of control and manageability (Kuvaas 2002 ), and it was evidenced that accessibility predicts the frequency of data use in the decision context (O’Reilly 1982 ). However, studies rooted in OIPT draw attention to the possibility of information overload (O’Reilly 1980 ), where organizational information processing capacity could be improved, but at the level of individual information processing remains limited by nature (Simon 1978 ); therefore, patterns of individual decision making cannot be expected to be altered. Uncertainty, originally defined as the difference between the amount of information available and the amount of information required (Galbraith 1973 ), does not arise from the lack of information anymore but from the oversupply, resulting in information fatigue (Buchanan and Kock 2001 ). Therefore, reducing information asymmetry by increasingly more data available to top management cannot be expected to alter managerial decision making. Rejecting the mediation effect of centralization in data use suggests that top management’s effort should be directed towards establishing and maintaining a strong analytical culture and eliminating possible negative effects of information asymmetry by using other tools of alignment in decision making, such as incentives (Prendergast 2002 ). This is supported by the findings of Labro et al. ( 2022 ), who showed that the application of predictive analytics is related to more precise goals and stronger ties between employee rewards and measured performance.

We based our chain of thought on the argument that managers have unprecedented freedom to choose between technologies supporting BA, in contrast to robust backbone systems. This has two significant consequences. First, as the properties of BA result from the choices of the senior manager, it is more likely that they can choose the best solutions that serve their information needs and, even more importantly, their actual interests. Second, as the chosen solutions exhibit minimal localization potential, other users are confronted with all its constraints (Leonardi and Barley 2008 ). This concept of planned change, which refers to alterations conceived, directed, and managed by management, is analogous to the organizational imperative. Critics of technological and organizational imperative point out that it makes no difference if the technology or the organization is the dependent factor, these are both contingency approaches assuming that outcomes can be explained by a combination of known determinants (Markus and Robey 1988 ; Jasperson et al. 2002 ). Instead, the introduction of IT is associated with more complex social processes, as well as unanticipated and unintended organizational effects (Robey et al. 2013 ). The literature occasionally refers to this as the emergent perspective that sees organizational IT implementations as catalysts of the chain of causes and effects that creates the actual use of technology, as well as the organizational outcomes (Markus and Robey 1988 ; Orlikowski 1992 ; Pinsonneault and Kraemer 1993 ; Jasperson et al. 2002 ; Bailey et al. 2022 ). The complexity arises from the reciprocal and cyclical nature of changes, where the social system poses informational requirements to the IT system, while the corresponding IT system has organizational requirements (Lee 2010 ). We argued that these information requirements are articulated mainly by top management in the form of supporting or not supporting BA initiatives, and in turn this will shape social artifacts of organization, namely the decision making culture and process of decision making.

The socio-technical systems theorists, again aiming to overcome the dilemma of technological or organizational determinism, emphasize the system approach in which the continuously interacting elements make up the system (Robey et al. 2013 ), here the information system containing technology, data, and organization (Lee 2010 ). In their view, IS research must comprehend how to make changes in these elements to reach desired ends, where the term desired is determined by the subjective values of the key organizational actors, namely the managers (Lee 2010 ). Not only this intention is very close to that of what is called organizational imperative approach in which the strategic choice is emphasized, but socio-technical studies also suggest that IT can be adapted with various intentions and accordingly it will have different implications for the organization (Zuboff 1988 ).

6 Conclusions and limitations

BA, the fastest growing area within the domain of BI&A, is accompanied by a number of organizational changes inducing new fields of research. A stream of research with a clear focus on technology investigates emerging novel solutions, their implementation, and their use in organizations. Another stream with a managerial focus studies changes in the information environment, culture, organization, and processes, such as decision making. This paper, positioned in the latter, emphasizes the importance of building an analytical culture to foster data use in decision making, a prerequisite for a company to benefit from the BA. An analytical culture interpreted as social practice (Cabantous and Gond 2011 ) or corporate mindset (Galbraith 2014 ) is promoted as a precondition of the rational model of decision making, a position supported by this paper as well. While promoting the rational model relying on data instead of intuition or prior experience, we must acknowledge that it faces challenges arising from limited human cognition (Simon 1990 ). The claims on how analytical culture framed by the use of BA tools and techniques can support a more intense reliance on data seem to fail to take this limitation into account, implicitly assuming that BA is able to overcome it, at least partially.

As a key finding of this research, we showed that the deliberate focus of top management on business analytics can build a culture of analytical decision making and this alters managerial information processing, namely enhances reliance on data in each phase of the decision making process. Moreover, we argued that the existing information asymmetry within the organization changes due to the use of BA, and this can lead to a shift of authorities. Based on the results of the mediator model, we found support for this argument and evidenced that the analytical culture promotes the centralization of data usage, but does not further strengthen the data-driven nature of decision making at the top level. With these key findings, the paper contributes to the academic literature in different ways.

First, this research enriched the literature on managerial decision making through the OIPT lens, the theory aimed at identifying what factors improve an organization’s information processing capacity. Drawing on the idea that BA is a powerful means of increasing organizational information processing capabilities (Cao et al. 2015 ), we verified the influence of previously scattered validated organizational factors in the context of BA. The special context of BA is of importance here, responding to the criticism that, in many cases, technology itself seems to be forgotten when analyzing the induced human behavior and organizational processes (Leonardi et al. 2012 ). Materiality matters (Orlikowski 2007 ), that is, we cannot neglect the fact that recent developments of BA have distinctive nature and therefore they call for new evidence to explain how they shape the information processes of organizations. In an effort to explain the effects arising from BA, we argued that the choice of top management is what drives the related changes. The choice refers to the features of technologies that are embedded in the organizational context and internal social relationships where people develop, implement, and use them (Bailey et al. 2022 ).

Second, we suggested and validated Simon’s model as a measurement framework for the operationalization of the data-driven decision making process. Prior operationalizations of this concept are usually based on quantifying the opposing tendencies of relying more on intuition or on data in managerial decision making (Covin et al. 2001 ; LaValle et al. 2011 ; Szukits 2022 ). These were occasionally extended by some context specific dimensions (Bokrantz et al. 2020 ). Out of the many possible aspects of data-driven decision making (Colombari et al. 2023 ), our theoretical lens suggested to keep our focus on the dimension of information processing. The theorizing of information collection, analysis and synthesis processes by OIPT (Tushman and Nadler 1978 ) resonates with Simon’s ( 1978 , 2013 ) process-oriented view of decision making. Therefore, our proposed scale (see Appendix I) measures data-driven decision making with eight variables that unfold the steps of Simon’s ( 2013 ) four phases from assessing the current situation to monitoring the implementation. Values reported for internal consistency and convergence in Table 3 are statistically convincing. This not only makes the scale suitable for future use, but confirms the scattered literature (Turban et al. 2011 ; Chiheb et al. 2019 ) that proposes to operationalize the use of data in decision making along the steps of rational decision making.

Third, our research joins the long-argued proposition about the continuing interplay of technologies and organizational design (Dibrell and Miller 2002 ; Sor 2004 ). It involves the potential shifts in authorities in the discussion based on the diminished information disadvantage at the top. Agency theory recognizes this information gap between local and top management as negative, as the lower-level manager is assumed to use the private information to make self-interested decisions, which is detrimental to the organization (Chia 1995 ). This underlying assumption of agency theory positions information asymmetry as an organizational phenomenon that should definitely be reduced by different means (Rajan and Saouma 2006 ). We argued that BA is a powerful means to mitigate information asymmetry and evidenced that analytical culture is associated with a possible change in authorities. In one of his last publications, Galbraith ( 2014 ), the father of OIPT, also argued that taking advantage of analytics capability shifts power in the organization. But instead of shifting the power to top management, he posited a shift to analytics experts who can analyze and read the data. This suggests that power in terms of information advantage and power in terms of decision making authority might develop differently. Analytical experts might gain an information advantage without broader authorities, as the right to decide remains with the top management. To fill this gap, a power shift from judgmental decision makers to digital decision makers is required (Galbraith 2014 ), devaluing the role of other decision making strategies in contemporary organizations.

Lastly, this paper contributed to theory by not only showing the revived relevance of information asymmetry and authorities in the BA context but also revealing its duality: analytical culture can alter power balance, but this does not affect the rational decision making model. To what extent decision making is data driven at the top, it is not dependent on a potential change in the information distribution to the detriment of lower-level managers. This conclusion adds to prior agency theory research investigating the ambiguity of behavioral implications and arguing that more comprehensive data on the top do not necessarily establish contextual experience that lower-level managers have (Brown-Liburd et al. 2015 ). The missing contextual experience along with the feeling of information fatigue does not support the use of available data in a decision context (Buchanan and Kock 2001 ).

As our main contribution to practice, we emphasize the role that top management has to play in achieving data-driven decision making. Resonating with the organizational imperative assumption, we attract attention to the amplifying influence of top management, as their choices have a significant impact on both perceived data quality and the analytical decision making culture. The literature extensively illustrates the significance of their commitment to the adoption of novel technology (Ross et al. 2013 ; Chen et al. 2015 ; Kulkarni et al. 2017 ; Cruz-Jesus et al. 2018 ). Our research has also shown that when senior managers support and choose BA technology that is suitable for their organization, they ascertain the trajectory of data access patterns within the organization. If additional information becomes available to lower-levels of management, it is possible to delegate decisions to them (Järvenpää, 2007 ; Wohlstetter et al. 2008 ), while more information at higher levels allows for tighter control of lower levels or even centralization of decisions (Sharma et al. 2014 ; Labro et al. 2022 ). Thus, top management needs to consciously consider to what extent the available solutions allow to reduce or occasionally reproduce information asymmetry. Furthermore, in this context, depending on whether centralization of data use or greater empowerment is the aim of the top management, they should look for a BA solution that best supports this aim (Robey and Boudreau 1999 ; Leonardi et al. 2012 ). The top management has a degree of freedom here, particularly because our research concludes that data-driven decision making can be achieved either way: it is definitely influenced by analytical decision making culture and perceived data quality, but not by the degree of centralization of data use.

When evaluating our results, it is important to keep in mind that this research did not inventory concrete BA techniques and tools used by organizations, even though the paper claims that the BA calls for new evidence. We justify this choice with three reasons. First, the underlying IT and statistical solutions are diverse and constantly growing, making a comprehensive survey difficult. Second, the number of techniques applied does not reveal anything. Third, the mere adoption of any tool or technique was argued not to transform decision making, but the relevance of a supporting culture is widely emphasized in studies (Popovič et al. 2012 ; Grublješič and Jaklič 2015 ; Kulkarni et al. 2017 ) and, to a lesser extent, in the context of BA (Cao et al. 2015 ).

Some other limitations of this research need to be acknowledged. First, by measuring data utilization in decision making, we do not explicitly address other information processing modes. By excluding the discussion of the possible dichotomy of intuitive and rational decision making (Sadler-Smith and Shefy 2004 ), the paper cannot conclude the shift of focus in information processing modes from a more intuitive to a less intuitive and more analytical one. Second, the data are based on single respondents, namely the senior managers of the case companies. The opinion of highest-paid persons (McAfee and Brynjolfsson 2012 ) neglects the opinion of lower-level managers and other employees, which can distort the results, as these groups were argued to express different perceptions of their organizations’ data-driven decision making and culture (Maxwell et al. 2016 ). Although test statistics did not report a significant effect of the investigated positions on our results, CEOs, CXO level executives, and other strategic decision makers cannot be handled as a homogeneous group, as CEOs differ from other top managers in many aspects (Kaplan and Sorensen 2021 ). A multi-respondent survey design is suggested to explore the potentially diverging perspectives of subordinates, different managerial levels, and the CEO. Third, the questionnaire survey was conducted among firms registered in Hungary, which could limit the generalizability of the results. We do not expect that country conditions, such as economic or political factors, will impact the relationships hypothesized in the study. At the same time, the social and cultural values of decision makers could affect decision making itself (Forquer Gupta 2012 ). Differences have been reported not only between distant cultures (Calhoun et al. 2002 ), but even between neighboring countries of the Central-Eastern European region (Dabić et al. 2015 ) in this respect. Thus, the suggested measurement model has the potential to be extended to other countries and regions to exclude possible bias arising from cultural contingency.

Data availability

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request. Database available in.sav and.csv format.

Code availability

SmartPLS 4.0.

NACE is the nomenclature for economic activities introduced by the European Commission for a standard delineating of industries, as shown in Table 1 .

Aben TAE, van der Valk W, Roehrich JK, Selviaridis K (2021) Managing information asymmetry in public–private relationships undergoing a digital transformation: The role of contractual and relational governance. Int J Oper Prod Manag 41(7):1145–1191. https://doi.org/10.1108/IJOPM-09-2020-0675

Article Google Scholar

Afzal W, Roland D, Al-Squri MN (2009) Information asymmetry and product valuation: an exploratory study. J Inf Sci 35(2):192–203. https://doi.org/10.1177/0165551508097091

Ahearne M, Atefi Y, Lam SK, Pourmasoudi M (2022) The future of buyer–seller interactions: a conceptual framework and research agenda. J Acad Mark Sci 50(1):22–45. https://doi.org/10.1007/s11747-021-00803-0

Alaimo C, Kallinikos J (2022) Organizations decentered: data objects, technology and knowledge. Organ Sci 33(1):19–37. https://doi.org/10.1287/orsc.2021.1552

Appelbaum D, Kogan A, Vasarhelyi M, Yan Z (2017) Impact of business analytics and enterprise systems on managerial accounting. Int J Account Inf Syst 25:29–44. https://doi.org/10.1016/j.accinf.2017.03.003

Arnott D, Pervan G (2016) A critical analysis of decision support systems research revisited: the rise of design science. In: Willcocks LP, Sauer C, Lacity MC (eds) Enacting research methods in information systems: volume 3. Springer: Berlin, pp 43–103. https://doi.org/10.1007/978-3-319-29272-4_3

Aydiner AS, Tatoglu E, Bayraktar E, Zaim S, Delen D (2019) Business analytics and firm performance: the mediating role of business process performance. J Bus Res 96:228–237. https://doi.org/10.1016/j.jbusres.2018.11.028

Bailey DE, Faraj S, Hinds PJ, Leonardi PM, von Krogh G (2022) We are all theorists of technology now: a relational perspective on emerging technology and organizing. Organ Sci 33(1):1–18. https://doi.org/10.1287/orsc.2021.1562

Batini C, Cappiello C, Francalanci C, Maurino A (2009) Methodologies for data quality assessment and improvement. ACM Comput Surv 41(3):1–52. https://doi.org/10.1145/1541880.1541883

Bergh DD, Ketchen DJ, Orlandi I, Heugens PPMAR, Boyd BK (2019) Information asymmetry in management research: past accomplishments and future opportunities. J Manag 45(1):122–158. https://doi.org/10.1177/0149206318798026

Blaylock BK, Rees LP (1984) Cognitive style and the usefulness of information*. Decis Sci 15(1):74–91. https://doi.org/10.1111/j.1540-5915.1984.tb01197.x

Bloom N, Garicano L, Sadun R, Van Reenen J (2014) The distinct effects of information technology and communication technology on firm organization. Manage Sci 60(12):2859–2885. https://doi.org/10.1287/mnsc.2014.2013

Bokrantz J, Skoogh A, Berlin C, Wuest T, Stahre J (2020) Smart maintenance: an empirically grounded conceptualization. Int J Prod Econ 223:107534. https://doi.org/10.1016/j.ijpe.2019.107534

Brown-Liburd H, Issa H, Lombardi D (2015) Behavioral implications of big data’s impact on audit judgment and decision making and future research directions. Account Horiz 29(2):451–468. https://doi.org/10.2308/acch-51023

Buchanan J, Kock N (2001) Information overload: a decision making perspective. In: Köksalan M, Zionts S (eds) Multiple criteria decision making in the new millennium. Springer, Berlin, pp 49–58. https://doi.org/10.1007/978-3-642-56680-6_4

Cabantous L, Gond J-P (2011) Rational decision making as performative praxis: explaining rationality’s Éternel retour. Organ Sci 22(3):573–586. https://doi.org/10.1287/orsc.1100.0534Abstract

Calhoun KJ, Teng JTC, Cheon MJ (2002) Impact of national culture on information technology usage behaviour: an exploratory study of decision making in Korea and the USA. Behav Inf Technol 21(4):293–302. https://doi.org/10.1080/0144929021000013491

Cao G, Duan Y, Li G (2015) Linking business analytics to decision making effectiveness: a path model analysis. IEEE Trans Eng Manage 62(3):384–395. https://doi.org/10.1109/TEM.2015.2441875

Carlsson-Wall M, Goretzki L, Hofstedt J, Kraus K, Nilsson C-J (2022) Exploring the implications of cloud-based enterprise resource planning systems for public sector management accountants. Financ Account Manag 38(2):177–201. https://doi.org/10.1111/faam.12300

Chaudhuri S, Dayal U, Narasayya V (2011) An overview of business intelligence technology. Commun ACM 54(8):88–98. https://doi.org/10.1145/1978542.1978562

Chen DQ, Preston DS, Swink M (2015) How the use of big data analytics affects value creation in supply chain management. J Manag Inf Syst 32(4):4–39. https://doi.org/10.1080/07421222.2015.1138364

Chengalur-Smith IN, Ballou DP, Pazer HL (1999) The impact of data quality information on decision making: an exploratory analysis. IEEE Trans Knowl Data Eng 11(6):853–864. https://doi.org/10.1109/69.824597

Chia Y-M (1995) The interaction effect of information asymmetry and decentralization on managers’ job satisfaction: a research note. Hum Relations 48(6):609–624. https://doi.org/10.1177/001872679504800601

Chiheb F, Boumahdi F, Bouarfa H (2019) A new model for integrating big data into phases of decision-making process. Procedia Comput Sci 151:636–642. https://doi.org/10.1016/j.procs.2019.04.085

Chin WW (1998) The partial least squares approach to structural equation modeling. In: Marcoulides GA (ed), Modern methods for business research: vol. 295(2) (pp. 295–330). Psychology Press

Choo CW (1996) The knowing organization: How organizations use information to construct meaning, create knowledge and make decisions. Int J Inf Manage 16(5):329–340. https://doi.org/10.1016/0268-4012(96)00020-5

Cohen J (2013) Statistical power analysis for the behavioral sciences. Academic Press

Book Google Scholar

Colombari R, Geuna A, Helper S, Martins R, Paolucci E, Ricci R, Seamans R (2023) The interplay between data-driven decision-making and digitalization: a firm-level survey of the Italian and US automotive industries. Int J Prod Econ 255:108718. https://doi.org/10.1016/j.ijpe.2022.108718

Côrte-Real N, Ruivo P, Oliveira T (2020) Leveraging internet of things and big data analytics initiatives in European and American firms: Is data quality a way to extract business value? Inf Manag 57(1):103141. https://doi.org/10.1016/j.im.2019.01.003

Covin JG, Slevin DP, Heeley MB (2001) Strategic decision making in an intuitive vs. technocratic mode: structural and environmental considerations. J Bus Res 52(1):51–67. https://doi.org/10.1016/S0148-2963(99)00080-6

Cruz-Jesus F, Oliveira T, Naranjo M (2018) Understanding the adoption of business analytics and intelligence. In: Rocha Á, Adeli H, Reis LP, Costanzo S (eds) Trends and advances in information systems and technologies (pp 1094–1103). Springer. https://doi.org/10.1007/978-3-319-77703-0_106

Dabić M, Tipurić D, Podrug N (2015) Cultural differences affecting decision-making style: a comparative study between four countries. J Bus Econ Manag 16(2):275–289. https://doi.org/10.3846/16111699.2013.859172

Daft RL, Lengel RH (1986) Organizational information requirements, media richness and structural design. Manage Sci 32(5):554–571

Davenport TH, Bean R (2018). Big companies are embracing analytics, but most still don’t have a data-driven culture

Davenport TH, Harris JG (2007) Competing on analytics: the new science of winning (1st edition). Harvard Business Review Press

Davenport TH, Harris JG, De Long DW, Jacobson AL (2001) Data to knowledge to results: building an analytic capability. Calif Manage Rev 43(2):117–138. https://doi.org/10.2307/41166078

Davenport TH, Prusak L (1998) Working knowledge: how organizations manage what they know. Harvard Business Press

Davis JL, Chouinard JB (2016) Theorizing affordances: from request to refuse. Bull Sci Technol Soc 36(4):241–248. https://doi.org/10.1177/0270467617714944

Dibrell CC, Miller TR (2002) Organization design: the continuing influence of information technology. Manag Decis 40(6):620–627. https://doi.org/10.1108/00251740210434016

Djerdjouri M, Mehailia A (2017). Adopting business analytics to leverage enterprise data assets. In: Benlamri R, Sparer M (eds) Leadership, innovation and entrepreneurship as driving forces of the global economy (pp 57–67). Springer. https://doi.org/10.1007/978-3-319-43434-6_5

Doherty NF, Champion D, Wang L (2010) An holistic approach to understanding the changing nature of organisational structure. Inf Technol People 23(2):116–135. https://doi.org/10.1108/09593841011052138

Duan L, Xiong Y (2015) Big data analytics and business analytics. J Manag Anal 2(1):1–21. https://doi.org/10.1080/23270012.2015.1020891

Dubey R, Gunasekaran A, Childe SJ, Fosso Wamba S, Roubaud D, Foropon C (2021) Empirical investigation of data analytics capability and organizational flexibility as complements to supply chain resilience. Int J Prod Res 59(1):110–128. https://doi.org/10.1080/00207543.2019.1582820

European Commission (2022) Digital economy and society index (DESI) 2022 | Shaping Europe’s digital future. https://digital-strategy.ec.europa.eu/en/library/digital-economy-and-society-index-desi-2022

Evans JSBT (2010) Intuition and reasoning: a dual-process perspective. Psychol Inq 21(4):313–326. https://doi.org/10.1080/1047840X.2010.521057

Ferraris A, Mazzoleni A, Devalle A, Couturier J (2018) Big data analytics capabilities and knowledge management: Impact on firm performance. Manag Decis 57(8):1923–1936. https://doi.org/10.1108/MD-07-2018-0825

Fisher CW, Chengalur-Smith I, Ballou DP (2003) The impact of experience and time on the use of data quality information in decision making. Inf Syst Res 14(2):170–188. https://doi.org/10.1287/isre.14.2.170.16017

Forquer Gupta S (2012) Integrating national culture measures in the context of business decision making: an initial measurement development test of a mid level model. Cross Cult Manag: Int J 19(4):455–506. https://doi.org/10.1108/13527601211269987

Galbraith JR (1973) Designing complex organizations. Addison-Welsey, Reading, MA

Galbraith JR (2014) Organizational design challenges resulting from big data (SSRN Scholarly Paper ID 2458899). Social Science Research Network. https://papers.ssrn.com/abstract=2458899

George JF, King JL (1991) Examining the computing and centralization debate. Commun ACM 34(7):62–72. https://doi.org/10.1145/105783.105796

Gorla N, Somers TM, Wong B (2010) Organizational impact of system quality, information quality, and service quality. J Strateg Inf Syst 19(3):207–228. https://doi.org/10.1016/j.jsis.2010.05.001

Grover V, Chiang RHL, Liang T-P, Zhang D (2018) Creating strategic business value from big data analytics: a research framework. J Manag Inf Syst 35(2):388–423. https://doi.org/10.1080/07421222.2018.1451951

Grublješič T, Jaklič J (2015) Business intelligence acceptance: the prominence of organizational factors. Inf Syst Manag 32(4):299–315. https://doi.org/10.1080/10580530.2015.1080000

Gudfinnsson K, Strand M, Berndtsson M (2015) Analyzing business intelligence maturity. J Decis Syst 24(1):37–54. https://doi.org/10.1080/12460125.2015.994287

Gupta M, George JF (2016) Toward the development of a big data analytics capability. Inf Manag 53(8):1049–1064. https://doi.org/10.1016/j.im.2016.07.004

Hair JF, Hult GTM, Ringle CM, Sarstedt M (2021) A primer on partial least squares structural equation modeling (PLS-SEM) (Third Edition). SAGE Publications. https://uk.sagepub.com/en-gb/eur/a-primer-on-partial-least-squares-structural-equation-modeling-pls-sem/book270548

Hammond KR (1996) Human judgment and social policy: Irreducible uncertainty, inevitable error, unavoidable injustice. Oxford University Press

Hazen BT, Boone CA, Ezell JD, Jones-Farmer LA (2014) Data quality for data science, predictive analytics, and big data in supply chain management: an introduction to the problem and suggestions for research and applications. Int J Prod Econ 154:72–80. https://doi.org/10.1016/j.ijpe.2014.04.018

Heinle MS, Ross N, Saouma RE (2014) A theory of participative budgeting. Account Rev 89(3):1025–1050

Henseler J, Ringle CM, Sarstedt M (2015) A new criterion for assessing discriminant validity in variance-based structural equation modeling. J Acad Mark Sci 43(1):115–135. https://doi.org/10.1007/s11747-014-0403-8

Henseler J, Ringle C, Sinkovics R (2009) The use of partial least squares path modeling in international marketing. In: Advances in international marketing (vol 20, pp 277–319). https://doi.org/10.1108/S1474-7979(2009)0000020014

Holsapple C, Lee-Post A, Pakath R (2014) A unified foundation for business analytics. Decis Support Syst 64:130–141. https://doi.org/10.1016/j.dss.2014.05.013

Hulland J (1999) Use of partial least squares (PLS) in strategic management research: a review of four recent studies. Strateg Manag J 20(2):195–204

Işık Ö, Jones MC, Sidorova A (2013) Business intelligence success: the roles of BI capabilities and decision environments. Inf Manag 50(1):13–23. https://doi.org/10.1016/j.im.2012.12.001

Järvenpää M (2007) Making business partners: a case study on how management accounting culture was changed. Eur Account Rev 16(1):99–142. https://doi.org/10.1080/09638180701265903

Jasperson J, Carte TA, Saunders CS, Butler BS, Croes HJP, Zheng W (2002) Review: Power and information technology research: a metatriangulation review. MIS Q 26(4):397–459. https://doi.org/10.2307/4132315

Jensen MC, Meckling WH (1976) Theory of the firm: managerial behavior, agency costs and ownership structure. J Financ Economet 3(4):305–360. https://doi.org/10.1016/0304-405X(76)90026-X

Joseph J, Gaba V (2020) Organizational structure, information processing, and decision-making: a retrospective and road map for research. Acad Manag Ann 14(1):267–302. https://doi.org/10.5465/annals.2017.0103

Kaplan SN, Sorensen M (2021) Are CEOs different? J Finance 76(4):1773–1811. https://doi.org/10.1111/jofi.13019

Karaboga T, Zehir C, Tatoglu E, Karaboga HA, Bouguerra A (2022) Big data analytics management capability and firm performance: the mediating role of data-driven culture. RMS. https://doi.org/10.1007/s11846-022-00596-8

Kepner CH, Tregoe BB (2005) The new rational manager, rev. Kepner-Tregoe, NY

Google Scholar

Khayer A, Talukder MdS, Bao Y, Hossain MdN (2020) Cloud computing adoption and its impact on SMEs’ performance for cloud supported operations: a dual-stage analytical approach. Technol Soc 60:101225. https://doi.org/10.1016/j.techsoc.2019.101225

Knauer T, Nikiforow N, Wagener S (2020) Determinants of information system quality and data quality in management accounting. J Manag Control 31(1):97–121. https://doi.org/10.1007/s00187-020-00296-y

Kock N, Hadaya P (2018) Minimum sample size estimation in PLS-SEM: the inverse square root and gamma-exponential methods. Inf Syst J 28(1):227–261. https://doi.org/10.1111/isj.12131

Kowalczyk M, Buxmann P (2014) Big data and information processing in organizational decision processes: a multiple case study. Bus Inf Syst Eng 6(5):267–278. https://doi.org/10.1007/s12599-014-0341-5

Kowalczyk M, Gerlach J (2015) Business intelligence and analytics and decision quality: Insights on analytics specialization and information processing modes. In: ECIS 2015 completed research papers. https://doi.org/10.18151/7217398

Krishnamoorthi S, Mathew SK (2018) Business analytics and business value: a comparative case study. Inf Manag 55(5):643–666. https://doi.org/10.1016/j.im.2018.01.005

Kulkarni U, Robles-Flores J, Popovič A (2017) Business intelligence capability: the effect of top management and the mediating roles of user participation and analytical decision making orientation. J Assoc Inf Syst 18(7). https://doi.org/10.17705/1jais.00462

Kuvaas B (2002) An exploration of two competing perspectives on informational contexts in top management strategic issue interpretation. J Manage Stud 39(7):977–1001. https://doi.org/10.1111/1467-6486.00320

Labro E, Lang M, Omartian JD (2022) Predictive analytics and centralization of authority. J Account Econ 101526. https://doi.org/10.1016/j.jacceco.2022.101526

Laursen GHN, Thorlund J (2016) Business analytics for managers: taking business intelligence beyond reporting. Wiley.

LaValle S, Lesser E, Shockley R, Hopkins MS, Kruschwitz N (2011) Big data, analytics and the path from insights to value. MIT Sloan Manag Rev 52(2)

Leavitt HJ, Whisler TL (1958) Management in the 1980’s. Harvard Bus Rev. https://hbr.org/1958/11/management-in-the-1980s

Lee AS (2010) Retrospect and prospect: Information systems research in the last and next 25 years. J Inf Technol 25(4):336–348. https://doi.org/10.1057/jit.2010.24

Leonardi PM, Barley SR (2008) Materiality and change: challenges to building better theory about technology and organizing. Inf Organ 18(3):159–176. https://doi.org/10.1016/j.infoandorg.2008.03.001

Leonardi PM, Nardi BA, Kallinikos J (2012) Materiality and organizing: Social interaction in a technological world. OUP Oxford

Liberatore MJ, Wagner WP (2022) Simon’s decision phases and user performance: an experimental study. J Comput Inf Syst 62(4):667–679. https://doi.org/10.1080/08874417.2021.1878476

Lismont J, Vanthienen J, Baesens B, Lemahieu W (2017) Defining analytics maturity indicators: a survey approach. Int J Inf Manage 37(3):114–124. https://doi.org/10.1016/j.ijinfomgt.2016.12.003

Mandal P, Jain T (2021) Partial outsourcing from a rival: quality decision under product differentiation and information asymmetry. Eur J Oper Res 292(3):886–908. https://doi.org/10.1016/j.ejor.2020.11.018

Markus ML, Pfeffer J (1983) Power and the design and implementation of accounting and control systems. Acc Organ Soc 8(2):205–218. https://doi.org/10.1016/0361-3682(83)90028-4

Markus ML, Robey D (1988) Information technology and organizational change: causal structure in theory and research. Manage Sci 34(5):583–598. https://doi.org/10.1287/mnsc.34.5.583

Maroufkhani P, Wan Ismail WK, Ghobakhloo M (2020) Big data analytics adoption model for small and medium enterprises. J Sci Technol Policy Manag 11(4):483–513. https://doi.org/10.1108/JSTPM-02-2020-0018

Maxwell NL, Rotz D, Garcia C (2016) Data and decision making: same organization, different perceptions; different organizations. Differ Percept Am J Eval 37(4):463–485. https://doi.org/10.1177/1098214015623634

McAfee, A., & Brynjolfsson, E. (2012). Big data: The management revolution. Harvard Business Review , 9.

Mintzberg H, Raisinghani D, Théorêt A (1976) The structure of ‘unstructured’ decision processes. Adm Sci Q 21(2):246–275. https://doi.org/10.2307/2392045

O’Reilly CA (1980) Individuals and information overload in organizations: Is more necessarily better? Acad Manag J 23(4):684–696. https://doi.org/10.2307/255556

O’Reilly CA (1982) Variations in decision makers’ use of information sources: the impact of quality and accessibility of information. Acad Manag J 25(4):756–771. https://doi.org/10.2307/256097

Orlikowski WJ (1992) The duality of technology: rethinking the concept of technology in organizations. Organ Sci 3(3):398–427

Orlikowski WJ (2007) Sociomaterial practices: exploring technology at work. Organ Stud 28(9):1435–1448. https://doi.org/10.1177/0170840607081138

Pinsonneault A, Kraemer KL (1993) The impact of information technology on middle managers. MIS Q 17(3):271–292. https://doi.org/10.2307/249772

Pipino LL, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218. https://doi.org/10.1145/505248.506010

Popovič A, Hackney R, Coelho PS, Jaklič J (2012) Towards business intelligence systems success: effects of maturity and culture on analytical decision making. Decis Support Syst 54(1):729–739. https://doi.org/10.1016/j.dss.2012.08.017

Prendergast C (2002) The tenuous trade-off between risk and incentives. J Polit Econ 110(5):1071–1102. https://doi.org/10.1086/341874

Provost F, Fawcett T (2013) Data science and its relationship to big data and data-driven decision making. Big Data 1(1):51–59. https://doi.org/10.1089/big.2013.1508

Puklavec B, Oliveira T, Popovič A (2018) Understanding the determinants of business intelligence system adoption stages: an empirical study of SMEs. Ind Manag Data Syst 118(1):236–261. https://doi.org/10.1108/IMDS-05-2017-0170

Qiao Y (2022) To delegate or not to delegate? On the quality of voluntary corporate financial disclosure. Rev Manager Sci, 1–36. https://doi.org/10.1007/s11846-022-00576-y

Rajan MV, Saouma RE (2006) Optimal information asymmetry. Acc Rev 81(3):677–712. https://doi.org/10.2308/accr.2006.81.3.677

Rajagopalan N, Rasheed AMA, Datta DK (1993) Strategic decision processes: critical review and future directions. J Manag 19(2):349–384. https://doi.org/10.1016/0149-2063(93)90057-T

Redman TC (1998) The impact of poor data quality on the typical enterprise. Commun ACM 41(2):79–82. https://doi.org/10.1145/269012.269025

Richter NF, Cepeda G, Roldán JL, Ringle CM (2016) European management research using partial least squares structural equation modeling (PLS-SEM). Eur Manag J 34(6):589–597. https://doi.org/10.1016/j.emj.2016.08.001

Rikhardsson P, Yigitbasioglu O (2018) Business intelligence AND analytics in management accounting research: status and future focus. Int J Account Inf Syst 29:37–58. https://doi.org/10.1016/j.accinf.2018.03.001

Robey D, Anderson C, Raymond B (2013). Information technology, materiality, and organizational change: a professional odyssey. J Assoc Inf Syst14(7). https://doi.org/10.17705/1jais.00337

Robey D, Boudreau M-C (1999) Accounting for the contradictory organizational consequences of information technology: Theoretical directions and methodological implications. Inf Syst Res 10(2):167–185. https://doi.org/10.1287/isre.10.2.167

Ross JW, Beath CM, Quaadgras A (2013) You may not need big data after all. Harv Bus Rev 91(12):90–98

Rouibah K, Ould-ali S (2002) PUZZLE: a concept and prototype for linking business intelligence to business strategy. J Strateg Inf Syst 11(2):133–152. https://doi.org/10.1016/S0963-8687(02)00005-7

Saam NJ (2007) Asymmetry in information versus asymmetry in power: implicit assumptions of agency theory? J Socio-Econ 36(6):825–840. https://doi.org/10.1016/j.socec.2007.01.018

Sadler-Smith E, Shefy E (2004) The intuitive executive: understanding and applying ‘gut feel’ in decision-making. Acad Manag Perspect 18(4):76–91. https://doi.org/10.5465/ame.2004.15268692

Samitsch C (2014). Data quality and its impacts on decision-making: How managers can benefit from good data. Springer

Sarstedt M, Bengart P, Shaltoni AM, Lehmann S (2018) The use of sampling methods in advertising research: a gap between theory and practice. Int J Advert 37(4):650–663. https://doi.org/10.1080/02650487.2017.1348329

Scott WR (1981). Organizations: rational, natural and open systems. Prentice Hall, NJ

Scott WR, Davis G (2015) Organizations and organizing: rational, natural and open systems perspectives. Routledge. https://doi.org/10.4324/9781315663371

Sebastian I, Ross J, Beath C, Mocker M, Moloney K, Fonstad N (2017) How big old companies navigate digital transformation. MIS Q Execut 16(3). https://aisel.aisnet.org/misqe/vol16/iss3/6

Seddon PB, Constantinidis D, Tamm T, Dod H (2017) How does business analytics contribute to business value? Inf Syst J 27(3):237–269. https://doi.org/10.1111/isj.12101

Sharma R, Mithas S, Kankanhalli A (2014) Transforming decision-making processes: a research agenda for understanding the impact of business analytics on organisations. Eur J Inf Syst 23(4):433–441. https://doi.org/10.1057/ejis.2014.17

Sharma R, Yetton P (2003) The contingent effects of management support and task interdependence on successful information systems implementation. MIS Q 27(4):533–556. https://doi.org/10.2307/30036548

Shmueli G, Ray S, Velasquez Estrada JM, Chatla SB (2016) The elephant in the room: predictive performance of PLS models. J Bus Res 69(10):4552–4564. https://doi.org/10.1016/j.jbusres.2016.03.049

Simon HA (1978). Information processing theory of human problem solving. In: Estes WK (ed) Handbook of learning and cognitive processes: human information processing (vol. 5, pp 271–295). Psychology Press

Simon HA (1990). Bounded rationality. In: Eatwell J, Milgate M, Newman P (eds) Utility and probability (pp 15–18). Palgrave Macmillan UK. https://doi.org/10.1007/978-1-349-20568-4_5

Simon HA (2013) Administrative behavior, 4th edn. Simon and Schuster

Sor R (2004) Information technology and organisational structure: vindicating theories from the past. Manag Decis 42(2):316–329. https://doi.org/10.1108/00251740410513854

Srinivasan R, Swink M (2018) An investigation of visibility and flexibility as complements to supply chain analytics: an organizational information processing theory perspective. Prod Oper Manag 27(10):1849–1867. https://doi.org/10.1111/poms.12746

Stoica M, Liao J, Welsch H (2004) Organizational culture and patterns of information processing: the case of small and medium-sized enterprises. J Dev Entrepreneur 9(3)

Svenson O (1979) Process descriptions of decision making. Organ Behav Hum Perform 23(1):86–112. https://doi.org/10.1016/0030-5073(79)90048-5

Szukits Á (2022) The illusion of data-driven decision making: the mediating effect of digital orientation and controllers’ added value in explaining organizational implications of advanced analytics. J Manag Control 33(3):403–446. https://doi.org/10.1007/s00187-022-00343-w

Thunholm P (2004) Decision-making style: habit, style or both? Personal Individ Differ 36(4):931–944. https://doi.org/10.1016/S0191-8869(03)00162-4

Torres R, Sidorova A (2019) Reconceptualizing information quality as effective use in the context of business intelligence and analytics. Int J Inf Manage 49:316–329. https://doi.org/10.1016/j.ijinfomgt.2019.05.028

Turban E, Sharda R, Delen D, Aronson JE, Liang T-P, King D (2011) Decision support and business intelligence systems (9th ed). Pearson

Tushman ML, Nadler DA (1978) Information processing as an integrating concept in organizational design. Acad Manag Rev 3(3):613–624. https://doi.org/10.2307/257550

Valjanow S, Enzinger P, Dinges F (2019) Leveraging predictive analytics within a value driver-based planning framework. In: Liermann V, Stegmann C (eds) The impact of digital transformation and fintech on the finance professional (pp 99–115). Springer. https://doi.org/10.1007/978-3-030-23719-6_7

Van der Stede WA, Young SM, Chen CX (2005) Assessing the quality of evidence in empirical management accounting research: the case of survey studies. Acc Organ Soc 30(7):655–684. https://doi.org/10.1016/j.aos.2005.01.003

Visinescu LL, Jones MC, Sidorova A (2017) Improving decision quality: the role of business intelligence. J Comput Inf Syst 57(1):58–66. https://doi.org/10.1080/08874417.2016.1181494

Wang RY (1998) A product perspective on total data quality management. Commun ACM 41(2):58–65. https://doi.org/10.1145/269012.269022

Wang RY, Strong DM (1996) Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst 12(4):5–33

Weick KE (1979) The social psychology of organizing, second edition. McGraw-Hill

Wieder B, Ossimitz M-L (2015) The impact of business intelligence on the quality of decision making: A mediation model. Procedia Comput Sci 64:1163–1171. https://doi.org/10.1016/j.procs.2015.08.599

Wirges F, Neyer A-K (2022) Towards a process-oriented understanding of HR analytics: implementation and application. RMS. https://doi.org/10.1007/s11846-022-00574-0

Wohlstetter P, Datnow A, Park V (2008) Creating a system for data-driven decision-making: applying the principal-agent framework. Sch Eff Sch Improv 19(3):239–259. https://doi.org/10.1080/09243450802246376

Yoo Y (2012) Digital materiality and the emergence of an evolutionary science of the artificial. In: Leonardi PM, Nardi BA, Kallinikos J (eds) Materiality and organizing (1st ed., pp. 134–154). Oxford University Press, Oxford. https://doi.org/10.1093/acprof:oso/9780199664054.003.0007

Young R, Jordan E (2008) Top management support: mantra or necessity? Int J Project Manage 26(7):713–725. https://doi.org/10.1016/j.ijproman.2008.06.001

Yu W, Wong CY, Chavez R, Jacobs MA (2021a) Integrating big data analytics into supply chain finance: the roles of information processing and data-driven culture. Int J Prod Econ 236: 108135. https://doi.org/10.1016/j.ijpe.2021.108135

Yu W, Zhao G, Liu Q, Song Y (2021b) Role of big data analytics capability in developing integrated hospital supply chains and operational flexibility: an organizational information processing theory perspective. Technol Forecast Soc Change 163:120417. https://doi.org/10.1016/j.techfore.2020.120417

Zack MH (2007) The role of decision support systems in an indeterminate world. Decis Support Syst 43(4):1664–1674. https://doi.org/10.1016/j.dss.2006.09.003

Zelt S, Schmiedel T, vom Brocke J (2018) Understanding the nature of processes: an information-processing perspective. Bus Process Manag J 24(1):67–88. https://doi.org/10.1108/BPMJ-05-2016-0102

Zhu S, Song J, Hazen BT, Lee K, Cegielski C (2018) How supply chain analytics enables operational supply chain transparency: an organizational information processing theory perspective. Int J Phys Distrib Logist Manag 48(1):47–68. https://doi.org/10.1108/IJPDLM-11-2017-0341

Zuboff S (1988). In the age of the smart machine: the future of work and power. Heinemann Professional

Download references

Open access funding provided by Corvinus University of Budapest. Project no. NKFIH-869-10/2019 has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the ‘Tématerületi Kiválósági Program’ funding scheme.

Author information

Authors and affiliations.

Department of Strategic Management, Institute of Strategy and Management, Corvinus University of Budapest, Fővám Square 8, Budapest, Hungary

Ágnes Szukits & Péter Móricz

You can also search for this author in PubMed Google Scholar

Contributions

Both authors contributed to the study’s conception and design. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Ágnes Szukits .

Ethics declarations

Conflict of interest.

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix I: Questionnaire items and scales

Questionnaire items	Constructs and variable names	References
1. To what extent do you consider the following statements about your company to be true?	TM: Top Management Support	Kulkarni et al. ( )
a) Top management considers that business analytics plays a strategically important role	TM_1
b) Top management sponsors business analytics initiatives	TM_2
c) Top management demonstrates commitment to business analytics via policy/guidelines	TM_3
d) Top management hires and retains people with analytical skills	TM_4
(1 = Not at all—5 = Completely)
2. To what extent do you consider the following statements about the data available in your company's information system to be true?	DQ: Perceived data quality	Wang and Strong ( )
a) The data values are in conformance with the actual or true values	DQ_1
b) The data are applicable (pertinent) to the task of the data user	DQ_2
c) The data are presented in an intelligible and clear manner	DQ_3
d) The data are available or obtainable	DQ_4
(1 = Not at all—5 = Completely)
3. To what extent do you consider the following statements about your company to be true?	AC: Analytical decision making culture	Popovic et al. ( )
a) The decision-making process is well established and known to its stakeholders	AC_1
b) It is our organization's policy to incorporate available information within any decision-making process	AC_2
c) We consider the information provided regardless of the type of decision to be taken	AC_3
(1 = Strongly disagree … 5 = Strongly agree)
4. To what extent do you consider the following statements about your company to be true?	CE: Centralization in data use	George and King ( )
a) The results of numerical analyses are only available to a narrow group of people	CE_1
b) Where sufficient quantity and quality of data is available, senior management can easily make decisions without involving lower levels of management	CE_2
(1 = Strongly disagree … 5 = Strongly agree)
5. To what extent do you consider the following statements about your company to be true?	DM: Data-driven decision making	Simon ( )
In our organization, top management relies on available data
a) … in assessing the current situation	DM_1
b) … when identifying problems	DM_2
c) … exploring alternative courses of action	DM_3
d) … in the assessment of alternative courses of action	DM_4
e) … when choosing between alternative courses of action	DM_5
f) … when planning the implementation of decisions	DM_6
g) … in communicating decisions	DM_7
h) … to monitor the implementation of decisions	DM_8
(1 = Strongly Disagree… 5 = Strongly Agree)

Appendix II: Descriptive statistics of the variables

Variable name	Missing’s (count)	N	Theoretical range	Observed min	Observed max	Mean	SD
TM_1	1	304	1–5	1	5	4.342	0.836
TM_2	4	301	1–5	1	5	4.272	0.932
TM_3	2	303	1–5	1	5	4.228	0.929
TM_4	3	302	1–5	1	5	4.205	0.958
DQ_1	4	301	1–5	2	5	4.628	0.572
DQ_2	3	302	1–5	1	5	4.52	0.708
DQ_3	2	303	1–5	2	5	4.564	0.625
DQ_4	3	302	1–5	2	5	4.533	0.639
AC_1	0	305	1–5	1	5	4.41	0.789
AC_2	0	305	1–5	1	5	4.334	0.861
AC_3	0	305	1–5	1	5	4.37	0.779
CE_1	0	305	1–5	1	5	3.931	1.052
CE_2	2	303	1–5	1	5	4.096	0.776
DM_1	4	301	1–5	1	5	4.312	0.726
DM_2	5	300	1–5	1	5	4.193	0.838
DM_3	3	302	1–5	1	5	4.159	0.764
DM_4	6	299	1–5	1	5	4.157	0.775
DM_5	6	299	1–5	2	5	4.194	0.769
DM_6	5	300	1–5	1	5	4.21	0.808
DM_7	4	301	1–5	1	5	4.173	0.849
DM_8	2	303	1–5	1	5	4.201	0.768

Rights and permissions

Reprints and permissions

About this article

Szukits, Á., Móricz, P. Towards data-driven decision making: the role of analytical culture and centralization efforts. Rev Manag Sci (2023). https://doi.org/10.1007/s11846-023-00694-1

Download citation

Received : 07 December 2022

Accepted : 10 August 2023

Published : 16 September 2023

DOI : https://doi.org/10.1007/s11846-023-00694-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Data-driven decision making
Business analytics
Decision making culture
Centralization

JEL Classification

Find a journal
Publish with us
Track your research

A data-driven optimization model for renewable electricity supply chain design

Panahi, Homa
Sabouhi, Fatemeh
Bozorgi-Amiri, Ali
Ghaderi, S. F.

The rising electricity demand, driven by technological progress and population growth, has made the design of the electricity supply chain a major challenge. The transition to renewable energy sources has led to a substantial reduction in the costs and greenhouse gases emissions of electricity production. This study proposes a hybrid approach consisting of a recurrent neural network and an optimization model for designing a resilient electricity supply chain based on wind and solar energies. First, the electricity demand is forecast using the long short-term memory algorithm. Second, based on the forecast demand, a two-stage stochastic programming model is developed for designing an electricity supply chain under disruption risks. Partial and complete disruptions in power lines and facilities, including power plants, distribution and subtransmission substations, and low-power solar panel farms are considered. The objective of the proposed model is to minimize the total cost of the supply chain by making optimal decisions on facility location; technology selection for power plants; setting up transmission lines; and the quantity of power generation, power transmission, and power shortage in low- and high-voltage energy consumption systems. Three strategies are adopted to enhance the supply chain's resilience: multiple renewable energy sources, multiple power transmission lines, and lateral power lines. Lastly, the Sistan and Baluchestan Electric Power Distribution Company in Iran is used as a case study for this work to demonstrate the applicability of the proposed approach, analyze the findings, and derive relevant managerial insights.

Electricity supply chain;
Renewable energy sources;
Data-driven approach;
Resilience;
Disruption risks;
Long short-term memory

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Astrophysics > Solar and Stellar Astrophysics

Title: a data-driven spectral model of main sequence stars in gaia dr3.

Abstract: Precise spectroscopic classification of planet hosts is an important tool of exoplanet research at both the population and individual system level. In the era of large-scale surveys, data-driven methods offer an efficient approach to spectroscopic classification that leverages the fact that a subset of stars in any given survey has stellar properties that are known with high fidelity. Here, we use The Cannon, a data-driven framework for modeling stellar spectra, to train a generative model of spectra from the Gaia Data Release 3 Radial Velocity Spectrometer. Our model derives stellar labels with precisions of 72 K in Teff , 0.09 dex in log g, 0.06 dex in [Fe/H], 0.05 dex in [{\alpha}/Fe] and 1.9 km/s in vbroad for main-sequence stars observed by Gaia DR3 by transferring GALAH labels, and is publicly available at this https URL . We validate our model performance on planet hosts with available Gaia RVS spectra at SNR>50 by showing that our model is able to recover stellar parameters at {\geq}20% improved accuracy over the existing Gaia stellar parameter catalogs, measured by the agreement with high-fidelity labels from the Spectroscopic Observations of Cool Stars (SPOCS) survey. We also provide metrics to test for stellar activity, binarity, and reliability of our model outputs and provide instructions for interpreting these metrics. Finally, we publish updated stellar labels and metrics that flag suspected binaries and active stars for Kepler Input Catalog objects with published Gaia RVS spectra.

Comments:	Accepted to ApJ; 19 pages, 8 figures
Subjects:	Solar and Stellar Astrophysics (astro-ph.SR); Earth and Planetary Astrophysics (astro-ph.EP); Astrophysics of Galaxies (astro-ph.GA)
Cite as:	[astro-ph.SR]
	(or [astro-ph.SR] for this version)
	Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

HTML (experimental)
Other Formats

References & Citations

Google Scholar
Semantic Scholar

BibTeX formatted citation

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

The Data-Driven Research Paper
FREE 46+ Research Paper Examples & Templates in PDF, MS Word
A Process for Systematic Data-driven Research
(PDF) Data-driven research: Open data opportunities for growing
ENG 358 Week 3 Assignment Collaborative Learning Community Data-Driven
(PDF) A Data-Driven Research of Sales and Purchases on JD.com Platform

VIDEO

2023 Quantitative Workshop 01
With Al data-driven research, you may develop a more effective marketing #viral #shortsfeed #shorts
Navigating Challenges & Building Synergy in Data-Driven Research by Kristoffer Skau
Data-Driven Research Hub
How to draw meaningful and data-driven research conclusions
Data-Informed Decision Making in Research Administration

COMMENTS

Data Science and Analytics: An Overview from Data-Driven Smart
Introduction. We are living in the age of "data science and advanced analytics", where almost everything in our daily lives is digitally recorded as data [].Thus the current electronic world is a wealth of various kinds of data, such as business data, financial data, healthcare data, multimedia data, internet of things (IoT) data, cybersecurity data, social media data, etc [].
1.1 What is Data Driven Research?
What is Data Driven Research? Data Driven Research is a subset of the field of Digital Humanities that deals particularly with data analytic and visualization methodologies. This field broadly consists of theories and methodologies from a range of humanities disciplines that inform how a researcher gathers, analyzes, and filters datasets to ...
Data science: a game changer for science and innovation
This paper shows data science's potential for disruptive innovation in science, industry, policy, and people's lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e ...
Moving back to the future of big data-driven research: reflecting on
From theory to data-driven science. More than a decade has gone by since Savage and Burrows described a crisis in empirical research, where the well-developed methodologies for collecting data ...
Full article: DECAS: a modern data-driven decision theory for big data
Earlier in the paper, some examples of data-driven failures and success of decisions were portrayed. Had there been clear theories, methods, and principles in research regarding data-driven decision making at the time, some failures might have been prevented from the start. Hence, the need for a new theory was evident.
Data-driven modeling and learning in science and engineering
However, today data-driven approaches are also flooding fields like mechanics and materials science, where the traditional approach seemed to be highly satisfactory. In this paper we review the application of data-driven modeling and model learning procedures to different fields in science and engineering. 1.
From user-generated data to data-driven innovation: A research agenda
1. Introduction. In the beginning of the 21st century, the development of data-centric strategies has changed the paradigm and existing business models (Lies, 2019).Strategies focused on data-driven innovation (DDI) have led to the emergence and development of both new products, business models and opportunities in the digital ecosystem (Akter et al., 2019; Bouncken, Kraus, & Roig-Tierno, 2019 ...
Data, measurement and empirical methods in the science of science
As is often the case for data-driven research, drawing conclusions from specific data sources requires scrutiny and care. ... (National Bureau of Economic Research Working papers, 2020). Bornmann ...
Data-Driven vs. Hypothesis-Driven Research: Making sense of big data
Papers published in the Proceedings are abridged because presenting papers at their full length could preclude subsequent journal publication. Please contact the author(s) directly for the full papers. Data-Driven vs. Hypothesis-Driven Research: Making sense of big data. Willy Shih; and ; Sen Chai; Willy Shih. Harvard U. Search for more papers ...
PDF Introduction and Overview of Data-Driven Articles
Chapter 2• Introduction and Overview of Data-Driven Articles 25. locatesthe study in the context of the current literature, focuseson the study's research question(s), reportsthe methods by which the questions were answered, and arguesthe contribution of the study to our current understanding of the overall topic.
PDF Data-Driven Meets Theory -Driven Research in the Era of Big Data ...
Data-driven research uses exploratory approaches to analyze big data to extract scientifically interesting insights (Kitchin, 2014). Due to the complexity of the environments and processes that generate data, there may not be a strong theoretical base for the questions being studied. Data-driven research is typically described
Data-driven Innovation: Understanding the Direction for Future Research
In this paper, data-driven innovation is defined as "a strategic initiative of organisations to use data and analytics to develop data-driven insights that help for new product development, process improvements, discover new markets and business ... To analyse and understand the current state of data-driven innovation research, we conducted a ...
Analysis on open data as a foundation for data-driven research
Open Data, one of the key elements of Open Science, serves as a foundation for "data-driven research" and has been promoted in many countries. However, the current status of the use of publicly available data consisting of Open Data in new research styles and the impact of such use remains unclear. Following a comparative analysis in terms of the coverage with the OpenAIRE Graph, we ...
Machine learning in project analytics: a data-driven framework ...
This study proposes a machine learning-based data-driven research framework for addressing problems related to project analytics. ... Maind, S. B. & Wankar, P. Research paper on basic of ...
Data Science and Analytics: An Overview from Data-Driven Smart
The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science ...
Data-driven research and healthcare: public trust, data governance and
There seems to be wide acknowledgement of the importance of trust in the context of data driven research and healthcare. It has been argued that building trust in this area can facilitate public acceptability of data sharing and the adoption of new technologies such as AI [1, 2].And yet, what it means, and how to promote trust in this context remains vague [3, 4].
Data-based decision-making for school improvement: Research insights
Purpose and sources of evidence: To explore data-based decision-making for school improvement, this theoretical paper discusses recent research and literature from different areas of data use in education. These areas include the use of formative assessment data, educational research study findings and 'big data'.
Data-driven approaches for road safety: A comprehensive systematic
In this category, Nagy and Simon (2018) reviewed data-driven approaches for traffic flow prediction in smart cities, Akhtar and Moridpour (2021) focused on artificial intelligence-based congestion prediction research, and Hossain et al. (2019) carried out a systematic review of real-time crash prediction methods.
(PDF) The role of data-driven initiatives in enhancing healthcare
This review paper explores the transformative role of data-driven initiatives in enhancing healthcare delivery and patient retention. It delves into the significant impacts of data analytics on ...
Leveraging HR Analytics for Data-Driven Decision Making: A
The findingsof this research paper emphasize the significance of HR analytics in enabling evidence-baseddecision making and fostering a data-driven HR function within organizations.Key Words: HR ...
Marketing in a data-driven digital world: Implications for the role and
We broadly divide the nine studies into three themes: (1) Impact of social media (including studies that focus on Facebook, Snapchat, or Twitter) and text mining analyses, (2) Insights from survey-based data (in the context of healthcare and data marketing industry), and (3) New technologies and emerging research areas as summarized in Fig. 2.The next section offers a brief overview of the ...
Systems
The importance of data in current societal activities cannot be overstated, yet we know little about data governance and application. Using the Chinese Government Data Empowerment Initiative, this paper examines the process of data-driven business innovation. Using the staggered DiD model, we found that government data points effectively facilitate firms' product innovation, with higher ...
Towards data-driven decision making: the role of analytical ...
The surge in data-related investments has drawn the attention of both managers and academia to the question of whether and how this (re)shapes decision making routines. Drawing on the information processing theory of the organization and the agency theory, this paper addresses how putting a strategic emphasis on business analytics supports an analytical decision making culture that makes ...
[2404.17605] Autonomous LLM-driven research from data to human
Autonomous LLM-driven research from data to human-verifiable research papers. Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, Roy Kishony. As AI promises to accelerate scientific discovery, it remains unclear whether fully AI-driven research is possible and whether it can adhere to key scientific values, such as transparency, traceability ...
A data-driven optimization model for renewable electricity ...
The rising electricity demand, driven by technological progress and population growth, has made the design of the electricity supply chain a major challenge. The transition to renewable energy sources has led to a substantial reduction in the costs and greenhouse gases emissions of electricity production. This study proposes a hybrid approach consisting of a recurrent neural network and an ...
Data-driven engineering design: A systematic review ...
The yearly increase in the number of research articles on data-driven engineering design appears to be aligned with the growing trend of AI development, as shown by Wang et al. [41]. 3.2. RQ2: Main topics in DDED. Data-driven engineering design is a vast research area that covers diverse topics.
A Data-driven Spectral Model of Main Sequence Stars in Gaia DR3
View a PDF of the paper titled A Data-driven Spectral Model of Main Sequence Stars in Gaia DR3, by Isabel Angelo and 3 other authors. View PDF Abstract: Precise spectroscopic classification of planet hosts is an important tool of exoplanet research at both the population and individual system level. In the era of large-scale surveys, data ...

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

Introduction

Background and Related Work

Data Terms and Definitions

Related Work

Understanding Data Science Modeling

Types of Real-World Data

Steps of Data Science Modeling

Advanced Analytics Methods and Smart Computing

Types of Analytics and Outcome

Machine Learning Based Analytical Modeling

Regression Analysis

Classification Analysis

Cluster Analysis

Association Rule Analysis

Time-Series Analysis and Forecasting

Opinion Mining and Sentiment Analysis

Behavioral Data and Cohort Analysis

Anomaly Detection or Outlier Analysis

Factor Analysis

Log Analysis

Neural Networks and Deep Learning Analysis

Real-World Application Domains

Challenges and Research Directions

Declarations

1.1 What is Data Driven Research?

Understanding Quantitative and Qualitative Data

Data Driven Research and the Power of Spreadsheets

Comprehending the Scale of Slavery

By Kenton Rambsy

Media Attributions

Share This Book

Machine learning in project analytics: a data-driven framework and case study

Similar content being viewed by others

Long-term prediction modeling of shallow rockburst with small dataset based on machine learning

An ensemble-based machine learning solution for imbalanced multiclass dataset during lithology log generation

Prediction of jumbo drill penetration rate in underground mines using various machine learning approaches and traditional models

Machine learning-based framework for project analytics

Machine learning versus traditional programming

Proposed machine learning-based framework

Model-building procedure

Testing of the developed models and reporting results

Ethical approval

Informed consent

Case study: an application of the proposed framework

Data source

Machine learning algorithms

Feature selection

Findings from the case study

Summary of the case study

Data availability

Acknowledgements

Author information

Contributions

Corresponding author

Ethics declarations

Additional information

Supplementary Information

About this article

Share this article

This article is cited by

A robust, resilience machine learning with risk approach: a case study of gas consumption

Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets

Prediction of SMEs’ R&D performances by machine learning for project selection

A robust and resilience machine learning for forecasting agri-food production

Quick links

Data-driven research and healthcare: public trust, data governance and the NHS

Public trust

The trust relationship between the public and the NHS

Digitisation and datafication of the NHS – the ‘health and wealth’ agenda

Changing roles

Data Availability

Change history

Acknowledgements

Author information

Contributions

Corresponding author

Ethics declarations

Consent for publication

Additional information