ADVANCED DATA SCIENCE Table of Contents Abstract 1 Introduction 2 Background 3 Aim and objective 4 Dataset description 5 Problem to be addressed: 6 Machine learning: 6 Results: 7 Conclusion 8 Reference list: 9
An abstract is about summarizing the entire project in one place for the reader to take a thorough knowledge about what is really going on in here. Here in this project, the aim is to detect the position of any mushroom by taking the sepal length and width and petal length and width. The poison is being detected by the value 1,2,3. If the value is 1, then the mushroom is poisonous, but if th value is 2 or 3, the mushroom is not poisonous. Here in the given dataset named Iris dataset, there are three types of species. These species are taken to detect the poison. The implementation is done by using python machine learning. Also, a regression algorithm is used to display the decision tree, including the pieces of information of sepal length and width and petal length and width.
In the following project, the developer will detect the poisonous of the mushroom with data science technology. Python language of coding is used in the process, with the help of a regression table and decision tree. The machine learning process is used to implement the proceeds. The vector schematics process is also needed to be used. The dataset is needed to set up in the python language of coding. Python is a very high-level language of programming that is used for general coding. The philosophy of designing the python language of coding emphasizes the readability of the coding is not used remarkably in the identification, which is significant. Then the decision tree is needed to make for the project. The decision tree is a type of tool to make the decision support. It is basically a model that looks like a tree. It uses all the probable consequences, which includes the chances of outcomes of the event, cost of the resources and the utility. It is a process of displaying the algorithm, which consists of the control statements that are conditional. It is needed to build a decision tree to validate the result and also need to plot in the graph tree. The process of data exploration and visualization of data needed to be done in python. The developer uses pandas and scikit-learn, and seaborn libraries, which are included in anaconda. It is needed to evaluate the model of machine learning in the basic permanence in the matrix of python. Then the regression analysis process needs to be done with the help of using the library of scikit-learn. A simple version of the program is needed to use. Various steps of the algorithm are needed to use. The next step is to classify the company in the process of scikit learning. The clustering process will make sure about the properties of the data that are difficult to discover by the user. It needs to perform the word clustering on originating the text from the ebook or the document that is large. The python language on the deep learning process. All the information is needed to retrieve from the machine learning process.
The project is to identify the mushroom with the help of the machine learning process in the python coding of language. It is needed to find the set of data that will address whether the mushroom is poisonous or not on the basis of the characteristics of the mushroom. It needs to build the process of machine learning, and that will address the problem and evaluate and conclude the coding. The machine learning process will be done in google colab. The google colab will help to execute the coding of python. It also helps with the documentation of the coding that is done in the python language. In the past, black marketers sold poisonous mushrooms. Many people are suffering from food poisoning after eating that poisonous mushroom. In the above-mentioned project, the developer develops software with the process of machine learning that will detect whether the mushroom is poisonous or not. The coding of the project will be done using python. The taste of the mushroom is predetermined. That is also done in the coding process with the help of a regression table and the decision table. The software will check whether a mushroom is safe or poisonous. That will save lives.
Aim and objective
The aim of the above-mentioned project is to develop software that will help to determine whether a mushroom is poisonous or not. In the past year, the poisonous mushrooms were mixed with the original. People are suffering from food poison. To get rid of the problem, the software is needed to build, which will help to determine the quality of the mushroom, whether the mushroom is poisonous or not. The coding will be done with the help of a machine learning process which is processed to analyze the sets of data in the creation of a data model. Machine learning will allow the user to make decisions about the invention which the project developer had done. It is basically a branch that consists of artificial intelligence, which the system can learn from the data. The coding of the software will be done in python coding. A decision tree is needed to make, which will represent all the decisions of the coding that is the personality of the mushroom; there is a poisonous value that is set. The value is higher than the poisonous value, then the mushroom will detect as a poisonous mushroom, and if the value is under the poisonous value, then the mushroom will denote as healthy. The regression table will explain the process. The vector schematics are needed to develop. The data sets are needed to create and upload and share with the help of google colab (Patle , et al.2019).
A dataset or set of data is basically a collection of a bunch of data. The list of dataset gives value to each and every variable, such as weight and the height of any object for each and every member of the set of data. This value is known as a datum. It is a total collection of the files. In this set of data, the sepal length is the stripe length of the mushroom, and the sepal width will denote the volva of the mushroom (Senaviratna , et al.2019). The segment, petal length, will denote the ring of the mushroom, and the cap of the mushroom will be mentioned in petal width(Mar?iukaitis et al.2017). The dataset is described as follows,
In the first case, the stripe length of the mushroom is 5.1 unit, and if the volva of the mushroom is 3.5 unit and if the ring of the mushroom is 1.4 unit and also the cap of the mushroom is denoted as0.2 unit, then the mushroom will be said to be a poisonous mushroom(Mar?iukaitis et al.2017). In the second case of the set of the data, if the stripe of the mushroom is 4.9 unit and also the volva of the mushroom is defined as 3 unit and the cap of the mushroom is denoted as 1.4 unit, and if the ring of the mushroom is 0.2 unit then the output will shoe 1, that means the mushroom is poisonous (Miao , et al.2018). In the third study, if the stripe length of the mushroom is denoted as 4.7 unit and the volva of the mushroom is 3.2 unit, and the perimeter of the ring is 1.2 unit, and the ring is 0.2 unit, then the mushroom will be a poisonous output. If the output comes to zero, then the mushroom will be a non-poisonous mushroom (Yousof , et al.2020).
Problem to be addressed:
According to the author, Praveena and Jagdishan, machine learning is a type of artificial intelligence that will be available on the computer. Machine learning is classified into three categories like learning supervision and unsupervised process of learning. The choice to extract the data for the knowledge of human in the applicational data mining process. The machine-learning algorithm has several steps; the first step is to establish the type of machine learning. The next stage is to cover the set of data. The next step is to illustrate the function. Then it needs to resolve the formation then the algorithm is designed. Then the correctness is evaluated.
Machine learning is a computer science field which helps computer systems for providing sense to data which is almost the same way as human beings do(Czajkowski , et al.2019).
In other words, it can be said that machine learning is a kind of AI which stands for artificial intelligence that uses a method or algorithm which brings out patterns from raw data. The main goal of machine learning is to permit computer systems to learn and gain experience but not from being precisely programmed or any human interference (Harrison , et al.2018).
Need for machine learning:
Today on earth, being the most advanced and intelligent species, human beings are the fastest runner in the race. They have the capability to think, examine, and solve problems which are complex(Zhao , et al.2020). In the case of AI, i.e. artificial intelligence has not yet crossed its initial stage and have not able to cross the level of human intelligence. Then primarily, only one question comes that why people still wanting to use machine learning (Ahmad , et al.2018). The most appropriate answer for that question would be that "to make decisions, based on data, with efficiency and scale” (Rizvi , et al .2019).
Recently, in the case of the latest technologies such as Artificial Intelligence, Deep learning and machine learning are the main topic of investigation for organizations. The purpose is to achieve the key information which is being extracted from data to work on different real-world tasks and sort out the issues. It is called "data-driven decisions", which is taken by machines specifically for the process to be automated. In place of using programming language, these type of "data-driven decisions" can be taken and used for the issues and complexities that are not possibly be programmed inherently. The main fact is that it is not possible to solve anything without taking the help of human intelligence, but the point of view is that everyone needs to solve the problems related to real-world problems with high scalability and efficiency. Therefore technology needs machine learning to rise (Meng , et al.2020).
The reason and timing to make machines learn:Previously it's already been discussed the purpose of using machine learning. But thats not all. There also another question that comes to mind while looking at the scenarios why most people need to learn the machine. Various circumstances arise when technology needs machines which will take "data-driven decisions" precisely with efficiency and high scalability (Cai et al.2018 ).
Lack of human expertise: The first step of this process is that the machines need to be learnt and take "data-driven decisions"(Zhang , et al.2017 ). This can be the scenario where there is not enough human expertise.
Dynamic scenarios: The scenarios, which are by default dynamic in nature, keep on changing from time to time. These scenarios and behaviors demand circumstances where machines need to be learnt and take "data-driven decisions". For example, infrastructure with network connectivity and availability in an organization (Rajkomar , et al.2019 ).
Difficulty in translating expertise into computational tasks: There are several regions where there is human expertise for the technologies that need to be developed. But this expertise is not able to execute computational tasks. In that case, machine learning is a must. For example, cognitive speech recognition is the domain which needs to be implemented by using machine learning (Biran , et al.2017).
Machine Learning Model:
The understanding of the definition of machine learning should be clear before coming to a discussion about the machine learning model. This is given by professor Mitchell-
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."The definition by the professor is mainly based on three parameters, also the major components of any algorithm related to machine learning which is named as Task(T), experience(E), Performance(P) (Ke , et al.2017).In here the definition is being simplified below-
Machine Learning is a field included in artificial intelligence which is assigned for
Improving their performance(P),
For executing some tasks(T),
With experience by the time(E) (Mu , et al.2018).
Figure2: Machine Learning
Let us discuss the details about the above topics :
Task(T): In the case of a problem, T is a Task which is a real-world problem is needed to be solved. Finding a house with the best price in a particular location or finding a marketing strategy which is the best suitable. On the other hand, if machine learning is being discussed, machine learning-based tasks are critical to solving by using a traditional programming approach. That's why the task definition is different from others.
T is a task that is said to be a task depending on machine learning when it is focused on the system, and the process should go for the data points which are needed to be operated. Machine learning-based tasks examples are Regression, Classification, Clustering, Structured annotation, Transcription etc.
Experience (E): It is mainly about knowledge which is acquired from data points that is useful for a model or algorithm. The model provided with the dataset will run in an iterative way, and the machine will learn some patterns inherently. The acquired learning is referred to as experience(E). While an analogy with a human being is made, it can be said that a human is learning or experiencing from several entities like relationships, situation etc. There are several ways of learning or gaining experience, i.e. supervised, unsupervised and reinforcement. To solve the task, a T machine learning algorithm or model will be used to gain experience(Beam , et al.2018 ).
Performance (P): In this case, machine learning algorithms should be used for performing tasks and experience gaining with some time. Performance(P) is the measurement of a machine learning algorithm, whether it is being performed as per expectations or not. Basically, P is called a quantitative metric which delivers the way of task performance of the model with the help of its experience E. Accuracy score, sensitivity, confusion matrix, F1 score, F1 score, precision, recall are some of many metrics that support understanding the performance of ML.
Challenges in Machines Learning: Though machine learning is evolving drastically, significantly march with autonomous cars and cybersecurity, this part of artificial intelligence entirely has a huge way to go.Because there many challenges that machine learning is still now facing. Those are-
Quality of data: The largest challenges of all in machine learning algorithm is having data of good quality.Issues relevant to data processing and extracting feature arise if data of low quality is being used(Butler,et al.2018).
Time-Consuming task: Machine learning models are facing other challenges related to huge consumption of time, specifically for feature extraction, retrieval and data acquisition.
Lack of specialist persons: Expertise in machine learning is still rare to find as machine learning technology is still an infant(Wei, et al.2019).
Here in this project, data science is implemented using python. The project is done in google collab(Maxwell, et al.2018). The dataset is the Iris data set which an excel sheet full of three species of flowers. The aim is to implement a decision tree of these species consists of petal length, sepal length. Let's talk about the Decision tree.
Decision Tree: There are two types of decision tree named Regression Trees and classification.
Figure 3: Decision Tree
Classification trees are implemented for separating the dataset into different classes which belong to the response variable. Here decision tree is being practiced with python.
Decision Tree Regression: A decision tree is designed to build a classification tree. Regression trees are required for the response variable, which is continuous or numeric.
Random regression forest:
Random forest is built for “ensemble learning method” for regression and classification. It is also called “Supervised Learning algorithm”. As random forest is not a boosting technique, the trees in random forests do not interact with each other while the tress are built. The trees in random forest run parallel. This is a bagging technique. It constructs decision trees during the time of training and gives the output as class which is basically “mean prediction(regression) or mode of classes” of the trees individually.
Figure 4: Random regression forest
This technique has been used here for making the decision tree of the poisonous and non-poisonous mushrooms. Using random forest regression two or three of the sepal length and width and petal length and width of mushrooms has been predicted randomly. Then the 4th one is predicted according to those random predictions.
#importing the libraries for the different function to work
import numpy as np
import pandas as PD
import matplotlib.pyplot as plt
#For directing the sample data
#reading the data from dataset and import it in the python file.
dataset = pd.read_csv('Iris.csv')
Output for this code segment
(Source: Self created)
X = dataset.iloc[:,1:4].values
y = dataset.iloc[:,5].values
Output for this code segment:
(Source: Self created)
#training the respected model on the data
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=0)
Output for this code segment:
Figure 3: Output
(Source: Self created)
#random forest model
from sklearn.ensemble import RandomForestRegressor
regRf = RandomForestRegressor(n_estimators = 10,\
Output for this code segment
(Source: Self created)
#plotting a graph
import matplotlib.pyplot as plt
Output for this code segment:
Figure 4: Graph for poison detecting
With the help of a machine learning process, the developer builds software which will be able to detect if a mushroom is poisonous or not. The coding part of the software will be done in the python coding language. The numPy, pandas and the SciFi type of command are used in the coding of the python. The code made in python. A decision tree has been made with the help of the scikit learn library, which will visualize the results with the help of the matplotlib library. The process of visualization of the data and the data exploration process is done in the Python coding language. The following coding libraries are used in the coding pandas coding, and the scikit learn library and the pandas and the NumPy libraries are used to understand the sets of data. With the help of evaluating machine learning, the cross-validation process is used to implement codes in the basic matrix. The data sets are split and are cross-validated in the scikit learn library. Then the regression analysis is done. The classifiers are compared in the scikit learn library. The clustering will allow the user to discover the pattern of the sets of the data. The process of deep learning is implemented to retrieve the value and will help the deep learning process of python coding. The google colab will help to execute the data in python coding. In future, by implementing the python coding, the software will be made that will find the poisonous vegetable or fruit.
Butler, K.T., Davies, D.W., Cartwright, H., Isayev, O. and Walsh, A., 2018. Machine learning for molecular and materials science. Nature, 559(7715), pp.547-555.
Wei, J., Chu, X., Sun, X.Y., Xu, K., Deng, H.X., Chen, J., Wei, Z. and Lei, M., 2019. Machine learning in materials science. InfoMat, 1(3), pp.338-358.
Maxwell, A.E., Warner, T.A. and Fang, F., 2018. Implementation of machine-learning classification in remote sensing: An applied review. International Journal of Remote Sensing, 39(9), pp.2784-2817.
Beam, A.L. and Kohane, I.S., 2018. Big data and machine learning in health care. Jama, 319(13), pp.1317-1318.
Rajkomar, A., Dean, J. and Kohane, I., 2019. Machine learning in medicine. New England Journal of Medicine, 380(14), pp.1347-1358.
Ahmad, M.A., Eckert, C. and Teredesai, A., 2018, August. Interpretable machine learning in healthcare. In Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics (pp. 559-560).
Meng, T., Jing, X., Yan, Z. and Pedrycz, W., 2020. A survey on machine learning for data fusion. Information Fusion, 57, pp.115-129.
Cai, J., Luo, J., Wang, S. and Yang, S., 2018. Feature selection in machine learning: A new perspective. Neurocomputing, 300, pp.70-79.
Zhang, L., Tan, J., Han, D. and Zhu, H., 2017. From machine learning to deep learning: progress in machine intelligence for rational drug discovery. Drug discovery today, 22(11), pp.1680-1685.
Biran, O. and Cotton, C., 2017, August. Explanation and justification in machine learning: A survey. In IJCAI-17 workshop on explainable AI (XAI) (Vol. 8, No. 1, pp. 8-13).
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. and Liu, T.Y., 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, pp.3146-3154.
Mu, Y., Liu, X. and Wang, L., 2018. A Pearson’s correlation coefficient based decision tree and its parallel implementation. Information Sciences, 435, pp.40-58.
Rizvi, S., Rienties, B. and Khoja, S.A., 2019. The role of demographics in online learning; A decision tree based approach. Computers & Education, 137, pp.32-47.
Harrison, P.A., Dunford, R., Barton, D.N., Kelemen, E., Martín-López, B., Norton, L., Termansen, M., Saarikoski, H., Hendriks, K., Gómez-Baggethun, E. and Czúcz, B., 2018. Selecting methods for ecosystem service assessment: A decision tree approach. Ecosystem Services, 29, pp.481-498.
Czajkowski, M. and Kretowski, M., 2019. Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach. Expert Systems with Applications, 137, pp.392-404. Czajkowski, M. and Kretowski, M., 2019. Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach. Expert Systems with Applications, 137, pp.392-404.
Zhao, R., Zhan, L., Yao, M. and Yang, L., 2020. A geographically weighted regression model augmented by Geodetector analysis and principal component analysis for the spatial distribution of PM2. 5. Sustainable Cities and Society, 56, p.102106.
Yousof, H.M., Jahanshahi, S.M.A., Ramires, T.G., Aryal, G.R. and Hamedani, G.G., 2018. A NEW DISTRIBUTION FOR EXTREME VALUES: REGRESSION MODEL, CHARACTERIZATIONS AND APPLICATIONS. Journal of Data Science, 16(4).
Miao, F., Wu, Y., Xie, Y. and Li, Y., 2018. Prediction of landslide displacement with step-like behavior based on multialgorithm optimization and a support vector regression model. Landslides, 15(3), pp.475-488.
Senaviratna, N.A.M.R. and Cooray, T.M.J.A., 2019. Diagnosing multicollinearity of logistic regression model. Asian Journal of Probability and Statistics, pp.1-9.
Patle, G.T., Sikar, T.T., Rawat, K.S. and Singh, S.K., 2019. Estimation of infiltration rate from soil properties using regression model for cultivated land. Geology, Ecology, and Landscapes, 3(1), pp.1-13.
Mar?iukaitis, M., Žutautait?, I., Martišauskas, L., Jokšas, B., Gecevi?ius, G. and Sfetsos, A., 2017. Non-linear regression model for wind turbine power curve. Renewable Energy, 113, pp.732-741.