WhatsApp Chat Sentiment Analysis in R

Introduction to Data Mining



"The capacity of digital data storage worldwide has doubled every nine months for at least a decade, at twice the rate predicted by Moore's Law for the growth of computing power during the same period."

We are in an age often referred to as the information age. In this time, since we believe that this information leads to power and success, we have started investing our time in collection large amount of information.

What type of information are we collecting?

Business transactions which include stocks, banking, exchanges etc.

Scientific data which includes South Pole iceberg centre that gather data about the oceanic activity or an American university gathering information about human psychology etc.

Games which includes gathering a tremendous amount of data regarding players, games etc.

Medical and personal data which includes gathering data regarding an individual's medical records such as disease history, current diseases if any etc.

This also includes gathering an individual's personal data to help them manage Human Resources, understanding the market etc.


What is data mining?

With the enormous and tremendous amount of data being stored and utilized on a daily basis, it is important to develop powerful means for analyzing and interpreting such data which could in turn help in decision making and extraction of, particularly useful information.

Data Mining refers to a set of methods applicable to large and complex databases to eliminate the randomness and discover the hidden pattern. Data mining methods are almost always computationally intensive.

It is an essential process where intelligent methods are applied to extract data patterns.

The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use

The actual data mining task is the automatic analysis of large quantities of data to extract previously unknown and interesting patterns such as groups of data records (cluster analysis), unusual records and dependencies.

The data mining process:


Problem Definition

Data mining project begins with understanding business.

Here, data mining, business and domain experts communicate and work with each other to define objective from a business perspective.


Data Exploration

The task of a domain expert is a collection of data followed by exploring the data.this is important because they also identify any problems in the data. A frequent exchange o data between the business experts and the data mining experts is necessary.


Data Preparation

The main task of a domain expert is to build a model for a modeling process.Thus they have collected data. Once the data is collected, they will cleanse the data ie, remove unwanted data, error producing data etc .this cleansed data is now formatted because certain functions only accept formatted data. A domain expert will also create derived attributes, which are the attributes which are derived from other attributes. The average is a derived attribute derived from two number attributes.

In the data preparation phase, data is tweaked multiple times. Here, there is no particular order in which tweaking is done. This phase involves preparation of data for a model by selecting the tables, records and attributes .in no prescribed order.

However, it is to be remembered that the meaning of data does not change in this step.

Modeling

For one type of data mining problem, different types of mining functions can be applied. but it is observed that for some functions, they require fixed and specific data types.

Here in. This phase, there is a constant exchange taking place with the domain experts and the data preparation phase.

The modeling phase and the evaluation phase of data mining can be grouped together. This is done by repeating them several times by changing parameters thereby, carrying out permutations on the result of this place is a high quality and efficient model.

Evaluation

Once the modeling phase is over and a model is generated, the next important step is an evaluation of the model. Here, we check if a model meets the expectations or not. If it doesn't meet the expectations, it is either discarded and a new model is generated from scratch or it goes back to modeling phase where parameters are changed. If the new parameters yield optimal values, they are chosen. When they are finally satisfied with the model, they can extract business explanations.

At the end of the evaluation phase, the data mining experts decide how to use the data mining results.

Deployment

In this stage, the mining result obtained from the previous step is exported or stored in a database table or spreadsheets.we have certain tools like The Intelligent Miner™ that will assist you to follow this process.

This function can be applied in 3 ways as, independently wherein individual data s stored in the database, or iteratively wherein data one after the other is stored or in a combination where a combination of data is stored together not the database.

Applications of data mining :

Risk management :

With the help of data analysis and prediction, data mining helps in risk management. Project feasibility can also be determined by data mining. It collects data pertaining to an individual's area of expertise, achievements etc. Identity fraudulent behavior, detection of fraudulent credit card usage etc, are certain risk management facilities provided by data mining.

Marketing Plan


Data mining can be essentially used for developing a marketing plan.

Here, the data collected in relational database systems. Based on this, a mining model is generated which includes the elements as columns. Now, previous marketing data and information is incorporated into this model. Thus, the model now consists of information regarding previous as well as present marketing

Strategies thus allowing it to make predictions and analysis about future profits to be expected with a particular marketing plan.

Education

The education system in today's world faces a major difficulty in determining the path of future students and alumni. This involves asking various questions like -what course will a student choose ? which student will need assistance during his course? Etc.

One way of handling this problem is through data mining. Data mining allows institutions to understand study patterns via databases. Thus insights are provided to the institutions using which they can effectively allocate staff and resources. For example, If in a university, a study shows that majority of the students choose management as a core subject, the university will hire more staff in that domain.

Customer Relations Management (CRM)

Customer relations management deals with acquiring customers and retaining them. It also includes improving customer loyalty and endorses customer strategies. In order to maintain a cordial relationship with the customers, collecting data becomes very important. This data mining is used in CRM to analyse the collected data thereby seeking accurate results for problems.

Future healthcare

Data mining has a great potential in healthcare. In order to provide accurate healthcare facilities, collecting data is necessary. Here is where data mining comes into the picture. Through data mining, the data can be analysed and interpreted through to which proper treatment can be provided. For example, an individual's data such as previous surgeries if any, medical illness, hierarchical history for a particular illness etc is collected. This data is then analysed to provide correct remedy or treatment to that patient for a particular condition.

Fraud Detection

Fraud has been inevitable in the world. It is noticed that from the previous decade, the amount of fraud has boosted by a massive scale.

The traditional method of fraud detection is time-consuming and less efficient. Data mining can be used to generate an effective fraud detection mechanism. Here all the criminal records are stored. These records are classified as fraudulent or non-fraudulent. A model is built using this data to identify whether the record is fraudulent or not.