Data mining is also known as data knowledge discovery. This is a method where data is analyzed and different perspectives are gained from it. This is then used for summarizing the information so that revenues can be increased, costs can be decreased or sometimes both can happen simultaneously.
Data mining can also be thought of as a process of finding patterns and correlations among different fields in relational databases.
Data mining is done through specialized software. There are powerful computers that are used to sift through volumes of data and based on it research reports are drawn up. However, there is a continuous innovation which is being done in data mining and as a result the accuracy of analysis is increasing and the cost is decreasing due to decreased disk storage space being used and lesser processing power needed. The software developed for data mining are also now more powerful and can extract a lot more information than they could in earlier years. Now there is even visual representation of the data which is prepared, so that the common man can get a clearer picture through graphs or pictures rather than by deciphering various numbers.
The difference between data, information and knowledge is that:
- Data consists of either numbers, text or facts that can be processed. This again can be divided into :
- Operational data which consists of cost, payroll, inventory, sales or accounting.
- Non operational data which consists of forecasting data, sales industry data as well as macro economic data
- Meta data which consists of data regarding data such as dictionary definition or design of logical databases.
- Information consists of relationships, patterns or associations which can be gathered from data.
- Knowledge is what information is converted into. It can help make decisions on the basis of information.
Uses of data mining
Data mining is used by customer focus organizations. These helps the companies to find out relationships between internal factors such as how to position products, the price, the staff skills or even external factors like competition, demographics of customers and economic indicators. It also enables the company or business to make decisions with regards the impact on customer satisfaction, sales as well as corporate profits. It also helps the companies and businesses to drill down summary information so that the transactional data can be viewed.
When data mining is done, the retailer uses point of sale records in order to target promotions based on the history of individual purchases. Based on demography of the customers, products are developed and promotions are created keeping especially that segment in mind as well as the buying patterns or even the seasons of the year.
The levels of analysis are varied and they are :
- Artificial neural networks which are predictive non lineal models which can be trained to resemble biological neural networks.
- Generic algorithms consists of optimization techniques. These use genetic combinations, mutations as well as it uses natural selection for designs. These designs are based on natural evolution concepts.
- Decision trees are tree shaped structures and they represent different decision sets. The methods are CART which refers to Classification and Regression Trees as well as CHAID which stands for Chi Square Automatic Interaction Detection.
- The nearest neighbor method classifies the record in the database based on the combination of classes. This is also known as the technique of the k-nearest neighbor.
- Rule induction extracts useful information based on if-then rules
- Data visualization is used to interpret complex relationships visually through multidimensional data such as graphs.
Advantages of data mining
Data mining has got many advantages and benefits.
Data mining is part of the knowledge discovery and using it analyzing of huge data sets is possible and through it a lot of useful knowledge and hidden information is made known. This is applied not only for business but in different fields like medicine, forecasting the weather, transportation, insurance, health care, government agencies and more.
- In Marketing it is used to build models which are based on historical data. This helps in finding out who will be the best audience to respond to new campaigns such as online marketing, direct mails and more. Using the results, marketers are able to push profitable products to those cross sections of customers that have been identified.
- In retail companies, it helps in arranging products in such a way that customers can choose and buy products which generally they buy together. Data mining helps companies to give discounts with regards particular products so as to attract more customers.
- In financial institutions, it helps with regards credit reporting as well as loan information. When a model is built from customer’s historic data, the financial institution and the banks can determine the loans which are good and bad. It also helps banks to identify credit card transactions which are fraudulent and so the credit card of the owner is protected.
- In manufacturing it is used to detect equipment which is faulty and also to determine control parameters which are optimal.
- Governments use data mining in their agencies by analyzing financial transaction records so that patterns are built and money laundering is detected as well as criminal activities are detected.
Disadvantages of data mining
However, data mining has a few disadvantages. The major disadvantage is that people are concerned about their personal privacy. People are afraid that their personal information can be misused or used in ways that are unethical and therefore they have huge concerns regarding privacy issues. Businesses may collect information so that they understanding the purchasing behavior trends of the customers. However, when the business is gone or acquired or merged, the personal information could be sold or worse still it could be leaked and thus the consumer is put at risk.
Data mining poses a security risk as well and there are chances that the information which consists of very personal and financial data could be hacked. Credit card theft and identity theft are huge problems looming over businesses.
Data collected for ethical purposes can be transformed into information and that information has the potential of being misused to either discriminate against people or take undue benefit of those that are vulnerable.
Data mining when not done accurately will cause wrong decisions to be taken and not only that, these decisions could have serious and in some cases disastrous consequences.
Data mining and privacy
For privacy to be upheld in data mining, there are various methods that are used. The first is the non interactive method where the database is first sanitized and only then it is released. The second method is the interactive method where answers are adapted as per the questions asked. Multiple questions can be asked in this model and the answers adaptively are given.
In the Interactive sanitizer method, the query function is applied to the database. This returns a result which is noisy. As the noise is random, there is uncertainty which is introduced and thus privacy is protected. However, in this the privacy and the amount of noise is configurable.
When the database is joined, there is a bad event which is created and this could be a concern with regards privacy. However, the goal of strong privacy is that the joining of database should no way either decrease or increase the probability of the event happening.
The data distribution method also protects the privacy of the person. In this there are algorithms which execute the privacy protection. This is done on some distributed data and on centralized data. The distributed data has vertically portioned data. There are different records which are there in different sites and they have data which is horizontally partitioned. The data which is vertically partitioned has different attribute values for different sites.
Data distortion to protect privacy consists of modifying the original record in the database. This is sub categorized into blocking, perturbation, merging, aggregation, sampling and swapping. All these alter an attribute, granularity or values.
Data mining algorithms preserve the privacy by classification mining, clustering, Bayesian networks as well as association rule mining.
Data or rules hidden method consists of hiding the original data or the rules pertaining to the original data. This is very complex and heuristic methods are used to solve this issue.
Privacy protection is done so that the data is modified based on heuristic methods so that data loss is minimum and only some values are modified. There are also encryption technologies which are used and these are multiparty computations. Data reconstruction is also used from random data to reconstruct original data distribution.
Technologies to protect privacy
In order that privacy protection needs to be ensured in the database, sensitive information is hidden and the data which is between the hidden data as well as the original data has the same characteristics. Also the same data accuracy is extracted as from the original data set. Therefore, data mining along with association rule discovery, clustering, classification and the need to modify or choose or purify data is an issue. This issue is solved by distortion of the data. The various ways of distorting the data are :
- Perturbation based association rules mining : In this method, all rules of association are either equal to or great than the user defined confidence and support. These association rules hide the sensitive information by :
- All the rules that are sensitive can only appear in the data mining of the original set and greater or same time support and confidence is not allowed when the set is purified.
- Non sensitive rules are dug in the original data set and also can be dug on data sets which are clean and in the same confidence and support levels.
- Sensitive rules are not allowed to be dug from original or even purification data sets at the same confidence and support levels.
- The mining associations rules which use block are another method of perturbation of association rules. In this the property value is replaced with question marks and this unknown value is used rather than a false value being used. These are usually used in medicine.
- The classification rule mining which is based on block combines the analysis of the classification rule as well as the parsimonious downgrading and the data administrator blocks values for a label of a class. The person receiving the information cannot build informative models on that information which is downgraded.
- Distributed privacy preserving is done by using encryption. In this there are two or more parties who mine the data but they do not reveal their data and there is multiparty computation, problems of the distributed environment, SMC and more which converts these methods and solves issues like clustering, generalization as well as aggregation of data.
- Reconstructed technology consists of a decision tree or using EM algorithms for reconstructing using the Bayesian reconstruction process. The EM algorithm informs as to how much is the maximum distribution of disruption on the original data.
- Anonymous release consists of sensitive data not being published or sensitive data being published which has low accuracy. In this the k-anonymity principle is used. In this each record is not be distinguished from the other k-1 records. The greater k values that are there the more is the privacy protection but it also means the information loss increases too. If there is an attacker, he or she can use protocols and background knowledge and identify sensitive data as well as personal relationships and this can lead to privacy loss or loss of anonymity.
Privacy protection is growing in all fields and especially in data mining. This involves various disciplines and this field is still evolving. There are still a number of issues that need to be studied and even mining of mobile data or even mining of data streaming. Based on the geographic data as well as spatial data applications and user mobility, various behaviors are emerging.