Data Mining: Input and Output
The definition given in the previous item implies data mining needs inputs and outputs.
Inputs could be:
- concept, the things that is to be learned;
- instance, an individual and indipendent example of concept to be learned (classified, associated or clustered). Each instance is characterized by the values of a set of predetermined attributes;
- attribute, feature of instances which assumes a value choosen by a predetermined set.
The row of a database represents an instances while the colomn an attribute. If the whole set of instances has the attribute values not null then the database is called dense, while if at least one instance has at least one attribute value null then the database is called sparse.
The mmost common way of representing outputs is reported below.
Decision table. It is a schema of row and column just the same as the input but reduced to instance and attribute of interest. The figure reported below shows an example of decision table.
Decision tree. This structure derived from graph theory. It has nodes, which contain information, and edgeswhich link two nodes. If the edge is directed, as an arrow, it links the head node to the tail. The root is the head of the whole tree, while the leaf is one of the tail of the whole tree. The figure reported below shows a decision tree in which "A" is the root and "C", "D", "D" are the leafs.
Rule. A rule has the antecedent or precondition, a logical expression, and a consequent or conclusion which represents predicate.There exist two kind of rule: classification rule and association rule.In the classification rules the consequent gives the class or classes that apply to instances covered by the rule and all the tests must succeed if the rule is to fire. Rules can be gruoped in a set a fired in order as a decision list. Each rule seems to represent an indipendent nugget of knowledge, so that new rule can be added to an existing rule set without disturbing ones already there, but it is important to consider how the set of rules is executed. The figure below illustrates two examples of classification rules.Different from classification rules, association rules can predict any attrivute not just the class, and this allows to predict combinations of attribute too.Beacuse so many different association rules can be derived from even a tiny database, interest is restricted to those that apply to a reasonably large number of instances and have and high accuracy.The two most important measures are: support, or coverage, is the number of instances in which the rule occurs;confidence, or accuracy, is the number of instances in which the rule occur with respect to those instances which contains just only the antecedent of the same rule.Tipically an association rule must have a minimum value of support and confidence.The figure below illustrates two examples of association rules.
Cluster. It grupos similar instances according to one or more characteristics. The output takes the form of diagram that shows how the instances fall into the cluster. Some algorihtms allow one instance to belong to more than one cluster. The figure reported below illustrates how 5 instances have been clustered in 2 different clusters.
- be meaningful in that they lead to some advantage, usually economic advantage;
- allow non trivial forecasting;
- infer previously unknown information.
- classification, the learning scheme is a set of classified examples from which it is expected to learn a way of classifying unseen examples;
- association, relations beetween instances are sought;
- clustering, instances belonging to the same category are grouped in a cluster;
- numeric prediction, the outcome of prediction is a numeric quantity. Most of time the terms machine learning and data mining vengono are commonly confused, as they often employ the same methods and overlap significantly. Machine learning focuses on prediction, based on known properties learned from the training data while data mining focuses on the discovery of unknown properties on the data. The two areas overlap in many ways: data mining uses many machine learning methods, but often with a slightly different goal in mind. On the other hand, machine learning also employs data mining methods as "unsupervised learning" or as a preprocessing step to improve learner accuracy.