Design and Development of Apriori Algorithm for Sequential to Concurrent Mining Using MPI

: Owing to the conception of big data and massive data processing there are increasing owes related to the temporal aspects of the data processing. In order to address these issues a continuous progression in data collection, storage technologies, designing and implementing large-scale parallel algorithm for Data mining is seen to be emerging in a rapid pace. In this regards, the Apriori algorithms have a great impact for finding frequent item sets using candidate generation. This paper presents highlights on parallel algorithm for mining association rules using MPI for passing message base in the Master-Slave based structural model


INTRODUCTION
Data mining could be characterized as the methodology of discovering hidden pattern in database. The main objective of the data mining is to manipulate the data into knowledge. Association rule mining is a sort of data mining process [1]. Association rule mining is done to extract interesting correlations, patterns, associations among items in the transaction database or other data repositories [2]. Association rules are widely utilized in various areas such as telecommunication networks, marketing, risk management, inventory control etc. Data Mining directly arranged to the enormous databases which have hundreds of properties and a huge number of records that contain complex relationship between the data sheets, and this will inevitably lead to the dramatic increase of the search space and size in the process of data mining [3]. It is obviously able to improve efficiency when using parallel data mining. Hence, that has become an imperative problem for design the parallel algorithms of association rules for the efficient mining when using the highperformance parallel workstations.
In data mining, Apriori is a classic algorithm for studying association rules. Apriori is intended to operate on databases containing transactions for example, collections of items bought by customers, or details of a website frequentation [4].
Association rules mining is a vital zone of research for data mining. A number of potential and interesting relationships will be found in the large amount of data through mining some potential relationship between the item sets of the database [5]. These relationships play an important role in guiding and reference for the market basket analysis, cross-selling of commodities, business decision-making such as advertising mail analysis [6].
In this paper, in order to achieve high-performance parallel computing, there is an algorithm which using Master-Slave structure and communicate by MPI between the hosts, make full use of the resources of the workstation, a unified scheduling, coordination of treatment, under the cluster environment.

ASSOCIATION RULE
The association rules problem is as follows. Let I = { be a set of n binary attributes called items. Let D = { t1,t2,….,tm} be a set of transactions called database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form where and

Measures Association Rule
Basically, association mining is about discovering a set of rules that is shared among a large percentage of the data [7]. Association rules mining tend to produce a large number of rules. The objective is to find the rules that are suitable to Users. There are two methods of measuring usefulness, being objectively and subjectively. Objective measures involve statistical analysis of the data, such as support and confidence. [8]

BCD
The rule holds with confidence c if c% of the transaction in D that contain X also contain Y. Rules that have a c grater than a user specified confidence is said to have minimum confidence. [8]

IDENTIFYING LARGE KEYWORD SETS
There are numerous algorithms of association rules are available, but out of which Apriori classic algorithm is are available for data mining. The algorithm is used on I / 0 for a lot of time because of the need to repeatedly scan the database and produce a large number of frequent itemsets, therefore, it will resulting very low efficiency for data mining. [9] Therefore, in order to improve the efficiency of Apriori algorithm, Apriori algorithm have been improved on a large number of extent, would like to find an efficient and reliable algorithm for mining frequent itemsets, but most of them are just confined to the optimize and improvement of the serially algorithm [6]. Optimization of the serial algorithm, to a certain extent, improve efficiency, but it is on a single computer run, in theory, N workstations running Apriori algorithm, then the efficiency will be enhanced, therefore, the algorithm of the association rules mining is attempting to parallel. [10]

Save the frequent itemsets in Lk
The key to the parallel association rules is dealing with good communication between the processor and load balancing, so in this paper, there is an algorithm for parallel association rules mining base on MPI [12]. In algorithm design, In order to achieve higher efficiency of load balancing, MPI is used to uniform average distribution of resources; Using centralized architecture: a processor as a control processor dedicated to generate the overall itemsets for frequently, and is responsible for exchange information with other processing, and other processors as a workstation processor, only responsible for generating the local candidates and pruning set and count, there are not existing the exchange of information between the workstations, with the goal that we can reduce the communication time and enhance efficiency. [13]

PARALLEL APRIORI ALGORITHM
The algorithm assumes shared-nothing architecture where each of processor has private memory and a private disk. The processors are connected by a communication network and can communicate only by passing messages [14]. The communication primitives used by our algorithms are part of the MPI (Message Passing Interface) communication library supported on the IBM-SP and are keywords set for a message passing communication standard currently under discussion. Data is equally distributed on the disks attached to the processors [11,14]. Each processor's disk has roughly an equal number of transactions. We do not require transactions to be placed on the disks in any unique way. We can accomplish the parallelism of Apriori Algorithm in different ways; at instructional level or at data level or control level [15]. We are following data level parallelism. Using given database generates the dominant group and also divides the database into N partitions. Each partition will be assigned to a processor. Data level parallelism of Apriori algorithm addresses the problem of finding all frequent keyword sets and the generation of the rules form frequent keyword set [16]. Refer to table 4 for a summary of notations used in the algorithm description. We are using superscripts to indicate processor id or rank and subscripts to indicate the pass number (also the size of keyword set).

Ck
Set of candidate k keyword set (potentially frequent keyword set) each member of this set has two fields 1) keyword set 2) support count.
Our proposed data level parallelism approach used irredundant computations in parallel. We have avoided the communication between the child or slave processors.
1. Select one processor to be the master, the other N-1 processors are slaves 2. Master processor devides data equally into n-1 processors. 3. Each processor Pi receives a 1/N part of the database from the parent or master processors, 0<i<N. 4. Processor Pi performs a pass over data partition Di and develops local support count for candidates in Ck. 5. Each processor Pi now computes Lk from Ck. 6. Each processor Pi sends its own local frequent itemsets to the master processor. 7. The master processor gathers the summary to generate global frequent itemsets. 8. The master processor partition the frequent itemsets, send to the local processor together with the global frequent itemsets. Cycle repeat until you find the most frequent itemsets. The master processor finally combine the output of nodes to generate set of global most frequent itemsets overall, delete the redundant information according to the credibility .