Design and Analysis of JIC Algorithm on Big Data
Main Article Content
Abstract
Computers have significantly impacted various fields, but managing vast amounts of information remains challenging. Artificial intelligence helps machines make data-driven decisions, but large datasets still pose difficulties for researchers. To tackle challenges and gain deep insights from large datasets, we proposed a novel technique using Geometric Progression series numbers (GPLN) for labeling singleton frequent itemsets and Cumulative Geometric Progression series numbers (CGPLN) for labeling itemsets with multiple frequent items. Initially, the algorithm used 2 as the constant 'r' for generating the series, but that was inadequate for large datasets. So, this paper proposes the Jagged Itemset Counting (JIC) algorithms 1 and 2 by reducing the value of ‘r’ and introduced dotted pairs for CGPLN labels to represent frequent itemsets. This redefined methodology requiring two passes over the transaction database: the first pass involves pre-processing, identifying singleton (1-k) frequent itemsets, determining GPLN, and partitioning the dataset. The partitions are processed sequentially: JIC-Algorithm-1 is applied to the first partition and JIC-Algorithm-2 to the remaining partitions. The n-k frequent itemsets from all partitions are combined using the Join algorithm, alongside 1-k itemsets found earlier. For small and medium-sized data, the JIC methodology outperforms Apriori and Eclat algorithms, showing better execution time even at low support thresholds. In Big Data scenarios, while FP-Growth and Eclat struggled, the proposed methodology excelled in execution time, main memory consumption, and disk memory utilization.