forked from vicentebolea/hadoop-apriori
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
348 lines (265 loc) · 12.9 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
In this assignment, you will implement Apriori algorithm in Hadoop
***********
1. Overview
***********
Apriori algorithm finds all itemsets with supports larger than or equal to minsup.
Apriori algorithm makes multiple passes over the transaction data.
In the first pass, the support of individual items is counted and frequent items are determined (based on minsup).
In each subsequent pass, a seed set of itemsets found to be frequent in the previous pass is used for generating new potentially frequent itemsets, called candidate itemsets;
then their actual support is counted while making a pass over the transaction data.
At the end of the pass, those that satisfy the minimum support constraint are collected, that is, frequent itemsets are determined, and they become the seed for the next pass.
This process is repeated until no new frequent itemsets are found. (In this assignment, we will determine number of pass)
In this assignment, you implement Apriori algorithm in MapReduce (Hadoop) framework.
We briefly describe the algorithm in the following; in the description we use the following notations:
L_k : frequent itemsets of length k.
C_k : candidate frequent itemsets of length k.
To find the 1_itemsets (frequent itemsets of size 1, L_1) in MapReduce, we simply emit (item, 1) in the Map phase, then add the counts up in the Reduce Phase.
To generate C_(k+1) from L_k in MapReduce, we do the following:
In Pre-Map phase: (in setup() function that is invoked for each map instance)
1) Self-join L_k to create itemsets of size k+1.
In Self-joining step, you can only generate the (k+1)-itemSet with k-itemSets having the same itemSet up to (k-1) items
For example,
Case 1
Suppose we have 2 itemSet that have length 3 (list[0..2])
[1, 2, 3], [1, 2, 4]
In this case, you should check that the itemSets are the same from index 0 to index 1 and generate 4-newItemSet and pruning.
Case 2
Suppose we have 2 itemSet the have length 3 (list[0..2])
[1, 2, 3], [2, 3, 4]
In this case, you should only check index 0 and break.
you don't need to generate the (k+1) itemSet
Think about why consider generating (k+1)-newItemSet only for the same thing up to (k-1) consecutively.
2) For each subset of the above itemsets, we check if it is in L_k; we do this check approximately by keeping hashcode of itemsets in L_k in a set and check against this set.
If all of the subsets are in L_k, then we add the itemset to C_(k+1).
3) Add the itemsets in C_(k+1) into the Trie (implemented in the previous assignment).
In n-Map phase: (n > 1)
4) Read part of transaction file and for each transaction, call findItemsets() function to get the itemsets appearing in the transaction.
5) Emit the above itemsets as (itemset, 1).
In Reduce phase:
6) Add the numbers up and if it's above minsup, add the itemset into L_(k+1).
The following example illustrates how the above algorithm runs.
* number of transactions = 6.
* min_sup = 40%.
* min_number of item = 2.
Iteration 1 (find 1-itemsets):
Database (A, 1) Output
+-----+----------------+ (B, 1) +---------------+
| TID | Items | +--------+ (C, 1) ... +------------+ | <Item, Count> |
| t1 | A, B, C |--->| Map - 1| ----------------> | Reduce - 1 | +---------------+
| t2 | A, C | +--------+ +-> +------------+ | <A, 3> |
| t3 | B, C, D, E | ------------| | <B, 2> |
| t4 | A, D | +--------+ / +------------+ | <C, 4> |
| t5 | E |--->| Map - 2| ----------------> | Reduce - 2 | | <D, 3> |
| t6 | C, D, E, F, G | +--------+ +------------+ | <E, 5> |
+-----+----------------+ +---------------|
|
Iteration 2 (find 2-itemsets) +--------------------------------------------------+
| Use the previous output
Database | (in setup() build Trie of C_(k+1) ) Output
+-----+----------------+ v +---------------+
| TID | Items | +--------+ +------------+ | <Item, Count> |
| t1 | A, B, C |--->| Map - 1| ----------------> | Reduce - 1 | +---------------+
| t2 | A, C | +--------+ +-> +------------+ | <A C, 2> |
| t3 | B, C, D, E | ------------| | <B C, 2> |
| t4 | A, D | +--------+ / +------------+ | <C D, 2> |
| t5 | E |--->| Map - 2| ----------------> | Reduce - 2 | | <C E, 2> |
| t6 | C, D, E, F, G | +--------+ ((A,C), 1) +------------+ | <D E, 2> |
+-----+----------------+ ((B,C), 1) +---------------|
... (assuming (A,C), (B,C) are in the Trie)
...
Iteration k
(Repeated until L_k is empty)
[1] Apriori-based Frequent Itemset Mining Algorithms on MapReduce
http://delivery.acm.org/10.1145/2190000/2184842/a76-lin.pdf?ip=118.35.162.207&id=2184842&acc=ACTIVE%20SERVICE&key=0EC22F8658578FE1%2E6E075F2F25AA1826%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=922087149&CFTOKEN=93076259&__acm__=1491784730_ba949e073e89900d9ef6e87dfd86ca9b
[2] https://en.wikipedia.org/wiki/Apriori_algorithm
[3] BigdataSystem Lecture 9.
************
2. Template Code
************
In the template code, we provide the following classes:
AprioriDriver class
AprioriPass1Mapper class
AprioriPassKMapper class
AprioriReducer class
ItemSet class
Transaction class
Trie class
TrieNode class
AprioriUtils class
AprioriDriver class : set up excution information about mapreduce job.
AprioriPass1Mapper class :
It read the dataset line by line and emit <Individual item, 1>
The dataset line is composed of [line number] [itemset]
For example,
sample.txt
+---------------+
| 1 1 3 4 2 5 | -> line number : 0 Items : 1 3 4 2 5
| 2 2 3 5 | -> line number : 1 Items : 2 3 5
| 3 1 2 3 5 | -> line number : 2 Items : 1 2 3 5
| 4 2 5 | -> line number : 3 Items : 2 5
+ --------------+
In this class, The Key is line number, txnRecord is items.
And you should emit in this form: [individual item] 1
i.e [1] 1
[3] 1
[4] 1
AprioriPasskMapper class:
Pre-map phase
n-Map phase
you should emit in this form: [itemset] 1
i.e if itemset length is 2,
[1, 2] 1
[2, 3] 1
[3, 4] 1
AprioriReducer class
Reduce phase
Transaction class:
It represent transaction using list(Integer).
AprioriUtils class:
It contains utility methods for Apriori algorithm.
TrieNode class
Trie class
Refer to assignment3.
You should use the class that you have implemented and change some method parameters.
findItemSets(ArrayList<ItemSet> matchedItemSet, ArrayList<Integer> transaction)
-> findItemSets(ArrayList<ItemSet> matchedItemSet, Transaction transaction)
ItemSet class
Refer to assignment3.
But do not use the previous file. Becacuse previous ItemSet class have some errors.
********
3. What you need to implement
*********
In the template code, you will find comments with COMPLETE (all in capital).
You are required to implement the necessary code in those locations.
Here we list all the functions you need to implement:
1) AprioriPass1Mapper class
- map()
2) AprioriPassKMapper class
- setup()
- map()
3) AprioriReducer class
- reduce()
3) Trie class
- Use your assignment3 files.
4) AprioriUtils class
- getCandidateItemSets(..)
- generateItemSetMap(..)
- getSubSets(..)
********
5. Building the Project
*********
In ferrari,
First, you should set environment variable.
After typing vi ~/.bashrc
At the bottom of the .bashrc, type exactly as shown below
+-------------------------------------------------------------------+
.bashrc file
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_INSTALL=/scratch/hadoop-2.7.3
export HADOOP_HOME=$HADOOP_INSTALL
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop/
+-------------------------------------------------------------------+
Second, After completing implementation, execute run.sh (./run.sh)
Ignore this message.
Note: ./src/apriori/AprioriDriver.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
adding: apriori/(in = 0) (out= 0)(stored 0%)
adding: apriori/AprioriPass1Mapper.class(in = 2060) (out= 914)(deflated 55%)
adding: apriori/AprioriReducer.class(in = 2320) (out= 1013)(deflated 56%)
adding: apriori/AprioriPassKMapper.class(in = 4613) (out= 2128)(deflated 53%)
adding: apriori/AprioriDriver.class(in = 3351) (out= 1734)(deflated 48%)
adding: utils/(in = 0) (out= 0)(stored 0%)
adding: utils/AprioriUtils.class(in = 3058) (out= 1666)(deflated 45%)
adding: trie/(in = 0) (out= 0)(stored 0%)
adding: trie/TrieNode.class(in = 1179) (out= 670)(deflated 43%)
adding: trie/Trie.class(in = 1795) (out= 961)(deflated 46%)
adding: list/(in = 0) (out= 0)(stored 0%)
adding: list/ItemSet.class(in = 1300) (out= 748)(deflated 42%)
adding: list/Transaction.class(in = 957) (out= 565)(deflated 40%)
Third, enter the folloing command:
hdfs dfs -put dataset/[dataset-file]
(you can check if the file is stored in hdfs with the following command: hdfs dfs -ls)
Fourth, enter the folloing command:
hadoop jar HadoopBasedApriori.jar apriori.AprioriDriver [dataset file] [output file] [numPass] [minSup] [number of Transaction]
(Actually, [output file] is prefix output filename)
After running the above command, you can see the result. (hdfs://[output file]/part-r-00000)
hdfs dfs -cat output[passnum]/part-r-00000
You can also copy files to the local file system
hdfs dfs -get output[passnum]
Note: If you use the same output filename, you should always remove the output file before starting the mapReduce
hdfs dfs -rmr [remove file]
hdfs instruction : https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/FileSystemShell.html#get
********
4. DataSet
********
1. sample
number of transaction : 6
2. mushroom.dat
number of transaction : 8124
3. retail.dat
number of transaction : 88162
4. connect.dat
number of transaction : 67557
********
4. Usage Examples
********
hdfs dfs -put dataSet/sample
hadoop jar HadoopBasedApriori.jar apriori.AprioriDriver sample output 3 0.5 6
minsup = 50%
output1/part-r-00000
[1] 4
[2] 6
[3] 4
[5] 5
output2/part-r-00000
[1, 2] 4
[1, 3] 3
[1, 5] 3
[2, 3] 4
[2, 5] 5
[3, 5] 3
output3/part-r-00000
[1, 2, 3] 3
[1, 2, 5] 3
[2, 3, 5] 3
hdfs dfs -put dataSet/mushroom.dat
hadoop jar HadoopBasedApriori.jar apriori.AprioriDriver mushroom.dat output 5 0.5 8124
.
.
output5/part-r-00000
[24, 34, 85, 86, 90] 4208
[34, 36, 39, 85, 86] 4346
[34, 36, 59, 85, 86] 4232
[34, 36, 85, 86, 90] 6272
[34, 39, 85, 86, 90] 4784
[34, 53, 85, 86, 90] 4608
[34, 59, 85, 86, 90] 4544
[34, 63, 85, 86, 90] 4304
********
5. Requirements
*********
- You need to fill in the blanks in the template code.
(Search the string 'COMPLETE' which marks the places where you need to implement your code)
- You must write a report describing your implementation. The report need to be in word format or pdf format.
- You must name your submission file as following:
hw4_studentNumber.tar, hw4_studentNumber.pdf (or .docx)
and submit both files (hw4_studentNumber.tar, hw4_studentNumber.pdf)
- You must fully describe the code you have implemented.
- You must test how much solution space is reduced by pruning.
- You must think about why consider generating (k+1)-newItemSet only for the same thing up to (k-1) consecutively in self-joining step.
*********
6. Submission
*********
You must submit the report for this assignment as well as your source code.
The report must be in PDF/Word format and include the following :
- Describe the code you have implemented.
- Test how much solution space is reduced by pruning.
- Describe why we consider generating (k+1)-newItemSet only for the same thing up to (k-1) consecutively in self-joining step.