Skip to content

Application for Hebrew word prediction based on Amazon Elastic Map-Reduce ( Hadoop ).

Notifications You must be signed in to change notification settings

AradCarmi/Hadoop-Word-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

number_of_instance: 2
ec2_instances: m4.large
hSadoop.version: 2.10.1
emr_version: emr-5.32.0

link to output file: s3://output-hadoop-test/outputfinal/part


------------------------------------How to run the project-------------------------

Run RunJob main class with params jar : s3://input-file-hadoop/<jarfile>, input : s3://input-file-hadoop/<input> 
,output: s3://output-hadoop-test/<output>
The RunJob main will run an EMR Cluster which has 5 steps that do the following:
Step 1- Sums up the occurrences by year in both parts of the corpus into a single file in the following format(tr: r0) ,(tr: r1).
Step 2- Combine the outcomes of Step1 into a single file in the following format: (Tr: r0 r1).Also sum the total number of trigram in the corpus (n) and upload it to s3 bucket.
Step 3- Calculate and append the Nr0 and Tr01 for each trigram in the Step2 output file.  format: (tr: r0 r1 tr01 nr0)
Step 4- Calculate and append the Nr1 and Tr10 for each trigram in the Step3 output file.  format: (tr: r0 r1 tr01 nr0 tr10 nr1)
Step 5- Calculate the Probability for each trigram in the output of Step4 which is the last step of the program.
final file format(tr: pr(tr)).
----------------------------------*Interesting Trigrams*---------------------------

-------------------------------אכפת לך Trigram------------------------------------ 

אכפת לך .	1.6722451814285212E-7
אכפת לך ?	7.188126477404688E-8
אכפת לך אם	1.0268893965762584E-7
אכפת לך מה	8.214684991071963E-8
אכפת לך שאני	3.667596224308859E-8


-------------------------------אכתוב לך Trigram------------------------------------ 

אכתוב לך .	1.1882409088271745E-7
אכתוב לך את	7.188126477404688E-8
אכתוב לך יותר	3.5208923753365046E-8
אכתוב לך מה	4.694523167115339E-8
אכתוב לך מכתב	6.748180278357322E-8
אכתוב לך עוד	6.307859396040948E-8
אכתוב לך על	1.6722451814285212E

-------------------------------בגלל שאין Trigram------------------------------------ 

בגלל שאין לה	1.7604461876682523E-8
בגלל שאין להם	2.7873731304747326E-8
בגלל שאין לו	6.601394507111965E-8
בגלל שאין לי	8.947967230303418E-8
בגלל שאין לנו	1.7604461876682523E-8

-------------------------------בגלל שלא Trigram------------------------------------ 


בגלל שלא היה	1.3202521146782627E-7
בגלל שלא היו	9.975523140644139E-8
בגלל שלא היתה	4.40111546917063E-8
בגלל שלא רצה	3.227484677391796E-8
בגלל שלא רציתי	5.281338563004756E-8

-------------------------------בגלל תנאי Trigram------------------------------------ 

בגלל תנאי האקלים	4.694523167115339E-8
בגלל תנאי החיים	6.161561656838883E-8
בגלל תנאי המלחמה	4.107707771225922E-8
בגלל תנאי העבודה	3.8143000732812135E-8
בגלל תנאי מזג	4.40111546917063E-8

-------------------------------אחד המנהיגים Trigram------------------------------------ 

אחד המנהיגים הבולטים	1.0709380974981869E-7
אחד המנהיגים החשובים	3.5208923753365046E-8
אחד המנהיגים הציוניים	4.40111546917063E-8
אחד המנהיגים הרוחניים	3.0807808284194415E-8
אחד המנהיגים של	9.38848458681329E-8

-------------------------------אחד המרכיבים Trigram------------------------------------ 

אחד המרכיבים הבולטים	3.667596224308859E-8
אחד המרכיבים החשובים	2.464624662735553E-7
אחד המרכיבים המרכזיים	6.748180278357322E-8
אחד המרכיבים העיקריים	1.2614315641822366E-7
אחד המרכיבים של	2.02451311581849E-7

-------------------------------אחד הצדדים Trigram------------------------------------ 

אחד הצדדים .	3.5502331451309753E-7
אחד הצדדים או	6.454620475237879E-8
אחד הצדדים הלוחמים	3.0807808284194415E-8
אחד הצדדים לא	4.254411620198276E-8
אחד הצדדים של	3.667596224308859E-8

-------------------------------אכתי לא Trigram------------------------------------ 

אכתי לא הוה	4.547819318142985E-8
אכתי לא הוי	8.06836631587083E-8
אכתי לא היה	4.40111546917063E-8
אכתי לא ידע	3.667596224308859E-8
אכתי לא ידעינן	8.508017404309765E-8

-------------------------------אל אדוני Trigram------------------------------------ 

אל אדוני אבי	3.227484677391796E-8
אל אדוני המלך	9.828501948022541E-8
אל אדוני יש	5.281338563004756E-8
אל אדוני שעירה	2.420613508043847E-7
אל אדוני .	8.655094053192732E-8

-----------------------------------------Statics---------------------------------------


Step1-

	File System Counters
		FILE: Number of bytes read=121225985
		FILE: Number of bytes written=377503712
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=3627
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=39
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
		S3: Number of bytes read=2621736532
		S3: Number of bytes written=81258482
		S3: Number of read operations=0
		S3: Number of large read operations=0
		S3: Number of write operations=0
	Job Counters 
		Killed map tasks=1
		Launched map tasks=39
		Launched reduce tasks=1
		Data-local map tasks=39
		Total time spent by all maps in occupied slots (ms)=53322240
		Total time spent by all reduces in occupied slots (ms)=4210080
		Total time spent by all map tasks (ms)=1110880
		Total time spent by all reduce tasks (ms)=43855
		Total vcore-milliseconds taken by all map tasks=1110880
		Total vcore-milliseconds taken by all reduce tasks=43855
		Total megabyte-milliseconds taken by all map tasks=1706311680
		Total megabyte-milliseconds taken by all reduce tasks=134722560
	Map-Reduce Framework
		Map input records=83465129
		Map output records=83465128
		Map output bytes=1938833557
		Map output materialized bytes=247482313
		Input split bytes=3627
		Combine input records=83465128
		Combine output records=21257948
		Reduce input groups=3172863
		Reduce shuffle bytes=247482313
		Reduce input records=21257948
		Reduce output records=3172863
		Spilled Records=42515896
		Shuffled Maps =39
		Failed Shuffles=0
		Merged Map outputs=39
		GC time elapsed (ms)=23707
		CPU time spent (ms)=695150
		Physical memory (bytes) snapshot=33822687232
		Virtual memory (bytes) snapshot=134840287232
		Total committed heap usage (bytes)=29107421184

File System Counters
		FILE: Number of bytes read=117541907
		FILE: Number of bytes written=363529811
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=3534
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=38
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
		S3: Number of bytes read=2513049968
		S3: Number of bytes written=81229681
		S3: Number of read operations=0
		S3: Number of large read operations=0
		S3: Number of write operations=0
	Job Counters 
		Killed map tasks=1
		Launched map tasks=38
		Launched reduce tasks=1
		Data-local map tasks=38
		Total time spent by all maps in occupied slots (ms)=52676928
		Total time spent by all reduces in occupied slots (ms)=3177600
		Total time spent by all map tasks (ms)=1097436
		Total time spent by all reduce tasks (ms)=33100
		Total vcore-milliseconds taken by all map tasks=1097436
		Total vcore-milliseconds taken by all reduce tasks=33100
		Total megabyte-milliseconds taken by all map tasks=1685661696
		Total megabyte-milliseconds taken by all reduce tasks=101683200
	Map-Reduce Framework
		Map input records=80006835
		Map output records=80006835
		Map output bytes=1858473280
		Map output materialized bytes=237412377
		Input split bytes=3534
		Combine input records=80006835
		Combine output records=20379707
		Reduce input groups=3172856
		Reduce shuffle bytes=237412377
		Reduce input records=20379707
		Reduce output records=3172856
		Spilled Records=40759414
		Shuffled Maps =38
		Failed Shuffles=0
		Merged Map outputs=38
		GC time elapsed (ms)=23779
		CPU time spent (ms)=674020
		Physical memory (bytes) snapshot=32753164288
		Virtual memory (bytes) snapshot=131428712448
		Total committed heap usage (bytes)=28233957376
Step2

File System Counters
		FILE: Number of bytes read=56078680
		FILE: Number of bytes written=126434930
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=428
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=4
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
		S3: Number of bytes read=162506061
		S3: Number of bytes written=91295486
		S3: Number of read operations=0
		S3: Number of large read operations=0
		S3: Number of write operations=0
	Job Counters 
		Killed map tasks=1
		Launched map tasks=4
		Launched reduce tasks=1
		Data-local map tasks=4
		Total time spent by all maps in occupied slots (ms)=4927104
		Total time spent by all reduces in occupied slots (ms)=1805664
		Total time spent by all map tasks (ms)=102648
		Total time spent by all reduce tasks (ms)=18809
		Total vcore-milliseconds taken by all map tasks=102648
		Total vcore-milliseconds taken by all reduce tasks=18809
		Total megabyte-milliseconds taken by all map tasks=157667328
		Total megabyte-milliseconds taken by all reduce tasks=57781248
	Map-Reduce Framework
		Map input records=6345719
		Map output records=6345719
		Map output bytes=186810791
		Map output materialized bytes=69258245
		Input split bytes=428
		Combine input records=0
		Combine output records=0
		Reduce input groups=3172967
		Reduce shuffle bytes=69258245
		Reduce input records=6345719
		Reduce output records=3172967
		Spilled Records=12691438
		Shuffled Maps =4
		Failed Shuffles=0
		Merged Map outputs=4
		GC time elapsed (ms)=2670
		CPU time spent (ms)=67930
		Physical memory (bytes) snapshot=4027146240
		Virtual memory (bytes) snapshot=18022395904
		Total committed heap usage (bytes)=3529506816
Step3-

File System Counters
		FILE: Number of bytes read=45487383
		FILE: Number of bytes written=91739368
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=224
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=2
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
		S3: Number of bytes read=91306211
		S3: Number of bytes written=133777440
		S3: Number of read operations=0
		S3: Number of large read operations=0
		S3: Number of write operations=0
	Job Counters 
		Killed map tasks=1
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=2466480
		Total time spent by all reduces in occupied slots (ms)=1985568
		Total time spent by all map tasks (ms)=51385
		Total time spent by all reduce tasks (ms)=20683
		Total vcore-milliseconds taken by all map tasks=51385
		Total vcore-milliseconds taken by all reduce tasks=20683
		Total megabyte-milliseconds taken by all map tasks=78927360
		Total megabyte-milliseconds taken by all reduce tasks=63538176
	Map-Reduce Framework
		Map input records=3172967
		Map output records=3172967
		Map output bytes=93938584
		Map output materialized bytes=45593294
		Input split bytes=224
		Combine input records=0
		Combine output records=0
		Reduce input groups=6774
		Reduce shuffle bytes=45593294
		Reduce input records=3172967
		Reduce output records=3172967
		Spilled Records=6345934
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=1995
		CPU time spent (ms)=46150
		Physical memory (bytes) snapshot=2649477120
		Virtual memory (bytes) snapshot=11367759872
		Total committed heap usage (bytes)=2303197184

Step4-

File System Counters
		FILE: Number of bytes read=49311794
		FILE: Number of bytes written=99283306
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=230
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=2
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
		S3: Number of bytes read=133792428
		S3: Number of bytes written=176259954
		S3: Number of read operations=0
		S3: Number of large read operations=0
		S3: Number of write operations=0
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=2629488
		Total time spent by all reduces in occupied slots (ms)=2055840
		Total time spent by all map tasks (ms)=54781
		Total time spent by all reduce tasks (ms)=21415
		Total vcore-milliseconds taken by all map tasks=54781
		Total vcore-milliseconds taken by all reduce tasks=21415
		Total megabyte-milliseconds taken by all map tasks=84143616
		Total megabyte-milliseconds taken by all reduce tasks=65786880
	Map-Reduce Framework
		Map input records=3172967
		Map output records=3172967
		Map output bytes=136420228
		Map output materialized bytes=49312812
		Input split bytes=230
		Combine input records=0
		Combine output records=0
		Reduce input groups=6773
		Reduce shuffle bytes=49312812
		Reduce input records=3172967
		Reduce output records=3172967
		Spilled Records=6345934
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=1581
		CPU time spent (ms)=51510
		Physical memory (bytes) snapshot=2729062400
		Virtual memory (bytes) snapshot=11358625792
		Total committed heap usage (bytes)=2363490304

Step5-

File System Counters
		FILE: Number of bytes read=53409444
		FILE: Number of bytes written=102163833
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=345
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=3
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
		S3: Number of bytes read=176289862
		S3: Number of bytes written=138765621
		S3: Number of read operations=0
		S3: Number of large read operations=0
		S3: Number of write operations=0
	Job Counters 
		Killed map tasks=1
		Launched map tasks=3
		Launched reduce tasks=1
		Data-local map tasks=3
		Total time spent by all maps in occupied slots (ms)=3502992
		Total time spent by all reduces in occupied slots (ms)=2205792
		Total time spent by all map tasks (ms)=72979
		Total time spent by all reduce tasks (ms)=22977
		Total vcore-milliseconds taken by all map tasks=72979
		Total vcore-milliseconds taken by all reduce tasks=22977
		Total megabyte-milliseconds taken by all map tasks=112095744
		Total megabyte-milliseconds taken by all reduce tasks=70585344
	Map-Reduce Framework
		Map input records=3172967
		Map output records=3172967
		Map output bytes=176259954
		Map output materialized bytes=47875548
		Input split bytes=345
		Combine input records=0
		Combine output records=0
		Reduce input groups=3172967
		Reduce shuffle bytes=47875548
		Reduce input records=3172967
		Reduce output records=3172967
		Spilled Records=6345934
		Shuffled Maps =3
		Failed Shuffles=0
		Merged Map outputs=3
		GC time elapsed (ms)=2120
		CPU time spent (ms)=64790
		Physical memory (bytes) snapshot=3459653632
		Virtual memory (bytes) snapshot=14685036544
		Total committed heap usage (bytes)=3101163520

About

Application for Hebrew word prediction based on Amazon Elastic Map-Reduce ( Hadoop ).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages