You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Calling prepare!(StringDocument, strip_case | strip_stopwords) even on a small ~3.4MB file takes forever to return (other than for small strings I have not seen this function finishing successfully).
functionmethod_textanalysis(str_rec)
sdoc =StringDocument(str_rec)
prepare!(sdoc, strip_case | strip_stopwords)
return sdoc
end
I have tracked the slowness to the following function.
functionremove_patterns(s::AbstractString, rex::Regex)
iob =IOBuffer()
ibegin =1
v=codeunits(s)
for m ineachmatch(rex, s)
len = m.match.offset-ibegin+1
next =nextind(s, lastindex(m.match)+m.match.offset)
if len >0
Base.write_sub(iob, v, ibegin, len)
if next !=length(s)+1write(iob, '')
endend
ibegin = next
end
len =length(v) - ibegin +1
(len >0) && Base.write_sub(iob, v, ibegin, len)
String(take!(iob))
end
Manually performing similar task takes ~1.4 second on a 3.4MB text file. Reason I say similar is because to eliminate stop words manually I first tokenize the document and then filter out the stop words. Which functionally is very different than executing regex on a large string and may not be the ideal approach for preserving the structure of document. (complete code at the end)
functionmethod_manual(str_rec)
stop_words = Languages.stopwords(Languages.English())
str_rec =lowercase(str_rec)
word_tokens =tokenize(str_rec)
res =filter(x->!in(x, stop_words), word_tokens)
return res
end
I was wondering if there could be a more efficient way to perform this elimination of keywords from a String Document?
using Pkg
@info"Installing required packages.."
required_pkg = ["TextAnalysis", "Languages", "WordTokenizers"]
installed_pkg = Pkg.installed()
[(in(p, keys(installed_pkg)) || Pkg.add(p)) for p in required_pkg]
using TextAnalysis
using WordTokenizers
using Languages
"""Download data if it does not already exists"""functiondownload_data(;
url="http://www.gutenberg.org/files/2600/2600-0.txt",
localfile="2600-0.txt")
if!isfile(localfile)
download(url, localfile)
else@info"file $(localfile) already exists, skipping download"endend"""Return data (~100MB uncompressed) in form of a string"""functiongetdata(fn="2600-0.txt")
download_data()
str_rec =read(fn, String)
end"""Pre Process data using TextAnalysis - strip_case | strip_stopwords"""functionmethod_textanalysis(str_rec)
sdoc =StringDocument(str_rec)
#prepare!(sdoc, strip_case | strip_stopwords)prepare!(sdoc, strip_stopwords)
return sdoc
end"""Pre Process data without using `prepare!`"""functionmethod_manual(str_rec)
stop_words = Languages.stopwords(Languages.English())
str_rec =lowercase(str_rec)
word_tokens =tokenize(str_rec)
res =filter(x->!in(x, stop_words), word_tokens)
return res
end"""Main"""functionmain()
str_rec =getdata()
@info"Manual Pre Processing"@time res2 =method_manual(str_rec)
@info"Pre Processing using TextAnalysis"@time res1 =method_textanalysis(str_rec)
endmain()
The text was updated successfully, but these errors were encountered:
FYI: quick test of using replace while looping over stop_words appears to be much faster than the existing remove_pattern method ~<2s vs ~940s.
functionpreprocess(str_rec)
stop_words = Languages.stopwords(Languages.English())
str_rec =lowercase(str_rec)
for sw in stop_words
rex =Regex("\\b"*sw*"\\b")
str_rec =replace(str_rec, rex =>"")
endreturn str_rec
end
Not sure if my input is warranted but I just wanted to post a solution I found worked. However, this processes removes stop words from the return value of tokenize (using WordTokenizers)
STOPWORDS = stopwords(Languages.English()); # using Languages
"""
my_tokenize(text, sw)
return iterator for tokenized words in text with stopwords removed by default.
to return only stopwords in text, set argument sw to \'only\'
"""
function my_tokenize(text, sw::String="remove")
if sw == "remove"
return collect(word for word in tokenize(text) if !isin(word, STOPWORDS))
elseif sw == "only"
return collect(word for word in tokenize(text) if isin(word, STOPWORDS))
else
return collect(word for word in tokenize(text))
end
end
Calling
prepare!(StringDocument, strip_case | strip_stopwords)
even on a small ~3.4MB file takes forever to return (other than for small strings I have not seen this function finishing successfully).I have tracked the slowness to the following function.
TextAnalysis.jl/src/preprocessing.jl
Line 253 in c8ae7a2
Manually performing similar task takes ~1.4 second on a 3.4MB text file. Reason I say similar is because to eliminate stop words manually I first tokenize the document and then filter out the stop words. Which functionally is very different than executing regex on a large string and may not be the ideal approach for preserving the structure of document. (complete code at the end)
I was wondering if there could be a more efficient way to perform this elimination of keywords from a String Document?
The text was updated successfully, but these errors were encountered: