Author: S. Wu
This is in a text mining exploratory data analysis using R tm package. The goal is to sample and understand the corpus before the implementation of a n-gram prediction model.
# Check and load required R packages
pkg<-c("knitr", "ggplot2", "grid", "gridExtra", "RColorBrewer", "tidyr", "plyr", "dplyr", "tm", "wordcloud")
pkgCheck<-pkg %in% rownames(installed.packages())
for(i in 1:length(pkg)) {
if(pkgCheck[i]==FALSE) {
install.packages(pkg[i])
}
library(pkg[i],character.only = TRUE, quietly = TRUE)
}
The dataset was obtained from HC Corpora. The three English language corpus documents were used: blogs, news, and tweets. No other files were included. See their readme file for details on the corpora available
#blog
con<- file('en_US.blogs.txt', 'rb');
blog<- readLines(con, encoding = 'UTF-8', skipNul = TRUE);
close(con);
#news
con<- file('en_US.news.txt', 'rb');
news<- readLines(con, encoding = 'UTF-8', skipNul = TRUE);
close(con);
#twitter
con<- file('en_US.twitter.txt', 'rb');
twitter<- readLines(con, encoding = 'UTF-8', skipNul = TRUE);
close(con); rm(con);
#count lines
lines<- c(length(blog), length(news), length(twitter));
names(lines)<- c('blogs', 'news', 'twitter');
#count vocabulary(number of words separated by whitespace, without any data cleaning)
vCount<- function (x) {
sum(sapply(x, function(y) length(strsplit(y," ")[[1]])));
}
vocabulary0<- c(vCount(blog), vCount(news), vCount(twitter));
names(vocabulary0)<- c('blogs', 'news', 'twitter');
#count word type(number of unique words without any data cleaning)
wCount<- function (x) {
c<- unlist(lapply(x, function(y) unique(strsplit(y, " ")[[1]])));
length(unique(c));
}
wordType0<- c(wCount(blog), wCount(news), wCount(twitter));
names(wordType0)<- c('blogs', 'news', 'twitter');
wordsPerLine0<- vocabulary0/lines;
ttr0<- wordType0/vocabulary0;
diversity0<- wordType0/sqrt(vocabulary0*2);
Documents | Line Count | Word Count | Avg Words/ Line | Word Types | Word Type/Count Ratio |
---|---|---|---|---|---|
Blogs | 899,288 | 37,334,131 | 41.5 | 1,103,548 | 0.030 |
News | 1,010,242 | 34,372,530 | 34.0 | 876,772 | 0.026 |
2,360,148 | 30,373,583 | 12.9 | 1,290,173 | 0.042 |
Table 1 shows the total line count, word count, and word type count ratio from each document source before any data cleaning and transformation. Word count is the number of words separated by whitespace. Word type count is the number of unique words which occure in the data set. Avg. words/line is the average word count per line. Word type/count ratio is the total word types divided by world count. It indicates complexity, with higher numbers indicating a more complex data set.
20,000 lines from each of the three document source were randomly selected for data exploration.
The data set was preprocessed with the following transformations:
remove all numbers
covert alphabets to lower case
remove stopwords, using the built-in list of English stop words from the r tm package
remove profanity words based on George Carlin’s definition, including variations(ed, ing etc.)V
remove punctuations except intra-word dashes
unnecessary whitespace was removedV
transform selective contractions (ex. do not –> don’t, i am –> i’m)
remove symbols, foreign characters, unnecessary whitespace, and Twitter(# and @)
len<- 20000;
set.seed(123);
blogSample20k<- sample(blog, len, replace=FALSE);
set.seed(456);
newsSample20k<- sample(news, len, replace=FALSE);
set.seed(789);
twitterSample20k<- sample(twitter, len, replace=FALSE);
removeSymbol<- function(x) {
gsub("[^a-zA-Z0-9 '-]", "", x)
}
blogSample20k<- removeSymbol(blogSample20k);
newsSample20k<- removeSymbol(newsSample20k);
twitterSample20k<- removeSymbol(twitterSample20k);
badword<- readRDS('badword.rds');
cname<- file.path(getwd(), 'sample20k');
docSample <- Corpus(DirSource(cname));
docSample <- tm_map(docSample, tolower); # lower case
docSample <- tm_map(docSample, PlainTextDocument);
docSample <- tm_map(docSample, removeNumbers); #remove numbers
docSample <- tm_map(docSample, removeWords, stopwords("english")); #remove stop words
docSample <- tm_map(docSample, removePunctuation, preserve_intra_word_dashes=TRUE); #remove punctuation
docSample <- tm_map(docSample, removeWords, badword); #remove profanity words
docSample <- tm_map(docSample, stripWhitespace); #remove unnecessary whitespace
names(docSample)<- c('blogs', 'news', 'twitter');
saveRDS(docSample, 'docSample.rds');
Documents | Line Count | Word Count | Avg. Words/Line | Word Types | Word Type/Count Ratio |
---|---|---|---|---|---|
Blogs | 20,000 | 834,955 | 41.7 | 58,903 | 0.071 |
News | 20,000 | 678,948 | 33.9 | 56,116 | 0.083 |
20,000 | 255,468 | 12.8 | 31,745 | 0.124 |
Documents | Line Count | Word Count | Avg. Words/Line | Word Types | Word Type/Count Ratio |
---|---|---|---|---|---|
Blogs | 20,000 | 418,945 | 20.9 | 45,127 | 0.108 |
News | 20,000 | 376,736 | 18.8 | 43,367 | 0.115 |
20,000 | 133,941 | 6.7 | 22,868 | 0.171 |
Table 2 shows the total line count, word count, and word type count from each document source of the sample corpus, (A) before and (B) after data cleaning and transformation. Word count is the number of words separated by whitespace. Word type count is the number of unique words which occure in the data set. Avg. words/line is the average word count per line. Word type/count ratio is the total word types divided by world count. It indicates complexity, with higher numbers indicating a more complex data set.
tdmSample<- TermDocumentMatrix(docSample);
tdmSample;
## <<TermDocumentMatrix (terms: 75104, documents: 3)>>
## Non-/sparse entries: 111362/113950
## Sparsity : 51%
## Maximal term length: 111
## Weighting : term frequency (tf)
mSample<- as.matrix(tdmSample);
freqSample<- rowSums(mSample) ; mSample<- cbind(mSample, freqSample);
mSample<- mSample[order(freqSample, decreasing=TRUE),];
dSample<- data.frame(word=row.names(mSample), as.data.frame(mSample, row.names=NULL), stringsAsFactors = FALSE);
freqSample<- sort(freqSample, decreasing=TRUE);
Word Frequency
Figure 1 Word Cloud - Top 100 Most Frequently Occurring Words
Figure 2 Top 30 Most Frequently Occurring Words by Document Source
BigramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE);
}
tdmSample_Bi<- TermDocumentMatrix(docSample, control = list(tokenize = BigramTokenizer));
tdmSample_Bi;
## <<TermDocumentMatrix (terms: 769580, documents: 3)>>
## Non-/sparse entries: 816197/1492543
## Sparsity : 65%
## Maximal term length: 117
## Weighting : term frequency (tf)
mSample_Bi<- as.matrix(tdmSample_Bi);
freqSample_Bi<- rowSums(mSample_Bi) ; mSample_Bi<- cbind(mSample_Bi, freqSample_Bi);
mSample_Bi<- mSample_Bi[order(freqSample_Bi, decreasing=TRUE),];
dSample_Bi<- data.frame(word=row.names(mSample_Bi), as.data.frame(mSample_Bi, row.names=NULL), stringsAsFactors = FALSE);
freqSample_Bi<- sort(freqSample_Bi, decreasing=TRUE);
Word Frequency
Figure 3 Top 30 Most Frequently Occurring Bi-gram Words by Document Source
TrigramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE);
}
tdmSample_Tri<- TermDocumentMatrix(docSample, control = list(tokenize = TrigramTokenizer));
tdmSample_Tri;
## <<TermDocumentMatrix (terms: 939207, documents: 3)>>
## Non-/sparse entries: 941869/1875752
## Sparsity : 67%
## Maximal term length: 123
## Weighting : term frequency (tf)
mSample_Tri<- as.matrix(tdmSample_Tri);
freqSample_Tri<- rowSums(mSample_Tri) ; mSample_Tri<- cbind(mSample_Tri, freqSample_Tri);
mSample_Tri<- mSample_Tri[order(freqSample_Tri, decreasing=TRUE),];
dSample_Tri<- data.frame(word=row.names(mSample_Tri), as.data.frame(mSample_Tri, row.names=NULL), stringsAsFactors = FALSE);
freqSample_Tri<- sort(freqSample_Tri, decreasing=TRUE);
Word Frequency
Figure 4 Top 30 Most Frequently Occurring Tri-gram Words by Document Source
Distributions of frenquencies were further examined.
Single Words
Among 75,104 unique words(word types) in this sample corpus, it was observed that 40,150 of them (about 53%) appeared only once. There are also many others that appear very infrequently. 1,077 (top 1%) unique words cover 50% of the sample corpus, whereas 16,096 (top 21%) unique words are needed to account for 90%.
N-Gram
Figure 5 Distributions of Frequencies by N-Gram Models
The x axis represents sequences of unique words sorted by number of occurence. The y axis represents frequency. The x, y axes remain in each model’s original scales. The blue line shows 50% coverage of the total words in the dataset, whereas the orange line shows the 90% coverage. The red line indicates the begining of single occurence words, ie, frequencies to the right side of the red line are all “one” - 53% for uni-gram, 90% for bi-gram, and 99% for tri-gram.
Table 3 Number of Word Types(Unique Words) and Highest Frequency by N-Gram Models
N-Gram | Word Types | Highest Frequency |
---|---|---|
1-Gram | 75,104 | 6,015 |
2-Gram | 769,580 | 390 |
3-Gram | 939,207 | 56 |