Text Counting

Sometimes we need to do the text counting

Let’s load two cool libraries and create sample texts.

##### Load Libraries #####
library(stringr)

##### Create Sample Data Frame #####
sample <- data.frame(text = c("Hello", "Good Morning", "that's cool.", "thats cool."))
(sample$text <- as.character(sample$text))

##### Load Libraries #####

library(stringr)

##### Create Sample Data Frame #####

sample <- data.frame(text = c("Hello", "Good Morning", "that's cool.", "thats cool."))

(sample$text <- as.character(sample$text))

> (sample$text <- as.character(sample$text))
[1] "Hello"        "Good Morning" "that's cool." "thats cool."

1 2	> (sample$text <- as.character(sample$text)) [1] "Hello" "Good Morning" "that's cool." "thats cool."

Number of Characters

str_count(sample$text)

1	str_count(sample$text)

> str_count(sample$text)
[1]  5 12 12 11

1 2	> str_count(sample$text) [1] 5 12 12 11

Number of Words

str_count(sample$text, '\\w+')

1	str_count(sample$text, '\\w+')

> str_count(sample$text, '\\w+')
[1] 1 2 3 2

1 2	> str_count(sample$text, '\\w+') [1] 1 2 3 2

How str_count handles the ‘ might get a misleading result. I believe str_count will replace ‘ with space which results in extra word. So, to correctly count the word, we probably need to use str_replace_all before using str_count

Strigi Function

Stringi has cool function: stri_stats_latex .

stri_stats_latex(sample[1,])

1	stri_stats_latex(sample[1,])

> stri_stats_latex(sample[1,])
    CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds        Envirs 
            5             0             0             1             0             0

> stri_stats_latex(sample[1,])

CharsWord CharsCmdEnvir CharsWhite Words Cmds Envirs

5 0 0 1 0 0

With just one function, you get all. The only disadvantage of stri_stats_latex is we need to specify what observation to count. If we were to…

stri_stats_latex(sample$text)

1	stri_stats_latex(sample$text)

> stri_stats_latex(sample$text)
    CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds        Envirs 
           34             0             6             8             0             0

> stri_stats_latex(sample$text)

CharsWord CharsCmdEnvir CharsWhite Words Cmds Envirs

34 0 6 8 0 0

Yep… not too good. So, another way to circumvent is to use apply .

apply(sample, 1, stri_stats_latex)

1	apply(sample, 1, stri_stats_latex)

> apply(sample, 1, stri_stats_latex)
              [,1] [,2] [,3] [,4]
CharsWord        5   11    9    9
CharsCmdEnvir    0    0    0    0
CharsWhite       0    1    3    2
Words            1    2    3    2
Cmds             0    0    0    0
Envirs           0    0    0    0

> apply(sample, 1, stri_stats_latex)

[,1] [,2] [,3] [,4]

CharsWord 5 11 9 9

CharsCmdEnvir 0 0 0 0

CharsWhite 0 1 3 2

Words 1 2 3 2

Cmds 0 0 0 0

Envirs 0 0 0 0