I like to think of myself as the anti data scientist, or the troll of data science. I do my posts for me (and to entertain my one loyal reader, Sam). If I find some data to play around with and want to make a funky chart, then I can't be stopped. That's why I decided to make a word cloud of words from a Bank of Canada press release. I also decided to use R. On top of that, I am using R inside Jupyter. A slew of faux-pas.

People in the data world tend to hate word clouds. Some data scientists hate them so much that they have the urge to write an entire article about why Word Clouds are Lame. I only skimmed this article to find a funny quote. Here is one: "word clouds [are] the pie chart of text data." So if people don't like it, of course I'm going to create it.

Also, I agree that pie charts kind of suck, but I have worked in places where pie charts are the only natural, easily-digestible choice. Maybe I will create 100 pie charts for my next post.

Packages

I'm using pdftools to grab the text from the PDF press release, tokenizers to break up the text into sentences and words, corpus to analyze the words and wordcloud to create the word cloud. I threw tidyverse in there because you basically need it for everything in R.

I think technically you should only have to use one of tokenizers or corpus but I found that using both gave me cleaner data.

In [10]:
library(pdftools)
library(tidyverse)
library(tokenizers)
library(wordcloud)
library(corpus)

Download PDF, extract text and tokenize

In [2]:
url = "https://www.bankofcanada.ca/wp-content/uploads/2019/10/fad-press-release-2019-10-30.pdf"
file_name = "fad-press-release-2019-10-30.pdf"
download.file(url, file_name, mode = "wb")

I apply the tokenize_sentences and tokenize_words functions from the tokenizers library to the raw text to break it into sentences and words. I've picked out sentences 3 to 26 since those are the body of the text. Sentences 1 and 2 contain the letter head information and the title, so they're not useful here. Sentences after 26 are also not useful. They give information on the next press release date and the Bank's contact info.

In [3]:
sentences <- pdf_text(file_name) %>% tokenize_sentences()

txt <- sentences[[1]][3:26]
txt <- txt %>% tokenize_words()

Below is the result of tokenizing the full text and then the sentences into single words. Notice that the case has been changed to lower. There are still some things in there like numbers that we want to get rid of. For that I use the corpus package.

In [4]:
txt
    1. 'the'
    2. 'outlook'
    3. 'for'
    4. 'the'
    5. 'global'
    6. 'economy'
    7. 'has'
    8. 'weakened'
    9. 'further'
    10. 'since'
    11. 'the'
    12. 'bank’s'
    13. 'july'
    14. 'monetary'
    15. 'policy'
    16. 'report'
    17. 'mpr'
    1. 'ongoing'
    2. 'trade'
    3. 'conflicts'
    4. 'and'
    5. 'uncertainty'
    6. 'are'
    7. 'restraining'
    8. 'business'
    9. 'investment'
    10. 'trade'
    11. 'and'
    12. 'global'
    13. 'growth'
    1. 'a'
    2. 'growing'
    3. 'number'
    4. 'of'
    5. 'countries'
    6. 'have'
    7. 'responded'
    8. 'with'
    9. 'monetary'
    10. 'and'
    11. 'other'
    12. 'policy'
    13. 'measures'
    14. 'to'
    15. 'support'
    16. 'their'
    17. 'economies'
    1. 'still'
    2. 'global'
    3. 'growth'
    4. 'is'
    5. 'expected'
    6. 'to'
    7. 'slow'
    8. 'to'
    9. 'around'
    10. '3'
    11. 'percent'
    12. 'this'
    13. 'year'
    14. 'before'
    15. 'edging'
    16. 'up'
    17. 'over'
    18. 'the'
    19. 'next'
    20. 'two'
    21. 'years'
    1. 'canada'
    2. 'has'
    3. 'not'
    4. 'been'
    5. 'immune'
    6. 'to'
    7. 'these'
    8. 'developments'
    1. 'commodity'
    2. 'prices'
    3. 'have'
    4. 'fallen'
    5. 'amid'
    6. 'concerns'
    7. 'about'
    8. 'global'
    9. 'demand'
    1. 'despite'
    2. 'this'
    3. 'the'
    4. 'canada'
    5. 'us'
    6. 'exchange'
    7. 'rate'
    8. 'is'
    9. 'still'
    10. 'near'
    11. 'its'
    12. 'july'
    13. 'level'
    14. 'and'
    15. 'the'
    16. 'canadian'
    17. 'dollar'
    18. 'has'
    19. 'strengthened'
    20. 'against'
    21. 'other'
    22. 'currencies'
    1. 'growth'
    2. 'in'
    3. 'canada'
    4. 'is'
    5. 'expected'
    6. 'to'
    7. 'slow'
    8. 'in'
    9. 'the'
    10. 'second'
    11. 'half'
    12. 'of'
    13. 'this'
    14. 'year'
    15. 'to'
    16. 'a'
    17. 'rate'
    18. 'below'
    19. 'its'
    20. 'potential'
    1. 'this'
    2. 'reflects'
    3. 'the'
    4. 'uncertainty'
    5. 'associated'
    6. 'with'
    7. 'trade'
    8. 'conflicts'
    9. 'continuing'
    10. 'adjustment'
    11. 'in'
    12. 'the'
    13. 'energy'
    14. 'sector'
    15. 'and'
    16. 'the'
    17. 'unwinding'
    18. 'of'
    19. 'temporary'
    20. 'factors'
    21. 'that'
    22. 'boosted'
    23. 'growth'
    24. 'in'
    25. 'the'
    26. 'second'
    27. 'quarter'
    1. 'business'
    2. 'investment'
    3. 'and'
    4. 'exports'
    5. 'are'
    6. 'likely'
    7. 'to'
    8. 'contract'
    9. 'before'
    10. 'expanding'
    11. 'again'
    12. 'in'
    13. '2020'
    14. 'and'
    15. '2021'
    1. 'at'
    2. 'the'
    3. 'same'
    4. 'time'
    5. 'government'
    6. 'spending'
    7. 'and'
    8. 'lower'
    9. 'borrowing'
    10. 'rates'
    11. 'are'
    12. 'supporting'
    13. 'domestic'
    14. 'demand'
    15. 'and'
    16. 'activity'
    17. 'in'
    18. 'the'
    19. 'services'
    20. 'sector'
    21. 'remains'
    22. 'robust'
    1. 'employment'
    2. 'is'
    3. 'showing'
    4. 'continuing'
    5. 'strength'
    6. 'and'
    7. 'wage'
    8. 'growth'
    9. 'is'
    10. 'picking'
    11. 'up'
    12. 'although'
    13. 'with'
    14. 'some'
    15. 'variation'
    16. 'among'
    17. 'regions'
    1. 'consumer'
    2. 'spending'
    3. 'has'
    4. 'been'
    5. 'choppy'
    6. 'but'
    7. 'will'
    8. 'be'
    9. 'supported'
    10. 'by'
    11. 'solid'
    12. 'income'
    13. 'growth'
    1. 'meanwhile'
    2. 'housing'
    3. 'activity'
    4. 'is'
    5. 'picking'
    6. 'up'
    7. 'in'
    8. 'most'
    9. 'markets'
    1. 'the'
    2. 'bank'
    3. 'continues'
    4. 'to'
    5. 'monitor'
    6. 'the'
    7. 'evolution'
    8. 'of'
    9. 'financial'
    10. 'vulnerabilities'
    11. 'in'
    12. 'light'
    13. 'of'
    14. 'lower'
    15. 'mortgage'
    16. 'rates'
    17. 'and'
    18. 'past'
    19. 'changes'
    20. 'to'
    21. 'housing'
    22. 'market'
    23. 'policies'
    1. 'the'
    2. 'bank'
    3. 'projects'
    4. 'real'
    5. 'gdp'
    6. 'will'
    7. 'grow'
    8. 'by'
    9. '1.5'
    10. 'percent'
    11. 'this'
    12. 'year'
    13. '1.7'
    14. 'percent'
    15. 'in'
    16. '2020'
    17. 'and'
    18. '1.8'
    19. 'percent'
    20. 'in'
    21. '2021'
    1. 'this'
    2. 'implies'
    3. 'that'
    4. 'the'
    5. 'current'
    6. 'modest'
    7. 'output'
    8. 'gap'
    9. 'will'
    10. 'narrow'
    11. 'over'
    12. 'the'
    13. 'projection'
    14. 'horizon'
    1. 'measures'
    2. 'of'
    3. 'inflation'
    4. 'are'
    5. 'all'
    6. 'around'
    7. '2'
    8. 'percent'
    1. 'cpi'
    2. 'inflation'
    3. 'likely'
    4. 'will'
    5. 'dip'
    6. 'temporarily'
    7. 'in'
    8. '2020'
    9. 'as'
    10. 'the'
    11. 'effect'
    12. 'of'
    13. 'a'
    14. 'previous'
    15. 'spike'
    16. 'in'
    17. 'energy'
    18. 'prices'
    19. 'fades'
    1. 'overall'
    2. 'the'
    3. 'bank'
    4. 'expects'
    5. 'inflation'
    6. 'to'
    7. 'track'
    8. 'close'
    9. 'to'
    10. 'the'
    11. '2'
    12. 'percent'
    13. 'target'
    14. 'over'
    15. 'the'
    16. 'projection'
    17. 'horizon'
    1. 'all'
    2. 'things'
    3. 'considered'
    4. 'governing'
    5. 'council'
    6. 'judges'
    7. 'it'
    8. 'appropriate'
    9. 'to'
    10. 'maintain'
    11. 'the'
    12. 'current'
    13. 'level'
    14. 'of'
    15. 'the'
    16. 'overnight'
    17. 'rate'
    18. 'target'
    1. 'governing'
    2. 'council'
    3. 'is'
    4. 'mindful'
    5. 'that'
    6. 'the'
    7. 'resilience'
    8. 'of'
    9. 'canada’s'
    10. 'economy'
    11. 'will'
    12. 'be'
    13. 'increasingly'
    14. 'tested'
    15. 'as'
    16. 'trade'
    17. 'conflicts'
    18. 'and'
    19. 'uncertainty'
    20. 'persist'
    1. 'in'
    2. 'considering'
    3. 'the'
    4. 'appropriate'
    5. 'path'
    6. 'for'
    7. 'monetary'
    8. 'policy'
    9. 'the'
    10. 'bank'
    11. 'will'
    12. 'be'
    13. 'monitoring'
    14. 'the'
    15. 'extent'
    16. 'to'
    17. 'which'
    18. 'the'
    19. 'global'
    20. 'slowdown'
    21. 'spreads'
    22. 'beyond'
    23. 'manufacturing'
    24. 'and'
    25. 'investment'
    1. 'in'
    2. 'this'
    3. 'context'
    4. 'it'
    5. 'will'
    6. 'pay'
    7. 'close'
    8. 'attention'
    9. 'to'
    10. 'the'
    11. 'sources'
    12. 'of'
    13. 'resilience'
    14. 'in'
    15. 'the'
    16. 'canadian'
    17. 'economy'
    18. 'notably'
    19. 'consumer'
    20. 'spending'
    21. 'and'
    22. 'housing'
    23. 'activity'
    24. 'as'
    25. 'well'
    26. 'as'
    27. 'to'
    28. 'fiscal'
    29. 'policy'
    30. 'developments'

Convert to corpus data frame

The corpus package makes working with text data easier. I've decided to apply corpus functions to my already-tokenized text as I found that just using the tokenizers library wasn't enough. Corpus allows you to create corpus data frame objects using corpus_frame. The corpus library still lets you pass other object types as inputs, such as a regular data.frame. However, passing a corpus data frame gives you more functionality, such as the ability to apply the text_filter function, which filters out things like punctuation and numbers. See this vignette for an introduction to corpus.

Important note: the corpus_frame function expects the input to have a column called text.

In [5]:
names(txt) <- paste("page", seq_along(txt), sep = "")

# convert to normal data frame
df = data.frame(text)

for (item in txt)
{
    df_temp = data.frame(text = item)
    df <- rbind(df, df_temp)
}

# convert to corpus data frame
data <- corpus_frame(df)

Filter out numbers and punctuation

The default options for drop_number and drop_punc are set to FALSE, so we need to change those in the text_filter properties:

In [6]:
text_filter(data)$drop_number <- TRUE
text_filter(data)$drop_punct <- TRUE

Create a list of Bank-specific stop words

Stop words are common words, such as "the", that we need to instruct the program to ignore. Bank of Canada press releases tend to use a handful of words common to economics so I've decided to create a custom stop word list to filter them out.

In [7]:
boc_stopwords <- c('per','cent', 'percent','rate','rates','bank',"bank\'s",'canada',
                   'monetary','policy','report','mpr','governing','council','year',
                   'january','february','march','april','may','june','july','august',
                   'september','october','november','december')

Tally word counts

Corpus has a term_stats function that counts all the instances of words. Below we see that "growth" is the most frequently used term. Coming from a central bank, that's not suprising.

In [8]:
term_stats <- term_stats(data, subset = !term %in% boc_stopwords & !term %in% stopwords_en)
head(term_stats)
termcountsupport
growth 6 6
global 5 5
trade 4 4
activity 3 3
conflicts3 3
economy 3 3

Create word cloud

In [9]:
# save image
png("oct2019.png", width=12,height=8, units='in', res=300)

wordcloud(words = term_stats$term, freq = term_stats$count, min.freq = 1,
          max.words=500, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(6, "Dark2"))