Language Processing

Measuring the vulgarity level of popular French podcasts

Year
2020
Project
Personal
Stack
youtube-dl, python
See the project (French)   

Mike Ward Sous écoute

Sous écoute is without a doubt one of the most popular podcasts in Quebec. With several hundred episodes available for free on youtube, each averaging over two hours in length, it is an important archive of informal discussions between Franco-Quebecers. Given the comical and rather vulgar nature of the podcast, I had the idea to analyze the automated captions by youtube to see what insights could be obtained. The results are surprising.

Approach

After a quick reading, I realized that the captions generated by youtube are more or less faithful to reality. That said, with the volume of words, it is still possible to observe trends in terms of expressions and the level of vulgarity used in the various episodes.

A few steps were necessary in order to proceed with the analysis:

  • Collecting the transcripts of 162 episodes, that is 340 hours of podcasts, in txt format via the youtube-dl command line tool.
  • Creation of a (subjective) dictionary of the most vulgar words in Quebec, and the variations observed in the youtube captions. (Click here for a crash course on Quebecois swear words).
  • Scoring of expressions, where 1 is the most vulgar, and 0.4 the least.
  • Calculation of the trash score of every episode, which is the total sum of the scores associated with the expressions used by the host and the guests.

Results

I was surprised to find that the less vulgar episodes were usually podcasts with older or more articulate guests, without being snobby. Check out the project for details and code.



See the project (French)