Licence CC BY-NC-ND Thierry Parmentelat & Arnaud Legout
classe WordCounts¶
on veut calculer la fréquence d’apparition des mots dans un texte
pour cela on vous demande d’écrire une classe qui s’utilise comme ceci
from wordcounts import WordCounts
wc = WordCounts("wordcounts-data.txt")
# on choisit arbitrairement d'afficher les 5 mots les + fréquents
print(wc)wordcounts-data.txt: 1580 total words570 different words the : 65
he : 56
a : 52
to : 52
it : 40
# ensuite on peut chercher le nombre d'occurences comme ceci
for word in ['arthur', 'people']:
print(f"word {word} was found {wc.counter[word]} times")word arthur was found 16 times
word people was found 9 times
# et voir si un mot apparait ou pas
for word in ['arthur', 'armageddon']:
present = word in wc.vocabulary()
print(f"is word '{word}' present ? : {present} ")is word 'arthur' present ? : True
is word 'armageddon' present ? : False
Indices¶
il est raisonnable de tout mettre en minuscule une bonne fois au tout début du traitement
voyez éventuellement le module standard
string, etstring.punctuationsachez aussi que le texte en question contient des apostrophes non-ASCII
“”voyez aussi la classe
collections.Counter, qui va vous rendre la vie bien plus facile
variantes¶
comment trouveriez-vous tous les mots qui apparaissent entre 30 et 40 fois dans le texte ?
si vous vous sentez confortable (il faut faire de la surcharge d’opérateur), faites en sorte qu’on puisse aussi écrire:
for word in ['arthur', 'people']:
# here we can index the WordCount instance directly
print(f"word {word} was found {wc[word]} times")word arthur was found 16 times
word people was found 9 times
solution¶
la classe
wordcounts.py
"""
playing with word frequencies
"""
from collections import Counter
from string import punctuation
# the text has unicode quotes in it
punctuation += "“”"
class WordCounts:
def __init__(self, filename) -> None:
# just in case, keep for future reference
self.filename = filename
# the words as they appear in the text, all lowercase and with punctuation removed
words = []
# read and erase punctuation
with open(self.filename) as feed:
for line in feed:
line = line.strip().lower()
for char in punctuation:
line = line.replace(char, " ")
# add the words in the list
words.extend(line.split())
# using Counter makes it easier
self.counter = Counter(words)
def __repr__(self) -> str:
result = ""
result += f"{self.filename}:"
result += f" {self.size()} total words"
result += f"{len(self.vocabulary())} different words"
result += "\n".join(f" {w:>5} : {c}" for w, c in self.counter.most_common(5))
return result
def size(self) -> int:
"""
number of words in the original text
"""
# only in 3.10
#return self.counter.total()
# a Counter is a dict
return sum(value for value in self.counter.values())
def vocabulary(self) -> set[str]:
"""
return the set of words used in the text
"""
return set(self.counter.elements())
# pour la variante
def __getitem__(self, word: str) -> int:
return self.counter[word]
les recherches
# les mots apparassant entre 30 et 40 fois
{word for word, count in wc.counter.items() if 30 <= count <= 40}
-> {'and', 'it', 'was'}