Counting Words in a Text#
For this assignment, you will write a program that counts how many times each word appears in a text file and prints out the results. For example, in one of the files we will provide you — caesar.txt
, which is part of the text of Julius Caesar — you will find that the word “caesar” appears 34 times (ignoring case).
As you have not yet learned to read files, we have provided the count_file function to you:
def count_file(fname):
counts = {}
with open(fname) as f:
for line in f:
count_words(counts, line)
print_results(counts)
This function makes an empty dictionary, and then reads each line in a file, calling your count_words
function and to update the dictionary which maps words to their counts. When it finishes with the file, it calls your print_results
function to print out the final results.
Step 0:#
Download and unzip this zip file which contains four .txt
text files with content from the plays Julius Caesar, Hamlet, Henry VIII, and Macbeth. Place these in the directory in which you plan to work. (You don’t need to upload these files to the autograder — it will have its own copies.)
Then:
create a file called
countwords.py
in that same directory.Copy-paste into it the scaffolding code down below and save.
Quit and re-open VS Code.
Under
File
in VS Code, select"Open Folder..."
and open the directory where you placedcountwords.py
andcaesar.txt
.
Step 1:#
Write the function count_words(counts, line)
which takes:
counts
: a dictionary that maps words to how many times they appear, andline
: a string, which you should split into words and, for each word you see, update counts to reflect that occurrence of the word.
This function should return the updated “counts” dictionary.
A few hints:
Python strings come with a number of standard methods for string manipulation you can read about here.
Of particular interest for this exercise is probably the
.split()
method (e.g.,my_string.split()
)..split()
splits string into a list of sub-strings where it makes a split anywhere it sees whitespace (like a space or newline). So in:
my_string = "hello world!" my_string.split()
will return
["hello", "world!"]
.We have provided a simple “print_results” which just prints
counts
: you can use it to test yourcount_words
function before you proceed.
Step 2:#
If you look at the results of count_words
, you will see that sometimes we end up with words like "now,"
(with a comma) which really just should be "now"
. Likewise, "The"
and "the"
are counted as two separate words.
You can improve on your count_words
by using the following two string methods:
string.strip("-?.!,[]—:;'\"")
: will return a string with any leading or trailing occurrences of any of the characters-?.!,[]—:;"'
removed.string.lower()
will return string with all characters converted to lower case letters. (there is also a.upper()
to go the other way if you ever need).
Step 3:#
print_results
currently just prints the dictionary using its built in conversion to a string. That is fine for testing, but not a great output format.
For this step, you should replace the code in print_results
so that it prints out each word and its corresponding count, 1 per line, with a space, a colon, and a space between the word and count. e.g.:
and : 59
a : 27
the : 72
caesar : 34
To help with this, it may be useful to know about the three main ways to dump data out of a dictionary: .items()
, .keys()
, and .values()
.
.items()
returns a collection of tuples, where each tuple has a key (in this case, the word), and the associated value (in this case, the count)..keys()
returns a collection of all the keys in the dictionary..values()
returns a collection of all the values in the dictionary.
Finally, if you just loop over a dictionary object (for k in my_dict:
), it will iterate over the keys, the same as it would if you used for k in my_dict.keys():
.
Step 4#
For this final step, you will make your output a little nicer even still by sorting it by the words, so that they are in alphabetical order. To accomplish this, you will want to put the items from the dictionary into a list, then use thatlist.sort()
.
To convert one of the collections you get back from .items()
, .keys()
, or .values()
into a list, just use the list()
function. e.g.: keys_as_list = list(my_dict.keys())
.
Scaffolding#
weird_characters = "-?.!,[]—:;\"'"
def count_words(counts, line):
# You should write this function in Step 1,
# and improve it in Step 2
pass
def print_results(counts):
# You should replace this code
# with something more sophisticated
# in Step 3 and 4.
print(counts.items())
# You do not need to modify this function.
# It will call your count_words and print_results functions
def count_file(fname):
counts = {}
with open(fname) as f:
for line in f:
count_words(counts, line)
print_results(counts)
return counts
if __name__ == "__main__":
counts = count_file("caesar.txt")
Expected Output#
Folded below is the output you may expect from the test code at the bottom of your file if you implement everything correctly.
a : 27
abide : 1
about : 2
abroad : 1
action : 1
afoot : 1
after : 1
again : 1
against : 2
alas : 2
all : 23
allow'd : 1
alone : 2
already : 1
am : 3
ambition : 3
ambitious : 7
an : 5
ancestors : 1
and : 59
angel : 1
another : 1
answer : 2
answer'd : 1
antony : 39
any : 7
arbours : 1
are : 12
arms : 1
art : 2
as : 14
ascended : 1
assembly : 1
at : 3
audience : 1
awake : 1
away : 5
awhile : 1
back : 3
base : 2
be : 16
bear : 2
bearing : 1
beasts : 1
beg : 1
begins : 1
behold : 1
beholding : 2
being : 1
believe : 2
belike : 1
benches : 1
benefit : 1
bequeathing : 1
best : 2
better : 2
bid : 1
blest : 1
blood : 4
bloody : 2
blunt : 1
body : 6
bondman : 1
bones : 1
brands : 1
bring : 3
brought : 1
brutish : 1
brutus : 38
burn : 3
burst : 1
bury : 1
but : 11
by : 2
caesar : 34
caesar's : 15
can : 1
capitol : 1
captives : 1
casca : 1
cassius : 8
cause : 3
censure : 1
certain : 2
chair : 1
choose : 1
citizen : 52
citizens : 7
clamours : 1
closet : 1
coffers : 1
coffin : 1
come : 10
comes : 4
common : 1
commons : 1
commonwealth : 1
compare : 1
compel : 1
consider : 1
conspirators : 1
corpse : 2
country : 2
countrymen : 7
course : 1
cried : 1
crown : 2
crown'd : 1
cursed : 1
cut : 1
dagger : 2
daggers : 1
day : 2
dead : 3
dear : 2
dearly : 1
death : 7
deed : 1
demand : 1
depart : 2
depart,—that : 1
descend : 2
deserved : 1
did : 5
die : 2
dint : 1
dip : 1
disposed : 1
disprove : 1
do : 14
does : 1
done : 2
doors : 1
doubt : 1
down : 5
drachmas : 1
drops : 1
dumb : 1
dying : 2
ears : 1
em : 1
enforced : 1
enrolled : 1
enter : 3
entreat : 1
envious : 1
even : 1
evening : 1
ever : 2
every : 3
evil : 1
exeunt : 2
exit : 2
extenuated : 1
eyes : 1
face : 1
faithful : 1
fall : 1
far : 1
fault : 1
fear : 3
feel : 1
fell : 2
fellow : 1
fetch : 1
fill : 1
finds : 1
fire : 4
first : 15
fled : 1
flood : 1
flourish'd : 1
follow : 3
follow'd : 1
for : 27
forgot : 1
forms : 1
fortunate : 1
fortune : 2
found : 2
fourth : 11
free : 1
friend : 4
friends : 7
from : 2
full : 1
funeral : 1
gates : 1
gave : 1
general : 1
gentle : 2
give : 4
gives : 1
glories : 1
glory : 1
go : 7
gods : 1
goes : 2
good : 5
grace : 2
gracious : 1
great : 2
griefs : 1
grievous : 1
grievously : 1
had : 5
hair : 1
hand : 1
harm : 1
has : 2
hath : 7
have : 20
he : 30
hear : 21
heard : 1
hearse : 1
heart : 2
hearts : 2
heirs : 2
here : 13
here's : 1
him : 33
himself : 2
his : 33
ho : 5
holy : 1
home : 2
honour : 4
honourable : 11
house : 4
houses : 1
how : 5
i : 57
if : 12
in : 20
inflame : 1
ingratitude : 1
interred : 1
into : 4
is : 23
issue : 1
it : 25
joy : 1
judge : 2
judgment : 1
just : 1
kill : 1
kind : 1
kingly : 1
kiss : 1
knock'd : 1
know : 12
last : 1
leave : 4
left : 2
legacy : 1
lend : 1
lepidus : 1
less : 2
let : 14
let's : 1
lies : 1
like : 1
live : 5
lives : 1
living : 1
look : 2
lost : 1
love : 5
loved : 5
lover : 1
lovers : 1
loves : 1
lupercal : 1
mad : 1
made : 4
madmen : 1
make : 3
man : 8
mantle : 2
many : 1
mark : 6
mark'd : 1
marr'd : 1
masters : 2
matter : 1
may : 3
me : 25
mean : 1
meet : 1
memory : 1
men : 10
men's : 1
mention : 1
merry : 1
methinks : 1
might : 1
mighty : 1
minds : 1
mine : 2
mischief : 1
mood : 1
more : 3
moreover : 1
most : 6
mourn : 1
mourned : 1
mouths : 1
move : 1
moved : 1
much : 1
muffling : 1
murderers : 1
must : 3
mutiny : 4
my : 11
myself : 3
napkins : 1
nay : 2
need : 1
neither : 1
nervii : 1
never : 2
new-planted : 1
no : 7
noble : 9
nobler : 1
none : 4
nor : 5
not : 26
notice : 1
now : 6
numbers : 1
o : 12
o'ershot : 1
octavius : 2
of : 36
off : 1
offences : 1
offended : 4
oft : 1
on : 5
once : 1
only : 1
or : 1
orator : 1
orchards : 1
other : 1
others : 1
our : 2
out : 1
over : 1
overcame : 1
parchment : 1
pardon : 1
part : 1
parts : 1
patience : 2
patient : 2
pause : 2
peace : 7
people : 1
perceive : 1
permission : 1
piteous : 1
pity : 1
place : 4
plain : 1
please : 1
pleasures : 1
pluck : 2
pluck'd : 1
pompey's : 1
poor : 5
power : 1
praise : 1
prepare : 1
presented : 1
press : 1
private : 2
public : 3
pulpit : 2
put : 2
question : 1
quite : 1
rage : 1
ran : 2
ransoms : 1
rather : 2
read : 7
reason : 2
reasons : 3
receive : 1
recreate : 1
red : 1
refuse : 1
rejoice : 1
remember : 1
rendered : 2
rent : 1
reply : 1
resolved : 1
respect : 1
rest : 1
revenge : 2
revenged : 1
reverence : 1
rich : 1
rid : 2
right : 1
rightly : 1
ring : 2
rise : 1
roman : 2
romans : 3
rome : 8
room : 2
rose : 1
round : 1
royal : 1
rude : 1
ruffle : 1
rushing : 1
sacred : 1
sake : 3
same : 1
satisfied : 2
save : 1
saw : 1
say : 4
sayings : 1
says : 4
seal : 2
second : 14
see : 3
seek : 2
seem : 1
senses : 1
servant : 4
seventy-five : 1
several : 3
severally : 1
shall : 9
shed : 1
should : 4
shouts : 1
show : 2
side : 1
sight : 1
silence : 2
silent : 1
sir : 1
slaves : 1
slay : 1
slew : 2
so : 10
some : 3
soul : 1
souls : 1
speak : 14
speaks : 1
spectacle : 1
speech : 2
spirits : 1
spoke : 2
stab : 1
stabb'd : 2
stand : 5
statua : 1
statue : 1
stay : 6
steal : 1
steel : 1
sterner : 1
stir : 3
stones : 2
stood : 1
straight : 1
street : 1
strong : 1
stuff : 1
such : 3
sudden : 1
suffered : 1
summer's : 1
sure : 1
sweet : 2
take : 3
tears : 2
tell : 3
tending : 1
tent : 1
testament : 2
than : 6
that : 27
that's : 1
the : 72
their : 6
them : 8
then : 10
there : 9
there's : 1
therefore : 1
these : 1
they : 9
thing : 2
third : 12
this : 14
thither : 1
those : 2
thou : 5
though : 1
thrice : 2
throng : 1
through : 3
thus : 1
tiber : 1
till : 3
time : 1
tis : 3
to : 47
told : 2
tongue : 1
traitor : 1
traitors : 5
treason : 1
triumph : 1
true : 1
twas : 1
twere : 1
tyrant : 1
under : 2
unkindest : 1
unkindly : 1
unto : 2
up : 6
upon : 2
us : 9
utterance : 1
valiant : 1
valour : 1
vanquish'd : 1
vesture : 1
vile : 1
villains : 2
visit : 1
walk : 1
walks : 1
was : 18
we : 5
we'll : 11
weep : 3
weeping : 1
well : 1
well-beloved : 1
wept : 1
were : 8
what : 12
when : 6
where : 1
wherein : 2
which : 7
while : 1
whilst : 1
who : 5
whose : 2
why : 2
will : 39
wills : 1
wilt : 1
windows : 1
wisdom : 1
wise : 1
wish : 1
wit : 1
with : 20
withholds : 1
within : 1
without : 1
woful : 1
wood : 1
word : 1
words : 2
work : 1
world : 1
worse : 1
worth : 1
worthy : 1
would : 6
wound : 1
wounded : 1
wounds : 2
wrong : 8
ye : 1
yea : 1
yesterday : 1
yet : 4
you : 58
your : 8
yourselves : 2