Today we will see how I was doing learning Python from scratch- Sort Dictionaries and files.
But before that let’s go three weeks ago.
Python from scratch so far
Three weeks ago I have published my first post about how I started my journey into Python.
It was a huge hit and it led me to the second post about me facing more learning
Also, more problems in my way to be a Python developer.
That was 2 weeks ago.
I was so eager to read more, to study more, solve issues and eventually write about it.
All is true but when you are lack of time it can’t happen.
To study language seriously you need to have time and a clear mind.
You can imagine I didn’t have at that time since I have my family duties.
Python from scratch – Sort
Last night finally I found some quiet hours and continue to read Google’s Class Chapter #5– Sorting and Chapter #6– Dictionaries and Files.
The Sorting chapter was actually a follow-up of the previous chapter.
And for those of you who read my previous post know that I was complaining about the function sorted() not explained at all.
Well, the function has been found in this chapter.
So I read the sorting page and learned what tuple is
And when I came to solve some issues I found out that accidentally I already solved them when study chapter of ‘lists’.
Python from scratch – Dictionaries and files
From here jumped into the new page- Dict and files.
I have learned it, memorized it, did all the tasks needed.
Only to find out that my last task was to solve the exercise ‘wordcount’ (See below)
And guys this exercise was very hard to understand.
I read the question a few times until I got a sense of what is needed.
It got to a point where I became a bit frustrated.
I even had thought of leaving Google class and moved to another one.
Somehow after wandering in some python sites and getting help, it hit me.
The fact that I am looking and reading more sites it is a strength I am developing thanks for Google search engine.
Also, it is a blessing to use Google’s class.
I would like to thank the Python community out there that willing to solve any python issue or help.
Just go to Google Search Engine and type- “Python <your question>”
As you may think I was able to finish eventually the test.
When I compared my code to the provided solution by Google I found out that we are not that different which is very inspiring.
So here is the question:
# Define print_words(filename) and print_top(filename) functions.
# You could write a helper utility function that reads a file
# and builds and returns a word/count dict for it.
# Then print_words() and print_top() can just call the utility function.
1. For the --count flag, implement a print_words(filename) function that counts how often each word appears in the text and prints:
word1 count1
word2 count2
...
Print the list in order sorted by word (python will sort punctuation to come before letters- that's fine).
Store all the words as lowercase, so 'The' and 'the' count as the same word.
2. For the --topcount flag implement a print_top(filename) which is similar to print_words() but which prints just the top 20 most common words sorted so the most common word is first, then the next most common and so on
Now, the question is a bit longer but I shortcut it for you.
The point that made me lose some valuable time was the fact that nowhere was told what the target of this exercise is.
Only when I ran the python command as is and got the following reply I got realized that I should type on the command line —count or —topcount
which activate some calculation over a text file of my own.
Some thinking
Understanding the task is a must have.
Once you got it, it makes everything easier.
From here it took me around one hour to solve the case.
So, here is my solution (Which is not far from Google’s)
It will be my pleasure to see how you can handle it better.
Click here for the next post- Python and RegEx
not knowing how large the input file might be I might try something more like this:
file_to_dict(filename):
counts = {}
for line in open(filename,'r'):
for word in line.split(' '):
counts[word] = counts.get(word,0)+1
return counts
FYI there is no need to remove the new line character. Split works on whitespaces characters.
I modified the code following your suggestion and got error-
"
for line in open(file1,'r'):
TypeError: coercing to Unicode: need string or buffer, file found
"
Have no clue how to solve it.
Anyone for rescue?
I tried what you suggested and it didn't help- the ' n ' appeared.
The error message means file1 is a file object, but it should be a filename, aka a string.
You have to do data.split() not data.split(" "). Split with no arguments works significantly differently from split with an argument. It removes all whitespace characters and it also gets rid of empty sequences.
What the other anonymous commenter said. Also, yeah, I should have figured that the formatting would have been borked. Please try to imagine some indentation as appropriate.
file1 is actually filename in my code.
So the error I guess not relates to file1 being a string since it is a string.
I do understand that my code has performance issues with large input i just don't know how to re-code it to handle it better.
Very good.
Thanks.
# Split by space
for word in data:
line_data = data.split (" ");
This loop is going over every CHARACTER in "data" (which is the entire file) – that's what happens when you iterate over a string. It doesn't help much because what you are doing inside the loop is running the (relatively heavy) split() over and over again on the same data.
This is what's causing your performance issues I believe.
— Arik