Monday, February 20, 2012

Python from scratch- The journey continues



Three weeks ago I have published my first post about how I started my journey into Python.
It was a huge hit and it led me to second post about me facing more learning and more problems in my way to be a Pythoner.
That was 2 weeks ago; I was so eager to read more, to study more, solve issues and eventually write about it.
All is true but when you are lack of time it can't happen; to study language seriously you need to have time and clear mind which I didn’t have at that time since I have my family duties.

Last night finally I found some quiet hours and continue to read Google’s Class Chapter #5- Sorting and Chapter #6- Dictionaries and Files.
The Sorting chapter was actually follow-up of the previous chapter, and for those of you who read my previous post know that I was complaining about the function sorted() not explained at all; well the function has been found in this chapter.

So I read the sorting page and learnt what tuple is and when I came to solve some issues I found out that accidently I already solved them when study chapter of 'lists'.

From here jumped into new page- Dict and files, learned it, memorized it, did all the tasks needed and then my last task was to solve the exercise 'wordcount' (See below) and guys this exercise was very hard to understand.  
I read the question few times until I got a sense what is needed. It got to a point where I became a bit frustrated and even I had thought of leaving Google class and moved to another one.
Somehow after wandering in some python sites and getting help it hit me- The fact that I am looking and reading more sites it is a strength I am developing thanks for Google search engine and it is a bless to use Google’s class.
I would like to thank the Python community out there that willing to solve any python issue or help- Just go to Google Search Engine and type- “Python <your question>”


As you may think I was able to finish eventually the test and when I compared my code to the provided solution by Google I found out that we are not that different which is very inspiring.
So here is the question:
# Define print_words(filename) and print_top(filename) functions.
# You could write a helper utility function that reads a file
# and builds and returns a word/count dict for it.
# Then print_words() and print_top() can just call the utility function.
1. For the --count flag, implement a print_words(filename) function that counts how often each word appears in the text and prints:
word1 count1
word2 count2
...
Print the list in order sorted by word (python will sort punctuation to come before letters- that's fine).
Store all the words as lowercase, so 'The' and 'the' count as the same word.
2. For the --topcount flag implement a print_top(filename) which is similar to print_words() but which prints just the top 20 most common words sorted so the most common word is first, then the next most common and so on

Now, the question is a bit longer but I shortcut it for you.
The point that made me lose some valuable time was the fact that nowhere was told what the target of this exercise is.
Only when I ran the python command as is and got the following reply I got realized that I should type on the command line --count or --topcount which activate some calculation over a text file of my own.



Understanding the task is a must have key; once you got it, it makes everything easier.
From here it took me around one hour to solve the case and here is my solution (Which is not far from Google’s)
It will be my pleasure to see how you can handle it better.



Click here for the next post- Python and RegEx

10 comments:

  1. not knowing how large the input file might be I might try something more like this:

    file_to_dict(filename):
    counts = {}
    for line in open(filename,'r'):
    for word in line.split(' '):
    counts[word] = counts.get(word,0)+1
    return counts

    ReplyDelete
    Replies
    1. I modified the code following your suggestion and got error-

      "
      for line in open(file1,'r'):
      TypeError: coercing to Unicode: need string or buffer, file found
      "

      Have no clue how to solve it.
      Anyone for rescue?

      Delete
    2. The error message means file1 is a file object, but it should be a filename, aka a string.

      Delete
    3. What the other anonymous commenter said. Also, yeah, I should have figured that the formatting would have been borked. Please try to imagine some indentation as appropriate.

      Delete
    4. file1 is actually filename in my code.
      So the error I guess not relates to file1 being a string since it is a string.

      I do understand that my code has performance issues with large input i just don't know how to re-code it to handle it better.

      Delete
  2. FYI there is no need to remove the new line character. Split works on whitespaces characters.

    ReplyDelete
    Replies
    1. I tried what you suggested and it didn't help- the ' \n ' appeared.

      Delete
    2. You have to do data.split() not data.split(" "). Split with no arguments works significantly differently from split with an argument. It removes all whitespace characters and it also gets rid of empty sequences.

      Delete
  3. # Split by space
    for word in data:
    line_data = data.split (" ");

    This loop is going over every CHARACTER in "data" (which is the entire file) - that's what happens when you iterate over a string. It doesn't help much because what you are doing inside the loop is running the (relatively heavy) split() over and over again on the same data.

    This is what's causing your performance issues I believe.

    -- Arik

    ReplyDelete