Find the most occuring word given some text?

0 ⤊

Find the most occuring word given some text?

2.You are given the following problem:
Find within a text (given as a text file) the most occurrent word.

a.To this end, design an algorithm that takes as input a text, and prints the most occurrent word, along with its number of occurrences.

How would I go about solving this? I was considering a hash table using chaining and a counter but it might get complex and inefficient. I was also considering using some sort of binary search tree. Anyhow, I am not looking for a solution but rather some opinions and ideas. Thanks

2007-01-21 16:48:43 · 3 answers · asked by procomp9 1 in Computers & Internet ➔ Programming & Design

3 answers

I would make two arrays with a common key. As I read the sentence, I compare the words with those already stored in the array. If so, I increase the count of the corresponding array. For ex., take a simple five word sentence.

word word2 word3 word wor2

As I read through the words, following would be my sequence.

read "word"
word_array[0]="word";count_array[0]=1
read "word2"
word_array[1]="word2";count_array[1]=1
read "word3"
word_array[2]="word3";count_array[2]=1
read "word"
exists; count_array[0]=2
read "word2"
exists: count_array[1]=2

so on..

2007-01-21 17:02:29 · answer #1 · answered by jaggie_c 4 · 0⤊ 0⤋

Using VB you can use a collection of integers for totalizing. The nice thing about collections is that each member of the collection can be identified by either an index (like an array) or with a unique key string.

So if you use string keys made up of the extracted words from your text you will be able to increment the integer item every time a matching word is extracted.

All you need to do is test to see if the extracted word key is used in the collection, if it isn't add a member to the collection and add 1 to the item. If it already exists then just add one to the item.

You will also get a count of the members in the collection which will correspond to the number of unique words in your text with this method.

The text will have to have some prepossessing before being used as a key. I would trim away any white space from each word then convert all letters to uppercase. Other than misspellings this will ensure that the same word is not counted two or more different ways due to capitalization's of letters.

2007-01-22 01:26:29 · answer #2 · answered by MarkG 7 · 0⤊ 0⤋

I kind of like a hash. Think of it this way, it does not matter how large the text is, but only how many different words there are. I believe they say an English newspaper vocabulary is about 30k words, and a very educated person knows about 80-100k. So, somewhere around 1MB of memory would be sufficient, and this is not unreasonable.

The other would work as well, but binary trees can get pretty uneven. This might not be a problem, or if it is you could always balance them.

Are you given any other requirements? That is, should you be thinking in terms of optimizing for space, speed, efficiency, portability...

But my first thought would be to go with the hash. It is easiest, workable, and should not be too inefficient.

2007-01-22 01:00:59 · answer #3 · answered by sofarsogood 5 · 0⤊ 0⤋