Tuesday, February 01, 2005

Bayesian Filter

Explanation of Bayesian Filter
Article : A plan for Spam

There are 2 corpus one(corpus1) which is used to store the legitimate mails ie the mails that are deleted by the user ordinarily goes to this corpus. Another one(corpus2) for storing spam messages that are obtained when the user deletes his received mail using the option of "delete as spam".

Now the messages are scanned entirely in the corpus1(good corpus) and separated as tokens. Here the whole message including the header, Java script, html are scanned. Then each token is taken and the number of time that token occur in the whole corpus is counted and stored in a hash table(good). Thus the hash table contains the mapping between the token and the number of times it occurs in the corpus.
The same procedure is repeated with the corpus2(spam corpus) to get another hash table(bad) which also contains the mapping between the tokens in the corpus2 and their number of occurances.

A third has table is created mapping the token with the probabilty that the mail containing it is a spam.
The fomula used for calculating the probability is as follows.

(let((g(*2(or gethash word good) 0)))
(b(or (gethash word bad) 0)))
(unless (> ( + g b) 5)
(max 0.1 (min .99 (float (/ (min 1(/ b nbad)) ( + (min 1 (/ g ngood)) ( min 1 ( / b nbad)))))))

Explanation:
Here word is the token for which the probability is to be calculated. Gethash means getting the number of occurance of the word from the hash tables ( good or bad). ngood and nbad is the total number of messages in the corpus1 and corpus2 respectively.

The code here is a LISP code. In Lisp the expression (a+b) is written as (+ a b).

To understand the code we need to know first about the probability. Suppose there are 2 red,3 white balls. Then the probability of white balls is 3/5 ie number of white balls divided by total number of balls present.
And also another point to be noted is the probability always lies between 0 and 1.

Here in our code in order to reduce/avoid false positive we actually do two things
1) double the good words.
2) then while calculating the probability we consider the divisor to be total number of messages in the corpus rather than total number of words in the corpos which is the actual thing we have to consider by the definition of probability.


Thus the variable g = 2 * (the number of occurance of the word in the good hash table if present or 0).

And b = the number of occurance of the word in the good hash table if present or 0.

One point to be considered is that the probability of the new word(ie the token that is not present in any of the corpus) is considered as 0.4.

We consider only those words that have occured more than 5 times when both the corpuses are taken together. This is checked using the line of code (> (+ g b) 5) which is similar to g+b>5. If the condition satisfies then the following steps are performed.

1)Then calculate g/ngood and compare it with 1 and get the minimum of both.
2)Calculate b/nbad and compare it with 1 and get the minimum of both.

The comparsion with 1 is done in order to have a maximum of only 1 ie some times it so happens that the value of the ratio might be > 1 but the probabilty can be from 0 to 1 thus the values > 1 are rounded of to 1 to satisfy the probability condition.

3) Add the results of step 1 nad step 2.
4) divide the b/nbad by taking the result of step 3 as denominator.
5) Adjust the value so that it should be betwwen 0.1 to 0.99.


For better explnation consider the word zzzz that surely indicates the msg containg it is spam. The word has occured 20 times and there are 10 msg in the corpus2. The b/nbad becomes 2. And suppose g/ngood is around 0.1 (suppose) then applying step 3 will result in value like 2.10. Step 4 will then be 2/2.10 which will surely be around 1 , in our case it is 0.9523. Now the min(0.99,0.9523) is taken. here it is 0.9523. And then max(0.9523,0.1) is taken .The result is 0.9523 which becomes the probabilty of the word zzzz.

After calculating the probability of each token from the message to be tested as spam, the most interesting 15 tokens are taken based on how far they are from the neutral 0.5.(I dont know what they mean by this ).

Then the combined probability is calculated using the formula(This formula seems to be difficult to understand):
(let((prod(apply # ' * probs)))
(/ prod ( + prod ( apply # ' * (mapcar # ' (lambda(x) (-1 x)) probs ))))).

The mail is considered as spam only if this combined probability results in a value greater than 0.9.



0 Comments:

Post a Comment

<< Home