Wednesday, August 30, 2006
Monday, March 21, 2005
A Statistical Approach to the Spam Problem | Linux Journal
A Statistical Approach to the Spam Problem | Linux Journal: "This article discusses one of many possible mathematical foundations for a key aspect of spam
filtering--generating an indicator of ``spamminess'' from a collection of tokens representing the content of
an e-mail."
filtering--generating an indicator of ``spamminess'' from a collection of tokens representing the content of
an e-mail."
Wednesday, February 16, 2005
jgc's spam and anti-spam newsletter #7
SESSION 1
9:00 Bill Yerazunis Unified Model of Spam Filtration (0.17)
9:20 Eugene Koontz Bayesian Phishing Classification (18.37)
9:40* Jonathan Zdziarski Bayesian Noise Reduction (39.02)
10:00* Jonathan Oliver Lexicographical Distancing (58.05)
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-morning1-8
0k.ram
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-morning1-2
20k.ram
SESSION 2
10:40* Richard Segal et al Classifier Aggregation (0.20)
11:00* Jim Fenton Message vs. User Authentication (19.50)
11:20 Rui Dai et al Regulation (39.50)
11:40* Oscar Boykin Personal Email Network Structure (1:00.15)
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-morning2-8
0k.ram
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-morning2-2
20k.ram
SESSION 3
13:40 Brian McWilliams Spam Kings (0.15)
14:00 John Graham-Cumming People and Spam (19.45)
14:20* Constance Bommelaer French Government and Spam (39.05)
14:40* Matthew Prince Project Honeypot (1:00.15)
15:00* Jon Praed Jeremy Jaynes Spam Trial (1:19.40)
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-afternoon1
-80k.ram
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-afternoon1
-220k.ram
SESSION 4
15:40 Gordon Cormack Standardized Filter Evaluation (0.20)
16:00* Dave Mazieres Mail Avenger (25.20)
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-afternoon2
-80k.ram
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-afternoon2
-220k.ram
-
9:00 Bill Yerazunis Unified Model of Spam Filtration (0.17)
9:20 Eugene Koontz Bayesian Phishing Classification (18.37)
9:40* Jonathan Zdziarski Bayesian Noise Reduction (39.02)
10:00* Jonathan Oliver Lexicographical Distancing (58.05)
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-morning1-8
0k.ram
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-morning1-2
20k.ram
SESSION 2
10:40* Richard Segal et al Classifier Aggregation (0.20)
11:00* Jim Fenton Message vs. User Authentication (19.50)
11:20 Rui Dai et al Regulation (39.50)
11:40* Oscar Boykin Personal Email Network Structure (1:00.15)
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-morning2-8
0k.ram
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-morning2-2
20k.ram
SESSION 3
13:40 Brian McWilliams Spam Kings (0.15)
14:00 John Graham-Cumming People and Spam (19.45)
14:20* Constance Bommelaer French Government and Spam (39.05)
14:40* Matthew Prince Project Honeypot (1:00.15)
15:00* Jon Praed Jeremy Jaynes Spam Trial (1:19.40)
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-afternoon1
-80k.ram
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-afternoon1
-220k.ram
SESSION 4
15:40 Gordon Cormack Standardized Filter Evaluation (0.20)
16:00* Dave Mazieres Mail Avenger (25.20)
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-afternoon2
-80k.ram
http://web.mit.edu/webcast/spamconf05/spam-conference-21jan05-afternoon2
-220k.ram
-
Thursday, February 03, 2005
review questions
Today i had my project review . Some of the questions
that was asked by the guide are:
1) What are the attributes do u filter the mail.
List out the what are all the things u
check in order to consider to derive the conclusion of
spam and legitimate message.
2) If mail is the literal translation of the tamil
words wriiten in english then what do the filter do.
Whether it will consider it as spam or it will pass
through the filter.
Ans : ( I guess)
If the mail is written in english though the words
represent tamil words( ie literal translation of the
tamil sentences in english)l then each token will be
given the probability of 0.4 (as the tokens are all
new) according to bayesian filter concept. Thus the
mail will be considered as non spam mail.
3) As there is tamil sentence written in english.
There are chances that users may use different
spelling to represent the same word. Then what is the
case?
Ans : ( I guess)
There are possibility of only a few words to be
misspelled or spelled differently. At this case, while
calculating the combined probabilty there are chances
of leaving those words and thus due to calculation of
combined probability the mail will pass through the
filter.
4) bayesian filter is better than what other filters.
Say with reason.
--
O.R.Vaishnavi Devi
Thiagarajar College
________________________________________________________________________
Yahoo! India Matrimony: Find your life partner online
Go to: http://yahoo.shaadi.com/india-matrimony
that was asked by the guide are:
1) What are the attributes do u filter the mail.
List out the what are all the things u
check in order to consider to derive the conclusion of
spam and legitimate message.
2) If mail is the literal translation of the tamil
words wriiten in english then what do the filter do.
Whether it will consider it as spam or it will pass
through the filter.
Ans : ( I guess)
If the mail is written in english though the words
represent tamil words( ie literal translation of the
tamil sentences in english)l then each token will be
given the probability of 0.4 (as the tokens are all
new) according to bayesian filter concept. Thus the
mail will be considered as non spam mail.
3) As there is tamil sentence written in english.
There are chances that users may use different
spelling to represent the same word. Then what is the
case?
Ans : ( I guess)
There are possibility of only a few words to be
misspelled or spelled differently. At this case, while
calculating the combined probabilty there are chances
of leaving those words and thus due to calculation of
combined probability the mail will pass through the
filter.
4) bayesian filter is better than what other filters.
Say with reason.
--
O.R.Vaishnavi Devi
Thiagarajar College
________________________________________________________________________
Yahoo! India Matrimony: Find your life partner online
Go to: http://yahoo.shaadi.com/india-matrimony
Wednesday, February 02, 2005
Tuesday, February 01, 2005
Whitelist
Whitelist:
Whitelist is similar to address book where in we keep the details about the known senders.
Having whitelist in the filter has an advantage that we can accept the mails from the senders in the whitelist without any filtering there by saving computation.
But the problem is that one sender can have different email address. If the known sender has send the mail with his new address and the mail contains words that may increase the probabilty of being considered as spam then there are chances for false positives when passed through the filter.
The whilelist associated with the bayesian solves this problem that as the entire header is checked the senders route, protocol used, the ip address are noted. Thus though the sender uses the different address for the mail the other fields may indicate that the message is from the known person and the mail will be accepted without filtering.
Whitelist is similar to address book where in we keep the details about the known senders.
Having whitelist in the filter has an advantage that we can accept the mails from the senders in the whitelist without any filtering there by saving computation.
But the problem is that one sender can have different email address. If the known sender has send the mail with his new address and the mail contains words that may increase the probabilty of being considered as spam then there are chances for false positives when passed through the filter.
The whilelist associated with the bayesian solves this problem that as the entire header is checked the senders route, protocol used, the ip address are noted. Thus though the sender uses the different address for the mail the other fields may indicate that the message is from the known person and the mail will be accepted without filtering.
Bayesian Filter
Explanation of Bayesian Filter
Article : A plan for Spam
There are 2 corpus one(corpus1) which is used to store the legitimate mails ie the mails that are deleted by the user ordinarily goes to this corpus. Another one(corpus2) for storing spam messages that are obtained when the user deletes his received mail using the option of "delete as spam".
Now the messages are scanned entirely in the corpus1(good corpus) and separated as tokens. Here the whole message including the header, Java script, html are scanned. Then each token is taken and the number of time that token occur in the whole corpus is counted and stored in a hash table(good). Thus the hash table contains the mapping between the token and the number of times it occurs in the corpus.
The same procedure is repeated with the corpus2(spam corpus) to get another hash table(bad) which also contains the mapping between the tokens in the corpus2 and their number of occurances.
A third has table is created mapping the token with the probabilty that the mail containing it is a spam.
The fomula used for calculating the probability is as follows.
(let((g(*2(or gethash word good) 0)))
(b(or (gethash word bad) 0)))
(unless (> ( + g b) 5)
(max 0.1 (min .99 (float (/ (min 1(/ b nbad)) ( + (min 1 (/ g ngood)) ( min 1 ( / b nbad)))))))
Explanation:
Here word is the token for which the probability is to be calculated. Gethash means getting the number of occurance of the word from the hash tables ( good or bad). ngood and nbad is the total number of messages in the corpus1 and corpus2 respectively.
The code here is a LISP code. In Lisp the expression (a+b) is written as (+ a b).
To understand the code we need to know first about the probability. Suppose there are 2 red,3 white balls. Then the probability of white balls is 3/5 ie number of white balls divided by total number of balls present.
And also another point to be noted is the probability always lies between 0 and 1.
Here in our code in order to reduce/avoid false positive we actually do two things
1) double the good words.
2) then while calculating the probability we consider the divisor to be total number of messages in the corpus rather than total number of words in the corpos which is the actual thing we have to consider by the definition of probability.
Thus the variable g = 2 * (the number of occurance of the word in the good hash table if present or 0).
And b = the number of occurance of the word in the good hash table if present or 0.
One point to be considered is that the probability of the new word(ie the token that is not present in any of the corpus) is considered as 0.4.
We consider only those words that have occured more than 5 times when both the corpuses are taken together. This is checked using the line of code (> (+ g b) 5) which is similar to g+b>5. If the condition satisfies then the following steps are performed.
1)Then calculate g/ngood and compare it with 1 and get the minimum of both.
2)Calculate b/nbad and compare it with 1 and get the minimum of both.
The comparsion with 1 is done in order to have a maximum of only 1 ie some times it so happens that the value of the ratio might be > 1 but the probabilty can be from 0 to 1 thus the values > 1 are rounded of to 1 to satisfy the probability condition.
3) Add the results of step 1 nad step 2.
4) divide the b/nbad by taking the result of step 3 as denominator.
5) Adjust the value so that it should be betwwen 0.1 to 0.99.
For better explnation consider the word zzzz that surely indicates the msg containg it is spam. The word has occured 20 times and there are 10 msg in the corpus2. The b/nbad becomes 2. And suppose g/ngood is around 0.1 (suppose) then applying step 3 will result in value like 2.10. Step 4 will then be 2/2.10 which will surely be around 1 , in our case it is 0.9523. Now the min(0.99,0.9523) is taken. here it is 0.9523. And then max(0.9523,0.1) is taken .The result is 0.9523 which becomes the probabilty of the word zzzz.
After calculating the probability of each token from the message to be tested as spam, the most interesting 15 tokens are taken based on how far they are from the neutral 0.5.(I dont know what they mean by this ).
Then the combined probability is calculated using the formula(This formula seems to be difficult to understand):
(let((prod(apply # ' * probs)))
(/ prod ( + prod ( apply # ' * (mapcar # ' (lambda(x) (-1 x)) probs ))))).
The mail is considered as spam only if this combined probability results in a value greater than 0.9.
Article : A plan for Spam
There are 2 corpus one(corpus1) which is used to store the legitimate mails ie the mails that are deleted by the user ordinarily goes to this corpus. Another one(corpus2) for storing spam messages that are obtained when the user deletes his received mail using the option of "delete as spam".
Now the messages are scanned entirely in the corpus1(good corpus) and separated as tokens. Here the whole message including the header, Java script, html are scanned. Then each token is taken and the number of time that token occur in the whole corpus is counted and stored in a hash table(good). Thus the hash table contains the mapping between the token and the number of times it occurs in the corpus.
The same procedure is repeated with the corpus2(spam corpus) to get another hash table(bad) which also contains the mapping between the tokens in the corpus2 and their number of occurances.
A third has table is created mapping the token with the probabilty that the mail containing it is a spam.
The fomula used for calculating the probability is as follows.
(let((g(*2(or gethash word good) 0)))
(b(or (gethash word bad) 0)))
(unless (> ( + g b) 5)
(max 0.1 (min .99 (float (/ (min 1(/ b nbad)) ( + (min 1 (/ g ngood)) ( min 1 ( / b nbad)))))))
Explanation:
Here word is the token for which the probability is to be calculated. Gethash means getting the number of occurance of the word from the hash tables ( good or bad). ngood and nbad is the total number of messages in the corpus1 and corpus2 respectively.
The code here is a LISP code. In Lisp the expression (a+b) is written as (+ a b).
To understand the code we need to know first about the probability. Suppose there are 2 red,3 white balls. Then the probability of white balls is 3/5 ie number of white balls divided by total number of balls present.
And also another point to be noted is the probability always lies between 0 and 1.
Here in our code in order to reduce/avoid false positive we actually do two things
1) double the good words.
2) then while calculating the probability we consider the divisor to be total number of messages in the corpus rather than total number of words in the corpos which is the actual thing we have to consider by the definition of probability.
Thus the variable g = 2 * (the number of occurance of the word in the good hash table if present or 0).
And b = the number of occurance of the word in the good hash table if present or 0.
One point to be considered is that the probability of the new word(ie the token that is not present in any of the corpus) is considered as 0.4.
We consider only those words that have occured more than 5 times when both the corpuses are taken together. This is checked using the line of code (> (+ g b) 5) which is similar to g+b>5. If the condition satisfies then the following steps are performed.
1)Then calculate g/ngood and compare it with 1 and get the minimum of both.
2)Calculate b/nbad and compare it with 1 and get the minimum of both.
The comparsion with 1 is done in order to have a maximum of only 1 ie some times it so happens that the value of the ratio might be > 1 but the probabilty can be from 0 to 1 thus the values > 1 are rounded of to 1 to satisfy the probability condition.
3) Add the results of step 1 nad step 2.
4) divide the b/nbad by taking the result of step 3 as denominator.
5) Adjust the value so that it should be betwwen 0.1 to 0.99.
For better explnation consider the word zzzz that surely indicates the msg containg it is spam. The word has occured 20 times and there are 10 msg in the corpus2. The b/nbad becomes 2. And suppose g/ngood is around 0.1 (suppose) then applying step 3 will result in value like 2.10. Step 4 will then be 2/2.10 which will surely be around 1 , in our case it is 0.9523. Now the min(0.99,0.9523) is taken. here it is 0.9523. And then max(0.9523,0.1) is taken .The result is 0.9523 which becomes the probabilty of the word zzzz.
After calculating the probability of each token from the message to be tested as spam, the most interesting 15 tokens are taken based on how far they are from the neutral 0.5.(I dont know what they mean by this ).
Then the combined probability is calculated using the formula(This formula seems to be difficult to understand):
(let((prod(apply # ' * probs)))
(/ prod ( + prod ( apply # ' * (mapcar # ' (lambda(x) (-1 x)) probs ))))).
The mail is considered as spam only if this combined probability results in a value greater than 0.9.
Sunday, January 30, 2005
Some Useful Terms
Spyware :
Programs that causes your computer to display ads even when you are not using the program in question for its intended purpose.
Spyware hijacks computers,secretly changing their settings, bauanges them with pop up ads, and installs adware and other software program that may cause computer to malfunction, slow down and even crash.
Phishing Attacks:
The fraudulent solicitation for account information such as credit card numbers and passwords by impersonating the domain and email content of a company.
Programs that causes your computer to display ads even when you are not using the program in question for its intended purpose.
Spyware hijacks computers,secretly changing their settings, bauanges them with pop up ads, and installs adware and other software program that may cause computer to malfunction, slow down and even crash.
Phishing Attacks:
The fraudulent solicitation for account information such as credit card numbers and passwords by impersonating the domain and email content of a company.