Abstract:
Recently, spammers have proliferated "image spam", emails which contain the text of the spam message in a human readable image instead of the message body, making detection by conventional content filters difficult. New techniques are needed to filter these messages. Our goal is to automatically classify an image directly as being spam or ham. We present features that focus on simple properties of the image, making classification as fast as possible. Our evaluation shows that they accurately classify spam images in excess of 90% and up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm that selects features for classification based on their speed as well as predictive power. This technique produces an accurate system that runs in a tiny fraction of the time. Finally, we introduce Just in Time (JIT) feature extraction, which creates features at classification time as needed by the classifier. We demonstrate JIT extraction using a JIT decision tree that further increases system speed. This paper makes image spam classification practical by providing both high accuracy features and a method to learn fast classifiers.
Abstract:
In contextual computing, where cues beyond direct user input are used to trigger computation, one of the most daunting challenges is inferring what the user is doing. For the domain of task management, we have developed a new approach to reducing the problem of ambiguity of user action for intelligent systems. We introduce a construct we call an Activity, designed to reduce this ambiguity by providing a meaningful structure for task information that assists users with their work. We present ethnographic research and prototype evaluations to assess the value of the Activity construct from an end-user’s perspective. Our findings suggest that the Activity structure is useful to people and therefore could be exploited for inference.
Abstract:
Near-duplicate detection is not only an important pre and post processing task in Information Retrieval but also an effective spam-detection technique. Among different approaches to near-replica detection methods based on document signatures are particularly attractive due to their scalability to massive document collections and their ability to handle high throughput rates. Their weakness lies in the potential brittleness of signatures to small changes in content, which makes them vulnerable to various types of noise. In the important spam-filtering application, this vulnerability can also be exploited by dedicated attackers aiming to maximally fragment signatures corresponding to the same email campaign. We focus on the I-Match algorithm and present a method of strengthening it by considering the usage context when deciding which portions of a document should affect signature generation. This substantially (almost 100-fold in some cases) increases the difficulty of dedicated attacks and provides effective protection against document noise in non-adversarial settings. Our analysis is supported by experiments using a real email collection.
Abstract:
This paper analyzes trends seen in phishing attacks throughout 2006 based on real-world data obtained through Symantec's phishing data collection fabric. We examine both the prevalence and breakdown of phishing web sites as well as the frequency and breakdown of phishing emails. Beyond just the extent of data collected, our study differs from previously published studies in this area in two regards: * We discuss the data collection methodology (together with its limitations and biases) so that readers are better positioned to place the results in the appropriate context; * We perform a fine-grained analysis considering seasonal & day-of-week effects, geographic distinctions, brand segmentations, and geographic/population targets. We found a number of intriguing properties of phishing attacks. These include seasonal and day-of-week fluctuations in activity and fluctuations related to what brands are being spoofed. We also determined the industries, regions, languages, and population segments that appear to be targeted in these attacks.
Abstract:
A new trend in email spam is the emergence of image spam. Although current anti-spam technologies are quite successful in filtering text-based spam emails, the new image spams are substantially more difficult to detect, as they employ a variety of image creation and randomization algorithms. Spam image creation algorithms are designed to defeat well-known vision algorithms such as optical character recognition (OCR) algorithms whereas randomization techniques ensure the uniqueness of each image. We observe that image spam is often sent in batches that consist of visually similar images that differ only due to the application of randomization algorithms. Based on this observation, we propose an image spam detection system that uses near-duplicate detection to detect spam images. We rely on traditional anti-spam methods to detect a subset of spam images and then use multiple image spam filters to detect all the spam images that "look" like the spam caught by traditional methods. We have implemented a prototype system to achieve high detection rate while having a less than 0.001% false positive rate.
Abstract:
In this paper, we propose a new asymmetric boosting method, Boosting with Different Costs. Traditional boosting methods assume the same cost for misclassified instances from different classes, and in this way focus on good performance with respect to overall accuracy. Our method is more generic, and is designed to be more suitable for problems where the major concern is a low false positive (or negative) rate, such as spam filtering. Experimental results on a large scale email spam data set demonstrate the superiority of our method over state-of-the-art techniques.
Abstract:
Email has become an integral and sometimes overwhelming part of users' personal and professional lives. In this paper, we measure the flow and frequency of user email toward the identification of communities of interest (COI)--groups of users that share common goals or associations. If detectable, such associations will be useful in automating email management, e.g., topical classification, flagging import missives, and SPAM mitigation. An analysis of a large corpus of university email is used to drive the generation and validation of algorithms for automatically determining COIs. We examine the effect of the structure and transience of COIs with the algorithms and validate algorithms using user-labeled data. Our analysis shows that the proposed algorithms correctly identify email as being sent from the human-identified COI with high accuracy. The structure and characteristics of COIs are explored analytically and broader conclusions about email use are posited.
Abstract:
Active learning methods seek to reduce the number of labeled examples needed to train an effective classifier, and have natural appeal in spam filtering applications where trustworthy labels for messages may be costly to acquire. Past investigations of active learning in spam filtering have focused on the pool-based scenario, where there is assumed to be a large, unlabeled data set and the goal is to iteratively identify the best subset of examples for which to request labels. However, even with optimizations this is a costly approach. We investigate an online active learning scenario where the filter is exposed to a stream of messages which must be classified one at a time. The filter may only request a label for a given message immediately after it has been classified. The goal is to achieve strong online classification performance with few label requests. This is a novel scenario for low-cost active spam filtering, fitting for application in large-scale systems. We draw from the label efficient machine learning literature to investigate several approaches to selective sampling in this scenario using linear classifiers. We show that online active learning can dramatically reduce labeling and training costs while maintaining high levels of classification performance with negligible additional computational overhead.
Abstract:
The growing popularity of IP telephony systems has made them attractive targets for spammers. Voice call spam, also known as Spam over Internet Telephony (SPIT), is potentially a more serious problem because of the real time processing requirements of voice packets. We explore a novel mechanism that uses duration of calls between users to combat SPIT. CallRank, the scheme proposed by us, uses call duration to establish social network linkages and global reputations for callers, based on which call recipients can decide whether the caller is legitimate or not. CallRank has been implemented within a VoIP system simulation and our results show that we are able to achieve a false negative rate of 10% and a false positive rate of 3% even in the presence of a significant fraction of spammers.
Abstract:
Web spam research has been hampered by a lack of statistically significant collections. In this paper, we perform the first large-scale characterization of web spam using content and HTTP session analysis techniques on the Webb Spam Corpus -- a collection of about 350,000 web spam pages. Our content analysis results are consistent with the hypothesis that web spam pages are different from normal web pages, showing far more duplication of physical content and URL redirections. An analysis of session information collected during the crawling of the Webb Spam Corpus shows significant concentration of hosting IP addresses in two narrow ranges as well as significant overlaps among session header values. These findings suggest that content and HTTP session analysis may contribute a great deal towards future efforts to automatically distinguish web spam pages from normal web pages.
Abstract:
We propose a discriminative classifier learning approach to image modeling for spam image identification. We analyze a large number of images extracted from the SpamArchive spam corpora and identify four key spam image properties: color moment, color heterogeneity, conspicuousness, and self-similarity. These properties emerge from a large variety of spam images and are more robust than simply using visual content to model images. We apply multi-class characterization to model images sent with emails. A maximal figure-of-merit (MFoM) learning algorithm is then proposed to design classifiers for spam image identification. Experimental results on about 240 spam and legitimate images show that multi-class characterization is more suitable than single-class characterization for spam image identification. Our proposed framework classifies 81.5% of spam images correctly and misclassifies only 5.6% of legitimate images. We also demonstrate the generalization capabilites of our proposed framework on the TREC 2005 email corpus. Our results show that the technique operates robustly, even when the images in the testing set are very different from the training images.
Abstract:
Many of the first succesfull anti-spam filters were personalized classifier's that were trained on an individual user's spam and ham e-mail. Proponents of personalized filters argue that statistical text learning is effective because it can identify the unique aspects of each individual's e-mail. On the other hand, a single classifier learned for a large population of users can leverage the data provided by each individual user across hundreds or even thousands of users. This paper investigates the tradeoff between globally and personally trained anti-spam classifiers. We find that globally-trained text classification easily outperforms personally-trained classification under realistic settings. This result does not mean that personalization is not valuable. We show that the two techniques can be combined to produce a modest improvement in overall performance.
Abstract:
Email protocols were not designed to provide protection against falsification of a message’s address of origin, referred to as "spoofing". DomainKeys Identified Mail (DKIM) defines a mechanism for using digital signatures on email at the domain level, allowing the receiving domain to confirm that mail came from the domain it claims to. Using the associated DKIM sender signing policy specification, the receiving domain may also have more information for deciding how to treat mail without a valid signature. The use of DKIM signatures and signing policies gives sending domains one tool to help recipients identify legitimate messages from their domain, and a reliable identifier that can be used to combat spam and phishing.
Abstract:
The massive increase of spam is posing a very serious threat to email which has become an important means of communication. Not only does it annoy users, it also consumes a lot of the Internet's bandwidth. Most spam filters in existence are based on content of email one way or the other. While these anti-spam tools have proven very useful, they do not protect bandwidths from being wasted and spammers are learning to bypass them via clever manipulation of the spam content. A very different approach to spam detection is based on the behavior of email senders. In this paper, we propose a learning approach to spam sender detection based on features extracted from social networks constructed from email exchange logs. Legitimacy scores are assigned to senders based on their likelihood of being a legitimate sender. Also, various spam filtering and resisting possibilities are explored.
Abstract: Social Networking Services (SNS), such as MySpace and Facebook, are increasing in popularity. They encourage and enable users to communicate with previously unknown people on an unprecedented scale. Now our increased social sphere is requiring us to potentially distinguish more legitimate strangers than illegitimate. We look to machines to buffer and mediate the new communications landscape. Automatically rejecting 'unwanted' messages has long been the function of spam filtering; however, as we move from Viagra ads in e- mail to Betty the saleswoman in SNS, identifying a nuisance requires placing value judgments. Nuisances take the shape of legitimate social humans and commercial entities, and sometimes both: is Britney Spears spam? We seek to redefine spam in the context of SNS to facilitate the evaluation of people in addition to machines. We have developed a prototype that categorizes senders into broader categories using features unique to SNS in order to facilitate the demands of larger and more obtrusive social networks.
Abstract:
Email is a key communication tool on collaborative workgroups. In this paper, we investigate how team leadership roles can be inferred from a collection of email messages exchanged among team members. This task can be useful to monitor group leader's performance, as well as to study other aspects work group dynamics. Using a large email collection with several workgroups whose leaders were previously defined, we demonstrate that leadership positions can be predicted by a combination of traffic-based and text-based email patterns. Traffic-based patterns consist of information patterns that can be extracted from the message headers, such as frequency counts, message thread position and whether the message was broadcast to the entire workgroup or not. Textual patterns are represented by the message's "email speech acts",i.e., semantic information with the sender's intent that can be automatically inferred by language usage. Using off-the-shelf learning algorithms, we obtained 96% accuracy and 88.2% in F-measure in predicting the leadership roles on 34 email-centered work groups.
Abstract:
To evade blacklisting, the vast majority of spam email is sent from exploited MTAs (i.e., botnets) and with forged ``From'' addresses. In response, the anti-spam community has developed a number of domain-based authentication systems -- such as SPF and DomainKeys -- to validate the binding between individual domain names and legitimate mail sources for those domains. In this paper, we explore an alternative solution in which the mail recipient requests a real-time affirmation for each e-mail from the declared sender's MX of record. The {\em Occam} protocol is trivial to implement, offers authenticating power equivalent to SPF and Domainkeys and, most importantly, forces spammers to deploy and expose blacklistable servers for each domain they use during a campaign. We discuss the details of the protocol, compare its strengths and weaknesses with existing solutions and describe a prototype implementation in Sendmail.
Abstract:
We show how a game-theoretic model of spam e-mailing, which we had introduced in previous work, can be extended to include the possibility of employing Human Interactive Proofs (HIPs) in conjunction with filters that classify incoming messages as legitimate or spam. Using our extended model, we show that making HIPs widely available to e-mail users will reduce the volume of spam on the Internet and increase the benefit that legitimate users obtain from e-mail.
Abstract:
We address the problem of recognizing the so-called \textit{image spam}, a rapidly spreading kind of spam which consists in embedding the text message into attached images to defeat spam filtering techniques based on the analysis of e-mail's body text. We propose an approach based on low-level image processing techniques aimed at detecting one of the specific characterstics of image spam, namely the use of content obscuring techniques to defeat OCR tools. An implementation of this approach is described, aimed at detecting content obscuring techniques whose consequence is to compromise the OCR effectiveness through character breaking, or through the presence of background noise interfering with characters. An experimental evaluation of our approach is reported on a personal data set of spam images publicly available.
Abstract:
In this paper we evaluate the performance of the highest probability SVM nearest neighbor (HP-SVM-NN) classifier, which combines the ideas of the SVM and k-NN classifiers, on the task of spam filtering, using the pure SVM classifier as a quality baseline. To classify a sample the HP-SVM-NN classifier does the following: for each k in a predefined set {k_1, ..., k_N} it trains an SVM model on k nearest labeled samples, uses this model to classify the given sample, and transforms the output of SVM into posterior probabilities of the two classes using sigmoid approximation; than it selects that of the 2 * N resulting answers which has the highest probability. The experimental evaluation shows, that in terms of ROC curves the algorithm is able to achieve higher accuracy than the pure SVM classifier.
Abstract:
The existing tools for testing spam filters evaluate a filter instance by simply feeding it with a stream of emails, possibly also providing a feedback to the filter about the correctness of the detection. In such a scenario the evaluated filter is disconnected from the network of email servers, filters, and users, which makes the approach inappropriate for testing many of the filters that exploit some of the information about spam bulkiness, users' actions and social relations among the users. Corresponding evaluation results might be wrong, because the information that is normally used by the filter is missing, incomplete or inappropriate. In this paper we present a tool for testing spam filters in a very realistic scenario. Our tool consists of a set of Python scripts for unix/linux environment. The tool takes as inputs the filter to be tested and an affordable set of interconnected machines (e.g., PlanetLab machines, or locally created virtual machines). When started from a central place, the tool uses the provided machines to build a network of real email servers, installs instances of the filter, deploys and runs simulated email users and spammers, and computes the detection results statistic. Email servers are implemented using Postfix, a standard linux email server. Only per-email-server filters are currently supported, whereas per-email-client filters testing would require additional tool development. The size of the created emailing network is constrained only by the number of available PlanetLab or virtual machines. The run time is much shorter then the simulated system time, due to a time scaling mechanism. Testing a new filter is as simple as installing one copy of it in a real emailing network, which unifies the jobs of a new filter development, testing and prototyping. As a usage example, we test the SpamAssassin filter.
Abstract:
It is common to think of email as a one-to-one communication medium, but at the ISP level, many email flows are mailing-lists (one-to-many) or forwarded traffic (many-to-one). Some anti-spam systems have foundered on misapprehensions as to the nature and importance of these flows. However, although understanding has grown, there are no quantitative studies in the literature as to the relative importance of these different types of email flow. This brief study is a snapshot of the types of email that can be distinguished amongst the 331 million items that arrived at a medium-sized ISP in March 2007, and is intended to provoke the publication of further data, to better illuminate the relative importance of different types of email.
Abstract:
The current mail server architectures spawns a new process upon every new connection it receives. The new process deals with the handling of the mail from accepting 'Helo' information till the end of the connection. While forking a new process for each separate connection has a lot of advantages in terms of security and modularity, this architecture has severe problems in view of increasing unsolicited emails - spams and due to rogue connections. With spammers guessing email ids of user's, the number of emails that bounces off a mail server are increasing. For such emails, the mail server spawns a process which wastes its resources. In this paper we propose a new architecture for mail servers, which keeps all the advantages of the process architecture has for receiving mails, but at the same time do not waste server resources in case of for bounced emails/rogue connections. Basically, we do not fork out a new process until we are sure that the mail would not get bounced. We present detailed evaluation of our scheme and show that the new architecture could use servers resources efficiently.
Abstract:
We address the problem of gray mail -- messages that people could disagree on whether they are good or spam. We propose three simple methods for detecting gray mail and compare their performance using recall-precision curves. Given an imperfect gray mail detector, we show how it can help improve the spam filter consistently over the regular framework.
Abstract: Blogs are becoming an increasingly popular target for spammers. The existence of multiple vectors for spam injection, the potential of reaching many eyeballs with a single spam, and limited deployment of anti-spam technologies has led to a sustained increase in the volume and sophistication of attacks. This paper reviews the current state of spam in the blogosphere at large and in particular as seen at TypePad, a major hosted blog service. Furthermore the effectiveness of two popular open-source email antispam programs at classifying blog comment spam is evaluated.