Statistical Methods for Fast Anomaly Detection


Material Information

Statistical Methods for Fast Anomaly Detection
Physical Description:
1 online resource (136 p.)
Wu, Mingxi
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering, Computer and Information Science and Engineering
Committee Chair:
Jermaine, Christophe
Committee Members:
Kahveci, Tamer
Dobra, Alin
Ranka, Sanjay
Pardalos, Panagote M.


Subjects / Keywords:
anomaly, extreme, outlier, randomized, sampling, spatial
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Computer Engineering thesis, Ph.D.
Electronic Thesis or Dissertation
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )


In general, the task of detecting anomalies is to find the most anomalous points or subset of points from a given data set according to a user-defined score function. In order to give users the freedom to try different score functions during data exploration, the generic way to find anomalies would be that, evaluating the user-defined score function on each candidate data point or subset of points, sorting them according to their scores, and returning the top few candidates. Though simple, this approach is \emph{slow} when the data set is huge and/or the score function demands expensive computation. This thesis presents different statistical methods to speed three specific anomaly detection tasks. First, a simple sampling algorithm is proposed to efficiently detect distance-based outliers in domains where each and every distance computation is expensive. Unlike any existing algorithms, the sampling algorithm allows the user to control the number of distance computations and can return good results with accuracy guarantees. The most computationally expensive aspect of estimating the accuracy of the result is sorting all of the distances computed by the sampling algorithm. Second, a Bayesian framework is devised to predict the $k^{th}$ largest (or smallest) value in a real set. The Bayesian framework can combine both the domain knowledge derived from the available database workload and the sampled information obtained at query time, and return confidence bounds to the user. Third, a likelihood ratio test (LRT) framework is designed to efficiently detect spatial anomalies according to a user-supplied likelihood function. Given a spatial data set placed on an $n\times n$ grid, the goal is to find the rectangular regions within which subsets of the data set exhibit anomalous behavior. With any user-supplied arbitrary likelihood function, the naive algorithm would conduct a LRT over each rectangular region in the grid, rank all of the rectangles based on the computed LRT statistics, and return the top few most interesting rectangles. To speed this process, the LRT framework uses novel and effective pruning methods to prune a large fraction of the rectangles without computing their associated LRT statistics. For all of the three research issues, extensive experiments show significant speedups comparing to the alternative algorithms on real problem over real data.
Statement of Responsibility:
by Mingxi Wu.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Includes bibliographical references.
General Note:
Description based on online resource; title from PDF title page.
General Note:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis (Ph.D.)--University of Florida, 2008.
General Note:
Adviser: Jermaine, Christophe.
General Note:

Record Information

Source Institution:
Rights Management:
Applicable rights reserved.
lcc - LD1780 2008
System ID: