Browsing by Subject "algorithms"
Now showing 1 - 4 of 4
- Results Per Page
- Sort Options
Item Comparative Analysis of the Impact of Arrival Rates on the Performance of Distance-based Streaming Outlier Detection Algorithms(2021-05) Durrani, AreehaOutlier detection in data streams comes with many challenges. Among these challenges is the variable arrival rate of streams. When data packets are sent across an unreliable network, the data sending process is interrupted due to temporary loss of signal and later all of the data is tried to send at once as signals resume, resulting in data point drop, leading to faulty outlier detection. However, which algorithm performs the best in such cases remained a question until now. This research studies the impact of the arrival rate, varying queue capacity sizes, and slide sizes on the performance of state-of-the-art outlier detection algorithms for data streams. Our experiments show that using a bounded queue for incoming data points and allowing data drop has an average detrimental impact on the F-1 score, which is 100% for NETS, 99.78% for Thresh-Leap, 99.69% for Micro-Cluster, 67.5% for Exact Storm, and 0.422% for DUE. The number of outliers lost is 0% for Thresh-Leap, -0.33% for NETS, 0% for Micro-Cluster, 39.2% for Exact Storm and 38.6% for DUE, observed on default parameters.Item Efficient geometric algorithms for preference top-k queries, stochastic line arrangements, and proximity problems(2017-06) Li, YuanProblems arising in diverse real-world applications can often be modeled by geometric objects such as points, lines, and polygons. The goal of this dissertation research is to design efficient algorithms for such geometric problems and provide guarantees on their performance via rigorous theoretical analysis. Three related problems are discussed in this thesis. The first problem revisits the well-known problem of answering preference top-k queries, which arise in a wide range of applications in databases and computational geometry. Given a set of n points, each with d real-valued attributes, the goal is to organize the points into a suitable data structure so that user preference queries can be answered efficiently. A query consists of a d-dimensional vector w, representing a user's preference for each attribute, and an integer k, representing the number of data points to be retrieved. The answer to a query is the k highest-scoring points relative to w, where the score of a point, p, is designed to reflect how well it captures, in aggregate, the user's preferences for the different attributes. This thesis contributes efficient exact solutions in low dimensions (2D and 3D), and a new sampling-based approximation algorithm in higher dimensions. The second problem extends the fundamental geometric concept of a line arrangement to stochastic data. A line arrangement in the plane is a partition of the plane into vertices, edges, and faces. Surprisingly, diverse problems, including the preference top-k query and k-order Voronoi Diagram, essentially boil down to answering questions about the set of k-topmost lines at some abscissa. This thesis considers line arrangements in a new setting, where each line has an associated existence probability representing uncertainty that is inherent in real-world data. An upper-bound is derived on the expected number of changes in the set of k-topmost lines, taken over the entire x-axis, and a worst-case upper bound is given for k = 1. Also, given is an efficient algorithm to compute the most likely k-topmost lines in the arrangement. Applications of this problem including the most likely Voronoi Diagram in R^1 and stochastic preference top-k query are discussed. The third problem discussed is geometric proximity search in both the stochastic setting and the query-retrieval setting. Under the stochastic setting, the thesis considers two fundamental problems, namely, the stochastic closest pair problem and the k most likely nearest neighbor search. In both problems, the data points are assumed to lie on a tree embedded in R^2 and distances are measured along the tree (a so-called tree space). For the former, efficient solutions are given to compute the probability that the closest pair distance of a realization of the input is at least l and to compute the expected closest pair distance. For the latter, the thesis generalizes the concept of most likely Voronoi Diagram from R^1 to tree space and bounds its combinatorial complexity. A data structure for the diagram and an algorithm to construct it are also given. For the query-retrieval version which is considered in R^2, the goal is to retrieve the closest pair within a user-specified query range. The contributions here include efficient data structures and algorithms that have fast query time while using linear or near-linear space for a variety of query shapes. In addition, a generic framework is presented, which returns a closest pair that is no farther apart than the closest pair in a suitably shrunken version of the query range.Item High-performance tools for precise microbiome characterization(2018-08) Al-Ghalith, GabrielThe microbiome, defined as the vast number of microorganisms inhabiting both human and non-human environments, has been associated with human disease as well as other important ecological phenomena. However, its quantitative study is complicated in part by measurement error and computational limitations, pointing to a need for more sensitive and reproducible DNA sequence analysis techniques. To this end, I have developed a variety of improved methods including a flexible short-read quality control pipeline, curated databases of marker genes and whole genomes, streamlined OTU picking software, and a high-throughput optimal aligner with taxonomy interpolation. Together, these methods represent advancements over traditional sequence analysis pipelines and may improve the quality of downstream statistical analyses.Item Journalism in an Era of Big Data: Cases, Concepts, and Critiques(Digital Journalism, 2015) Lewis, Seth C.“Journalism in an era of big data” is thus a way of seeing journalism as interpolated through the conceptual and methodological approaches of computation and quantification. It is about both the ideation and implementation of computational and mathematical mindsets and skill sets in newswork—as well as the necessary deconstruction and critique of such approaches. Taking such a wide-angle view of this phenomenon, including both practice and philosophy within this conversation, means attending to the social/cultural dynamics of computation and quantification—such as the grassroots groups that are seeking to bring pro-social “hacking” into journalism (Lewis and Usher 2013, 2014)—as well as the material/technological characteristics of these developments. It means recognizing that algorithms and related computational tools and techniques “are neither entirely material, nor are they entirely human—they are hybrid, composed of both human intentionality and material obduracy” (Anderson 2013, 1016). As such, we need a set of perspectives that highlight the distinct and interrelated roles of social actors and technological actants at this emerging intersection of journalism (Lewis and Westlund 2014a). To trace the broad outline of journalism in an era of big data, we need (1) empirical cases that describe and explain such developments, whether at the micro (local) or macro (institutional) levels of analysis; (2) conceptual frameworks for organizing, interpreting, and ultimately theorizing about such developments; and (3) critical perspectives that call into question taken-for-granted norms and assumptions. This special issue takes up this three-part emphasis on cases, concepts, and critiques.