Browsing by Author "Gay, Gregory"
Now showing 1 - 20 of 21
- Results Per Page
- Sort Options
Item A Baseline Method For Search-Based Software Engineering(2010) Gay, GregoryBackground: Search-based Software Engineering (SBSE) uses a variety of techniques such as evolutionary algorithms or meta-heuristic searches but lacks a standard baseline method. Aims: The KEYS2 algorithm meets the criteria of a baseline. It is fast, stable, easy to understand, and presents results that are competitive with standard techniques. Method: KEYS2 operates on the theory that a small sub-set of variables control the majority of the search space. It uses a greedy search and a Bayesian ranking heuristic to fix the values of these variables, which rapidly forces the search towards stable high-scoring areas. Results: KEYS2 is faster than standard techniques, presents competitive results (assessed with a rank-sum test), and offers stable solutions. Conclusions: KEYS2 is a valid candidate to serve as a baseline technique for SBSE research.Item Automated Oracle Creation Support, or: How I Learned to Stop Worrying About Fault Propagation and Love Mutation Testing(IEEE Press, 2012) Staats, Matt; Gay, Gregory; Heimdahl, MatsIn testing, the test oracle is the artifact that determines whether an application under test executes correctly. The choice of test oracle can significantly impact the effectiveness of the testing process. However, despite the prevalence of tools that support the selection of test inputs, little work exists for supporting oracle creation. In this work, we propose a method of supporting test oracle creation. This method automatically selects the oracle data - the set of variables monitored during testing - for expected value test oracles. This approach is based on the use of mutation analysis to rank variables in terms of fault-finding effectiveness, thus automating the selection of the oracle data. Experiments over four industrial examples demonstrate that our method may be a cost-effective approach for producing small, effective oracle data, with fault finding improvements over current industrial best practice of up to 145.8% observed.Item Automated Oracle Data Selection Support(IEEE, 2015-06) Gay, Gregory; Staats, Matt; Whalen, Michael; Heimdahl, MatsThe choice of test oracle—the artifact that determines whether an application under test executes correctly—can significantly impact the effectiveness of the testing process. However, despite the prevalence of tools that support test input selection, little work exists for supporting oracle creation. We propose a method of supporting test oracle creation that automatically selects the oracle data—the set of variables monitored during testing—for expected value test oracles. This approach is based on the use of mutation analysis to rank variables in terms of fault-finding effectiveness, thus automating the selection of the oracle data. Experimental results obtained by employing our method over six industrial systems (while varying test input types and the number of generated mutants) indicate that our method—when paired with test inputs generated either at random or to satisfy specific structural coverage criteria—may be a cost-effective approach for producing small, effective oracle data sets, with fault finding improvements over current industrial best practice of up to 1,435% observed (with typical improvements of up to 50%).Item Automated Oracle Data Selection Support(2015-11) Gay, Gregory; Staats, Matt; Whalen, Michael; Heimdahl, MatsThe choice of test oracle—the artifact that determines whether an application under test executes correctly—can significantly impact the effectiveness of the testing process. However, despite the prevalence of tools that support test input selection, little work exists for supporting oracle creation. We propose a method of supporting test oracle creation that automatically selects the oracle data—the set of variables monitored during testing—for expected value test oracles. This approach is based on the use of mutation analysis to rank variables in terms of fault-finding effectiveness, thus automating the selection of the oracle data. Experimental results obtained by employing our method over six industrial systems (while varying test input types and the number of generated mutants) indicate that our method—when paired with test inputs generated either at random or to satisfy specific structural coverage criteria—may be a cost-effective approach for producing small, effective oracle data sets, with fault finding improvements over current industrial best practice of up to 1,435% observed (with typical improvements of up to 50%).Item Automated Steering of Model-Based Test Oracles to Admit Real Program Behaviors(2015-05) Gay, GregoryThe test oracle - a judge of the correctness of the system under test (SUT) - is a major component of the testing process. Specifying test oracles is challenging for some domains, such as real-time embedded systems, where small changes in timing or sensory input may cause large behavioral differences. Models of such systems, often built for analysis and simulation, are appealing for reuse as test oracles. These models, however, typically represent an idealized system, abstracting away certain issues such as non-deterministic timing behavior and sensor noise. Thus, even with the same inputs, the model's behavior may fail to match an acceptable behavior of the SUT, leading to many false positives reported by the test oracle. We propose an automated steering framework that can adjust the behavior of the model to better match the behavior of the SUT to reduce the rate of false positives. This model steering is limited by a set of constraints (defining the differences in behavior that are acceptable) and is based on a search process attempting to minimize a dissimilarity metric. This framework allows non-deterministic, but bounded, behavioral differences, while preventing future mismatches by guiding the oracle - within limits - to match the execution of the SUT. Results show that steering significantly increases SUT-oracle conformance with minimal masking of real faults and, thus, has significant potential for reducing false positives and, consequently, testing and debugging costs while improving the quality of the testing process.Item Automatically Finding the Control Variables for Complex System Behavior(2010) Gay, Gregory; Menzies, Tim; Davies, Misty; Gundy-Burlet, KarenTesting large-scale systems is expensive in terms of both time and money. Running simulations early in the process is a proven method of finding the design faults likely to lead to critical system failures, but determining the exact cause of those errors is still time-consuming and requires access to a limited number of domain experts. It is desirable to find an automated method that explores the large number of combinations and is able to isolate likely fault points. Treatment learning is a subset of minimal contrast-set learning that, rather than classifying data into distinct categories, focuses on finding the unique factors that lead to a particular classification. That is, they find the smallest change to the data that causes the largest change in the class distribution. These treatments, when imposed, are able to identify the factors most likely to cause a mission-critical failure. The goal of this research is to comparatively assess treatment learning against state-of-the-art numerical optimization techniques. To achieve this, this paper benchmarks the TAR3 and TAR4.1 treatment learners against optimization techniques across three complex systems, including two projects from the Robust Software Engineering (RSE) group within the National Aeronautics and Space Administration (NASA) Ames Research Center. The results clearly show that treatment learning is both faster and more accurate than traditional optimization methods.Item Community-Assisted Software Engineering Decision Making(2013-04-15) Gay, Gregory; Heimdahl, MatsThere are a class of problems - such as deciding on a series of development practices - that challenge AI approaches for two reasons: (1) well-defined data is necessary, but it is not clear which factors are important and (2) multiple recommendations must be made, and dependence must be considered. Recommender systems offer inspiration for addressing these problems. First, they can make use of data models that broadly pair a series of recommendations with generalized information like project descriptions and design documents. More importantly, they offer feedback mechanisms to refine the calculated recommendations. Detailed feedback could be factored back into the data model to, over time, build evidence and context for recommendations. Existing algorithms and data models would be amenable to the addition of feedback mechanisms, and the use of these dynamic models could lower start-up costs and generate more accurate results as the model grows. We believe that feedback-driven dynamic prediction models will become an exciting research topic in the field of AI-based software engineering.Item Efficient Observability-based Test Generation by Dynamic Symbolic Execution(IEEE, 2015) You, Dongjiang; Rayadurgam, Sanjai; Whalen, Michael; Heimdahl, Mats; Gay, GregoryStructural coverage metrics have been widely used to measure test suite adequacy as well as to generate test cases. In previous investigations, we have found that the fault-finding effectiveness of tests satisfying structural coverage criteria is highly dependent on program syntax – even if the faulty code is exercised, its effect may not be observable at the output. To address these problems, observability-based coverage metrics have been defined. Specifically, Observable MC/DC (OMC/DC) is a criterion that appears to be both more effective at detecting faults and more robust to program restructuring than MC/DC. Traditional counterexample-based test generation for OMC/DC, however, can be infeasible on large systems. In this study, we propose an incremental test generation approach that combines the notion of observability with dynamic symbolic execution. We evaluated the efficiency and effectiveness of our approach using seven systems from the avionics and medical device domains. Our results show that the incremental approach requires much lower generation time, while achieving even higher fault finding effectiveness compared with regular OMC/DC generation.Item Finding Robust Solutions in Requirements Models(2010) Gay, Gregory; Menzies, Tim; Jalali, Omid; Mundy, Gregory; Gilkerson, Beau; Feather, Martin; Kiper, JamesSolutions to non-linear requirements engineering problems may be "brittle"; i.e. small changes may dramatically alter solution effectiveness. Hence, it is not enough to just generate solutions to requirements problems- we must also assess solution robustness. The KEYS2 algorithm can generate decision ordering diagrams. Once generated, these diagrams can assess solution robustness in linear time. In experiments with real-world requirements engineering models, we show that KEYS2 can generate decision ordering diagrams in O(N 2). When assessed in terms of terms of (a) reducing inference times, (b) increasing solution quality, and (c) decreasing the variance of the generated solution, KEYS2 out-performs other search algorithms (simulated annealing, ASTAR, MaxWalkSat).Item How to Build Repeatable Experiments(2008) Gay, Gregory; Menzies, Tim; Cukic, Bojan; Turhan, BurakThe mantra of the PROMISE series is "repeatable, improvable, maybe refutable" software engineering experiments. This community has successfully created a library of reusable software engineering data sets. The next challenge in the PROMISE community will be to not only share data, but to share experiments. Our experience with existing data mining environments is that these tools are not suitable for publishing or sharing repeatable experiments. OURMINE is an environment for the development of data mining experiments. OURMINE offers a succinct notation for describing experiments. Adding new tools to OURMINE, in a variety of languages, is a rapid and simple process. This makes it a useful research tool. Complicated graphical interfaces have been eschewed for simple command-line prompts. This simplifies the learning curve for data mining novices. The simplicity also encourages large scale modification and experimentation with the code. In this paper, we show the OURMINE code required to reproduce a recent experiment checking how defect predictors learned from one site apply to another. This is an important result for the PROMISE community since it shows that our shared repository is not just a useful academic resource. Rather, it is a valuable resource industry: companies that lack the local data required to build those predictors can use PROMISE data to build defect predictors.Item Implications of Ceiling Effects in Defect Predictors(2008) Menzies, Tim; Turhan, Burak; Gay, Gregory; Bener, Ayse; Cukic, Bojan; Jiang, YueContext: There are many methods that input static code features and output a predictor for faulty code modules. These data mining methods have hit a "performance ceiling"; i.e., some inherent upper bound on the amount of information offered by, say, static code features when identifying modules which contain faults. Objective: We seek an explanation for this ceiling effect. Perhaps static code features have "limited information content"; i.e. their information can be quickly and completely discovered by even simple learners. Method:An initial literature review documents the ceiling effect in other work. Next, using three sub-sampling techniques (under-, over-, and micro-sampling), we look for the lower useful bound on the number of training instances. Results: Using micro-sampling, we find that as few as 50 instances yield as much information as larger training sets. Conclusions: We have found much evidence for the limited information hypothesis. Further progress in learning defect predictors may not come from better algorithms. Rather, we need to be improving the information content of the training data, perhaps with case-based reasoning methods.Item Improving the Accuracy of Oracle Verdicts Through Automated Model Steering(2014) Gay, Gregory; Rayadurgam, Sanjai; Heimdahl, MatsThe oracle--a judge of the correctness of the system under test (SUT)--is a major component of the testing process. Specifying test oracles is challenging for some domains, such as real-time embedded systems, where small changes in timing or sensory input may cause large behavioral differences. Models of such systems, often built for analysis and simulation, are appealing for reuse as oracles. These models, however, typically represent an idealized system, abstracting away certain issues such as non-deterministic timing behavior and sensor noise. Thus, even with the same inputs, the model's behavior may fail to match an acceptable behavior of the SUT, leading to many false positives reported by the oracle. We propose an automated steering framework that can adjust the behavior of the model to better match the behavior of the SUT to reduce the rate of false positives. This model steering is limited by a set of constraints (defining acceptable differences in behavior) and is based on a search process attempting to minimize a dissimilarity metric. This framework allows non-deterministic, but bounded, behavior differences, while preventing future mismatches, by guiding the oracle--within limits--to match the execution of the SUT. Results show that steering significantly increases SUT-oracle conformance with minimal masking of real faults and, thus, has significant potential for reducing false positives and, consequently, development costs.Item Measuring the Heterogeneity of Crosscompany Datasets(2010) Chen, Jia; Yang, Ye; Zhang, Wen; Gay, GregoryAs a standard practice, general effort estimate models are calibrated from large cross-company datasets. However, many of the records within such datasets are taken from companies that have calibrated the model to match their own local practices. Locally calibrated models are a double-edged sword; they often improve estimate accuracy for that particular organization, but they also encourage the growth of local biases. Such biases remain present when projects from that firm are used in a new cross-company dataset. Over time, such biases compound, and the reliability and accuracy of a general model derived from the data will be affected by the increased level of heterogeneity. In this paper, we propose a statistical measure of the exact level of heterogeneity of a cross-company dataset. In experimental tests, we measure the heterogeneity of two COCOMO-based datasets and demonstrate that one is more homogeneous than the other. Such a measure has potentially important implications for both model maintainers and model users. Furthermore, a heterogeneity measure can be used to inform users of the appropriate data handling techniques.Item Moving the Goalposts: Coverage Satisfaction is Not Enough(ACM, 2014) Gay, Gregory; Staats, Matt; Whalen, Michael; Heimdahl, MatsStructural coverage criteria have been proposed to measure the adequacy of testing efforts. Indeed, in some domains--e.g., critical systems areas--structural coverage criteria must be satisfied to achieve certification. The advent of powerful search-based test generation tools has given us the ability to generate test inputs to satisfy these structural coverage criteria. While tempting, recent empirical evidence indicates these tools should be used with caution, as merely achieving high structural coverage is not necessarily indicative of high fault detection ability. In this report, we review some of these findings, and offer recommendations on how the strengths of search-based test generation methods can alleviate these issues.Item Observable Modified Condition/Decision Coverage(IEEE, 2013) Whalen, Michael; Gay, Gregory; You, Dongjiang; Heimdahl, Mats; Staats, MattIn many critical systems domains, test suite adequacy is currently measured using structural coverage metrics over the source code. Of particular interest is the modified condition/decision coverage (MC/DC) criterion required for, e.g., critical avionics systems. In previous investigations we have found that the efficacy of such test suites is highly dependent on the structure of the program under test and the choice of variables monitored by the oracle. MC/DC adequate tests would frequently exercise faulty code, but the effects of the faults would not propagate to the monitored oracle variables. In this report, we combine the MC/DC coverage metric with a notion of observability that helps ensure that the result of a fault encountered when covering a structural obligation propagates to a monitored variable; we term this new coverage criterion Observable MC/DC (OMC/DC). We hypothesize this path requirement will make structural coverage metrics 1.) more effective at revealing faults, 2.) more robust to changes in program structure, and 3.) more robust to the choice of variables monitored. We assess the efficacy and sensitivity to program structure of OMC/DC as compared to masking MC/DC using four subsystems from the civil avionics domain and the control logic of a microwave. We have found that test suites satisfying OMC/DC are significantly more effective than test suites satisfying MC/DC, revealing up to 88% more faults, and are less sensitive to program structure and the choice of monitored variables.Item On the Danger of Coverage Directed Test Case Generation(Springer-Verlag, 2012) Gay, Gregory; Staats, Matt; Whalen, Michael; Heimdahl, MatsIn the avionics domain, the use of structural coverage criteria is legally required in determining test suite adequacy. With the success of automated test generation tools, it is tempting to use these criteria as the basis for test generation. To more firmly establish the eectiveness of such approaches, we have generated and evaluated test suites to satisfy two coverage criteria using counterexample-based test generation and a random generation approach, contrasted against purely random test suites of equal size. Our results yield two key conclusions. First, coverage criteria satisfaction alone is a poor indication of test suite effectiveness. Second, the use of structural coverage as a supplement, not a target, for test generation can have a positive impact. These observations points to the dangers inherent in the increase in test automation in critical systems and the need for more research in how coverage criteria, generation approach, and system structure jointly influence test effectiveness.Item On the Use of Relevance Feedback in IR-based Concept Location(2009) Gay, Gregory; Haiduc, Sonia; Marcus, Andrian; Menzies, TimConcept location is a critical activity during software evolution as it produces the location where a change is to start in response to a modification request, such as, a bug report or a new feature request. Lexical-based concept location techniques rely on matching the text embedded in the source code to queries formulated by the developers. The efficiency of such techniques is strongly dependent on the ability of the developer to write good queries. We propose an approach to augment information retrieval (IR) based concept location via an explicit relevance feedback (RF) mechanism. RF is a two-part process in which the developer judges existing results returned by a search and the IR system uses this information to perform a new search, returning more relevant information to the user. A set of case studies performed on open source software systems reveals the impact of RF on IR based concept location.Item Sharing Experiments Using Open Source Software(2011) Nelson, Adam; Menzies, Tim; Gay, GregoryWhen researchers want to repeat, improve or refute prior conclusions, it is useful to have a complete and operational description of prior experiments. If those descriptions are overly long or complex, then sharing their details may not be informative. OURMINE is a scripting environment for the development and deployment of data mining experiments. Using OURMINE, data mining novices can specify and execute intricate experiments, while researchers can publish their complete experimental rig alongside their conclusions. This is achievable because of OURMINE's succinctness. For example, this paper presents two experiments documented in the OURMINE syntax. Thus, the brevity and simplicity of OURMINE recommends it as a better tool for documenting, executing, and sharing data mining experiments.Item Steering Model-Based Oracles to Admit Real Program Behaviors(2014) Gay, Gregory; Rayadurgam, Sanjai; Heimdahl, MatsThe oracle - an arbiter of correctness of the system under test (SUT) - is a major component of the testing process. Specifying oracles is challenging for real-time embedded systems, where small changes in time or sensor inputs may cause large differences in behavior. Behavioral models of such systems, often built for analysis and simulation, are appealing for reuse as oracles. However, these models typically provide an idealized view of the system. Even when given the same inputs, the model’s behavior can frequently be at variance with an acceptable behavior of the SUT executing on a real platform. We therefore propose steering the model when used as an oracle, to admit an expanded set of behaviors when judging the SUT’s adherence to its requirements. On detecting a behavioral difference, the model is backtracked and then searched for a new state that satis�es certain constraints and minimizes a dissimilarity metric. The goal is to allow non-deterministic, but bounded, behavior differences while preventing future mismatches, by guiding the oracle - within limits - to match the execution of the SUT. Early results show that steering signi�cantly increases SUT-oracle conformance with minimal masking of real faults and, thus, has signi�cant potential for reducing development costs.Item The Risks of Coverage-Directed Test Case Generation(IEEE, 2015) Gay, Gregory; Staats, Matt; Whalen, Michael; Heimdahl, MatsA number of structural coverage criteria have been proposed to measure the adequacy of testing efforts. In the avionics and other critical systems domains, test suites satisfying structural coverage criteria are mandated by standards. With the advent of powerful automated test generation tools, it is tempting to simply generate test inputs to satisfy these structural coverage criteria. However, while techniques to produce coverage-providing tests are well established, the effectiveness of such approaches in terms of fault detection ability has not been adequately studied. In this work, we evaluate the effectiveness of test suites generated to satisfy four coverage criteria through counterexample-based test generation and a random generation approach—where tests are randomly generated until coverage is achieved—contrasted against purely random test suites of equal size. Our results yield three key conclusions. First, coverage criteria satisfaction alone can be a poor indication of fault �nding effectiveness, with inconsistent results between the seven case examples (and random test suites of equal size often providing similar—or even higher—levels of fault �nding). Second, the use of structural coverage as a supplement—rather than a target—for test generation can have a positive impact, with random test suites reduced to a coverage-providing subset detecting up to 13.5% more faults than test suites generated speci�cally to achieve coverage. Finally, Observable MC/DC, a criterion designed to account for program structure and the selection of the test oracle, can—in part—address the failings of traditional structural coverage criteria, allowing for the generation of test suites achieving higher levels of fault detection than random test suites of equal size. These observations point to risks inherent in the increase in test automation in critical systems, and the need for more research in how coverage criteria, test generation approaches, the test oracle used, and system structure jointly influence test effectiveness.