SOURCE ARCHIVE
EXTRACTED CONTENT
64,308 charsGenesys: Automatically Generating Representative Training Sets for Predictive Benchmarking
Reena Panda, Xinnian Zheng, Shuang Song, Jee Ho Ryoo, Michael LeBeane, Andreas Gerstlauer and Lizy K John The University of Texas at Austin {reena.panda, xzheng1, songshuang1990, jr45842, mlebeane}@utexas.edu, {gerstl, ljohn}@ece.utexas.edu
Abstract—Fast and efficient design space exploration is a Training Phase
critical requirement for designing computer systems, however, Prog 1 f , f₂,.. fN T F(f )=T
the growing complexity of hardware/software systems and sig- 1 Statistical i
Prog 2 f , f₂,.. fN T Learning
nificantly long run-times of detailed simulators often makes it 1 Algorithm
challenging. Machine learning (ML) models have been proposed Prog N f₁, f₂,.. fN T
as popular alternatives that enable fast exploratory studies. The
accuracy of any ML model depends heavily on the representa- Training Set Feature Extraction Training Algorithm Prediction Model
tiveness of applications used for training the predictive models.
While prior studies have used standard benchmarks or hand- Testing Phase
tuned micro-benchmarks to train their predictive models, in this F(f i)=T Prediction of target metric
paper, we argue that it is often sub-optimal because of their Prog X ( performance/power)
limited coverage of the program state-space and their inability to
be representative of the larger suite of real-world applications. In Test Program Prediction Model Prediction Result order to overcome challenges in creating representative training sets, we propose Genesys, an automatic workload generation Fig. 1: Machine learning based prediction framework methodology and framework, which builds upon key low-level application characteristics and enables systematic generation of based on the behavior of training applications, its accuracy applications covering a broad range of program behavior state- depends significantly on the ability of the training set to space without increasing the training time. We demonstrate that represent the wide-range of real-world applications that are run the automatically generated training sets improve upon the state- on modern computer systems. If the target applications have space coverage provided by applications from popular bench- similar performance behavior as the training applications, the marking suites like SPEC-CPU2006, MiBench, MediaBench, accuracy of the model will be higher and vice versa. TPC-H by over 11x and improve the accuracy of two machine learning based power and performance prediction systems by Most prior studies [2, 5, 6, 8, 11, 13] leveraging machine over 2.5x and 3.6x respectively. learning for computer design exploration select their training I. INTRODUCTION sets from standard benchmarking suites (e.g., SPEC CPU [14, 15], SPECjbb [16], MiBench [17], Implantbench [18], TPC- Fast and efficient design space exploration is a critical H [19], etc.), which consist of a ”few” selected benchmarks requirement for designing and optimizing computer systems. representing certain classes of applications. But the number of But the growing complexity of hardware systems coupled applications included in such suites is often far from being with the rapid pace of software development makes effi- sufficient for capturing the workload space or training any cient design space exploration challenging. Generally, detailed statistical model effectively. Also, as standard benchmarking simulation-based techniques are employed for performing ac- suites evolve at a much slower pace as compared to real-world curate performance/power analysis of programs. However, the software development, they are not always representative of high computational cost/run-time of detailed simulators [1] the entire suite of applications that are run on modern-day and large workload sizes affect the efficiency of elaborate computer systems. Furthermore, the significantly long run- exploratory studies. Machine learning (ML) based predictive times of many standard benchmarks limits several past studies modeling has been explored in several prior research studies to train their models by running either short subsets of the [2, 3, 4, 5, 6, 7, 8, 9, 10] as a popular alternative that enables benchmarks or using reduced input sets (e.g. - Minnespec fast design space exploration. ML models have also been [20]), which further reduces the representativeness of the adopted in the industry, finding applications in areas such training application sets. Few other studies [4, 9, 12] use hand- as server-scheduling, cluster maintenance models, operating tuned micro-benchmarks for training their predictive models, system scheduling [11, 12] etc. For example, Google uses ML however they are harder to write and increase the training time, models for making scheduling decisions on their clusters. which limits the number of micro-benchmarks that can be used A typical supervised learning-based framework employed to have a representative training set. for computer system modeling is shown in Figure 1. It consists To overcome these challenges in creating representative of two phases: training and prediction/testing. The training training programs for machine learning models, in this paper, phase involves learning the inherent behavioral patterns of a we propose “Genesys”, an automatic workload generation set of training applications and creating a predictive model methodology and framework that enables systematic gener- based on the learned patterns. The prediction phase involves ation of applications covering a broad range of the pro- using the created ML model for predicting desired metrics gram behavior state-space. Genesys builds on core low-level for test applications. However, as the model learns patterns application characteristics and systematically varies them to
generate sets of synthetic benchmarks that can provide targeted coverage of the workload space effectively. By enabling fast and effective coverage of the program state-space, Genesys aims to improve the accuracy of ML models used for design- space exploration of computer systems. In this paper, we PC2 demonstrate that using training sets automatically generated by Genesys improves upon the state-space coverage provided by applications from popular benchmarking suites (e.g., SPEC- CPU2006, MiBench, MediaBench, TPC-H) by over 11x. We also show that Genesys improves the accuracy of two machine learning based power and performance prediction systems by over 2.5x and 3.6x respectively. The key contributions of this PC1 paper are as follows: Fig. 2: KMEANS cluster plot • We propose Genesys, an automatic workload synthesis • programs that are representative of the target applications. framework to systematically improve the representative- As a wide-range of applications are run on modern com- • ness and coverage of training application sets. puter systems, it is important that the training programs We propose a novel metric, SpreadRatio, to quantify and cover the application design-space sufficiently well. This compare the state-space coverage provided by different increases the likelihood of finding programs with similar • program sets. characteristics during the prediction phase. We show that by controlling 12 key application-specific • a large number of program instances that provide broader metrics in a systematic manner, Genesys can improve and more denser coverage of the application state space. A upon the state-space coverage provided by popular, stan- training set based on few program instances increases the dard benchmarks like SPEC-CPU2006, MiBench, Medi- risk of over-fitting the model with very similar programs. aBench, TPC-H by over 11x. • We also demonstrate that using targeted synthetic training However, it is challenging to find the right set of applications sets, the accuracy of two machine learning based power to create a balanced and broader-coverage training set. and performance prediction models can be increased by Most prior studies [2, 5, 6, 8, 11, 12, 13] either use standard over 2.5x and 3.6x respectively. benchmarks or special, hand-tuned micro-benchmarks [4, 9] The rest of this paper is organized as follows: In Section 2, to train the predictive models. While standard benchmarking we provide a brief background about ML modeling followed suites usually consist of applications representing particular by discussing Genesys’s methodology and evaluation in sec- application domains, they are not always representative of the tions 3 and 4 respectively. Section 5 discusses prior work and wide-variety of applications that are run on modern-day com- section 6 concludes the paper. puter systems. For example, Figure 2 compares a subset of the SPEC CPU2006, MiBench benchmarks with popular big-data applications (Memcached [21], HiBench [22] and Cassandra II. BACKGROUND [23] running the Yahoo! cloud serving benchmark [24]) using A. Machine learning based modeling an extensive set of performance metrics (e.g., instructions per ML techniques have been widely explored for architectural cycle (IPC), control flow performance, cache misses, etc.). performance/power prediction and micro-architectural design- We can observe from the distinct cluster formations (KMeans space studies. In this paper, we use the supervised learning- clustering-based) that there are significant differences in per- based models, which typically consist of a training and a formance characteristics between the two sets of applications. prediction phase (Figure 1). During the training phase, a set Also, it is usually difficult to assess any application’s low-level of programs, which constitute the training set, are executed behavior by looking at the high-level application, unless you to capture their performance features as well as the target are a domain expert. For example, Phansalkar et al [25] showed reference metric(s) e.g., performance or power. In [6, 7, 9, 13], that many benchmarks in SPEC CPU2006 suite from different these features are obtained from hardware performance events, application domains exhibit similar behavior and vice versa. whereas in [2, 5, 8, 10], the underlying machine configurations Special, hand-tuned micro-benchmarks are definitely useful in are also used. The learning algorithm constructs a predictive stressing particular program properties, but the process is very model which associates applications’ characteristics with the tedious and time-consuming which can significantly increase reference metric(s). During the prediction phase, test program the training time. Finally, the number of program instances features are used as an input to the predictive model, which that can be created in this manner is usually far from being produces estimates of the target characteristics. sufficient for training any statistical model. The accuracy and usefulness of any such learning-based B. Overcoming the limitations approach depends heavily on the training sets that are used In order to overcome the challenges in creating a balanced for creating the predictive models. A well-formed training set set of programs for training ML models, we propose Genesys, can improve the accuracy and applicability of the predictive an automatic workload generation methodology and frame- models to a wide range of test applications, while an ill-formed work that builds upon core low-level program characteristics training set can affect the statistical model (e.g, cause over- and enables systematic generation of applications covering a fitting) and lead to inaccuracies during the prediction phase. broad range of the program behavior state-space. In the next Ideally, a good training set should consist of: section, we will describe Genesys’s methodology in detail.
Metric 1 TABLE I: Core metrics forming the workload-specific profile
Metric 2 User provides Syn Prog 1
: target values Syn Prog 2 Category Metrics Count
Metric i Workload Synthetic Syn Prog 3 1. Instruction mix 5 categories
Metric j Specific Code . Instruction-level 2. Instruction count 1
Variable Profile Generator . Characteristics 3. Instruction cache miss rate (ICMR) 1 Point Metric k Systematically . : randomized 4. Instruction level parallelism (ILP) 32 bins Metric N values Syn Prog N Control-flow 5. Average basic block size 1 Set of core metrics Training application-set Characteristics 6. Branch transition rate (BTR) 1 7. Branch misprediction rate 1 Fig. 3: Genesys’s overall methodology and framework 8. Data footprint 1 III. M Memory-access 9. Regular/irregular behavior 1 ETHODOLOGY Characteristics 10. Spatial locality stride bins 32 bins 11. Temporal locality bins 8 bins Genesys is a workload generation framework (shown in 12. L1/L2 Data cache miss rates 2 Figure 3) that enables systematic generation of synthetic appli- cations covering a broad range of the application state-space. It wise, IMIX fractions, randomized within bounded ranges, are builds upon a set of key workload-specific metrics that can be used to generate the suite of training applications. controlled systematically to generate workloads with desired 2) Instruction count: The second metric that we consider properties. Each workload-specific metric corresponds to a is instruction count (IC), which controls the static instruction low-level program feature, which defines particular application footprint of the application. IC can be provided by the user characteristics, and is available as a user-controllable knob. The directly or estimated automatically based on the target instruc- user can choose to fix the values of some (or all) core metrics tion cache miss rate (ICMR, metric 3). If the ICMR metric is to generate targeted program behavior. For the remaining set provided, Genesys determines the number of static instructions of core metrics (if any) whose values are not fixed by the user, to instantiate in the synthetic benchmark to achieve the target Genesys randomizes their values within reasonable bounds in ICMR. An initial estimate of the number of static instructions order to achieve well-rounded program state-space coverage is made based on the assumption of a default instruction cache around the target behavior. By allowing each workload-specific (Icache) size/configuration (32KB, 64 byte blocks, 2-way set- metric to be controlled using easy-to-use programmable knobs, associativity). This serves as an initial estimate only, the final Genesys allows to create targeted benchmarks with desired static code size is further tuned to achieve the target ICMR. program features. Together, the values chosen for the core 3) Instruction-level parallelism: Instruction-level paral- metrics act as unique profiles for the synthetic workloads. lelism (ILP) is an important determinant of an application’s These workload profiles are fed into a code generator algorithm performance. Tight producer-consumer chains in program se- that uses the target metric values to generate a suite of quences limit ILP and performance because of dependency- synthetic applications. Together, these applications form a set induced serialization effects. Genesys models ILP by control- of unique training programs, which target particular aspects of ling the dependencies between instructions in the application the program behavior depending upon the choice of the core sequence using the dependency distance metric. Dependency metrics and can be used to train any supervised ML model. distance is defined as the total number of instructions in the In this section, we will first present the core workload- dynamic instruction stream between the production (write) and specific metrics used by Genesys followed by the workload consumption (read) of a register/memory location. We classify generation methodology. The core feature set is divided at a dependency distance into 32 bins (values varying between 1 high-level into three categories, depending upon the aspects to 32 and higher), where each bin represents the fraction of of program behavior that the individual metrics control. The instructions having that particular dependency distance. The three categories, together with their associated sub-categories desired dependency distance can be provided as an input and component metrics are shown in Table I and are described to Genesys or automatically randomized (within bounds) to below. It should be noted that the set of core metrics used generate training programs with varying degrees of ILP. in this paper are not meant to be conclusive, rather they are key metrics that affect different aspects of program behavior. B. Control-flow characteristics Nevertheless, Genesys’s framework is flexible enough to add These metrics affect the program’s control-flow behavior. new metrics to control other aspects of program behavior. 1) Average basic block size: Average basic block size is A. Instruction-level characteristics an important metric because it determines the average number of instructions that can be executed in the program sequence These metrics correspond to the instruction-level behavior without executing any control instructions. This can affect of the applications. performance significantly depending on the branch predictor 1) Instruction mix: The first metric is the application’s performance. Again, this metric could be provided directly as instruction mix (IMIX). Genesys categorizes instructions into an input or inferred from the ICMR metric (described before) fractions of loads and stores (memory), control-flow, integer and the fraction of control instructions. and floating-point instructions. It should be noted that the 2) Branch predictability model: We consider two other framework is very flexible and can be easily extended to control-flow metrics: branch transition rate (BTR) and branch support specific instruction categories. The target IMIX can misprediction rate. Prior research studies [26] have shown that be provided as an input to Genesys directly, in which case it an application’s branch misprediction rate is highly correlated generates programs having the desired overall IMIX. Other- with the transition characteristics of the branch instructions.
The key idea behind this correlation is that higher the transition 1) Generate a random number in the interval [0, 1] and select probability of a branch instruction, the more difficult it is to a basic block based on this number and the block’s access predict its next direction and vice versa. To model a certain frequency. BTR, we choose a set of control instructions to be modeled 2) Basic block’s size is determined to satisfy the mean and with high transition probabilities (frequent switching) and the standard deviation of the target basic block size. remaining branch instructions to have very low transition prob- 3) The basic block is populated with instructions based on abilities (infrequent switching activity). Similarly, Genesys the IMIX metrics, while ensuring that the last instruction can also model the transition probability of individual branch of the basic block is a branch instruction. instructions in a directly correlated fashion to achieve a target 4) Every instruction is assigned a dependency distance (i.e., branch misprediction rate (metric 7). a previous instruction that it is dependent upon) in order C. Memory-level characteristics to satisfy the dependency distance criterion. 5) Load and store instructions are assigned a stride-offset This section describes metrics that affect the memory based on the memory access model described in the performance (data-side) of applications. previous section (regular or irregular). 1) Data footprint: Data footprint metric determines the 6) An X86 test operation is used to set the condition codes range of data addresses accessed by the synthetic application that affects the outcome of the conditional branch instruc- during its execution time. This is important because it can tion at the end of each basic block. The “test” operand is determine performance of different levels of caches and mem- controlled to achieve the target BTR metric. ory based on how large the footprint is with respect to the 7) Increment the number of basic blocks generated. available cache size and memory structure. It controls the size 8) If target number of basic blocks have been generated, go of the memory regions, which are accessed by the synthetic to step 9, else update the individual metric distributions application. and go back to step 1. 2) Memory access regularity: This metric determines if 9) Available architected registers are assigned to each in- the memory accesses made by load/store instructions of an struction while satisfying the data dependencies estab- application should have regular or irregular behavior. For lished in step 4. irregular memory behavior, Genesys generates load/store in- 10) The above instruction sequence is generated as a part of structions that access allocated and initialized memory regions two-level nested loops where the inner loop controls the based on a randomly generated sequence. Regular memory application’s data footprint and the outer loop controls the behavior is achieved using additional metrics (spatial-temporal number of dynamic instructions (overall runtime). Every locality or L1/L2 cache miss rate metrics) as described below. static load or store instruction resets to the first element 3) Spatial and temporal locality: The principle of data of the strided memory streams and re-walks the entire locality and its impact on applications performance is widely stream in the outer loop iterations. recognized. Genesys models regular data memory accesses The code generator generates the instruction sequence using simple strided stream classes over fixed-size data arrays, using C-language with embedded X86-based assembly in- where strides are defined to be the difference between consec- structions. An example code snippet is shown in Figure 4. utive effective addresses. Strides can be provided directly as The code generator can be modified to generate instructions an input to Genesys to control spatial locality characteristics for a different ISA. The code is encompassed inside a main (bins representing strides from -1K to 1K in multiples of header and malloc library calls are used to allocate memory 64B). Genesys also provides knobs to control the temporal for the data streams. Volatile directives are used for each locality (8 bins expressed as powers-of-two from 0 to 128) in asm statement and the program is compiled using the lowest the memory accesses. Temporal locality metric controls the compiler optimization level (-O0 with gcc) in order to prevent number of unique memory accesses between access to the the compiler from optimizing out the machine instructions. same memory location and affects the achieved cache miss rates as well. Together, the stride and temporal locality bin IV. EVALUATION values are used to generate the sequence of memory addresses. This section describes our experimental setup and evalua- Genesys can also automatically estimate the strides (off- tion in detail. sets) of the load/store instructions based on the target data A. Experimental setup cache miss rate statistics. This approach is similar to that adopted by Bell et al [27]. The strides for a particular memory All our experiments are conducted on Intel Xeon E5-2430 access is determined first by matching the L1 hit rate of a v2 server class machines with Ivy-bridge micro-architecture load/store, followed by the L2 hit rate. We generate a table based processing cores, three levels of caches (1.5MB L2 and that holds the correlation between L1/L2 cache hit rates and 15MB L3 cache) and 64 GB of main memory. For measuring the corresponding stride values to be used. We use the target { L1/L2 hit rate information along with this table to generate asm volatile("BBL1INS0:add %%ebx,%%ebx”) stride values of load and store instructions. By treating all asm volatile("BBL1INS1:mov 0(%%r8),%%ecx”) asm volatile("BBL1INS2:add %%ebx,%%ebx”) memory accesses as streams and working from a base cache asm volatile("BBL1INS3:mov 0(%%r14),%%ecx”) configuration, the memory access model is kept simple. asm volatile("BBL1INS4:add %%ebx,%%ebx”) asm volatile("BBL1INS5:test $0,%%eax”) asm volatile("BBL1INS6:jz BBL2INS0“) D. Workload generation methodology } The workload synthesis algorithm, based on the metrics Fig. 4: Example synthetic code snippet discussed before, is as follows:
hardware performance of different applications, we use Linux using subsets of performance characteristics, followed by using perf tool [28] that provides an interface to the processor the entire set. Since the number of metrics is large, it is performance counters. Power consumption is monitored using difficult to visualize all the variables simultaneously to draw Intel’s RAPL counters. any meaningful conclusions. Thus, we use statistical data To show the efficacy of Genesys, we compare the synthetic analysis techniques to simplify the comparison. Using a large programs generated using Genesys with a training set compris- number of correlated variables tends to unduly overemphasize ing of benchmarks drawn from several popular benchmarking the importance of a particular property. Therefore, we first suites (hereafter referred to as the REAL training set). The normalize the raw data to a unit normal distribution (mean REAL training set includes 70 standard benchmarks: 29 bench- = 0, standard deviation = 1) and then, pre-process them marks from the SPEC CPU2006 suite (using ref inputs), 20 using Principal Component Analysis (PCA) [30]. PCA is benchmarks from MiBench, 10 benchmarks from MediaBench an effective statistical data analysis technique to reduce the and 11 TPC-H queries. dimensionality of a data-set, while maintaining most of its Details about the synthetic training programs created using original information. PCA transforms the original variables Genesys (hereafter referred to as the GEN training sets) is pro- into a set of uncorrelated principal components (PCs). If vided in the following sections. Each GEN training program’s significant correlation exists between the original variables, size is restricted to complete within 1 to 15 seconds on the then most of the original information will be captured using target machine. just the top few PCs. First, we compare the memory subsystem performance B. State-space coverage behavior of the GEN and REAL program sets based on the In this section, we show how Genesys can be leveraged to L1 Dcache, Icache, L2 and LLC misses per kilo instruction automatically create programs with different features leading (MPKI) metrics. Figure 5a and 5b shows the scatterplot of the to a wider coverage of the program state-space. To do so, top 4 PCs respectively. We can observe that applications from we compare the program state-space coverage provided by the the REAL set do not stress the instruction side performance REAL training set against the GEN training set. For this study, much, because of which the Icache MPKI for the REAL the GEN set consists of 500 synthetic programs created using programs is mostly very low. Such behavior is different from Genesys. The GEN programs are uniquely generated by using the emerging big-data and cloud applications which have been random combinations of individual metric values (chosen sys- shown to stress the instruction side performance more heavily tematically within respective metric bounds). It takes roughly [31, 32]. Nevertheless, by controlling Genesys’s I/D memory- a few (∼5-20) seconds to generate each GEN training program side metrics, it is possible to create programs that stress and every program completes execution within 1 to 15 seconds the instruction and data-side performance to varying degrees. on the target machine. Thus, the training time to generate and Overall, we can see that GEN programs provide 25.4 times collect feature sets for the GEN training set is roughly equal (SpreadRatio = 25.4) higher coverage area than the REAL to running the 70 programs from the REAL set due to the programs for the first two principal components and 12.9 times significantly longer run-times of several REAL benchmarks. (SpreadRatio = 12.9) higher coverage than REAL programs in the PC3 versus PC4 space. In order to compare the program state-space coverage Figure 5c compares the I/D TLB performance of the REAL achieved by either training sets, we define a novel metric and GEN programs. The x-axis corresponds to the ITLB MPKI called SpreadRatio, which is defined as the ratio of the area of whereas the y-axis corresponds to the DTLB MPKI of the the convex hull envelope of the REAL versus GEN program programs. We can see that although the standard benchmarks features. The convex hull [29] of a set S of points in the provide good coverage in terms of DTLB performance, but Euclidean space is defined as the smallest convex set that none of the REAL programs stress the ITLB much whereas the contains S. The convex hull of a set of points S in n dimensions GEN programs provide extensive state-space coverage in terms is the intersection of all convex sets containing S. For N pointsp1 of both the instruction and data TLB performance. Overall, , ..., pN in n-dimensions, the convex hull C is given by the GEN training set provides 12.8 times higher coverage than expression: the REAL training set (SpreadRatio = 12.8). N N Next, we consider the instruction-level performance char- C ≡ ∑ λ j p j : λ j ≥ 0 ∀ j and ∑ λ j = 1 acteristics (shown in Figure 5d), based on the overall IPC, j=1 j=1 μOps/instruction, IMIX and ILP (given by dependency-driven pipeline stalls) metrics. We can see that GEN programs provide Based on this definition of a convex hull, let CREAL 8.1 times broader state-space coverage as compared to the represent the convex hull of the points covered by the REAL REAL programs for the instruction-level metrics as well. training set and CGEN represent the convex hull of the points Next, we compare the control-flow performance coverage covered by the GEN training set. Then, SpreadRatio can be of the REAL and GEN programs as shown in Figure 5e. We defined using the following expression: specifically consider the branch misprediction rate, average SpreadRatio = Area(CGEN ) basic block size, percentage of branch instructions. We can see Area(CREAL) that the REAL programs have much better branch performance coverage as compared to their cache and TLB performance. Next, we compare the state-space coverage provided by But GEN programs still outperform the REAL set by providing the GEN and REAL programs using the SpreadRatio metric. 2.4x higher coverage (SpreadRatio = 2.4). To better demonstrate the degree of controllability provided Figure 5f shows the state-space coverage provided by by Genesys, we first compare the GEN and REAL programs the REAL and GEN training programs in the PC1 versus
SpreadRatio = 25.4 SpreadRatio = 12.9
(a) (b)
SpreadRatio = 12.8 SpreadRatio = 8.1
(c) (d)
SpreadRatio = 2.4 SpreadRatio = 4.5
(e) (f)
Fig. 5: State-space coverage of REAL and GEN programs using (a) Cache/memory behavior - PC1 vs PC2 (b) Cache/memory behavior - PC3 vs PC4 (c) TLB behavior (d) Instruction-level behavior (e) Control-flow behavior (f) Overall characteristics
PC2 space using all performance features shown in Table II important to note that Genesys’s framework and methodology including IPC. Overall, GEN provides 4.5x higher state-space is applicable to any supervised learning based ML model and coverage as compared to the REAL set using all the program this case study merely serves as an example to demonstrate features. We can thus, conclude that Genesys’s methodology of the utility of Genesys’s framework in a particular context. controlling key low-level application metrics allows to easily For this case study, Genesys is used to create targeted GEN generate programs with varied performance characteristics. training sets (consisting of 60 synthetic benchmarks) having C. Case-study: performance and power modeling using ma- desired performance characteristics. As such, the training time chine learning using the GEN training applications is significantly lower as In this section, we demonstrate that Genesys framework compared to the REAL training set. For building the predictive can be leveraged to create targeted training sets by increasing models, we collect a set of performance features, given in the coverage density around regions of interest in the perfor- Table II, by running the REAL and GEN training sets together mance state-space. Using such targeted programs sets for train- with the reference IPC or power metric. The choice of these ing ML models, we demonstrate that significant improvements metrics is intended to characterize the application’s behavior in prediction accuracy for two power/performance prediction from a micro-architecture perspective. These characteristics are frameworks can be achieved as compared to using standard significant predictors of application’s overall performance and benchmarks. For this case study, we consider the case of per- power. For example, the performance effect of increasing the formance and power modeling of a given hardware platform. L1 Dcache size will have a larger impact on applications with The performance metric that we predict is the standard IPC of a high L1 Dcache miss rate. the test program and the target power metric that we predict is 1) Learning models: In this section, we introduce the the average power consumption of the program in watts. It is learning framework that we use for constructing the perfor-
TABLE II: Measured hardware performance features 45% GEN
IPC Prediction Error (%) 40%
Performance Features 35% REAL
μOps/instruction FP Ops/instruction 30%
branch/instruction branch miss/instruction 25%
Icache MPKI Dcache MPKI 20%
ITLB MPKI DTLB MPKI 15%
L2 MPKI LLC MPKI 10%
5%
0%
mance and power prediction models. Formally, for a given program i, we denote xi ∈ Rd as its d-dimensional performance feature vector, and yi ∈ R as the corresponding reference IPC or power. The goal of the learning algorithm is to find a function F : Rd → R such that, Fig. 6: Performance prediction accuracy modeled using GEN F(xi) ≈ yi ∀i in the training set. and REAL training sets Similar to prior research work [2, 7, 10], in this case-study, 3) Power prediction results: This section presents the we consider only the family of linear functions for F. Under results from predicting power consumption of a set of 18 test such assumptions, we formulate our problem of finding F as applications from SPEC-CPU2006, MiBench, Mediabench and the following optimization problem: TPC-H benchmarks. Again, we use and compare two different minimize( ‖X T w −Y ‖22 + λ ‖w‖22 (1) training sets: the REAL training set benchmarks excluding where X = xTw. . . xT ) is a matrix with each row corre- the test benchmark and the GEN training set benchmarks Figure 7 compares the power prediction accuracy using the sponding to the performance feature vectors1 n xi of each training generated for the performance prediction experiments before. program, and Y = (y1 . . . yn) is a vector consisting of the two training sets. We can see that power prediction accuracy corresponding reference performance or power values. In our using GEN set improves for most benchmarks except for five case, λ is a tuning parameter chosen to be 1. The optimization benchmarks libquantum, milc, soplex, leslie3D and zeusmp problem in equation 1 is called Ridge Regression [33]. While where the prediction accuracy reduces marginally. On an minimizing the sum of square errors of the predictor, it average though, creating models using the GEN training set penalizes the model parameter w if its magnitude is large. has significantly lower prediction error (6.19%) as compared The λ ‖w‖22 term is often called the l2-regularizer, which is to the REAL training set (15.90%), which is a 2.57 times widely used to prevent over-fitting of the model [33]. Ridge improvement in prediction accuracy. regression can be solved efficiently by using standard gradient descent method [33]. We use the CVXOPT v1.1.7 library in V. RELATED WORK Python as our main computation tool for solving the ridge Analytical approaches are being used for performance pre- regression problem. diction for several decades [34, 35]. Lee and Brooks [2] used 2) Performance prediction results: This section presents regression-based modeling for micro-architecture design space the results from predicting the performance (IPC) of a set of 18 exploration. Joseph et al. [8] and Ipek et al. [3] used regression- test applications from SPEC-CPU2006, MiBench, Mediabench based and artificial neural networks based modeling for creat- and TPC-H benchmarks. The learning model described before ing performance models. Zheng et al [6] used performance is trained using two different training sets: the REAL training counter based measurements on a host machine to predict set benchmarks excluding the test benchmark and the GEN performance of another system. Genesys is a framework to training set benchmarks containing 60 synthetic programs systematically generate training sets providing both broader targeting the test benchmark performance characteristics. Indi- and denser coverage of the program state-space and can be vidual sets of GEN training programs are generated for each applied to all the discussed ML proposals. test application by providing the target performance metrics Synthetic benchmarking has been applied for benchmark (e.g., IMIX, branch misprediction rate, cache miss rates, etc. cloning [27, 36, 37, 38] in prior studies. Bell et al. [27] profiled collected using performance counters) as an input to Genesys. applications at runtime to extract execution related metrics Figure 6 compares the IPC prediction accuracy using the two 50% training sets. We can see that using the GEN training set, Power Prediction Error (%) 45% GEN performance prediction accuracy improves for all test programs 40% REAL 35% except tonto from SPEC-CPU2006 suite, where the accuracy 30% reduces marginally. Overall, creating models using the GEN 25% training set has significantly lower prediction error (3.32%) as 20% 15% compared to the REAL training set (12.26%), which is a 3.69 10% times improvement in prediction accuracy without increasing 5% 0% the average training time significantly (owing to the shorter runtime of the workload generation algorithm and the synthetic benchmarks). Additionally, generating synthetic benchmarks based on target performance metrics is a one-time effort, and they can be reused for training different prediction models Fig. 7: Power prediction accuracy modeled using GEN and (as shown in the next section), which results in significantly REAL training sets shorter training time due to their shorter run-times.
and created proxy binaries that could be executed on simu- [10] G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou,
lators or real hardware. Other studies [37] cloned proprietary “Gpgpu performance and power estimation using machine learning,” in
benchmarks into synthetic proxies using microarchitecture- HPCA, 2015, pp. 564–576.
independent attributes. Automatic program generation tech- [11] S.-w. Liao, T.-H. Hung, D. Nguyen, C. Chou, C. Tu, and H. Zhou, “Ma-
niques [39, 40] have also been used for functional hardware chine learning-based prefetch optimization for data center applications,”
verification. In contrast, this paper is the first proposal that [12] in SC, 2009, pp. 56:1–56:10.
A. Negi and P. K. Kumar, “Applying machine learning techniques to
uses automatic workload generation to systematically improve improve linux process scheduling,” in TENCON 2005 2005 IEEE Region
the state-space coverage of training sets for ML models. [13] 10, Nov 2005, pp. 1–6. S. Sharkawi, D. DeSota, R. Panda, R. Indukuru, S. Stevens, V. E. Taylor, and X. Wu, “Performance projection of hpc applications using spec VI. CONCLUSION [14] cfp2006 benchmarks.” in IPDPS. IEEE, 2009, pp. 1–12. In this paper, we propose Genesys, a novel automatic [15] “SPEC CPU2006,” https://www.spec.org/cpu2006. “SPEC CPU2000,” https://www.spec.org/cpu2000. workload generation framework that enables systematic gen- [16] “SPECjbb 2005,” https://www.spec.org/jbb2005/. eration of representative training set applications, providing a [17] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and wider coverage of program behavior state-space, for effectively R. B. Brown, “Mibench: A free, commercially representative embedded training machine learning models. Genesys allows to control benchmark suite,” in WWC-4. Washington, DC, USA: IEEE Computer a set of key workload-specific characteristics using easy-to- [18] Society, 2001, pp. 3–14. Z. Jin and A. C. Cheng, “Implantbench: Characterizing and projecting use, programmable knobs and thereby, provides the ability to representative benchmarks for emerging bioimplantable computing,” generate applications targeting specific program properties as IEEE Micro, vol. 28, no. 4, pp. 71–91, July 2008. well. In order to compare the state-space coverage provided by [19] “TPC-H Benchmark Suite,” http://www.tpc.org/tpch. different sets of applications, we define a novel metric called [20] A. J. KleinOsowski and D. J. Lilja, “Minnespec: A new spec benchmark workload for simulation-based computer architecture research.” Com- SpreadRatio, which is based on the area of the convex hull puter Architecture Letters, 2002. envelope surrounding the program points. We demonstrate that [21] B. Fitzpatrick, “Distributed caching with memcached,” Linux J., vol. by using automatically generated training sets, it is possible [22] 2004, no. 124, Aug. 2004. to achieve over 11 times higher state-space coverage than S. Huang, J. Huang, J. Dai, T. Xie, and B. H. 0002, “The hibench bench- that provided by popular, standard benchmarks such as SPEC- mark suite: Characterization of the mapreduce-based data analysis.” in ICDE Workshops. IEEE, 2010, pp. 41–51. CPU2006, MiBench, MediaBench and TPC-H. We also show [23] “Cassandra,” wiki.apache.org/cassandra/FrontPage. that modeling using targeted synthetic training sets improves [24] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, the predictability/accuracy of two machine learning based “Benchmarking cloud serving systems with ycsb,” in SoCC, 2010, pp. power and performance prediction systems by over 2.5x and [25] 143–154. A. Phansalkar, A. Joshi, L. Eeckhout, and L. K. John, “Measuring 3.6x respectively. program similarity: Experiments with spec cpu benchmark suites,” in IEEE ISPASS, 2005, pp. 10–20. VII. ACKNOWLEDGEMENT [26] M. Haungs, P. Sallee, and M. K. Farrens, “Branch transition rate: A new metric for improved branch classification analysis.” in HPCA. IEEE This research is supported partially by SRC under Task Computer Society, 2000, pp. 241–250. ID 2504 and National Science Foundation (NSF) under grant [27] R. H. Bell, Jr. and L. K. John, “Improved automatic testcase synthesis number 1337393. Any opinions, findings, conclusions or rec- for performance model validation,” in ICS. New York, NY, USA: ACM, 2005, pp. 111–120. ommendations expressed in this material are those of the [28] “Linux perf tool,” https://perf.wiki.kernel.org/index.php/Main Page. authors and do not necessarily reflect the views of the NSF [29] C. B. Barber, D. P. Dobkin, and H. Huhdanpaa, “The quickhull algo- or other sponsors. rithm for convex hulls,” ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, vol. 22, no. 4, pp. 469–483, 1996. [30] G. Dunteman, Principal Component Analysis. Sage Publications, 1989. REFERENCES [31] R. Panda, C. Erb, M. LeBeane, J. Ryoo, and L. K. John, “Performance [1] A. Patel, F. Afram, S. Chen, and K. Ghose, “Marss: A full system characterization of modern databases on out-of-order cpus,” in IEEE simulator for multicore x86 cpus,” in DAC, 2011, pp. 1050–1055. SBAC-PAD, 2015. [2] B. C. Lee and D. M. Brooks, “Illustrative design space studies with [32] R. Panda and L. K. John, “Data analytics workloads: Characterization microarchitectural regression models,” in HPCA, 2007, pp. 340–351. and similarity analysis.” in IPCCC. IEEE, 2014, pp. 1–9. [3] E. ¨Ipek, S. A. McKee, R. Caruana, B. R. de Supinski, and M. Schulz, [33] K. P. Murphy, Machine Learning: A Probabilistic Perspective. The MIT “Efficiently exploring architectural design spaces via predictive model- Press, 2012. ing,” SIGPLAN Not., vol. 41, no. 11, pp. 195–206, Oct. 2006. [34] D. B. Noonburg and J. P. Shen, “Theoretical modeling of superscalar [4] N. Ardalani, C. Lestourgeon, K. Sankaralingam, and X. Zhu, “Cross- processor performance,” in Micro, 1994, pp. 52–62. architecture performance prediction (xapp) using cpu code to predict [35] P. G. Emma and E. S. Davidson, “Characterization of branch and data gpu performance.” in Micro, 2015, pp. 725–737. dependencies in programs for evaluating pipeline performance,” IEEE [5] C. Dubach, T. Jones, and M. O’Boyle, “Microarchitectural design space Transactions on Computers, vol. 36, no. 7, pp. 859–875, 1987. exploration using an architecture-centric approach,” in MICRO 2007, [36] L. Eeckhout, K. D. Bosschere, and H. Neefs, “Performance analysis Dec 2007, pp. 262–271. through synthetic trace generation,” IEEE ISPASS, vol. 0, p. i, 2000. [6] X. Zheng, P. Ravikumar, L. K. John, and A. Gerstlauer, “Learning-based [37] A. Joshi, L. Eeckhout, R. H. B. Jr., and L. K. John, “Performance analytical cross-platform performance prediction,” in SAMOS, 2015, pp. cloning: A technique for disseminating proprietary applications as bench- 52–59. marks.” in IISWC. IEEE Computer Society, 2006, pp. 105–115. [7] W. Lee, Y. Kim, J. H. Ryoo, D. Sunwoo, A. Gerstlauer, and L. K. John, [38] E. Deniz, A. Sen, B. Kahne, and J. Holt, “Minime: Pattern-aware “Powertrain: A learning-based calibration of mcpat power models.” in multicore benchmark synthesizer.” IEEE Trans. Computers, vol. 64, pp. ISLPED. IEEE, 2015, pp. 189–194. 2239–2252, 2015. [8] P. J. Joseph, K. Vaswani, and M. J. Thazhuthaveetil, “A predictive [39] A. Aharon, D. Goodman, M. Levinger, Y. Lichtenstein, Y. Malka, performance model for superscalar processors,” in MICRO. Washington, C. Metzger, M. Molcho, and G. Shurek, “Test program generation for DC, USA: IEEE Computer Society, 2006, pp. 161–170. functional verification of powerpc processors in ibm.” in DAC, 1995, pp. [9] K. Singh, M. Bhadauria, and S. A. McKee, “Real time power estimation 279–285. and thread scheduling via performance counters,” SIGARCH Comput. [40] S. Ur and Y. Yadin, “Micro architecture coverage directed generation of test programs,” in DAC, 1999, pp. 175–180. Archit. News, vol. 37, no. 2, pp. 46–55, Jul. 2009.