Model Performance - ML Model Synthesis - Optimizing on-chip Machine Learning for Data Prefetchi

4.3 ML Model Synthesis

5.1.3 Model Performance

The parametric sweep described in 4.1.6 was run on three different trace files in order to test how well the machine learning models were performing when trained and tested on different data. The purpose of this parameter search was to evaluate which parameters had an impact on the performance of machine learning models for data prefetching. Both the training data and validation data were filtered, as to only contain training example with outputs of offsets within a certain range. This range parameter was also varied in the parameter search.

The result of this analysis is not to be considered as a basis for which configu-ration is the optimal one, but rather to give an indication of which parameters play a role in a model’s ability to be trained and perform well for data prefetching. As such, it was found that the number of LSTM parameters in a model did not seem to positively contribute to the performance of a model. The LSTM layer are more difficult to train due to its deep architecture in the temporal domain [46], which could be a reason for the lack of performance when used in a model configuration.

There were several runs in all of the files where the LSTM layer performed worse than other configurations.

To study this further a couple of configurations were tested where the number of dense layers were varied as well as the number of LSTM layers, while other config-urable parameters were kept constant. The result of this sweep is shown in figure 5.7. As explained in section 4.1.7 this type of graph was used to study general trends in performance based on parameter choices. The two yellow lines in the figure shows the trend that a configuration with 0 LSTM layers and 2 dense layers seems to re-sult in better performance. Runs with 1 or 2 LSTM layers generally rere-sults in worse performance in this parameter search.

The two best runs were both with models without any LSTM layers indicating that dense layers work better or as well as LSTM layers. In addition, given the project aim to construct machine learning models to be able to be used in a hardware en-vironment, it was decided to select a model configuration with one input layer that flattens the input, two hidden layers and an output layer. The other parameters were chosen based on the broad parameter sweep described in section 4.1.6. In this sweep there were no major performance difference when varying the input sequence length. With regards to hardware constrains one of the smallest input sequence length that was tested were chosen i.e a length of 5. Given the large cluster of offsets centered around 0 as shown in section 5.1.1 for certain traces, an offset range of [-2048, 2048] was selected.

5. Results

Figure 5.7: Performance for different layer configurations when other hyperparam-eters were held constant. The configurations were trained and tested on the same trace file.

This offset range can capture common offsets such as "+64" and "-64". Also given the principle of locality, it would allow for other access patterns in the vicinity of the one that was fetched last. The configuration that was chosen to be trained and tested further is shown in table 5.1.

Table 5.1: Model configuration

Input sequence length 5 Number of dense layers 2

Offset range [-2048, 2048]

Output representation One-hot encoding

After selecting a specific parameter configuration, the model was trained and tested individually on the three load traces used for testing described in section 5.1.1. It was trained on 80% of the data in each trace and tested on the remaining 20%.

The increase in proportion of data in the testing set in relation to the training set compared to the split in section 5.1.3 was made to have more test data and thus get a fairer performance metric. The test was conducted in order to deduce if the model was able to learn patterns not found in the training files that were used to find the final configuration. Figure 5.8 shows the performance based on two metrics.

The majority of the addresses within range are predicted correctly. This indicates that the model is capable of learning different access patterns and predict them cor-rectly on data it has not been trained on, within the same trace file. The accuracy metric is significantly lower than the filtered accuracy metric for 471.omnepp − s0 and 437.leslie3d − s0. The reason for this is that the offset range is only capable of predicting 33.89% and 25.07% of all of the offsets in the trace files.

5. Results

471.omnetpp-s0 437.leslie3d-s0 433.milc-s2

0.0 0.2 0.4 0.6 0.8 1.0

Accuracy (proportion correct predictions)

0.36

0.286

0.441 0.856

0.995

0.48 Proportion out of total

Proportion out of addresses within predictable range

Figure 5.8: Test results after training and testing on data from the same file.

Accuracy is displayed in blue and filtered accuracy is displayed in orange. The metrics are defined in 5.1.2.

With these results it is possible to outperform the baseline, although it varies de-pending on the trace file as shown in figure 5.9. The performance on the 433.milc-s2 trace file is worse than the next line baseline for this run. Although, a simple Next line prefetcher would already perform well on this trace due to the large over-representation of offsets by one block, the 64-bar in figure 5.5. It appears hard for the model to outperform this already high performance.

A prefetcher should be able to predict a large number of different addresses for it to be useful in an actual computer system. To see how well the final ML model works in a more general setting, the model was trained on three trace files where common offsets are displayed, as seen in figures 5.1, 5.3 and 5.2. The model was then tested on three other traces, one by one, which it had not been exposed to and the results from these tests are displayed in figure 5.10. The test shows that the model has a worse accuracy compared to the previous result in figure 5.8. In comparison to the next line baseline, the performance is similar. This is because the model that was trained predicted next line for most of the inputs on the trace files used for testing.

Furthermore, in some of the test files certain offsets were more common compared to their occurrences in the trace files used for training, which made it hard for the model to predict them.

5. Results

471.omnetpp-s0 437.leslie3d-s0 433.milc-s2

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

Cor rect pr ediction compar ed to baseline

20.055 1.001 0.76

Figure 5.9: Performance compared to next line baseline when trained on 80% of the data in the traces, and tested on the remaining 20% of the data. The bars represent the relative performance compared to the next line prefetcher, in terms of the accuracy of correctly predicted load addresses. The height of the bars are proportional to the performance of the baseline. The horizontal red line, indicates the accuracy of a next line prefetcher.

Another reason is that offsets that were outside the predictable range of the final model were quite common, making them impossible for the model to predict. For instance in the 471.omnetpp − s0 trace file the offset 1344 was the second most common offset, occurring 770588 times, and in the training files that offset only occurred 373 times. In 437.leslie − s0, only 25.07% of the offsets are within the predictable range and most of them are next line offsets, which the model predicted correctly.

Even though the performance is worse compared to the performance for the model when it was trained on 80% of the data and tested on the remaining 20%, it provides an indication that the model is able to predict patterns such as next line it learned from the training data on never seen testing data on a completely different trace file. Yet, for models that are trained on trace files with certain prefetch patterns it could be harder for them to generalize and predict patterns in other contexts.

This indicates that the models need to be trained on a larger quantity of data to be able to generalize better. However, it might be possible to improve the performance compared to the next line prefetcher if the model was to be retrained for a specific program, given the results seen in the test result displayed here 5.8. This is because the model showed an improved performance when predicting on data within the traces it had not been exposed to in comparison to the next line prefetcher.

5. Results

471.omnetpp-s0 437.leslie3d-s0 433.milc-s2

0.0 0.2 0.4 0.6 0.8 1.0

Accuracy (proportion correct predictions)

0.015

0.246

0.482

0.044

0.98

0.742 Proportion out of total

Proportion out of addresses within predictable range

Figure 5.10: Test results after testing on a different file. Accuracy is displayed in blue and filtered accuracy is displayed in orange. The metrics are defined in 5.1.2.

In document Optimizing on-chip Machine Learning for Data Prefetching (Page 45-49)