Institutionen för datavetenskap
Department of Computer and Information Science
Final thesis
Estimation of Expected Lowest Fare
in Flight Meta Search
by
Lars Kristensson
LIU-IDA/LITH-EX-G--14/022--SE
2014-06-19
Linköping University
Department of Computer and Information Science
Final Thesis
Estimation of Expected Lowest Fare
in Flight Meta Search
by
Lars Kristensson
LIU-IDA/LITH-EX-G--14/022--SE
2014-06-19
Supervisor: Arne Jönsson
Examiner: Arne Jönsson
Abstract
1
This thesis explores the possibility of estimating the outcome of a flight ticket fare comparison search, also called flight meta search, before it has been performed, as being able to do this could be highly useful in improving the flight meta search technology used today. The algorithm explored in this thesis is a distanceweighted knearest neighbour, where the distance metric is a linear equation with sixteen features of first degree extracted from the input of the search. It is found that while the approach may have potential, the distance metric used in this thesis is not sufficient to capture the similarities needed, and the end algorithm performs only slightly better than random. At the end of this thesis a series of possible further improvements are presented, that could potentially help improve the performance of the algorithm to a level that would be more useful.Table of contents
2
Abstract 1 Table of contents 2 1. Motivation 4 1.1 Fare overviews 4 1.2 Guided combinatorial searches 5 1.3 Inspiration searches 5 2. Previous Attempts 5 3. The Data 5 3.1 Collecting data 6 3.2 Data aging 6 3.3 Fare distribution 6 4. Defining The Problem 7 4.1 Basic definitions 7 4.2 Constraints 7 4.3 Segmentation 8 5. Defining The Algorithm 8 5.1 Input features 8 5.1.1 Hour of day 8 5.1.2 Day of week 9 5.1.3 Day of month 9 5.1.4 Month of year 9 5.1.5 Days to departure 10 5.1.6 Date 10 5.2 Similarity features 10 5.2.1 Circular dimensions 10 5.2.2 Linear dimensions 10 5.2.3 Binary dimensions 11 5.3 Similarity distance 11 5.4 Historic searches 11 5.5 Kernel function 12 5.6 The algorithm 12 6. Learning W 12 6.1 Learning curves 13 6.2 Testing the assumption 13 6.3 Feature analysis 14 7. Choosing k 14 8. Performance testing 15 8.1 Performance metrics 15 8.2 Results 159. Conclusions 16 10. Further improvements 17 10.1 Improved (I, )δ I′ 17 10.2 Multiple segments 17 10.3 Adjusting λ 18 10.4 Online learning 18 11. References 18 Appendix 1: Fare development 19 Appendix 2: Fare distributions 21 Appendix 3: Features and fares 23 Appendix 4: Samples selection 25 Appendix 5: Learning curves 27 Appendix 6: Feature tvalues 29 Appendix 7:
Σ
(V , )
sH
at different kvalues 30 Appendix 8: Test results 321. Motivation
4
Today a large portion of all airline tickets sold worldwide are being sold online. There is over one thousand OTA and airline sites selling those tickets. In many cases several of those sites sell 1 the very same ticket, but often at different fares. Further on, each OTA and, in particular, airline site holds only a limited selection of all tickets available for sale. To overview this djungle of tickets available at different fares there has been a large number of flight ticket fare comparison sites created, so called flight meta search sites . Those sites have 2 become very popular, and a few of them have turned into multimillion industries. A typical flight meta search site takes as input a route , a departure date, an eventual return date 3 and the number of passengers. It then aggregates flight data and ticket fares matching those criterias from a number of suppliers , and present those in some visual comparison. 4 Since the end of the 90’s, when the first flight meta search sites were launched, this has been the standard approach, with very few variations. Even though there throughout the years been a continuous flow of user requests for being able to perform more complex, open and combinatorial searches. Over the last few years there also been a couple of research reports, showing the need for more open search criterias and inspiration oriented searches [Foolproof, 2008] [Balen, 2014]. The problem with this kind of searches is that with a classic combinatorial approach they are prohibitively complex and resource consuming. In this thesis I’ll explore the possibility of solving the problem using an approximation approach. Solving this opens up for development of a series of features that have been requested by end users for a very long time.1.1 Fare overviews
4
If it is possible to estimate the cheapest ticket fare in a series of atomic searches without first 5 performing them, this can be presented to the user in the form of fare graphs or fare matrices, where the user can review the different options before selecting the atomic searches he want to 1 Online Travel Agent 2 The name ‘meta search’ comes from those sites originally aggregating their tickets from different OTAs, who in their turn aggregated tickets from different airlines, often through some broker system called GDS. 3 A pair of origin airport and destination airport. 4 Typically a number of OTAs and airlines. 5 An atomic search is a search containing exactly one origin, one destination and one departure date per route segment.perform.
1.2 Guided combinatorial searches
5
If it is possible to estimate the cheapest ticket fare in the atomic searches within a particular search space, this could be used as a heuristic function to guide a best first search algorithm. Opening up for guided meta meta searches performed over a large set of atomic meta searches.
1.3 Inspiration searches
5
If it is possible to estimate the cheapest ticket fare in any atomic search, it is also possible to construct searches where the return value is not the outcome of a particular atomic search, but rather a set of atomic searches. By leaving out one or several of the parameters within a normal atomic search, a return of all potential options of atomic searches could be returned to the user to select from. This could be further enhanced by including other external data and by applying other constraining filters.2. Previous Attempts
5
The idea of trying to estimate the outcome of a not yet performed atomic search is not entirely new. Previous attempts of using recorded extracts from the result of previously done searches have been performed. Those stored result extracts have been used directly, to present price graphs and suggest 6 potentially cheap atomic searches to the user . 7 All of those attempts have though been constrained by the number of possible atomic searches 8 causing the recorded data to become very sparse and the aging of the recorded data. A record stored yesterday may no longer be valid as an estimate of the currently cheapest fare. Related to this problem there have also been attempts made to predict if the fare is going to go up or down in the nearby future. [Etzioni et al., 2003] [Groves and Gini., 2013] Although, those attempts focus mainly on classifying the future development of the fare, given the current fare, and not on estimating the current fare.3. The Data
5
First step required is to study the data available and needed. 6 Found earlier at momondo.com (20092012) 7 Found at skyscanner.com and flygresor.se, among many others. 8 There is roughly 30.000.000 possible one way searches and roughly 10.000.000.000 possible return searches.
3.1 Collecting data
6
To be able to calculate estimates on upcoming fares we need data from previously seen searches and their resulting fares. A single atomic meta search typically returns a result of hundreds or even thousands of ticket fares. With millions of searches being performed each day, it is not feasible to store all this data long term. Instead it is necessary to select relevant extracts and store those. For the purpose of this thesis the cheapest fare found in every result set of every search were being extracted and stored. This could be extended into storing the cheapest fare available under some certain criteria on the flight data, to be able to calculate estimates under those criterias, but that goes beyond the scope of this thesis.3.2 Data aging
6
Fares on flight tickets are highly fluctuating. The airlines are continuously adjusting the fare of the cheapest available tickets on a particular flight, in response to ticket availability and popularity. Similarly the OTAs are continuously adjusting their markups, to meet demands and to compete with each other about positioning themselves in the comparison on the meta search sites. On top of this there is also airlines running temporary promotions, where they dump the fare on some particular tickets, and subsystems temporary breaking down or getting disconnected, rendering sets of ticket fares temporarily unavailable. All this causes the cheapest available fare for a particular atomic search to highly fluctuate over time, down to a frequency of minutes. Similar to a slow moving stock market. To study this closer ten searches were randomly selected and the locale and route extracted 9 from those. A program were then setup to search those locale and route combinations for a fixed departure date repeatedly once per hour for two weeks up until the departure date. The result of this has been summarized in Appendix 1: Fare development.3.3 Fare distribution
6
Beside tracking the fare development over time a distribution summary were also made upon those locale and route combinations, were all recorded lowest fare for searches over those combinations up until the sample moment were extracted. Histograms over those ten data sets are provided in Appendix 2: Fare distributions. 9 See Appendix 4: Sample selectionIt turns out that most sets have a positively skewed distribution and often contain multiple peaks. Beside that, they tend to be quite different, which may come natural as each route is being operated by a different set of airlines and each locale and route is being covered by a different set of suppliers. The search volume and demand for each of those combinations also vary significantly.
4. Defining The Problem
7
The problem can be formalized through the following series of definitions.
4.1 Basic definitions
7
Define the input parameters needed for an atomic search of n segments as: I , , , , , } In= { n L∪
k=1..nI n Ok∪
k=1..nI n Dk∪
k=1..nI n DTk In P ITn ocale In L= L rigin of segment k In Ok= O estination of segment k IDkn = D eparture date of segment k In DTk= D assengers In P = P ime when the search being performed In T = T Define relevant components of the search result as: Rn=∪
iOi n (O ) otal fare of offer O f in = T i n (R ) inimum total fare found in any O in R fmin n = M in n The process of an atomic search of n segments is then defined as: Sn: In→ Rn The looked for algorithm can then be defined as: (I ) xpected f (R ) E n = E min n4.2 Constraints
7
To simplify the work of this thesis, it is being limited to study only one way routes for only one adult passenger. Giving the following simplifying definitions: I = I1 I , , , I } I = { L IO ID IDT, T R = R1S : I → R
4.3 Segmentation
8
Studying the input parameters, there are three dimensions between which we cannot generalize: I , , } { L IO IDFor every {I , }O ID there is a different set of timetables and serving airlines, and for every there is a different set of data sources chosen by the search engine. We may hence I , } { L IO not be able to compare data between any of those. This leaves us with the remaining set within which we may do generalizations: I I } { DT, T
5. Defining The Algorithm
8
Since the fares are continually changing and there is a continuous flow of new data being recorded, it would be favorable to use an approach that takes this new data into account, rather than trying to learn a static function. One such lazy approach is given by the distanceweighted knearest neighbours algorithm [Russel and Norvig, 2003] [Mitchell, 1997] [Segaran, 2007]. This algorithm is based on looking up some similar examples from a set of recorded examples and compute a weighted mean on the outcome from those. Examples in our case means performed and stored searches, defined as: I , } S′= { ′ R′
5.1 Input features
8
Needed then is a similarity metric between and .I I′ A proper definition of similarity is essential to the possible performance of the algorithm, and hence need a careful treatment.The first task is to identify all features in and that may have an influence on the fare. The onlyI I′ possibilities available are to be found in the set of {I DT, TI ,I′DT,I }′T . The features selected below are based on available domain expert knowledge.
5.1.1 Hour of day
8
fares do not have more than one significant change within one hour. Partially due to many levels of caching down through the supplier chain, interfering with any precision attempted below one hour. However, there are known cases of fares fluctuating periodically throughout the day. Providing different markups during morning, daytime and evening. Consequently the hour of the day feature is being defined as: (I ) our of day part of I H T = H T
5.1.2 Day of week
9
There is rich material available showing that the day of week for the departure has a strong influence on the fare. There is also a suspicion that the fare may in some cases depend on what day of week the search is being performed. Consequently the following features follow: W (I ) ay of week part of I D DT = D DT W (I ) ay of week part of I D T = D T5.1.3 Day of month
9
There are no indications available that the fares would periodically cycle throughout the days of a month. It could however be argued that such pattern could exist, due to salary days, etcetera. For experimental purpose it is therefore included as: M(I ) ay of month part of I D DT = D DT M(I ) ay of month part of I D T = D T5.1.4 Month of year
9
It is well known that there are seasonal changes over the year in the fare in regards to the departure date, but it is also commonly believed that this would apply to when the ticket is being booked. This is covered by the following definitions: Y (I ) onth part of I M DT = M DT Y (I ) onth part of I M T = M T5.1.5 Days to departure
10
Known as the strongest factor in determining the fare is the number of days remaining until the date of departure. Important to notice in this case is how the impact of an absolute difference becomes stronger as it gets closer to the departure date, with the fare raising exponentially in the last days before departure. To represent this the logarithm of the difference is used in the feature definition. Also, to make sure an absolute difference of zero is also represented by a zero, with raising values giving a positive value representation, plus one is added to the date difference:D(I , ) og(days difference in date between I and I 1)
D DT IT = l T DT +
5.1.6 Date
10
It may be argued that the particular date of departure should be taken into consideration, as a particular event may raise the fare for a particular date or period, but this is already taken care of by including both of the features DD and A.5.2 Similarity features
10
With the input features defined, it is then possible to define some similarity features between those.5.2.1 Circular dimensions
10
Hour of day, day of week and day of month are all defined over circular dimensions of different sizes. Those distances are hence defined as:(H(I ), (I )) H(I ) (I ) 2 4 ) H(I ) (I ) ) δ T H ′T = (|| T − H ′T|| > 1 → 2 − H(I )|| T − H(I )′T || ⋀ (|| T − H ′ T ||
(DW(I ), W(I )) DW(I ) W(I ) , ) DW(I ) W(I ) )
δ DT D ′DT = (|| DT − D ′DT || > 3 5 → 7 − DW(I )|| DT − DW(I )′DT || ⋀ (|| DT − D ′DT ||
(DW(I ), W(I )) DW(I ) W(I ) , ) DW(I ) W(I ) )
δ T D ′T = (|| T − D ′T || > 3 5 → 7 − DW(I )|| T − DW(I )′T|| ⋀ (|| T − D ′T ||
(DM(I ), M(I )) DM(I ) M(I ) 5, 1 ) DM(I ) M(I ) )
δ DT D ′DT = (|| DT − D ′DT || > 1 5 → 3 − DM(I )|| DT − DM(I )′DT || ⋀ (|| DT − D ′DT || (DM(I ), M(I )) DM(I ) M(I ) 5, 1 ) DM(I ) M(I ) )
δ T D ′T = (|| T − D ′T || > 1 5 → 3 − DM(I )|| T − DM(I )′T || ⋀ (|| T − D ′T ||
(MY (I ), Y (I )) MY (I ) Y (I ) 2 ) MY (I ) Y (I ) )
δ DT M ′DT = (|| DT − M ′DT || > 6 → 1 − MY (I )|| DT − MY (I )′DT || ⋀ (|| DT − M ′DT ||
(MY (I ), Y (I )) MY (I ) Y (I ) 2 ) MY (I ) Y (I ) )
δ T M ′T = (|| T − M ′T|| > 6 → 1 − MY (I )|| T − MY (I )′T|| ⋀ (|| T − M ′T ||
5.2.2 Linear dimensions
10
Since days to departure already has the logarithmic factor built in, the distance can be defined as a linear relationship: (DD(I , ), D(I , )) δ DT IT D ′DT I′T = DD(I , )|| DT IT − DD(I′DT, )I′T || As discussed previously the validity of stored data decays as it grows older. Using hours as our lowest resolution of time measurement and using the same reasoning as for days to departure, the age of data distance can be defined as:(I , ) og(hours between I and I 1) δ T I′T = l T ′T +
5.2.3 Binary dimensions
11
By studying the selected graphs in Appendix 3: Features and fares it may be argued that an exact match is sometimes more important than the distance to a nearby feature value. The following definitions are hence added in as complement to the ones above: (H(I ), (I )) H(I ) = (I ) ) 1) δb T H ′T = ( T = H ′T → 0 ⋀ ( (DW (I ), W (I )) DW (I ) = W (I ) ) 1) δb DT D ′DT = ( DT = D ′DT → 0 ⋀ ( (DW (I ), W (I )) DW (I ) = W (I ) ) 1) δb T D ′T = ( T = D ′T → 0 ⋀ ((DM(I ), M(I )) DM(I ) = M(I ) ) 1) δb DT D ′DT = ( DT = D ′DT → 0 ⋀ (
(DM(I ), M(I )) DM(I ) = M(I ) ) 1) δb T D ′T = ( T = D ′T → 0 ⋀ (
(MY (I ), Y (I )) MY (I ) = Y (I ) ) 1) δb DT M ′DT = ( DT = M ′DT → 0 ⋀ (
(MY (I ), Y (I )) MY (I ) = Y (I ) ) 1) δb T M ′T = ( T = M ′T → 0 ⋀ (
5.3 Similarity distance
11
Assuming independence between the different similarity features, a manhattan distance between two sets of input parameters can be expressed as a linear relationship:
(I, ) δ(H(I ), (I )) δ(DW (I ), W (I )) δ I′ = wBias+ wCir H T H ′T + wCir DW DT DT D ′DT +
δ(DW (I ), W (I )) δ(DM(I ), M(I )) wCir DW T T D ′T + wCir DMDT DT D ′DT +
δ(DM(I ), M(I )) δ(MY (I ), Y (I )) wCir DMT T D ′T + wCir MY DT DT M ′DT +
δ(MY (I ), Y (I )) δ(DD(I , ), D(I , )) wCir MY T T M ′T + wLin DD DT IT D ′DT I′T +
δ(I , ) δ (H(I ), (I )) δ (DW (I ), W (I )) wLin A T I′T + wBin H b T H ′T + wBin DW DT b DT D ′DT +
δ (DW (I ), W (I )) δ (DM(I ), M(I )) wBin DW T b T D ′T + wBin DMDT b DT D ′DT +
δ (DM(I ), M(I )) δ (MY (I ), Y (I )) wBin DMT b T D ′T + wBin MY DT b DT M ′DT + δ (MY (I ), Y (I )) wBin MY T b T M ′T Where all the weights can be summarized into a vector: < w , , , , , , , , ,
W = wBias, Cir H wCir DW DT wCir DW T wCir DMDT wCir DMT wCir MY DT wCir MY T wLin DD wLin A
, , , , , ,
wBin H wBin DW DT wBin DW T wBin DMDT wBin DMT wBin MY DT wBin MY T>
5.4 Historic searches
11
The algorithm also requires a set of historic searches, , defined in Appendix 4: SamplesH selection.
The similarity distance between and each search in can then be defined as:I H (I, ) δ(I, ), I }
Δ H = { I′ ′∈ S′∈ H
5.5 Kernel function
12
now gives the similarity distance between the input of two searches. However, it cannot (I, )
δ I′
be directly applied, as the distanceweighted knearest neighbours algorithm requires similar entities to have a higher value than entities being dissimilar. While δ(I, )I′ gives the opposite.
While it is possible to invert δ(I, )I′ , it may not be a good idea since it gives unproportionally large values for two inputs being very similar and will even fail to be defined for two inputs being exactly the same. Instead a kernel function is applied to smoothen out the influence of large similarities and dissimilarities. One useful such kernel function is the Gaussian distribution function [Segaran, 07]: (δ(I, )) e K I′ = 1 σ√2π −(δ(I,I )−μ) 2σ 2′ 2 Since only proportional values are of interest and we want to keep values closest to zero the highest, this kernel function can be simplified into: (δ(I, )) K I′ = e−δ(I,I ) 2σ 2′2 With: (I, ) (I, ) σ2 H = 1 Δ(I,H) | | ∑ I′ ′∈S∈Δ(I,H) δ I′ 2
5.6 The algorithm
12
Using this kernel function, the final algorithm can now be defined as: (I, ) E H = (δ(I,I )) ∑ S′∈H k K ′ (δ(I,I ))f (R ) ∑ S′∈H k K ′ min ′Where H k is the subset of the k with lowest S′ δ(I, )I′ in .H
6. Learning
W 12With the estimation algorithm well defined, the next step is to find W .
To make this possible, the first observation required is that δ(I, )I′ relates to the difference between f (R)min and f (R )min ′ , such that:
(I, )
δ I′ ~ f (R)|| min − f (R )min ′||
There are many ways this relationship could be strengthened, but considering the nonlinearity of the fare distributions in the data sets, the following relationship is assumed: (I, ) og( f (R) (R ) ) δ I′ ≃ l || min − fmin ′|| + 1 Using this assumption, finding W can be considered an optimization problem where an error function J(W , )T p is being minimized over a set of pairs of old searches: < , , S S } T p= { S′T s S′H> ′T s∈ T s, ′H∈ H (W , ) (δ(I , ) og( f (R ) (R ) )) J T p =2m1 ∑ <S ,S >∈T ′T s ′H p ′T s I′H − l || min ′T s − fmin ′H|| + 1 2
Multivariate linear regression may now be used to find a W for which J(W , )T p is minimum over the given training set T p.
6.1 Learning curves
13
Beside the training set T p a validation set V p is also defined such that: < , , S S } V p= { S′V s S′H> ′V s∈ V s, ′H∈ H Note that: ⊘ V p⋂ T p= For a V p of fixed size it is then possible to plot curves of the, by linear regression given, minimum J(W , )T p and the J(W , ) V p for the same W, given by different sizes of T p.
A series of such learning curves are given in Appendix 5: Learning curves.
6.2 Testing the assumption
13
To show that the assumption chosen hold some degree of validity, a few other assumptions were tested towards it.
(I, )
A1: δ I′ = log(f (R)|| min + 1 − l) og(f (R )min ′ + 1 || )
(I, ) og( f (R) (R ) ) A2: δ I′ = l || min − fmin ′|| + 1
(I, )
A3: δ I′ = f (R)|| min − f (R )min ′||
A4: δ(I, )|| I′|| 2= log(f (R)|| min + 1 − l) og(f (R )min ′ + 1 || ) 2
og( f (R) (R ) ) A5: δ(I, )|| I′|| 2= l || min − fmin ′|| + 1 2
A6: δ(I, )|| I′|| 2= f (R)|| min − f (R )min ′|| 2
Table 1: r 2 for |T | 024 for the different assumptions
A 1 A 2 A 3 A 4 A 5 A 6 <daDK, BKK, HAM> 0.2747 0.3292 0.1746 0.1323 0.3234 0.04537 <ruRU, MOW, LBD> 0.4586 0.589 0.4002 0.03134 0.5049 0.1606 <nbNO, OSL, AMS> 0.05314 0.05222 0.03432 0.03996 0.05933 0.02567 <deDE, OTP, MUC> 0.08581 0.1222 0.0911 0.1214 0.1057 0.148 <deDE, MUC, OTP> 0.258 0.3272 0.1996 0.204 0.27 0.09757 <daDK, ATH, CPH> 0.04862 0.07004 0.09414 0.04511 0.06315 0.09643 <daDK, KRP, OSL> 0.3276 0.2266 0.2478 0.2837 0.2345 0.181 <daDK, OSL, KRP> 0.2125 0.2156 0.164 0.1287 0.2351 0.1089 gives a higher in at least 5 out of 8 cases toward any of the other assumptions. A2 r 2
6.3 Feature analysis
14
While the r 2 above gives a metric on the coverage by the features, a further study of the individual features were done by calculating the tvalue for each of those. A listing of those together with findings is available in Appendix 6: Feature tvalues.7. Choosing k
14
The final parameter to be chosen is k. To do this, the following standard deviation is computed over the full validation set for a series of values of k: (V , ) Σ s H =√
|V 1s| ∑ (E(I , ) (R )) S′∈V s ′H − f min ′ 2Appendix 7: Σ(V , ) s H at different kvalues show Σ(V , ) s H for a series of kvalues for all locales and routes. Since computing those values is a heavy operation, compared to fitting W, it would be preferable if a single value of k could be used for all routes. Choosing the lowest value for which most routes seem to be performing fairly well gives k= 5 .
8. Performance testing
15
8.1 Performance metrics
15
To be able to measure the performance of the algorithm, a few more metrics are defined. First, to ensure that the actual fare can at all be estimated using the available data, the outer borders of the fare range covered by the historic set are defined as:H (H) < in(f (R )), max(f (R )) , R Θ = m min ′ min ′ > ′∈ S′∈ H The error between the estimated fare and the actual fare is defined as: (S, ) (I, ) (R) ε H = E H − f min A normalized standard error is used to determine the relative deviation of the estimated fare toward the mean actual fare, defined as: (F , ) ρ s H = Σ(F ,H)s (R ) 1 H | | ∑ S′∈Hfmin ′8.2 Results
15
Figure 1: The result for <nbNO, OSL, AMS> over a F s with 527 examples, sorted by actual fare. A full listing of the result for all locale and routes is presented in Appendix 8: Test results.While the final results turns out to be very weak, as can be seen in figure 1 and found further in Appendix 8, the algorithm is still performing overall better than random, as is shown in table 2.
Table 2: Σ(F , ) s H for the normal algorithm, for a run where the distance, δ(I, )I′ , is generated randomly and for a run where the estimated value is simply the mean of all f (R )min ′ in suchH as E(I)= 1 (R ). The best performing run for each locale and route has been marked
H | | ∑ R′ ′∈S∈H fmin ′ out in bold for clarity.
Normal Random (I, )δ I′ Mean over H
<daDK, BKK, HAM> 118 137 127 <ukAU, DXB, IEV> 211 118 175 <ruRU, MOW, LBD> 22 85 82 <nbNO, OSL, AMS> 116 151 149 <deDE, OTP, MUC> 95 98 90 <deDE, MUC, OTP> 106 105 108 <daDK, ATH, CPH> 186 214 199 <daDK, KRP, OSL> 96 98 96 <daDK, OSL, KRP> 18 71 51
9. Conclusions
16
While performing overall better than random, the results are not satisfying. Even when excluding <ukUA, DXB, IEV>, where the test set ended up being particularly bad compared to the training set, there are only a few locale and routes where the algorithm is performing clearly better than random. Considering that , with only a few exceptions, contains the span of recorded fares needed forH full performance, the main cause for the poor performance is that δ(I, )I′ fail to capture the similarities with those.
This is not surprising, since the used model for δ(I, )I′ only captures linear relationships in the input features, while the actual look of those tend to be more complex than that. This is also verified by the very low r 2 for several locale and routes.
If a model for δ(I, )I′ that better captures the characteristics in the input and the similarity between those could be found, the algorithm still has the potential of performing very well. Since tends to contain the records needed, and the overall price level tends to be fairly stable, H based on a few plateaus and only a limited amount of exceptions deviating from those. The wide variety in amount of data being available for different locale and routes though puts constraints on what model could be chosen for δ(I, )I′ . While it might be natural to chose a very complex model for δ(I, )I′ to capture the complexity of the inputs, this might cause the model to overfit, or being unable to fit at all, for the majority of locale and routes where the amount of data available is small. Further exploration of this is needed.
10. Further improvements
17
Here’s some suggestions on how the algorithm could potentially be improved.10.1 Improved
δ(I, )I′ 17The most obvious, and required, choice for further improvement is to improve δ(I, )I′ s ability to detect similarities. One option would of course be to find another algorithm altogether, using a vector space model, clustering technique, probability distribution, or any other technique. Keeping the regression model, similarity features of higher degree could be added. Such polynomial expression could help capture the actual look of the input features, which tend to be more complex than a straight line. New input and similarity features could also be introduced. For example week of the year, or combining features that may be related.
10.2 Multiple segments
17
Once the algorithm has been found to perform acceptable for one oneway flights, the next natural improvement would be to extend it to multiple segments, that is S n. This expands the number of I DT from which input features could be extracted. Although, since there will still be only one I T, the complexity of the similarity model will grow with less than a factor of .n10.3 Adjusting
σ
17
The value of directly influences how strong the difference in distance affects the final fareσ estimate. By adjusting the influence of a single deviating record in matching a particular σ H I could be corrected, potentially compensating for an insufficient spread in δ(I, )I′ .
10.4 Online learning
18
Considering how a continuous flow of new data is coming into the system, it would be interesting to turn the algorithm into an online learning variant. Where δ(I, )I′ is adjusted for every search that were first estimated and then performed, feeding the error back into the parameters of the model. This could be particularly useful for quickly adapting to changes in the market and for quickly learning new routes that become popular and simply haven’t had sufficient data available before.
11. References
18
[Foolproof, 2008] Natalie Machon, Chris Meeke, Julia Williams and Tom Wood for Foolproof Inc. Online Shopping Survey Travel. Retrieved from http://www.foolproof.co.uk/, 2008 [Balen, 2014] John Balen. We’re Still Traveling Like It’s 1996. http://techcrunch.com/2014/05/10/whyarewestilltravelinglikeits1996/ (Accessed 20140623) [Etzioni et al., 2003] Oren Etzioni, Rattapoom Tuchinda, Craig Knoblock, and Alexander Yates. To Buy or Not To Buy: Mining Airfare Data To Minimize Ticket Purchase Price. In SIGKDD Conf. on Knowledge Discovery and Data Mining, pages 119–128, 2003. [Groves and Gini, 2013] W. Groves and M. Gini. Optimal Airline Ticket Purchasing Using Automated UserGuided Feature Selection. In IJCAI ’13: Proc. 23rd Int’l Joint Conf. on Artificial Intelligence, 2013 [Russel and Norvig, 2003] Stuart Russel and Peter Norvig. InstanceBased Learning. In Artificial Intelligence: A Modern Approach, Second Edition, pages 733736, 2003 [Mitchell, 1997] Tom M. Mitchell. InstanceBased Learning, In Machine Learning, International Edition, pages 230248, 1997 [Marmanis, 2007] Toby Marmanis. Building Price Models. In Programming Collective Intelligence, pages 167195, 2007Appendix 1: Fare development
19
Graphs showing fare development per hour for a fixed departure date of 20140529 for different locale and routes.
Appendix 2: Fare distributions
21
Histograms over all recorded lowest fare in the result of all searches with the given locale and route.
Appendix 3: Features and fares
23
Examples of where the fare distribution for a particular group of departure dates differ significantly from that of nearby groups.
Appendix 4: Samples selection
25
This appendix gives an overview of how the sets used in this paper where selected. First a sample of 10 searches were selected by capturing the 10 latest searches done at a random point in time: Q =∪
10 i=1 S′iThe segment defining parameters {I ,′L I′O, }I′D and search time I′ T were extracted from each of those.
All searches belonging to any same segment {I , ,′L I′O I′D} as one of those, that were requested before min(I ′T ∈ S′∈ Q) and up to one week before were then extracted into a training set:
S , S , S , S ,
T s= { ′ ′∋ I′L∈ S′(Q)∈ Q ′∋ I′O∈ S′(Q)∈ Q ′∋ I′D∈ S′(Q)∈ Q
in(I ) week in(I )} m ′T(Q)∈ S′(Q)∈ Q − 1 ≤ I′T< m ′T(Q)∈ S′(Q)∈ Q A similar set for validation were then extracted from the one week before the oldest search in the training set: S , S , S , S , V s= { ′ ′∋ I′L∈ S′(Q)∈ Q ′∋ I′O∈ S′(Q)∈ Q ′∋ I′D∈ S′(Q)∈ Q
in(I ) week in(I )} m ′T(T )s ∈ S′(T )s ∈ T s − 1 ≤ I′T< m ′(T )T s ∈ S′(T )s ∈ T s Next a set for testing purpose were extracted from the one week before the oldest search in the validation set: S , S , S , S , F s= { ′ ′∋ I′L∈ S′(Q)∈ Q ′∋ I′O∈ S′(Q)∈ Q ′∋ I′D∈ S′(Q)∈ Q
in(I ) week in(I )} m ′T(V )s ∈ S′(V )s ∈ V s − 1 ≤ I′T < m ′(V )T s ∈ S′(V )s ∈ V s Finally a similar set of historic records were extracted with all the recorded searches that happen before any search in the test set. S , S , S , S , H = { ′ ′∋ I′L∈ S′(Q)∈ Q ′∋ I′O∈ S′(Q)∈ Q ′∋ I′D∈ S′(Q)∈ Q in(I )} I′T< m ′T(F )s ∈ S′(F )s ∈ F s Giving the following composition of samples: IL I O I D T s V s F s H daDK ATH CPH 1225 1093 1083 10 154
nbNO OSL AMS 502 478 527 3400
daDK BKK HAM 291 303 312 1231
deDE MUC OTP 27 26 34 62
deDE OTP MUC 26 26 32 65
daDK KRP OSL 21 13 13 367 daDK OSL KRP 17 14 15 184 ukUA DXB IEV 15 14 19 60 Please note that those sets only contain 9 segments of searches, due to the initial sample, ,Q containing one search that had never been recorded before. Interesting to note is also the stability in the number of searches being done within a particular segment over consecutive weeks.
Similarity sets T p and V p, containing similarity features, were further generated by extracting the input features from T s and V s respectively and crosscombine those with .H
Appendix 5: Learning curves
27
Here’s a couple of learning curves showing J(W , )T p (green, bottom) and J(W , ) V p (blue, top) for a series of sizes on the training set T p while keeping the validation set V p fixed and equal to the maximum tested size of T p.
It can be seen that for most sets J(W , ) V p converges at a |T |p of about 1000 random samples. Only neglectable improvements are happening beyond a |T |p of 10.000 random samples. Further on there is a requirement of |T |p being above approximately 100 random samples to be able to learn a generalized W at all.
Appendix 6: Feature t-values
29
<daDK, BKK, HAM> <ruRU, MOW, LBD> <nbNO, OSL, AMS> <daDK, ATH, CPH>
(H(I ), (I )) δ T H ′T 1.871 0.993 2.548 0.903 (DW(I ), W(I )) δ DT D ′DT 1.009 0.990 0.454 0.201 (DW(I ), W(I )) δ T D ′T 1.406 0.918 0.993 0.717 (DM(I ), M(I )) δ DT D ′DT 1.214 0.530 0.034 1.196 (DM(I ), M(I )) δ T D ′T 6.673 5.666 1.332 0.782 (MY (I ), Y (I )) δ DT M ′DT 2.934 7.203 2.110 16.766 (MY (I ), Y (I )) δ T M ′T 6.223 37.240 1.987 3.174 (DD(I , ), D(I , )) δ DTIT D ′DTI′T 15.651 1.831 19.669 1.096 (I , ) δ TI′T 7.837 44.704 3.809 1.373 (H(I ), (I )) δb T H ′T 0.111 0.573 2.075 0.651 (DW(I ), W(I )) δb DT D ′DT 0.342 1.936 1.722 3.435 (DW(I ), W(I )) δb T D ′T 1.234 1.263 1.956 0.528 (DM(I ), M(I )) δb DT D ′DT 0.258 0.301 2.889 2.006 (DM(I ), M(I )) δb T D ′T 0.033 0.911 0.647 0.713 (MY (I ), Y (I )) δb DT M ′DT 4.871 3.403 0.501 0.621 (MY (I ), Y (I )) δb T M ′T 14.685 6.207 2.861 0.703 Table shows tvalues for all features over a few locales and routes, with |T |p = 8192 and features with very high significance marked out in bold. While there might be some trends, the importance of the different factors seem to be very dependent on the locale and route. A full treatment of which factors are globally useful to the algorithm would require a much larger collection of locales and routes than is available for the sake of this thesis, and is hence left as a future improvement.
Appendix 7:
Σ
(V , )
sH
at different k-values
30
With a perfect δ(I, )I′ the optimal value of k should be very low, with performance degrading as k increases. While some locales and routes are doing well, not all graphs are showing this
behaviour.
Appendix 8: Test results
32
Line diagrams show the actual fare, f (R )min ′ , in black and the estimated fare, E(I )′ , in red, over all searches, , in the test set, S′ F s, sorted by the actual fare for visibility. The outer border of the fare range covered by the historic set, Θ(H) , is also shown by two pink lines.
In <ruRU, MOW, LBD> and <daDK, OSL, KRP> the upper limit of Θ(H) has been removed, since it goes far above the shown graphs.