Malvina Bozhidarova, Modelling extremes of environmental data
Degree: BSc (Hons) Mathematics, Manchester University
Supervisor: Callum Barltrop
Non-stationary is a term that is used to describe data for which the underlying distribution is not fixed. This is a common feature in many environmental datasets; for example, we often see temperature data increasing over time. When this is observed, standard statistical methodology that assumes data is identically distributed cannot be applied. This can lead to many inferential challenges, especially in the case of extreme value theory.
This branch of statistics is the theoretical framework used to model ‘extreme’ events (events considered to be rare or uncommon). By definition, there is little data for such events, any analysis relies heavily upon theoretical results. However, when non-stationarity is present, the standard theory cannot be applied. Moreover, we have to question what we mean by an ‘extreme’ event. For example, an ‘extreme’ event today may not be considered ‘extreme’ in 10 years time.
Non-stationarity can be present for a range of different factors. One such factor of particular interest is climate change. It is expected that climate change will continue to increase temperatures, while also increasing the frequency and magnitude of some extreme weather events, such as storms and floods. Therefore, it is important to find ways to build climate change into our framework.
In this project, we will be considering methods for modelling extreme events of environmental data using variables that drive climate change, such as CO2 levels. Initially, we will use a range of simulated datasets to develop a model capturing that can capture climate trends in extreme events. We will then be applying this model to UK climate projection data for a range of different variables, including temperature and humidity.
James Boyle, Modelling Populations of Networks
Degree: MMath Mathematics, University of Warwick
Supervisor: George Bolt
Network data arises when we have relational information between entities of a system. A canonical example is social network data, where we may observe information on friendships within a sample of a population. We typically represent this data as a mathematical graph, i.e. a set of vertices and edges, where vertices correspond to entities and edges represent relationships between them.
The development of statistical models for network data has become an active area of research, see Goldenberg et al. (2009) for a review. The majority of these models assume the observed network was generated in some pre-specified stochastic manner, dependent on a choice of model parameters. The task of the statistician is then to take an observed network and infer what parameters could have, or were most likely to have, led to its appearance.
Many of the traditional network models were constructed with application to individually observed networks in mind. However, with datasets becoming larger and richer there has been recent interest in developing models to describe the generative process of a population of networks. An interesting example of such data are connectomes, which are network representations of brain connectivity inferred from MRI scans. Typically, a scan would be taken on a sample of patients, and so after some data processing, we end up with a network for each patient.
In this project, the student will be introduced to the statistical problem of modelling network data. Through simulation experiments, they will explore the benefits and drawbacks of some popular network models, before comparing these with models recently proposed in the literature that deal specifically with the problem of modelling a population of networks.
Luke Fairley, Extreme events: what are the odds?
Degree: BSc Mathematics, Lancaster University
Supervisor: Stan Tendijck
In this project, we compare different models in modelling bivariate extremes. This has oceanographic applications, for example, in the joint modelling of wave height and wind speed, both of which are important variables in the calculation of failure probabilities of offshore facilities. Also, other applications can be thought of, like the joint modelling of losses on financial assets like the FTSElOO and AEX, or the modelling of the composition of certain gases in the atmosphere.
In the project, we compare the Heffernan-Tawn model introduced in [2] and a number of derived models. The Heffernan-Tawn model is a conditional extremes model that captures a wide variety of different dependency structures, essentially it is a form of regression model fitted only to the extremes. It is currently one of the most flexible models used in the field (5OO+ citations). They chose their particular model form since it worked asymptotically on almost all bivariate dependency structures (copulas) that were developed. The model is not perfect as many small variations have been introduced since its introduction, e.g. in [3]. We will compare a few of these and potentially come up with a new one.
The Heffernan-Tawn model is an asymptotic model, i.e., it works as long as we push observations far enough away. However, in the area of extremes, we do not want to model observations with an occurrence probability of 10−30 but rather 10−2. The Heffernan-Tawn model might not have converged enough to make the model form the best possible one.
Based on different model choices, we compare the differences by estimating probabilities of extreme sets, e.g. the probability that a wave larger than 5m occurs together with a wind speed of higher than 40 knots.
Daniel Hodgson, Uncertain Predictions in Resource Allocation Models
Degree: MSc Natural Sciences, Durham University
Supervisor: Ben Black
Resource allocation problems are extremely common in the operational research literature. They consist of allocating a fixed set of resources such as human workers or machines, to a set of skills, jobs or functions to try to meet as much demand as possible. This could be allocating electricians to jobs (Chen et al., 2018) or multi-skilled handlers to calls (Koole and Pot, 2005). These problems handle one day at a time, and as such, the demand (jobs and calls) are known.
However, in some problems such as that the medium-term planning of a telecommunications company’s engineers (Ainslie et al., 2015), we need to plan for a large number of days in the future. This means we need forecasts of the demand that we will need to meet over this period. These forecasts are almost always uncertain, but many optimisation models still treat them as though they aren’t. This can lead to big issues in the resulting allocations, such as wasted resources and unmet demand. This project will entail studying a variety of methods that can be used in mathematical and dynamic programming to reduce or incorporate this uncertainty in the models used. An example starting point could be training a reinforcement learning (Sutton and Barto, 1998) (RL) model that learns how best to correct a poor forecast in real-time.
Katharina Limbeck, Solving small-scale Arc Routing Problems
Degree: MSci Mathematics and Statistics, University of Glasgow
Supervisor: Thu Dang
Arc Routing Problem (ARP) arises in several applications, such as postal delivery, meter reading, snow removal, salt spreading, and waste collection. The aim is to find a vehicle route or a set of paths in a network at minimum cost, such that certain arcs are traversed by at least one vehicle, possibly subject to various side constraints such as limited vehicle capacities, time windows, one-way streets and so on.
Linear programming (LP) is a method to guide decision-makers toward the choice of the best options by making use of a mathematical model whose requirements are expressed by linear relationships. Linear programming is a special case of mathematical optimisation. It is expected that the focus of this project will be on how to model ARPs in small-scale instances.
Initially, simulated data will be used and suitable working code will be provided. Then, once a general framework for the model is established, real data at a small-scale will be used to test it. The model will then be tweaked to make use of all the available specific features of the data.
Adeeb Mahmood, Classification in an online setting
Degree: MSci Mathematics, Lancaster University
Supervisor: Chloe Fearn
In classification, a model is trained on some historic data, then for subsequent data, the features are used to predict the responses. However, sometimes the underlying distribution of the responses given the features changes over time; in this case, if a model is trained once and used to classify incoming data (whose responses are not known) forever it will eventually be rendered useless since the responses of the test data will not be related to their features in the same way that the training data was.
Another problem that arises in classification is that sometimes it is expensive to view the responses of instances. In this situation, we need to view the instances that bring the most information to the classifier and view as few as possible to save on cost.
There is plenty of literature for online classification, lots of which involve forgetting factors or sliding windows. This project will first explore offline binary classification methods, and then move on to online methods. If time allows, active learning will be explored, which involves methods that select a subset of the full data set to learn from when label requests are expensive.
Jack McGinn, Modelling Waves in the Ocean
Degree: MPhys Physics with Theoretical Physics, University of Manchester
Supervisor: Jake Grainger
The world’s oceans continue to play an important part in many aspects of modern life. Waves in the ocean can cause damage to structures and ships alike, endangering their crews and causing significant financial and environmental damage. Waves also propagate onshore, where they cause erosion and flooding. As such, it is important to understand their behaviour.
Observations of ocean waves come in the form of measurements of the displacement of the sea surface at a given location. These observations can then be used to develop parametric models that can describe the sea surface, many of which are summarised by Michel (1999). Typically these models describe the spectral density function of the process of interest (the frequency domain analogue of the autocovariance). To fit such models to actual data we can use pseudo-likelihood approaches such as the de-biased Whittle likelihood (Sykulski et al., 2019).
The way in which these parameters evolve over time is of increasing interest to engineers, especially in rapidly developing weather systems such as tropical cyclones. During this project, the student will use existing techniques to fit models to data sets and then use the model fits to explore the behaviour of the different parameters as the sea evolves.
Daniel Morton, Input Uncertainty Quantification for Stochastic Simulation
Degree: BSc MORSE, Lancaster University
Supervisor: Drupad Parmar
The behaviour of many real-world systems, such as airports, hospitals, and manufacturing lines depends greatly upon some level of inherent randomness, and therefore such systems are frequently modelled using stochastic simulation. The randomness in the simulation is driven by input models, represented by probability distributions or processes, which are often estimated via data collected from the real-world system. Since the samples of data are finite, uncertainty arises in the estimated input models and this propagates through the simulation model to performance measure outputs.
Rarely is this propagation of input model uncertainty considered in simulation output analysis. Common practice is to report simulation-based confidence intervals for performance measures, however, these typically ignore input uncertainty and only include stochastic estimation error. Without considering the propagation of input model uncertainty in simulation output analysis, decisions are at risk of being made with misleading levels of confidence. Interest, therefore, lies in quantifying the uncertainty that arises in the simulation output as a result of the uncertainty in estimating the input models.
The aim of this project is to develop an understanding of input uncertainty and implement some existing methodologies for quantifying input uncertainty on provided stochastic simulation models, whilst considering the relative advantages and disadvantages of each method.
Thomas Newman, Machine learning in simulation
Degree: MSci Statistics, University of Glasgow
Supervisor: Graham Laidler
Simulation is commonly used to model many real-world operations. For example, queueing systems naturally arise in the operational running of facilities such as hospitals, call centres, and manufacturing processes. To optimise the performance of such systems, their complexity often makes mathematical analysis infeasible and a simulation model is used instead. Briefly, a simulation model replaces the random processes that occur in the real-world system, such as customer arrivals and service times, with appropriately distributed random variables. Sampling these random variables allows the system to be simulated, and its performance can then be evaluated with regard to some measurable performance indicator, such as customer waiting times. However, by the stochastic nature of the systems being modelled, performance indicators can fluctuate significantly over time. As such, traditional time-averaged performance indicators give an incomplete picture of a highly variable system. There is growing interest in obtaining a deeper understanding of simulation behaviour; for example, we want to uncover the main causes of time-varying performance.
This project will include some exploratory data analysis of the data generated by a simulation model, with a focus on visualising the fluctuations in performance. We will then turn our attention to some common machine learning methods, and consider ways to exploit them for our purpose. Namely, we want to uncover the driving factors behind observed simulation performance. This project offers the chance to produce some novel methodology and can be flexible depending on the interests and prior experience of the student.
Taj Patel, Dynamic Latent Space Network Models
Degree: BSc Mathematics with Statistics, University of Warwick
Supervisor: Amiee Rice
Networks are often used to represent real-world interactions, and therefore the ability to model real-world behaviour is paramount in the field of network analysis and modelling.
Latent space models have been used to capture a high level of transitivity in networks (the common phrase that you will come across is “a friend of a friend is a friend of mine too”). The first task in this project is to understand the latent space modelling of networks.
Dynamic network models allow us to capture how time affects an interaction network. How do we accurately show the change in affinity for connection between two individuals as time goes on? Then using this information, you will be able to implement models that use both methods to capture realistic interaction patterns.
Ryan Pownall, Anomaly detection using functional data analysis, with applications to sea surface temperature data
Degree: BSc Mathematics with Finance, Newcastle University
Supervisor: Edward Austin
Functional Data Analysis is used to model phenomena observed over a period of time as a continuous function. This is of particular use in situations where the observations are recorded at a high frequency over the time period, as this means that a large collection of points can be represented as a single observed curve. Inference using the observed functions can then take a variety of forms, and this project will focus on anomaly detection using Functional Data. Anomaly detection is the process by which the data are examined to test whether an observation differs significantly from the other observations or some underlying expected process.
Anomaly detection for point data is a well-studied area, and the challenge is to extend the classical notions of an anomaly to the functional domain. In particular, how can outlyingness be measured with respect to a continuous function given the fact that each observation will be a smooth curve which varies over time. Furthermore, how can this definition of outlyingness be described so that sensible conclusions can be drawn from the data.
This project will seek to address these challenges, first by performing a review of the existing functional data anomaly detection methods, and then using these to detect anomalies within Pacific Sea Surface Temperature Data. The aim of this will be to detect not only the effect of a changing climate on the sea surface temperature but also identify periods where anomalous weather has led to the unexpected temperatures being recorded.
Matthew Speers, Online Sparse Temporal Disaggregation
Degree: BSc Mathematics, Lancaster University
Supervisor: Luke Mosley
Due to the significant adverse effect on the global economy caused by the coronavirus pandemic, there has never been a more important time to understand the short-term movements of headline macroeconomic variables. We can no longer rely on infrequent publications of GDP or traditional annual business surveys to inform us of the current state of the economy. Ever since the global financial crash between 2007-2008, national statistics institutes, such as ONS here in the UK, have motivated the need for a vast set of high-frequency indicator time series that are readily available and measure numerous processes. This set will be used to create disaggregated series of infrequent headline variables, which will provide early warning signals of potential large economic impacts such as financial crashes and pandemics, and administer more accurate measurements of the rapidly evolving modern economy.
With the digital revolution we witness today, there are many potential resources for high-frequency indicators. To disaggregate GDP, we could use credit card transaction data or VAT returns data. To disaggregate inflation, we could use scanned price data in supermarkets or social media news articles. To disaggregate unemployment, we could use web-scraped online job advertisement data. In the econometrics literature, the process of disaggregating a low-frequency time series by making use of indicator series recorded at the desired high frequency is known as temporal disaggregation. This is a two-step procedure that involves finding a preliminary estimate for the high-frequency disaggregated series (usually by performing GLS regression) and then distributing the aggregated residuals among the preliminary series. With the vast number of indicator series, we would now like to use when performing temporal disaggregation, standard techniques such as GLS become statistically infeasible due to the curse of dimensionality, and therefore current methods fail. To resolve this difficulty, one can set up the temporal disaggregation problem in the sparse modelling framework by incorporating a LASSO regularization penalty which will focus on selecting a small set of indicators having the most informative power on the variable of interest. The resulting high-frequency estimates from sparse temporal disaggregation will be informative on two fronts, firstly they provide accurate visualisation of the short-term movements of the headline variable and secondly, they give interpretation into what indicator series are most relevant for future estimations.
The aim of this project is to devise a way sparse temporal disaggregation can be performed in the online setting, i.e. how to automatically update the model in light of data revisions. Data revisions will be very common when performing temporal disaggregation. For example, they may be due to a new time period occurring for the low-frequency variable, or changes in the indicator series set due to improved data sources. More major revisions occur when there is a change in legislation or a change in accounting definitions or in times of financial crisis. Understanding how estimates are affected by revisions plays an important role in assessing how reliable the sparse temporal disaggregation model is. We would like estimates to remain precise but also stable over time.