So, You Want to be a Data Scientist...
Lightning strikes world-wide account for 240,000 injuries per year and about 10% of those are fatal [Holle, 2008]. Airport ground handling crews are particularly vulnerable because they can be struck in a 'side flash': a static electric discharge occurring when workers are standing near the aircraft even though they are not actually touching it.
As a data scientist, you want to use past data on lightning strikes to predict when a lightning strike might be imminent at Changi International Airport in Singapore. That is what this DIY (Do-It-Yourself) exercise is about.
Step through menu items on the left to experience the modeling and prediction process. There are mini-challenges and animations throughout the journey. By the way, this was a Term 4 project for two teams of students in our Data and Business Analytics course, Fall 2020.
Thanks go to the Vaisala Corporation for sharing the lightning strike data and for students Max Koh Junran, Heather Lee Xuan Hui, Tan Peck Kee, Wang Jiahui, Ho Bing Xuan, Mohamad Arshad S/O Khaja Moinudeen, Li Tianshu, and Ong Yao De for conducting the analysis. This presentation is made with R Shiny, which you will learn in our Term 5 course, Engineering Systems Architecture.
What the Data Look Like
The raw data for this study is a record of every lightning strike in the Singapore-Johor region for September-December, 2019.
We access these data using queries to our database such as:
SELECT * FROM LightningExtract LIMIT 10
Pick the query we would use if we wanted to just look at strikes which occurred on November 11, 2019.
Plot the Data
The first lesson in data science and statistics is to look at your data. You should always find some way to visualize your data. In this case, we have geographical coordinates of each lightning strike so it makes sense to see these on a map.
[Suggestion: once you select the slider, use the left ('<') and right ('>') keys on your keyboard.]
Choose a Grid Size
Let's break up the region into rectangular cells. We want a grid for which one of the cells nicely encloses Changi airport.
Pick a grid size and we will tell you if it matches the one we picked.
Choose the Time Step
We chose to segment time into 15 minute intervals. This is more for simplicity of presentation than for accuracy. There are 4*24*30=2880 such intervals in November: we use two slider bars to select one for visualization.
Summarize the Data
Having chosen a spatial grid and a time step, we next summarize the data: count the number of lightning strikes in each grid cell for each time step. (In a hidden step we already labelled each lightning strike as to which grid cell and time segment it belonged to.) Aggregation is a frequent task in data management so get started with aggregation by picking which query we should use. The resulting table will be revealed as soon as you pick the correct one.
Fill in the Blanks
The time intervals when there are no lightning strikes are just as important as intervals when there are. We can get all possible combinations of time intervals and grid cells with the following query:
SELECT intervalid,id FROM TimeSeq,Cells
With 2880 time intervals and 30 cells, that makes for 2880*30=86400 possible combinations. We next match up those combinations with the interval-cells with lightning strikes using what is called a database join.
Our students found they could be just as accurate in their predictions by simplifying the data. Instead of looking at the number of lightning strikes in each cell, they looked at simply whether there were any lightning strikes or not. They coded the time interval-cell id with a '1' if lightning struck that cell in that time interval, and with a '0' otherwise.
We display the '1's' on the map by shading the cell. We call these 'active cells'.
What to Correlate?
The next big decision is what to correlate? That is, what variable are we trying to predict (the dependent variable), and what variables will we use for the prediction (the independent variables)?
Clearly, we want to be able to predict whether Changi airport will experience a lightning strike. So, the dependent variable is whether or not the Changi cell is active or not in a given time interval. We want advance warning of such a strike, so we will base the predictions on which cells are active in the previous time interval. That is not much warning but it will likely be the most accurate we could achieve. So, the independent variables are the active cells, or a subset of them, in the previous interval.
Let's display the dependent variable in one map and the independent variables in a second map so you can look for correlation.
The Logistic Function
Hold onto your hat! The next two tableaus introduce you to a powerful technique known as logistic regression. We want a function of the independent variables to estimate the probability that the dependent variable will take the value 1 (that is, the probability that lightning will strike).
One function could simply be the sum of the independent variables, that is, the count of the number cells which just experienced a lightning strike. The problem with that function is that it can easily exceed 1 and probabilities should always range between 0 and 1 (or between 0 and 100%).
That is where the logistic function (sometimes called the sigmoid function because of its S-shape) comes in: It can convert any number on the number line into a number between 0 and 1.
We illustrate in the chart below where the sum of active cells is converted to a probability through the logistic function. [We have performed a hidden optimization on where to center the logistic curve.]
So we can take the number of active cells in one interval and turn that into an estimate of the probability lightning will strike Changi airport in the next interval. But, as you may have noticed, the correlation of this estimated probability with whether or not Changi actually experienced a strike was not particularly good.
Fortunately, there is one more step in the method which improves the result. Instead of just counting the active cells as input to the logistic function, we can compute a weighted sum of the active cells. That is, we can give more weight to some cells in the prediction than to others. Furthermore, there is an automated way in which we can compute these weights optimally. This method is called logistic regression .
As you move your mouse over the cells in the map at the bottom, you can see the optimized weight which the method assigned to each cell. Notice that the cells closest to Changi airport, including the airport cell itself, turn out to be the best predictors: They get the highest weight.
Use the slider bars to explore the relation between the estimated probability of a strike using optimized weighted active cells and whether or not lightning actually struck the airport.
The Last Step
Now that we have an optimized probability estimator, the last step is to set a threshold value. If the probability estimate is at or above this threshold we will say that we predict a lightning strike at Changi airport in the next 15 minutes. If the estimate is below the threshold, we will predict no strike.
- If we predict a strike and it happens, we call that a true positive.
- If we predict no strike and no strike happens, we call that a true negative.
- On the other hand, if we predict a strike and it doesn't happen, we call that a false positive.
- Finally, if we predict no strike and a strike happens, we call that a false negative.
Clearly, there is a tradeoff to be made in setting the threshold value: set it too high and we miss predicting a life-threatening event. Set it too low and we waste people's time with false alarms.
What would you choose? Adjust the slider bar to explore the tradeoff. We compute the results of all predictions for the November data set.
For the threshold value, we chose 48% to be the probability above which we would predict a lightning strike at Changi airport in the next 15 minutes. It provided a high number of true positives and, we thought, an acceptable number of false positives in the November data.
Now it comes to the test: how well would this probability estimator and threshold value perform for a different month? For this we test it, without modification, by applying it to the data from December of the same year. You can see the results in the table below.
Our conclusion is that the accuracy of these predictions is not sufficient yet for implementation. On the other hand, you can see that this approach holds promise. With more data, and more tools, such as you learn in our Statistical and Machine Learning course, the day is approaching when you can save lives through lightning predictions.
Find more games and exercises at ESD Games.
Learn more about joining ESD starting at For Prospective Students.