In class Exercise5


Zhao Yuetong

Setting the scene

To build explanatory model to discover factor affecting water point statues in Osun State, Nigeria,

Study area: Orun State, Negeria

Data sets:

  • Osun.rds, contains LGAs boundaries of Osun State. It is in sf polygon data frame, and

  • Osun_wp_sf.rds, contained water points within Osun State. It is in sf point data frame.

Model Varibales

  • Dependent variable: Water points status(i.e. functional/non-functional)

  • Independent variables:

  • distance_to_primary_road.

  • distance_to_secondary _road

  • distance_to_tertiary_road.

  • distance_to_city

  • distance_to_town

  • water_point_population

  • local_population_1km

  • usage_capacity

  • is_urban

  • water_source_clean

    Getting start

  • Create In-class Exercise 5 folder

  • Write a code chunk to load the following packages: sf, tidyverse, funModeling, blorr,corrplot, ggpubr,sf,spdep,GWmodel, tmap, skimr,caret

    pacman::p_load(spdep, tmap, sf, blorr, caret, corrplot, ggpubr,  NbClust, GWmodel, tidyverse, funModeling, skimr,report)

Data Import

In this in-class exercise, two data sets will be used. They are:

Osun <- read_rds("rds/Osun.rds")
Osun_wp_sf <- read_rds("rds/Osun_wp_sf.rds")
Osun_wp_sf %>%
  freq(input = 'status')
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
  Please report the issue at <>.

  status frequency percentage cumulative_perc
1   TRUE      2642       55.5            55.5
2  FALSE      2118       44.5           100.0
tmap mode set to interactive viewing
  tm_polygons(alpha = 0.4)+
  tm_dots(col = "status",
          alpha = 0.6)+
  tm_view(set.zoom.limits = c(9,12))

Exploratory Data Analysis

Summary Statics with skimr.

Osun_wp_sf %>%
Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
Column type frequency:
character 47
logical 5
numeric 23
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
Osun_wp_sf_clean <- Osun_wp_sf %>%
  mutate(usage_capacity = as.factor(usage_capacity))

Correlation Analysis

Osun_wp <- Osun_wp_sf_clean %>%
  select(c(7,35:39,42:43,46:47,57)) %>%
cluster_vars.cor = cor(
               lower = "ellipse",
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

Building a Logistic Regression Models

model <- glm(status ~ distance_to_primary_road +
               distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
             data = Osun_wp_sf_clean,
             family = binomial(link ='logit'))

Instead of using typical R report,

                             Model Overview                              
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
  data       status     4756      4755           4745           TRUE     

                    Response Summary                     
Outcome        Frequency        Outcome        Frequency 
   0             2114              1             2642    

                                 Maximum Likelihood Estimates                                   
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
              (Intercept)                   1      0.3731        0.1088      3.4303       6e-04 
        distance_to_primary_road            1      0.0000        0.0000     -0.7328      0.4637 
       distance_to_tertiary_road            1      1e-04         0.0000      4.7127      0.0000 
            distance_to_city                1      0.0000        0.0000     -4.7653      0.0000 
            distance_to_town                1      0.0000        0.0000     -5.4356      0.0000 
              is_urbanTRUE                  1     -0.2843        0.0785     -3.6210       3e-04 
           usage_capacity1000               1     -0.6222        0.0697     -8.9270      0.0000 
water_source_cleanProtected Shallow Well    1      0.4986        0.0851      5.8559      0.0000 
   water_source_cleanProtected Spring       1      1.2944        0.4387      2.9505      0.0032 
         water_point_population             1      -5e-04        0.0000    -11.4007      0.0000 
          local_population_1km              1      3e-04         0.0000     19.4015      0.0000 

 Association of Predicted Probabilities and Observed Responses  
% Concordant          0.7345          Somers' D        0.4691   
% Discordant          0.2655          Gamma            0.4691   
% Tied                0.0000          Tau-a            0.2317   
Pairs                5585188          c                0.7345   
We fitted a logistic model (estimated using ML) to predict status with
distance_to_primary_road (formula: status ~ distance_to_primary_road +
distance_to_tertiary_road + distance_to_city + distance_to_town + is_urban +
usage_capacity + water_source_clean + water_point_population +
local_population_1km). The model's explanatory power is moderate (Tjur's R2 =
0.16). The model's intercept, corresponding to distance_to_primary_road = 0, is
at 0.37 (95% CI [0.16, 0.59], p < .001). Within this model:

  - The effect of distance to primary road is statistically non-significant and
negative (beta = -4.75e-06, 95% CI [-1.75e-05, 7.95e-06], p = 0.464; Std. beta
= -0.03, 95% CI [-0.10, 0.04])
  - The effect of distance to tertiary road is statistically significant and
positive (beta = 9.75e-05, 95% CI [5.71e-05, 1.38e-04], p < .001; Std. beta =
0.16, 95% CI [0.10, 0.23])
  - The effect of distance to city is statistically significant and negative
(beta = -1.69e-05, 95% CI [-2.38e-05, -9.95e-06], p < .001; Std. beta = -0.19,
95% CI [-0.26, -0.11])
  - The effect of distance to town is statistically significant and negative
(beta = -1.54e-05, 95% CI [-2.09e-05, -9.83e-06], p < .001; Std. beta = -0.19,
95% CI [-0.26, -0.12])
  - The effect of is urbanTRUE is statistically significant and negative (beta =
-0.28, 95% CI [-0.44, -0.13], p < .001; Std. beta = -0.28, 95% CI [-0.44,
  - The effect of usage capacity [1000] is statistically significant and negative
(beta = -0.62, 95% CI [-0.76, -0.49], p < .001; Std. beta = -0.62, 95% CI
[-0.76, -0.49])
  - The effect of water source clean [Protected Shallow Well] is statistically
significant and positive (beta = 0.50, 95% CI [0.33, 0.67], p < .001; Std. beta
= 0.50, 95% CI [0.33, 0.67])
  - The effect of water source clean [Protected Spring] is statistically
significant and positive (beta = 1.29, 95% CI [0.49, 2.23], p = 0.003; Std.
beta = 1.29, 95% CI [0.49, 2.23])
  - The effect of water point population is statistically significant and
negative (beta = -5.11e-04, 95% CI [-6.02e-04, -4.27e-04], p < .001; Std. beta
= -0.75, 95% CI [-0.88, -0.62])
  - The effect of local population 1km is statistically significant and positive
(beta = 3.46e-04, 95% CI [3.12e-04, 3.82e-04], p < .001; Std. beta = 1.45, 95%
CI [1.31, 1.60])

blr_confusion_matrix(model, cutoff = 0.5)
Confusion Matrix and Statistics 

Prediction FALSE TRUE
         0  1292  740
         1   822 1902

                Accuracy : 0.6716 
     No Information Rate : 0.4445 

                   Kappa : 0.3324 

McNemars's Test P-Value  : 0.0404 

             Sensitivity : 0.7199 
             Specificity : 0.6112 
          Pos Pred Value : 0.6982 
          Neg Pred Value : 0.6358 
              Prevalence : 0.5555 
          Detection Rate : 0.3999 
    Detection Prevalence : 0.5728 
       Balanced Accuracy : 0.6655 
               Precision : 0.6982 
                  Recall : 0.7199 

        'Positive' Class : 1
Osun_wp_sp <- Osun_wp_sf_clean %>%
           water_source_clean)) %>%
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 11
names       : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, is_urban, usage_capacity, water_source_clean 
min values  :      0,        0.014461356813335,          0.152195902540837,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,        0,           1000,           Borehole 
max values  :      1,         26909.8616132094,           19559.4793799085,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,        1,            300,   Protected Spring 
bw.fixed <- bw.ggwr(status ~ distance_to_primary_road +
                     distance_to_secondary_road +
                     distance_to_tertiary_road +
                     distance_to_city +
                     distance_to_town +
                     is_urban +
                     usage_capacity +
                     water_source_clean +
                     water_point_population +
                   data = Osun_wp_sp,
                   family = "binomial",
                   approach = "AIC",
                   kernel = "gaussian",
                   adaptive = FALSE,
                   longlat = FALSE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
 Iteration    Log-Likelihood:(With bandwidth:  95768.67 )
       0        -2889 
       1        -2836 
       2        -2830 
       3        -2829 
       4        -2829 
       5        -2829 
Fixed bandwidth: 95768.67 AICc value: 5684.357 
 Iteration    Log-Likelihood:(With bandwidth:  59200.13 )
       0        -2875 
       1        -2818 
       2        -2810 
       3        -2808 
       4        -2808 
       5        -2808 
Fixed bandwidth: 59200.13 AICc value: 5646.785 
 Iteration    Log-Likelihood:(With bandwidth:  36599.53 )
       0        -2847 
       1        -2781 
       2        -2768 
       3        -2765 
       4        -2765 
       5        -2765 
       6        -2765 
Fixed bandwidth: 36599.53 AICc value: 5575.148 
 Iteration    Log-Likelihood:(With bandwidth:  22631.59 )
       0        -2798 
       1        -2719 
       2        -2698 
       3        -2693 
       4        -2693 
       5        -2693 
       6        -2693 
Fixed bandwidth: 22631.59 AICc value: 5466.883 
 Iteration    Log-Likelihood:(With bandwidth:  13998.93 )
       0        -2720 
       1        -2622 
       2        -2590 
       3        -2581 
       4        -2580 
       5        -2580 
       6        -2580 
       7        -2580 
Fixed bandwidth: 13998.93 AICc value: 5324.578 
 Iteration    Log-Likelihood:(With bandwidth:  8663.649 )
       0        -2601 
       1        -2476 
       2        -2431 
       3        -2419 
       4        -2417 
       5        -2417 
       6        -2417 
       7        -2417 
Fixed bandwidth: 8663.649 AICc value: 5163.61 
 Iteration    Log-Likelihood:(With bandwidth:  5366.266 )
       0        -2436 
       1        -2268 
       2        -2194 
       3        -2167 
       4        -2161 
       5        -2161 
       6        -2161 
       7        -2161 
       8        -2161 
       9        -2161 
Fixed bandwidth: 5366.266 AICc value: 4990.587 
 Iteration    Log-Likelihood:(With bandwidth:  3328.371 )
       0        -2157 
       1        -1922 
       2        -1802 
       3        -1739 
       4        -1713 
       5        -1713 
Fixed bandwidth: 3328.371 AICc value: 4798.288 
 Iteration    Log-Likelihood:(With bandwidth:  2068.882 )
       0        -1751 
       1        -1421 
       2        -1238 
       3        -1133 
       4        -1084 
       5        -1084 
Fixed bandwidth: 2068.882 AICc value: 4837.017 
 Iteration    Log-Likelihood:(With bandwidth:  4106.777 )
       0        -2297 
       1        -2095 
       2        -1997 
       3        -1951 
       4        -1938 
       5        -1936 
       6        -1936 
       7        -1936 
       8        -1936 
Fixed bandwidth: 4106.777 AICc value: 4873.161 
 Iteration    Log-Likelihood:(With bandwidth:  2847.289 )
       0        -2036 
       1        -1771 
       2        -1633 
       3        -1558 
       4        -1525 
       5        -1525 
Fixed bandwidth: 2847.289 AICc value: 4768.192 
 Iteration    Log-Likelihood:(With bandwidth:  2549.964 )
       0        -1941 
       1        -1655 
       2        -1503 
       3        -1417 
       4        -1378 
       5        -1378 
Fixed bandwidth: 2549.964 AICc value: 4762.212 
 Iteration    Log-Likelihood:(With bandwidth:  2366.207 )
       0        -1874 
       1        -1573 
       2        -1410 
       3        -1316 
       4        -1274 
       5        -1274 
Fixed bandwidth: 2366.207 AICc value: 4773.081 
 Iteration    Log-Likelihood:(With bandwidth:  2663.532 )
       0        -1979 
       1        -1702 
       2        -1555 
       3        -1474 
       4        -1438 
       5        -1438 
Fixed bandwidth: 2663.532 AICc value: 4762.568 
 Iteration    Log-Likelihood:(With bandwidth:  2479.775 )
       0        -1917 
       1        -1625 
       2        -1468 
       3        -1380 
       4        -1339 
       5        -1339 
Fixed bandwidth: 2479.775 AICc value: 4764.294 
 Iteration    Log-Likelihood:(With bandwidth:  2593.343 )
       0        -1956 
       1        -1674 
       2        -1523 
       3        -1439 
       4        -1401 
       5        -1401 
Fixed bandwidth: 2593.343 AICc value: 4761.813 
 Iteration    Log-Likelihood:(With bandwidth:  2620.153 )
       0        -1965 
       1        -1685 
       2        -1536 
       3        -1453 
       4        -1415 
       5        -1415 
Fixed bandwidth: 2620.153 AICc value: 4761.89 
 Iteration    Log-Likelihood:(With bandwidth:  2576.774 )
       0        -1950 
       1        -1667 
       2        -1515 
       3        -1431 
       4        -1393 
       5        -1393 
Fixed bandwidth: 2576.774 AICc value: 4761.889 
 Iteration    Log-Likelihood:(With bandwidth:  2603.584 )
       0        -1960 
       1        -1678 
       2        -1528 
       3        -1445 
       4        -1407 
       5        -1407 
Fixed bandwidth: 2603.584 AICc value: 4761.813 
 Iteration    Log-Likelihood:(With bandwidth:  2609.913 )
       0        -1962 
       1        -1680 
       2        -1531 
       3        -1448 
       4        -1410 
       5        -1410 
Fixed bandwidth: 2609.913 AICc value: 4761.831 
 Iteration    Log-Likelihood:(With bandwidth:  2599.672 )
       0        -1958 
       1        -1676 
       2        -1526 
       3        -1443 
       4        -1405 
       5        -1405 
Fixed bandwidth: 2599.672 AICc value: 4761.809 
 Iteration    Log-Likelihood:(With bandwidth:  2597.255 )
       0        -1957 
       1        -1675 
       2        -1525 
       3        -1441 
       4        -1403 
       5        -1403 
Fixed bandwidth: 2597.255 AICc value: 4761.809 
[1] 2599.672
gwlr.fixed <- ggwr.basic(status ~
                    data= Osun_wp_sp,
                    bw= bw.fixed,
                    family = "binomial",
                    kernel = "gaussian",
                    adaptive = FALSE,
 Iteration    Log-Likelihood
       0        -1958 
       1        -1676 
       2        -1526 
       3        -1443 
       4        -1405 
       5        -1405 
gwr.fixed <-$SDF)
gwr.fixed <- gwr.fixed %>%
  mutate(most = ifelse(
    gwr.fixed$yhat >= 0.5, T,F))
gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data=gwr.fixed$most, reference = gwr.fixed$y)
Confusion Matrix and Statistics

Prediction FALSE TRUE
     FALSE  1824  263
     TRUE    290 2379
               Accuracy : 0.8837          
                 95% CI : (0.8743, 0.8927)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                  Kappa : 0.7642          
 Mcnemar's Test P-Value : 0.2689          
            Sensitivity : 0.8628          
            Specificity : 0.9005          
         Pos Pred Value : 0.8740          
         Neg Pred Value : 0.8913          
             Prevalence : 0.4445          
         Detection Rate : 0.3835          
   Detection Prevalence : 0.4388          
      Balanced Accuracy : 0.8816          
       'Positive' Class : FALSE           
Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
  select(c(ADM2_EN, ADM2_PCODE,
           ADM1_EN, ADM1_PCODE,
gwr_sf.fixed <- cbind(Osun_wp_sf_selected, gwr.fixed)
tmap mode set to interactive viewing
prob_T <- tm_shape(Osun) +
  tm_polygons(alpha = 0.1) +
  tm_shape(gwr_sf.fixed) +
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd =1)+
  tm_view(set.zoom.limits = c(8,14))

Removing the Variables and Building Logistic Regression Model

model2 <- glm(status ~ distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
             data = Osun_wp_sf_clean,
             family = binomial(link = 'logit'))
                             Model Overview                              
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
  data       status     4756      4755           4746           TRUE     

                    Response Summary                     
Outcome        Frequency        Outcome        Frequency 
   0             2114              1             2642    

                                 Maximum Likelihood Estimates                                   
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
              (Intercept)                   1      0.3540        0.1055      3.3541       8e-04 
       distance_to_tertiary_road            1      1e-04         0.0000      4.9096      0.0000 
            distance_to_city                1      0.0000        0.0000     -5.2022      0.0000 
            distance_to_town                1      0.0000        0.0000     -5.4660      0.0000 
              is_urbanTRUE                  1     -0.2667        0.0747     -3.5690       4e-04 
           usage_capacity1000               1     -0.6206        0.0697     -8.9081      0.0000 
water_source_cleanProtected Shallow Well    1      0.4947        0.0850      5.8228      0.0000 
   water_source_cleanProtected Spring       1      1.2790        0.4384      2.9174      0.0035 
         water_point_population             1      -5e-04        0.0000    -11.3902      0.0000 
          local_population_1km              1      3e-04         0.0000     19.4069      0.0000 

 Association of Predicted Probabilities and Observed Responses  
% Concordant          0.7349          Somers' D        0.4697   
% Discordant          0.2651          Gamma            0.4697   
% Tied                0.0000          Tau-a            0.2320   
Pairs                5585188          c                0.7349   
blr_confusion_matrix(model2, cutoff = 0.5)
Confusion Matrix and Statistics 

Prediction FALSE TRUE
         0  1300  743
         1   814 1899

                Accuracy : 0.6726 
     No Information Rate : 0.4445 

                   Kappa : 0.3348 

McNemars's Test P-Value  : 0.0761 

             Sensitivity : 0.7188 
             Specificity : 0.6149 
          Pos Pred Value : 0.7000 
          Neg Pred Value : 0.6363 
              Prevalence : 0.5555 
          Detection Rate : 0.3993 
    Detection Prevalence : 0.5704 
       Balanced Accuracy : 0.6669 
               Precision : 0.7000 
                  Recall : 0.7188 

        'Positive' Class : 1
bw.fixed2 <- bw.ggwr(status ~ distance_to_tertiary_road +
                     distance_to_city +
                     distance_to_town +
                     is_urban +
                     usage_capacity +
                     water_source_clean +
                     water_point_population +
                   data = Osun_wp_sp,
                   family = "binomial",
                   approach = "AIC",
                   kernel = "gaussian",
                   adaptive = FALSE,
                   longlat = FALSE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
 Iteration    Log-Likelihood:(With bandwidth:  95768.67 )
       0        -2890 
       1        -2837 
       2        -2830 
       3        -2829 
       4        -2829 
       5        -2829 
Fixed bandwidth: 95768.67 AICc value: 5681.18 
 Iteration    Log-Likelihood:(With bandwidth:  59200.13 )
       0        -2878 
       1        -2820 
       2        -2812 
       3        -2810 
       4        -2810 
       5        -2810 
Fixed bandwidth: 59200.13 AICc value: 5645.901 
 Iteration    Log-Likelihood:(With bandwidth:  36599.53 )
       0        -2854 
       1        -2790 
       2        -2777 
       3        -2774 
       4        -2774 
       5        -2774 
       6        -2774 
Fixed bandwidth: 36599.53 AICc value: 5585.354 
 Iteration    Log-Likelihood:(With bandwidth:  22631.59 )
       0        -2810 
       1        -2732 
       2        -2711 
       3        -2707 
       4        -2707 
       5        -2707 
       6        -2707 
Fixed bandwidth: 22631.59 AICc value: 5481.877 
 Iteration    Log-Likelihood:(With bandwidth:  13998.93 )
       0        -2732 
       1        -2635 
       2        -2604 
       3        -2597 
       4        -2596 
       5        -2596 
       6        -2596 
Fixed bandwidth: 13998.93 AICc value: 5333.718 
 Iteration    Log-Likelihood:(With bandwidth:  8663.649 )
       0        -2624 
       1        -2502 
       2        -2459 
       3        -2447 
       4        -2446 
       5        -2446 
       6        -2446 
       7        -2446 
Fixed bandwidth: 8663.649 AICc value: 5178.493 
 Iteration    Log-Likelihood:(With bandwidth:  5366.266 )
       0        -2478 
       1        -2319 
       2        -2250 
       3        -2225 
       4        -2219 
       5        -2219 
       6        -2220 
       7        -2220 
       8        -2220 
       9        -2220 
Fixed bandwidth: 5366.266 AICc value: 5022.016 
 Iteration    Log-Likelihood:(With bandwidth:  3328.371 )
       0        -2222 
       1        -2002 
       2        -1894 
       3        -1838 
       4        -1818 
       5        -1814 
       6        -1814 
Fixed bandwidth: 3328.371 AICc value: 4827.587 
 Iteration    Log-Likelihood:(With bandwidth:  2068.882 )
       0        -1837 
       1        -1528 
       2        -1357 
       3        -1261 
       4        -1222 
       5        -1222 
Fixed bandwidth: 2068.882 AICc value: 4772.046 
 Iteration    Log-Likelihood:(With bandwidth:  1290.476 )
       0        -1403 
       1        -1016 
       2       -807.3 
       3       -680.2 
       4       -680.2 
Fixed bandwidth: 1290.476 AICc value: 5809.719 
 Iteration    Log-Likelihood:(With bandwidth:  2549.964 )
       0        -2019 
       1        -1753 
       2        -1614 
       3        -1538 
       4        -1506 
       5        -1506 
Fixed bandwidth: 2549.964 AICc value: 4764.056 
 Iteration    Log-Likelihood:(With bandwidth:  2847.289 )
       0        -2108 
       1        -1862 
       2        -1736 
       3        -1670 
       4        -1644 
       5        -1644 
Fixed bandwidth: 2847.289 AICc value: 4791.834 
 Iteration    Log-Likelihood:(With bandwidth:  2366.207 )
       0        -1955 
       1        -1675 
       2        -1525 
       3        -1441 
       4        -1407 
       5        -1407 
Fixed bandwidth: 2366.207 AICc value: 4755.524 
 Iteration    Log-Likelihood:(With bandwidth:  2252.639 )
       0        -1913 
       1        -1623 
       2        -1465 
       3        -1376 
       4        -1341 
       5        -1341 
Fixed bandwidth: 2252.639 AICc value: 4759.188 
 Iteration    Log-Likelihood:(With bandwidth:  2436.396 )
       0        -1980 
       1        -1706 
       2        -1560 
       3        -1479 
       4        -1446 
       5        -1446 
Fixed bandwidth: 2436.396 AICc value: 4756.675 
 Iteration    Log-Likelihood:(With bandwidth:  2322.828 )
       0        -1940 
       1        -1656 
       2        -1503 
       3        -1417 
       4        -1382 
       5        -1382 
Fixed bandwidth: 2322.828 AICc value: 4756.471 
 Iteration    Log-Likelihood:(With bandwidth:  2393.017 )
       0        -1965 
       1        -1687 
       2        -1539 
       3        -1456 
       4        -1422 
       5        -1422 
Fixed bandwidth: 2393.017 AICc value: 4755.57 
 Iteration    Log-Likelihood:(With bandwidth:  2349.638 )
       0        -1949 
       1        -1668 
       2        -1517 
       3        -1432 
       4        -1398 
       5        -1398 
Fixed bandwidth: 2349.638 AICc value: 4755.753 
 Iteration    Log-Likelihood:(With bandwidth:  2376.448 )
       0        -1959 
       1        -1680 
       2        -1530 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2376.448 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2382.777 )
       0        -1961 
       1        -1683 
       2        -1534 
       3        -1450 
       4        -1416 
       5        -1416 
Fixed bandwidth: 2382.777 AICc value: 4755.491 
 Iteration    Log-Likelihood:(With bandwidth:  2372.536 )
       0        -1958 
       1        -1678 
       2        -1528 
       3        -1445 
       4        -1411 
       5        -1411 
Fixed bandwidth: 2372.536 AICc value: 4755.488 
 Iteration    Log-Likelihood:(With bandwidth:  2378.865 )
       0        -1960 
       1        -1681 
       2        -1532 
       3        -1448 
       4        -1414 
       5        -1414 
Fixed bandwidth: 2378.865 AICc value: 4755.481 
 Iteration    Log-Likelihood:(With bandwidth:  2374.954 )
       0        -1959 
       1        -1679 
       2        -1530 
       3        -1446 
       4        -1412 
       5        -1412 
Fixed bandwidth: 2374.954 AICc value: 4755.482 
 Iteration    Log-Likelihood:(With bandwidth:  2377.371 )
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2377.371 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2377.942 )
       0        -1960 
       1        -1680 
       2        -1531 
       3        -1448 
       4        -1414 
       5        -1414 
Fixed bandwidth: 2377.942 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2377.018 )
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2377.018 AICc value: 4755.48 
gwlr.fixed2 <- ggwr.basic(status ~distance_to_tertiary_road +
                      distance_to_city +
                      distance_to_town +
                      water_point_population +
                      local_population_1km +
                      is_urban +
                      usage_capacity +
                    bw = bw.fixed2,
                    family = "binomial",
                    kernel = "gaussian",
                    adaptive = FALSE,
                    longlat = FALSE)
 Iteration    Log-Likelihood
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
   *                       Package   GWmodel                             *
   Program starts at: 2022-12-18 02:52:45 
   ggwr.basic(formula = status ~ distance_to_tertiary_road + distance_to_city + 
    distance_to_town + water_point_population + local_population_1km + 
    is_urban + usage_capacity + water_source_clean, data = Osun_wp_sp, 
    bw = bw.fixed2, family = "binomial", kernel = "gaussian", 
    adaptive = FALSE, longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_tertiary_road distance_to_city distance_to_town water_point_population local_population_1km is_urban usage_capacity water_source_clean
   Number of data points: 4756
   Used family: binomial
   *              Results of Generalized linear Regression               *


Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-129.368    -1.750     1.074     1.742    34.126  

                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 3.540e-01  1.055e-01   3.354 0.000796
distance_to_tertiary_road                 1.001e-04  2.040e-05   4.910 9.13e-07
distance_to_city                         -1.764e-05  3.391e-06  -5.202 1.97e-07
distance_to_town                         -1.544e-05  2.825e-06  -5.466 4.60e-08
water_point_population                   -5.098e-04  4.476e-05 -11.390  < 2e-16
local_population_1km                      3.452e-04  1.779e-05  19.407  < 2e-16
is_urbanTRUE                             -2.667e-01  7.474e-02  -3.569 0.000358
usage_capacity1000                       -6.206e-01  6.966e-02  -8.908  < 2e-16
water_source_cleanProtected Shallow Well  4.947e-01  8.496e-02   5.823 5.79e-09
water_source_cleanProtected Spring        1.279e+00  4.384e-01   2.917 0.003530
Intercept                                ***
distance_to_tertiary_road                ***
distance_to_city                         ***
distance_to_town                         ***
water_point_population                   ***
local_population_1km                     ***
is_urbanTRUE                             ***
usage_capacity1000                       ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5688.9  on 4746  degrees of freedom
AIC: 5708.9

Number of Fisher Scoring iterations: 5

 AICc:  5708.923
 Pseudo R-square value:  0.129406
   *          Results of Geographically Weighted Regression              *

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2377.371 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -3.7021e+02 -4.3797e+00  3.5590e+00
   distance_to_tertiary_road                -3.1622e-02 -4.5462e-04  9.1291e-05
   distance_to_city                         -5.4555e-02 -6.5623e-04 -1.3507e-04
   distance_to_town                         -8.6549e-03 -5.2754e-04 -1.6785e-04
   water_point_population                   -2.9696e-02 -2.2705e-03 -1.2277e-03
   local_population_1km                     -7.7730e-02  4.4281e-04  1.0548e-03
   is_urbanTRUE                             -7.3554e+02 -3.4675e+00 -1.6596e+00
   usage_capacity1000                       -5.5889e+01 -1.0347e+00 -4.1960e-01
   water_source_cleanProtected.Shallow.Well -1.8842e+02 -4.7295e-01  6.2378e-01
   water_source_cleanProtected.Spring       -1.3630e+03 -5.3436e+00  2.7714e+00
                                                3rd Qu.      Max.
   Intercept                                 1.3755e+01 2171.6373
   distance_to_tertiary_road                 6.3011e-04    0.0237
   distance_to_city                          1.5921e-04    0.0162
   distance_to_town                          2.4490e-04    0.0179
   water_point_population                    4.5879e-04    0.0765
   local_population_1km                      1.8479e-03    0.0333
   is_urbanTRUE                              1.0554e+00  995.1840
   usage_capacity1000                        3.9113e-01    9.2449
   water_source_cleanProtected.Shallow.Well  1.9564e+00   66.8914
   water_source_cleanProtected.Spring        7.0805e+00  208.3749
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 2815.659 
   AIC : 4418.776 
   AICc : 4744.213 
   Pseudo R-square value:  0.5691072 

   Program stops at: 2022-12-18 02:53:12 
gwr.fixed2 <-$SDF)
gwr.fixed2 <- gwr.fixed2 %>%
  mutate(most = ifelse(
    gwr.fixed2$yhat >= 0.5, T, F))
gwr.fixed2$y <- as.factor(gwr.fixed2$y)
gwr.fixed2$most <- as.factor(gwr.fixed2$most)
CM2 <- confusionMatrix(data=gwr.fixed2$most, reference = gwr.fixed2$y, positive = "TRUE")
Confusion Matrix and Statistics

Prediction FALSE TRUE
     FALSE  1833  268
     TRUE    281 2374
               Accuracy : 0.8846          
                 95% CI : (0.8751, 0.8935)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                  Kappa : 0.7661          
 Mcnemar's Test P-Value : 0.6085          
            Sensitivity : 0.8986          
            Specificity : 0.8671          
         Pos Pred Value : 0.8942          
         Neg Pred Value : 0.8724          
             Prevalence : 0.5555          
         Detection Rate : 0.4992          
   Detection Prevalence : 0.5582          
      Balanced Accuracy : 0.8828          
       'Positive' Class : TRUE            
gwr_sf.fixed2 <- cbind(Osun_wp_sf_selected, gwr.fixed2)
tmap mode set to interactive viewing
prob_T2 <- tm_shape(Osun) +
  tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed2) +
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(8,14))


After Removing the two variables, there is no improvement in True Positive, and the True Negative improved a little. Therefore, using gwLR model can be used when focusing on the non-functional water point ratio.