Skip to content

Commit d875915

Browse files
committed
Updated some styling elements and moved some notes, removed BIC
1 parent d88cad5 commit d875915

File tree

3 files changed

+97
-56
lines changed

3 files changed

+97
-56
lines changed

to_explain_or_predict.Rmd

Lines changed: 47 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -114,13 +114,14 @@ ___Box (1976)___
114114
What is your question?
115115
</h2></center>
116116

117+
???
117118

119+
Through out this, keep asking yourself: what is your question?
118120

119121
---
120122

121123
# The two broad classes of DS/modelling question:
122124

123-
--
124125

125126
## Explain
126127

@@ -142,8 +143,11 @@ __You can use many of the same models to fit in either context, but how you do i
142143

143144
???
144145
Prof. Shmueli's paper laments that statisticians had almost exclusively on 'explanatory' models.
145-
I'd like to suggest that, with the increasing accessibility of Data Science and Machine Learning, the focus of many
146-
modern practitioners has swung the other way. Some of you may always be approaching a model as a prediction question.
146+
147+
With the increasing accessibility of Data Science and Machine Learning, the focus of many
148+
modern practitioners has swung the other way.
149+
150+
Some of you may always be approaching a model as a prediction question.
147151

148152
What I'm presenting here today is fairly agnostic to your approach, be it bayesian / frequentist / whatever.
149153

@@ -172,9 +176,10 @@ $$E(Y) = f(X)$$
172176
Shmueli, G. (2010), http://www.jstor.org/stable/41058949
173177
]
174178

179+
175180
???
176181

177-
Firstly, don't be scared by the representation here, as I'll explain.
182+
...don't be scared, it's not that bad...
178183

179184
We are trying to model how X causes something, without being constrained by what data we have.
180185
This can be concepts such as Y = depression, and F(x) could be things like: anxiety, past trauma, physical health, stress... etc.
@@ -197,7 +202,7 @@ We can't measure them directly, so
197202
What do I mean by 'causes?' It's not the same as 'associated with'. There is an 'exposure' to 'outcome' effect, and a temporal element: i.e. exposure before outcome.
198203
This DAG is hypothesising the causal relationship between chemotherapy and venous thromoembolism (VTE)
199204

200-
The arrows indicator the direction of causal relationships. Age, sex, tumor site and tumour size are confounding this relationship and should be adjusted for in a model, but platelet count is a mediator and should not.
205+
The arrows indicator the direction of causal relationships. Age, sex, tumour site and tumour size are confounding this relationship and should be adjusted for in a model, but platelet count is a mediator and should not.
201206

202207
---
203208
# Simple Example:
@@ -356,21 +361,19 @@ print(py_model1_exp_summary)
356361
---
357362
# Testing Fit
358363

359-
Interested in fit within our sample:
360-
* Significance of coefficients in our summaries
364+
* Significance of coefficients in our model summaries
361365
* Assumption of regression being met - _a topic for another day_
362366

363367
```{r auc_exp_r, message=FALSE, warning=FALSE}
364368
library(ModelMetrics)
365369
auc(r_model_exp)
366-
BIC(r_model_exp)
370+
367371
```
368372

369373
```{python auc_exp_py}
370374
from sklearn import metrics
371375
py_auc = metrics.roc_auc_score(heart_failure_pd['DEATH_EVENT'], py_model1_exp.fittedvalues)
372376
print(py_auc)
373-
print(py_model1_exp.bic)
374377
```
375378

376379
--
@@ -497,10 +500,11 @@ print(py_auc)
497500

498501
]
499502

503+
???
500504

501505
So you might leave multiple 'non-significant' predictors in an explanatory model, as they are rational and all effects conditional on each other.
502506

503-
You might be happy with a 'wrong' model for in predction, if it gives better predictions.
507+
You might be happy with a 'wrong' model for in prediction, if it gives better predictions.
504508

505509
---
506510
# Explain or predict Bingo (1):
@@ -509,9 +513,11 @@ You might be happy with a 'wrong' model for in predction, if it gives better pre
509513

510514
.big[Forecasting attendances at an Emergency Department]
511515

516+
<br><br>
517+
512518
--
513519

514-
# Predict!
520+
## Predict!
515521

516522
---
517523

@@ -525,7 +531,7 @@ You might be happy with a 'wrong' model for in predction, if it gives better pre
525531

526532
--
527533

528-
##Explain!
534+
## Explain!
529535

530536

531537
---
@@ -539,7 +545,7 @@ You might be happy with a 'wrong' model for in predction, if it gives better pre
539545

540546
--
541547

542-
##It depends...is it about the person's individual risk based on explanatory factors, the best prediction you can make, or is it for risk-adjustment?
548+
## It depends:... is it about the person's individual risk based on explanatory factors, the best prediction you can make, or is it for risk-adjustment?
543549

544550
---
545551
# Explain or predict Bingo (4):
@@ -567,10 +573,10 @@ You might be happy with a 'wrong' model for in predction, if it gives better pre
567573

568574
--
569575

570-
##Predict
576+
## Predict!
571577

572578
---
573-
# Explain or predict Bingo (5):
579+
# Explain or predict Bingo (6):
574580

575581
<br><br><br>
576582

@@ -580,7 +586,31 @@ You might be happy with a 'wrong' model for in predction, if it gives better pre
580586

581587
--
582588

583-
##It depends...: are you testing what causes it, or predicting future states of the population?
589+
## It depends:... are you testing what causes it, or predicting future states of the population?
590+
591+
---
592+
593+
# Summary
594+
595+
.pull-left[
596+
597+
## Consider what the purpose of your model is:
598+
599+
* What is your question?
600+
601+
* Is it predictive or explanatory?
602+
603+
* Are you using the right modelling framework?
604+
605+
* Are you doing anything that is incompatible with the framework you've identified?
606+
]
607+
608+
.pull-right[
609+
<br><br><br>
610+
> "With great power comes greate responsibility"
611+
- Stan Lee (via Spiderman's Uncle Ben)
612+
613+
]
584614

585615
---
586616

@@ -605,7 +635,7 @@ Shmueli, G. (2010) 'To Explain or to Predict?' _Statistical Science_ __25__, no.
605635
---
606636

607637

608-
## Predictive Model - R bonus
638+
## Predictive Model - R bonus (ridge regression, like Scikit learn assumes you want...)
609639

610640
```{r r_ridge, message=FALSE, warning=FALSE}
611641
heart_failure_dt$sc_serum_creatinin <- scale(heart_failure_dt$serum_creatinine)
@@ -628,8 +658,6 @@ cv <- cv.glmnet(x, y, alpha = 0, family="binomial")
628658
629659
ridge1<-glmnet(x,y, alpha=0, lamda=cv$lambda.min, family="binomial")
630660
631-
632-
633661
# Make predictions on the test data
634662
x.test <- model.matrix(DEATH_EVENT~sc_serum_creatinin+sc_ejection_fraction, Test)[,-1]
635663

to_explain_or_predict.html

Lines changed: 48 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -65,13 +65,14 @@
6565
What is your question?
6666
&lt;/h2&gt;&lt;/center&gt;
6767

68+
???
6869

70+
Through out this, keep asking yourself: what is your question?
6971

7072
---
7173

7274
# The two broad classes of DS/modelling question:
7375

74-
--
7576

7677
## Explain
7778

@@ -93,8 +94,11 @@
9394

9495
???
9596
Prof. Shmueli's paper laments that statisticians had almost exclusively on 'explanatory' models.
96-
I'd like to suggest that, with the increasing accessibility of Data Science and Machine Learning, the focus of many
97-
modern practitioners has swung the other way. Some of you may always be approaching a model as a prediction question.
97+
98+
With the increasing accessibility of Data Science and Machine Learning, the focus of many
99+
modern practitioners has swung the other way.
100+
101+
Some of you may always be approaching a model as a prediction question.
98102

99103
What I'm presenting here today is fairly agnostic to your approach, be it bayesian / frequentist / whatever.
100104

@@ -123,9 +127,10 @@
123127
Shmueli, G. (2010), http://www.jstor.org/stable/41058949
124128
]
125129

130+
126131
???
127132

128-
Firstly, don't be scared by the representation here, as I'll explain.
133+
...don't be scared, it's not that bad...
129134

130135
We are trying to model how X causes something, without being constrained by what data we have.
131136
This can be concepts such as Y = depression, and F(x) could be things like: anxiety, past trauma, physical health, stress... etc.
@@ -148,7 +153,7 @@
148153
What do I mean by 'causes?' It's not the same as 'associated with'. There is an 'exposure' to 'outcome' effect, and a temporal element: i.e. exposure before outcome.
149154
This DAG is hypothesising the causal relationship between chemotherapy and venous thromoembolism (VTE)
150155

151-
The arrows indicator the direction of causal relationships. Age, sex, tumor site and tumour size are confounding this relationship and should be adjusted for in a model, but platelet count is a mediator and should not.
156+
The arrows indicator the direction of causal relationships. Age, sex, tumour site and tumour size are confounding this relationship and should be adjusted for in a model, but platelet count is a mediator and should not.
152157

153158
---
154159
# Simple Example:
@@ -298,7 +303,7 @@
298303
## Model: Logit Df Residuals: 296
299304
## Method: MLE Df Model: 2
300305
## Date: Mon, 18 Nov 2024 Pseudo R-squ.: 0.1359
301-
## Time: 13:09:11 Log-Likelihood: -162.16
306+
## Time: 13:57:20 Log-Likelihood: -162.16
302307
## converged: True LL-Null: -187.67
303308
## Covariance Type: nonrobust LLR p-value: 8.308e-12
304309
## =====================================================================================
@@ -314,8 +319,7 @@
314319
---
315320
# Testing Fit
316321

317-
Interested in fit within our sample:
318-
* Significance of coefficients in our summaries
322+
* Significance of coefficients in our model summaries
319323
* Assumption of regression being met - _a topic for another day_
320324

321325

@@ -328,14 +332,6 @@
328332
## [1] 0.7614173
329333
```
330334

331-
``` r
332-
BIC(r_model_exp)
333-
```
334-
335-
```
336-
## [1] 341.4225
337-
```
338-
339335

340336
``` python
341337
from sklearn import metrics
@@ -347,14 +343,6 @@
347343
## 0.7614172824302136
348344
```
349345

350-
``` python
351-
print(py_model1_exp.bic)
352-
```
353-
354-
```
355-
## 341.422453885286
356-
```
357-
358346
--
359347

360348
### Is over-fitting an issue?
@@ -444,7 +432,7 @@
444432
```
445433

446434
```
447-
## 0.7284541723666211
435+
## 0.6331249999999999
448436
```
449437

450438
---
@@ -476,10 +464,11 @@
476464

477465
]
478466

467+
???
479468

480469
So you might leave multiple 'non-significant' predictors in an explanatory model, as they are rational and all effects conditional on each other.
481470

482-
You might be happy with a 'wrong' model for in predction, if it gives better predictions.
471+
You might be happy with a 'wrong' model for in prediction, if it gives better predictions.
483472

484473
---
485474
# Explain or predict Bingo (1):
@@ -488,9 +477,11 @@
488477

489478
.big[Forecasting attendances at an Emergency Department]
490479

480+
&lt;br&gt;&lt;br&gt;
481+
491482
--
492483

493-
# Predict!
484+
## Predict!
494485

495486
---
496487

@@ -504,7 +495,7 @@
504495

505496
--
506497

507-
##Explain!
498+
## Explain!
508499

509500

510501
---
@@ -518,7 +509,7 @@
518509

519510
--
520511

521-
##It depends...is it about the person's individual risk based on explanatory factors, the best prediction you can make, or is it for risk-adjustment?
512+
## It depends:... is it about the person's individual risk based on explanatory factors, the best prediction you can make, or is it for risk-adjustment?
522513

523514
---
524515
# Explain or predict Bingo (4):
@@ -546,10 +537,10 @@
546537

547538
--
548539

549-
##Predict
540+
## Predict!
550541

551542
---
552-
# Explain or predict Bingo (5):
543+
# Explain or predict Bingo (6):
553544

554545
&lt;br&gt;&lt;br&gt;&lt;br&gt;
555546

@@ -559,7 +550,31 @@
559550

560551
--
561552

562-
##It depends...: are you testing what causes it, or predicting future states of the population?
553+
## It depends:... are you testing what causes it, or predicting future states of the population?
554+
555+
---
556+
557+
# Summary
558+
559+
.pull-left[
560+
561+
## Consider what the purpose of your model is:
562+
563+
* What is your question?
564+
565+
* Is it predictive or explanatory?
566+
567+
* Are you using the right modelling framework?
568+
569+
* Are you doing anything that is incompatible with the framework you've identified?
570+
]
571+
572+
.pull-right[
573+
&lt;br&gt;&lt;br&gt;&lt;br&gt;
574+
&gt; "With great power comes greate responsibility"
575+
- Stan Lee (via Spiderman's Uncle Ben)
576+
577+
]
563578

564579
---
565580

@@ -584,7 +599,7 @@
584599
---
585600

586601

587-
## Predictive Model - R bonus
602+
## Predictive Model - R bonus (ridge regression, like Scikit learn assumes you want...)
588603

589604

590605
``` r
@@ -608,8 +623,6 @@
608623

609624
ridge1&lt;-glmnet(x,y, alpha=0, lamda=cv$lambda.min, family="binomial")
610625

611-
612-
613626
# Make predictions on the test data
614627
x.test &lt;- model.matrix(DEATH_EVENT~sc_serum_creatinin+sc_ejection_fraction, Test)[,-1]
615628

0 commit comments

Comments
 (0)