precision medicine

data sharing and Transparency

real world evidence

Big data and predictive analysics

precision medicine

data sharing and Transparency

real world evidence

Big data and predictive analysics

1 command line interface

2, computer structure: tree, directory

3, / root directory

4, ~ home directory

5, cd change directory

6, pwd: print working directory

7, mkdir: make a directory

8, cp: copy to

9. mv: move

introduction to git

tyoe of data science:

descriptive

exploratory analysis

inferential analysis

predictive analysis

sensitivity-> pr(positive|disease)

specificity-> pr(negative| no disease)

positive predictive value-> pr(disease|positive)

negative predictive value-> pr(no disease|negative test)

accuracy->pre(correct outcome)

causal analysis

mechanistic analysis

1, while: test condition

2, stackoverflow

3,github

4, R mySQL

1, know how much memory you system have

2, know a little bit of your data like nrow and class

3, bytes to MB, divided by 2^20

4, remove na, find na by

bad<-is.na(x)

x[!bad]

good<-complete.cases(x,y)

glassdoor for quantitative analyst position in Google

questions from KD nuggets:

Q1: regularization

smooth, L1 (lasso), L2 (ridge)

Q3: validation

jackknife, mean squared error, R squared

Q4: ROC curve

ROC: receiver operation characteristic (ROC) is used to illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR, sensitivity or recall in machine learning ) against the false positive rate (FPR,fall-out,1-specificity). ROC curve is thus the sensitivity as function of fall-out.

This type of graph is called a **Receiver Operating Characteristic curve** (or ROC curve.) It is a plot of the true positive rate against the false positive rate for the different possible cutpoints of a diagnostic test.

An ROC curve demonstrates several things:

- It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
- The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
- The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
- The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test. You can check this out on the graph above. Recall that the LR for T4 < 5 is 52. This corresponds to the far left, steep portion of the curve. The LR for T4 > 9 is 0.2. This corresponds to the far right, nearly horizontal portion of the curve.
- The area under the curve is a measure of text accuracy. This is discussed further in the next section.

R package: {rocr} {proc}

Q5: Good algorithm

no selection bias, representative of population, controlled group are comparable, reproducible, local or global

Q8: power

power is sensit ivity, is the prob that correctly reject null under H1

Q9: resampling

Bootstrap. used for estimating the precision of sample statistics, permutation test, randomization test

Q 10: false positive and false negative

fp is type 1 error, in medical false positive is tolerable.

spam false negative is tolerable

Q11: avoid selection bias by randomizatiob

Citation:

[1] Yu, Zhangsheng, Xihong Lin, and Wanzhu Tu. “Semiparametric frailty models for clustered failure time data.” *Biometrics* 68.2 (2012): 429-436.

[2]