R learning_GIT

1 command line interface

2, computer structure: tree, directory

3, / root directory

4, ~ home directory

5, cd change directory

6, pwd: print working directory

7, mkdir: make a directory

8, cp: copy to

9. mv: move

introduction to git

 

tyoe of data science:

descriptive

exploratory analysis

inferential analysis

predictive analysis

 

sensitivity-> pr(positive|disease)

specificity-> pr(negative| no disease)

positive predictive value-> pr(disease|positive)

negative predictive value-> pr(no disease|negative test)

accuracy->pre(correct outcome)

causal analysis

mechanistic analysis

 

 

Data science interview (1)

questions from KD nuggets:

Q1: regularization

smooth, L1 (lasso), L2 (ridge)

 

Q3: validation

jackknife, mean squared error, R squared

 

Q4: ROC curve

ROC: receiver operation characteristic (ROC) is used to illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR, sensitivity or recall in machine learning ) against the false positive rate (FPR,fall-out,1-specificity). ROC curve is thus the sensitivity as function of fall-out.

This type of graph is called a Receiver Operating Characteristic curve (or ROC curve.) It is a plot of the true positive rate against the false positive rate for the different possible cutpoints of a diagnostic test.

An ROC curve demonstrates several things:

  1. It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
  2. The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
  3. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
  4. The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test. You can check this out on the graph above. Recall that the LR for T4 < 5 is 52. This corresponds to the far left, steep portion of the curve. The LR for T4 > 9 is 0.2. This corresponds to the far right, nearly horizontal portion of the curve.
  5. The area under the curve is a measure of text accuracy. This is discussed further in the next section.

R package: {rocr} {proc}

Q5: Good algorithm

no selection bias, representative of population, controlled group are comparable, reproducible, local or global

 

Q8: power

power is sensit  ivity, is the prob that correctly reject null under H1

 

Q9: resampling

Bootstrap. used for estimating the precision of sample statistics, permutation test, randomization test

 

Q 10: false positive and false negative

fp is type 1 error, in medical false positive is tolerable.

spam false negative is tolerable

Q11: avoid selection bias by randomizatiob