Data science interview (1)

questions from KD nuggets:

Q1: regularization

smooth, L1 (lasso), L2 (ridge)


Q3: validation

jackknife, mean squared error, R squared


Q4: ROC curve

ROC: receiver operation characteristic (ROC) is used to illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR, sensitivity or recall in machine learning ) against the false positive rate (FPR,fall-out,1-specificity). ROC curve is thus the sensitivity as function of fall-out.

This type of graph is called a Receiver Operating Characteristic curve (or ROC curve.) It is a plot of the true positive rate against the false positive rate for the different possible cutpoints of a diagnostic test.

An ROC curve demonstrates several things:

  1. It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
  2. The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
  3. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
  4. The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test. You can check this out on the graph above. Recall that the LR for T4 < 5 is 52. This corresponds to the far left, steep portion of the curve. The LR for T4 > 9 is 0.2. This corresponds to the far right, nearly horizontal portion of the curve.
  5. The area under the curve is a measure of text accuracy. This is discussed further in the next section.

R package: {rocr} {proc}

Q5: Good algorithm

no selection bias, representative of population, controlled group are comparable, reproducible, local or global


Q8: power

power is sensit  ivity, is the prob that correctly reject null under H1


Q9: resampling

Bootstrap. used for estimating the precision of sample statistics, permutation test, randomization test


Q 10: false positive and false negative

fp is type 1 error, in medical false positive is tolerable.

spam false negative is tolerable

Q11: avoid selection bias by randomizatiob



