Last modified on 01 Oct 2021.
This is my note for the course (Structuring Machine Learning Projects). The codes in this note are rewritten to be more clear and concise.
👉 Course 1 – Neural Networks and Deep Learning.
👉 Course 2 – Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization.
👉 Course 3 – Structuring Machine Learning Projects.
👉 Course 4 – Convolutional Neural Networks.
👉 Course 5 – Sequence Models.
⭐ Case study (should read): Bird recognition in the city of Peacetopia.
⭐ Case study (should read): Autonomous driving (I copied it from this).
This course will give you some strategies to help analyze your problem to go in a direction that will help you get better results.
Introduction to ML Strategy
Why ML strategy?
- “ML strategy” = How to structure your ML project?
-
Ideas to improve your ML systems:
- Collect more data.
- Collect more diverse training set.
- Train algorithm longer with gradient descent.
- Try different optimization algorithm (e.g. Adam).
- Try bigger network.
- Try smaller network.
- Try dropout.
- Add L2 regularization.
- Change network architecture (activation functions, # of hidden units, etc.)
- However, don’t spend too much time to do one of above things, we need to go right direction!
Orthogonalization
- In orthogonalization, you have some controls, but each control does a specific task and doesn’t affect other controls.
- Chain of assumptions in ML:
- You’ll have to fit training set well on cost function (near human level performance if possible).
- If it’s not achieved you could try bigger network, another optimization algorithm (like Adam)…
- Fit dev set well on cost function.
- If its not achieved you could try regularization, bigger training set…
- Fit test set well on cost function.
- If its not achieved you could try bigger dev. set…
- Performs well in real world.
- If its not achieved you could try change dev. set, change cost function…
- You’ll have to fit training set well on cost function (near human level performance if possible).
Setting up your goal
Single number evaluation metric
- Advice : It’s better and faster to set a single number evaluation metric for your project before you start it.
- Example: instead of using both precision and recall, just use f1. Check this note.
- Dev set + single row number evaluation metric enough to make a choice!
Satisfying and Optimizing metric
- It’s difficult to set all parameters to a single row number evaluation metric set up (many) satisfying + (one) optimizing matrix.
- Satisfying (use threshold): satisfying this is enough.
- Optimizing: more important, it’s accuracy!
- Example: call “Hi Siri”,
- Accuracy: is it awoken? optimizing
- False positive: it’s awoken but we don’t call it! set the satisfying as less then 1 false positive per day!
Train/dev/test distributions
- The way we set the distribution of train / dev / test sets can impact much on the running time.
- Dev set = developement set / hold out cross validation set.
- Advice: Make dev set and test set come from the same distribution!
Size of the dev and test sets
- Old (less data, <100000): 70% train - 30% test or something like that.
- Now (big data): 98% - 1% - 1%.
- Test set: set your test set to be big enough to give high confiance in the overall performance of your system.
When to change dev/test sets and metrics
- Sometimes, we put our target a wrong place should change metric!
- Example: cat classification,
-
Algo A: 3% error but contains porn train / test error like this!
-
Algo B: 5% error but no porn human like this!
-
- This is actually an example of an orthogonalization where you should take a machine learning problem and break it into distinct steps:
- Figure out how to define a metric that captures what you want to do (place the target).
- Worry about how to actually do well on this metric (how to aim/shoot accurately at the target).
- Conclusion: if doing well on your metric + dev/test set doesn’t correspond to doing well in your application, change your metric and/or dev/test set.
Comparing to human-level performance
Why human-level performance?
- Reasons:
- ML algos are now work better & easier (than the past) only them is not enough need human-level performance (HLP).
- Workflow of building ML system wanna more efficient? try to do something that human can also do.
- Bayes error = best possible error (theory).
- After surpassing HLP, it’s slow down, why?
- HLP is very closed to Bayes optimal error. Ex: we can recognize things in blur.
- Whenever under HLP, there are certain tools to use to improve the performance but there is no tool to do after surpassing HLP.
- So long as ML is worse than HLP, you can:
- Get labeled data from human.
- Gain insight from manual error analysis: why did a person get this right?
- Better analysis of bias / variance.
Avoidable bias
- Sometimes we don’t want algo works TOO WELL on the training set use HLP.
- Example: cat recognition gives 2 different results (but the same gap between train & test)
-
Big gap between train and human. focus on reducing bias (bigger NN, run training longer,…) Underfitting!
Humans 1% 1% Training error 8% 8% Dev Error 10% 10% -
Small gap between train and human. focus on reducing variance Overfitting!
Humans 1% 7.5% Training error 8% 8% Dev Error 10% 10%
-
- Based on the human error decide whether high/low error bias / variance reduction!
- Gap between human & training Avoidable bias.
- Gap between training & test Variance!
Understanding human-level performance
- Use the nearest value to Bayes error as a human level error! (the smallest)
- The way we choose HL error sometimes can impact the way we improve our algo (bias or variance reduction)
- Use human level error as a proxy of Bayes error!
Surpassing human-level performance
- When training error less than human error, it’s difficile to decide what’s avoidable bias!
- In some problems, deep learning has surpassed human-level performance. Example: Online advertising, Product recommendation, Loan approval. Structured data.
- In natural perception tasks (speech recognition, NLP,…): ML surpasses human!
- In short:
- Machine > human structured data.
- Machine > One person some natural perception tasks.
- Machine > human natural perception tasks.
Improving your model performance
- The two fundamental asssumptions of supervised learning:
- You can fit the training set pretty well. This is roughly saying that you can achieve low avoidable bias.
- The training set performance generalizes pretty well to the dev/test set. This is roughly saying that variance is not too bad.
- To improve your deep learning supervised system follow these guidelines:
- Look at the difference between human level error and the training error - avoidable bias.
- Look at the difference between the dev/test set and training set error - Variance.
- If avoidable bias is large you have these options:
- Train bigger model.
- Train longer/better optimization algorithm (like Momentum, RMSprop, Adam).
- Find better NN architecture/hyperparameters search.
- If variance is large you have these options:
- Get more training data.
- Regularization (L2, Dropout, data augmentation).
- Find better NN architecture/hyperparameters search.
Error Analysis
Carrying out error analysis
- Error Analysis = manually examining mistakes that your algorithm is making can give you insight what to do next.
-
Example: in cat recognition, there are some factors affecting Consider a table ERROR ANALYSIS Evaluate multiple ideas in parallel,
Image Dog Great Cats blurry Instagram filters Comments 1 ✓ ✓ Pitbull 2 ✓ ✓ ✓ 3 Rainy day at zoo 4 ✓ …. % totals 8% 43% 61% 12% - We focus on Great Cats and blurry (have much influence).
- To carry out error analysis you should find a set of mislabeled examples in dev set look at mislabeled: False Positive or False Negative (number of errors) decide if to create a new category?
Cleaning up incorrectly labaled data
- In training: DL algo is robust to random errors we can ignore them!
- However, DL algo is LESS robust to systematic errors.
- Solutions: Using table Error Analysis to decide what types of error to focus in the next step (base of their fraction of errors in the total of errors).
- (Recall) The purpose of dev set is to help you select between 2 classifier A and B.
- If you decide to fix labels:
- Apply the same process to dev and test sets and make sure they come from the same distribution!
- Examine also on examples got right (not only on the ones got wrong) otherwise, we have overfitting problem!
- Train vs dev/test may have different distribution No need to be corrected mislabeled on training set!
When starting a new project?
Advice: Build your first system quickly and then iterate!!
- Quickly set up dev/test sets + metric.
- Build initial system quickly.
- Check bias analysis and Error analysis Priopritize the next step!
Mismatched training and dev/test set
Training & testing on different distribution
- Example: training (photos from internet, 200K), dev & test (photos from phone, 4K).
- Shouldn’t: Shufflt all 204K and split into train/dev/test!
- Should:
- Train - 200K (web) + 2K (mobile).
- Dev = Test = 0.5K (mobile).
Bias and Variance with mismatched data dist
- Sometimes, dev err > training err (possibly) the data in dev is more difficult to predict than the data in training.
- When comming from training err to dev err:
- The algo saw data in training set but not in dev set.
- The distribution of data in dev set is different!
- IDEA: create a new "train-dev" set which has the same distribution as training data but not used for training.
- Keys to be considered: Human error, Train error, Train-dev error, Dev error, Test error:
- Avoidable bias = train - human.
- Variance problem = train-dev - train
- Data mismatch = dev - train-dev
- Overfitting to dev set = test - dev
- If there is a huge gap between dev & test err overtune to the dev set may need to find a bigger dev set!
- Example 1: A high variance problem! (train/train-dev big, train-dev/dev small)
- Human error: 0%
- Train error: 1%
- Train-dev error: 9%
- Dev error: 10%
- Example 2: data mismatch problem (train/train-dev small, train-dev/dev big)
- Human error: 0%
- Train error: 1%
- Train-dev error: 1.5%
- Dev error: 10%
- Example 3: avoidable bias problem (because training err is much worse than human level, others are small)
- Human error: 0%
- Train error: 10%
- Train-dev error: 11%
- Dev error: 12%
- Example 4: Avoidable bias problem and mismatch problem. (human/train big, train-dev/dev big)
- Human error: 0%
- Train error: 1%
- Train-dev error: 1.5%
- Dev error: 10%
-
Remark: most of the time, the errs are decreasing from human to test. However if (sometimes) dev > train-dev, we rewrite all above errors in to a new table like this,
Error table. Image from the course.
- We find by hand 2 6% errors to consider the quality of dev/test err. In the figure, your figure is infact GOOD!
Addressing data mismatch
- Addressing data mismatch (don’t garantee it will work but you can try):
- Carry out manually error analysis try to understand difference between training and dev/test errors.
- Make the training data more similar or collect more data similar to dev/test set.
- Artificial data synthesis:
- “the quick brown fox jumps over the lazy dog” shortest sentence contains all A-Z letters in English.
- Create manually data (combine 2 different types of data). However, BE CAREFUL if one of 2 data is much smaller to the other. It may be overfitting to the smaller!
Learning from multiple tasks
Transfer learning
- IDEA: already trained on 1 task (Task A) + don’t have enough data on the current task (Task B) we can apply the trained network on the current one.
Transfer learning. Image from the course.
- To do transfer learning, delete the last layer of NN and it’s weights and:
- Option 1: if you have a small data set - keep all the other weights as a fixed weights. Add a new last layer(-s) and initialize the new layer weights and feed the new data to the NN and learn the new weights.
- Option 2: if you have enough data you can retrain all the weights.
- Pretraining = training on task A.
- Fine-tuning = using pretrained weights + use new data to train task B.
- This idea is useful because some of layers of trained NN contain helpful information for the new problem.
- Transfer learning makes sense when (e.g. from A to B):
- Task A and B have the same input X.
- You have a lot more data for task A than task B.
- Low level features from A could be helpful for learning B.
Multi-task learning
- 1 NN do several things at the same time and each of these tasks helps hopefully all of the other tasks!
- Example: Autonomous driving example Detect several things (not only 1) at the same time like: pedestrians, other cars, stop signs, traffic lights,…
Multi-task learning. Image from the course.
- We use Logistic Regression for the last layer. It’s DIFFERENT from softmax regression because in this case, we need to determine more than 2 labels!
- If there are some infos unclear in Y (e.g. don’t know if there is a traffic light or not?), we consider only the rest and just ignore the unclear!
- Multi-task makes sense when (e.g. from A to B):
- Training on a set of tasks that could benefit from having shred lover-level features.
- Usually: amount of data you have for each task is quite similar.
- Can train a big enough NN to do well on all the task.
- In general (have ENOUGH DATA), multi-task gives better performances!
- Other remarks:
- Multi-task learning (usually) works good in object detection.
- (Usually) transfer learning is USED MORE OFTEN (an better) than multi task learning!
End-to-end deep learning
- There have been some data processing system require multiple stages of processing. End-to-end does take all of them into 1 NN.
-
Example: speech recognition from English to French: this case, end-to-end works better than separated problems because it has enough data!
End-to-end learning. Image from the course.
- Example: auto open gate system: in this case, separated task is better end-to-end.
- Determine the head.
- Determine the name.
- When end-to-end works, it works very well!
- Pros & Cons:
- Pros:
- Let the data speaks.
- Let hand-designing of components needed.
- Cons:
- May need a lot of data.
- Excludes potentially useful hand-designing components.
- Pros:
- If having enough data can think of using end-to-end!
- Advice: carefully choose what type of mapping depends on what tasks you can get data for!
👉 Course 4 – Structuring Machine Learning Projects.