So, I personally keep a local test set which I don’t use for any modeling purpose. I use it only to check the final performance of a model before submitting to kaggle. It’s important that performance on this local test goes hand-in-hand with LB score. And that’s why creating this local test set gets tricky sometimes. I try both random splitting and time-based split (if possible). There’s no simple rule for this.
Create a validation set also in a similar manner and leftover data is your actual train data. And do all the trend and noisy feature analysis on this train and validation data.
In real life problems though, you should be good with just a time-based split for both test and validation sets. Reason being that you want your model to perform similarly across different time periods. And you aren’t trying to optimize your model on a single test set like in kaggle.
In this competition, you can try using the ‘DAYS_’ feature for a time-based split. Some of them tell you how long ago the loan was applied (or something on those lines). I personally didn’t compete in this one. Used only the data for illustration here. So, not sure what works best to get a good LB rank.