The goal of the competition is to find browsing logs which belong to the same user.
More details at http://cikmcup.org and https://competitions.codalab.org/competitions/11171
- Convert user ids into integers so they occupy less RAM
- Split the train data into 2 folds based on connected components
- Use Elastic Search and More-Like-This queries to find top pair candidates
- Split logs into sessions (using 30 minute intervals) and compute the user "profile" (log features):
- Number of sessions
- Clicks within session
- Duration of breaks between sessions
- Starts and ends of sessions
- Title-based, Domain-based and Url-based similarities within sessions
- For candidates retrieved with Elastic Search, compute the following features:
- Absolute difference between the profile features
- Cosine between domains, full urls and titles
- Train an xgboost model for predicting if a candidate pair corresponds to the same user or not
1_prepare_data.py
: preprocesses the data2_data_to_elastic.py
: puts the log data to elastic search3_candidates_elastic.py
: uses elastic search for retrieving top 70 candidates for each user4_session_vectorizers.py
: "trains" count vectorizers for urls, domains and titles for user sessions5_user_profiles.py
: extracts profile information from each user log6_pair_features.py
: computes features for each candidate pair7_model.py
: trains the xgb model and creates the submission file
- This solution was presented at Berlin Machine Learning meetup. See the slides here.