2 Data Description
2.1 Original Dataset
The original dataset provided by Zheng et al. (2011) was collected in the GeoLife project (Microsoft Research Asia) by 182 users in a period of over three years (from April 2007 to August 2012). A GPS trajectory of this dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. This dataset contains 17’621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours. These trajectories were recorded by different GPS loggers and GPS-phones, and have a variety of sampling rates. 91 percent of the trajectories are logged in a dense representation, e.g. every 1~5 seconds or every 5~10 meters per point.
This dataset consists of a broad range of users’ outdoor movements, including not only life routines like go home and go to work but also some entertainments and sports activities, such as shopping, sightseeing, dining, hiking, and cycling. 73 users have labeled their trajectories with transportation mode, such as driving, taking a bus, riding a bike and walking.
2.2 Processed dataset
We downloaded Version 1.2.2 of the original dataset (on the 19.11.2024) and processed it in the following manner:
- Merged the data of all users into a single dataset
- Added transport mode labels and removed all trajectories without a transport mode label.
- Split the trajectories into segments based on the user id, transportation mode and time difference between consecutive points. A new segment is created if the time difference is larger than 10 minutes.
- Split the segments (from the previous step) further based on the distance between consecutive points. A new segment is created if the distance is larger than 100 meters. The created segment ids are unique across all users.
- Removed all segments with less than 100 points.
- Projected the data into UTM zone 50N (EPSG: 32650)
- Removed all segments that move outside of the bounding box of Beijing (406993 , 487551 , 4387642, 4463488 in EPSG 32650)
- Split the data into 4 sets of training, testing and validation data.
The full process is documented in this GitHub Repository.