How Do We Map Geolocation?
For any device connected to the Internet, we try to map their geolocation. We get geolocation data from two data sources, client and IP address. For the users (client) who allow their GPS location, using data analytics algorithm we convert the latitude and longitude to their respective location. And, for those users who do not allow their GPS location, we try to fetch their location from the IP address.
What Does Geolocation Mean To Us?
This data is one of the most important data for us with multiple uses:
- Personalizing the Indus App Bazaar experience.
- Sourcing apps from developers through the user’s distribution across geographies.
- Targeting users with relevant apps.
- Localization by recommending locally relevant apps and content to the user based on their demographics.
The Inaccuracy Involved
Through extensive research on GPS location, we are able to mark a user’s location within 10 meters of accuracy. But, the majority of the users go with the thought that “my location should remain confidential” and hence for the majority of the users, we do not get the GPS location.
When we try to map geolocation from IP addresses using third-party services, it has been observed that we get localities marked as cities, for example, Koramangala, Lajpat Nagar, Worli, etc. This leads to miss targeting and some users don’t get targeted, which further impacts the campaign performance, personalization and VOCs.
- Scalability: We started off with clustering using K-means and DBSCAN using the euclidean distance. Since the number of users is 100+M, clustering them was taking unfeasible time.
- Finding the perfect K (# of clusters): Elbow method does give the best clusters statistically heterogeneous but practically knowing the perfect K varies with each city cluster into consideration. A smaller number of clusters doesn’t make sense for the problem and a large number of clusters is making the results inaccurate and non-scalable.
- Distance measure algorithms: Euclidean distance is extensively available in most of the framework, which is an incorrect way for finding the geo-distance.
- Slow web scraping: We tried to map each city’s center to the user’s IP address location. In distributed systems, only the master node is used for scraping geo-coordinates. As a result, it is taking almost 24 hours to scrape just 24K addresses.
After doing research on multiple ways to solve the inaccuracy involved with geolocation, we have come up with an approach that combines the localities into the closest city a.k.a city cluster. This helps us to improve user’s coverage under a city cluster i.e. the recall metric and the accuracy. The motive behind this exercise is to club anomalies in the location (particularly the city).
To find the nearest city to a locality, we need to calculate the distance between the locality and city center, for which after extensive research we boiled down to Haversine distance. The Haversine is a great-circle distance between two points on a sphere given their latitudes and longitudes. The first coordinate of each point is assumed to be the latitude, the second is the longitude, given in radians. We have created our own algorithm to calculate this distance.
Final Approach – Nearest Neighbour Classifier
Since the number of clusters (K) was a big question, we started off with pre-defined centroids (cluster centers) and then clustered/ assigned the localities to the nearest cluster centers.
- Client GPS location is given preference over the location derived from the IP address.
- Standardize the locality string to Camel Case. Also, remove special characters and numbers.
- Get the unique set of cities and states. Form address by concatenating city and state.
- Filter out addresses for which Geo coordinates are not scrapped.
- Geo coordinates are scrapped for only newly added locations. Multithreading is used to fasten the scraping time.
- Calculate the Haversine distance (in KMS) between city clusters and the locality coordinates using the custom build algorithm.
- Filter out the nearest city cluster corresponding to each locality along with the distance in km. Customize the cities to be considered as city clusters.
- Deciding the range in kms from the city cluster if we want to consider a locality in the city.