MaxMind has a nice geo ip database which assigns known IP ranges with geocoordinates. Looking up geo-address information from IP addresses are more-and more essential in today’s data-driven processes. MaxMind provides the data in various forms in their GeoLit2 database. Also, they provide some tools that optimize the lookup process as well. However, I often find, that these methods are not fit for certain applications, where there are no integration methods available, and the processing is purely done in the database (think of datawarehouses or old-school fraud-monitoring software). I mostly work with Oracle, and I haven’t found a ready-made solution for this, so I’ll share mine with the general public. You can access it on GitHub: https://github.com/harangp/maxmind-oracle
The database contains the geolocation of an IP range - and not individual IP addresses. You can find more on the range / subnet topic here:
The main thing is, if you examine the
country_blocks.network field, you’ll find data something like this:
192.168.5.64/26, which is a text representation, so it must be transformed to be worked with.
There can be more subnets that match your query against a the database - and this is totally normal. You have your ISP’s internal subnet, your ISP’s global subnet, your country’s subnet, regional, etc. The challenge is to find the one that has the largest match for the IP you are trying to geolocate. It’s obvious, that matching text input against text in the database column (and essentially doing calculations in query time) won’t result in an efficient solution, so we’ll have to design a different approach.
There’s a lesser known (albeit an old) existing feature called function-based virtual columns. The main concept is that you can add a column, that is derived from the original (physical) columns through some expression(s), so the value can be calculated from the original data on-the-fly, or even stored upon creation/modification. If the calculation/expression is deterministic (meaning it will yield the same output for the same input), it can be used for other purposes as well - in our case to be included in an index.
Comparing characters, or character chains is less effective than comparing numbers. IPv4 addresses are essentially 32 bit unsigned integers, so comparing one or multiple integers to one input number yields a significant performance gain running on modern processors using a vector/matrix instruction set. An additional gain, is that storing numbers instead of
varchar2 take up less space both on the disk and in memory.
The aboved mentioned approaches combined, we’d generate virtual columns that extract and calculate:
All of them are based only on the string given in the
The IP address comes in the usual format of
192.168.5.64, and also, the IP range is like this:
192.168.5.64/26. So to retrieve the numerical representation into the integer, we have to split/convert/shift/add the numbers. Thanks for regexp, this can be easily done like this*:
*considering the address comes in the variable
Upon querying the input address, there’s the problem of comparing the IP range to a number that is specifically masked for the defined range (the
/xx part of the network field, or the
significant_bits field we’ve just calculated) - this would mean, that for each record, we would have to generate the masked address of the input IP address, and compare it against the masked network number (found in the virtual column). Instead, we generate all the possible masked variants of the input address - there are 32 of them - and select those from the lookup table.
numip is the numerical representation of the IP address we try to look up
From this on, it is quite simple: just select all the records that match the masked_network field. Order descending the results based on the
significant_bits virtual column, and selecting the first elemet yields our best match, which is the result we’d like to get.
This means, we’ll have to generate only one index on the
masked_network field, and we are good to go. Using this technique, instead of lengthy scans, we have 32 index-based direct id lookups, which the database can further parallelize during execution.
If you try this method, you’ll get a nearly optimal solution on the latest Oracle releases. My experience is that if you try to use this approach with small dataset (partial import), it works fine, but for the whole dataset the db tries to outsmart you by doing full-table scans, since it needs to fetch the data for the
significant_bits column, and Oracle doesn’t like that. The optimal approach is to include the
significant_bits field to the index, and then it can be used for sorting.