MaxMind seems to be mapping the same IP to different countries
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Milimetric
	May 31 2024, 3:57 PM

Description

year=2024
month=5
day=10
hour=12

webrequest has multiple countries for the same ip address, and multiple isps for the same country

select * from webrequest
 where year = 2024
   and month = 5
   and day = 10
   and hour = 12
   and ip = ''

Event Timeline

Milimetric created this task.May 31 2024, 3:57 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 31 2024, 3:57 PM

(and sub-country it's much worse)

hypothesis so far: maybe some workers are getting MaxMind updates on a staggered schedule from others, so there's always some variation?

Indeed, different versions of the database seems to be present on cluster hosts.

#%%
@F.udf()
def maxmind() -> str:
    import maxminddb
    from datetime import datetime
    reader = maxminddb.open_database('/usr/share/GeoIP/GeoIP2-City.mmdb')
    ts_epoch = reader.metadata().build_epoch
    return datetime.fromtimestamp(ts_epoch).strftime('%Y-%m-%d %H:%M:%S')
    

@F.udf()
def host() -> str:
    import socket
    return socket.gethostname()

#%%
out = (spark.range(1, 10000)
 .repartition("id")
 .withColumn("maxmind", maxmind())
.withColumn("host", host()) 
).cache()
#%%
out.groupBy("maxmind").agg(F.collect_set("host").alias("hosts")).show(truncate=False)

Returns

+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|maxmind            |hosts                                                                                                                                                                                                             |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2024-05-28 15:24:59|[an-worker1142, an-worker1104, an-worker1107, an-worker1139, an-worker1125, an-worker1120, analytics1076, an-worker1165, an-worker1159, an-worker1161, an-worker1118, an-worker1103, an-worker1098, an-worker1126]|
|2024-04-23 12:31:09|[an-worker1132, an-worker1168, an-worker1141, an-worker1144, analytics1071, an-worker1111, an-worker1160, an-worker1102, an-worker1122]                                                                           |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Note this is only looking at the city db, and only for a random subset of hosts that spark happens to run partitions on.

• OKJ04 added a subtask: T366433: CentralAuth tests broken unless you run them inside Quibble.Jun 3 2024, 1:39 AM

JJMC89 removed a subtask: T366433: CentralAuth tests broken unless you run them inside Quibble.Jun 3 2024, 1:44 AM

@BTullis I added DPE-SRE. Can you look into this and see if the version of the maxmind dbs is the same on all hadoop workers?

Ahoelzl moved this task from Incoming (new tickets) to Radar (External Teams) on the Data-Engineering board.Oct 17 2024, 5:21 PM

Gehel triaged this task as High priority.Oct 28 2024, 2:56 PM

Gehel edited projects, added Data-Platform-SRE (2024.10.19 - 2024.11.08); removed Data-Platform-SRE.

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2024.10.19 - 2024.11.08) board.Oct 29 2024, 4:53 PM

With the exception of a worker that seemingly can't be reached, the DB seem to be the exact same everywhere:

brouberol@cumin2002:~$ sudo cumin 'an-worker*.eqiad.wmnet' 'md5sum /usr/share/GeoIP/GeoIP2-City.mmdb'
100 hosts will be targeted:
an-worker[1078-1177].eqiad.wmnet
OK to proceed on 100 hosts? Enter the number of affected hosts to confirm or "q" to quit: 100
===== NODE GROUP =====
(1) an-worker1165.eqiad.wmnet
----- OUTPUT of 'md5sum /usr/shar...GeoIP2-City.mmdb' -----
ssh: connect to host an-worker1165.eqiad.wmnet port 22: Connection timed out
===== NODE GROUP =====
(99) an-worker[1078-1164,1166-1177].eqiad.wmnet
----- OUTPUT of 'md5sum /usr/shar...GeoIP2-City.mmdb' -----
784d268a38b213e9fafa572b193d1e1a  /usr/share/GeoIP/GeoIP2-City.mmdb
================
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████  |  99% (99/100) [00:21<00:00,  4.66hosts/s]
FAIL |█▉

@fkaelin Are we still observing the same discrepancy on your end?

The snippet pasted above now returns the same maxmind metadata for all hosts the job ran on: 2024-10-29 19:58:23.

Cool. I'll close then.

MaxMind seems to be mapping the same IP to different countriesClosed, ResolvedPublicActions

Description

Event Timeline

MaxMind seems to be mapping the same IP to different countries
Closed, ResolvedPublic
Actions