year=2024
month=5
day=10
hour=12
webrequest has multiple countries for the same ip address, and multiple isps for the same country
select * from webrequest where year = 2024 and month = 5 and day = 10 and hour = 12 and ip = ''
year=2024
month=5
day=10
hour=12
webrequest has multiple countries for the same ip address, and multiple isps for the same country
select * from webrequest where year = 2024 and month = 5 and day = 10 and hour = 12 and ip = ''
hypothesis so far: maybe some workers are getting MaxMind updates on a staggered schedule from others, so there's always some variation?
Indeed, different versions of the database seems to be present on cluster hosts.
#%% @F.udf() def maxmind() -> str: import maxminddb from datetime import datetime reader = maxminddb.open_database('/usr/share/GeoIP/GeoIP2-City.mmdb') ts_epoch = reader.metadata().build_epoch return datetime.fromtimestamp(ts_epoch).strftime('%Y-%m-%d %H:%M:%S') @F.udf() def host() -> str: import socket return socket.gethostname() #%% out = (spark.range(1, 10000) .repartition("id") .withColumn("maxmind", maxmind()) .withColumn("host", host()) ).cache() #%% out.groupBy("maxmind").agg(F.collect_set("host").alias("hosts")).show(truncate=False)
Returns
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |maxmind |hosts | +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |2024-05-28 15:24:59|[an-worker1142, an-worker1104, an-worker1107, an-worker1139, an-worker1125, an-worker1120, analytics1076, an-worker1165, an-worker1159, an-worker1161, an-worker1118, an-worker1103, an-worker1098, an-worker1126]| |2024-04-23 12:31:09|[an-worker1132, an-worker1168, an-worker1141, an-worker1144, analytics1071, an-worker1111, an-worker1160, an-worker1102, an-worker1122] | +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Note this is only looking at the city db, and only for a random subset of hosts that spark happens to run partitions on.
@BTullis I added DPE-SRE. Can you look into this and see if the version of the maxmind dbs is the same on all hadoop workers?
With the exception of a worker that seemingly can't be reached, the DB seem to be the exact same everywhere:
brouberol@cumin2002:~$ sudo cumin 'an-worker*.eqiad.wmnet' 'md5sum /usr/share/GeoIP/GeoIP2-City.mmdb' 100 hosts will be targeted: an-worker[1078-1177].eqiad.wmnet OK to proceed on 100 hosts? Enter the number of affected hosts to confirm or "q" to quit: 100 ===== NODE GROUP ===== (1) an-worker1165.eqiad.wmnet ----- OUTPUT of 'md5sum /usr/shar...GeoIP2-City.mmdb' ----- ssh: connect to host an-worker1165.eqiad.wmnet port 22: Connection timed out ===== NODE GROUP ===== (99) an-worker[1078-1164,1166-1177].eqiad.wmnet ----- OUTPUT of 'md5sum /usr/shar...GeoIP2-City.mmdb' ----- 784d268a38b213e9fafa572b193d1e1a /usr/share/GeoIP/GeoIP2-City.mmdb ================ PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 99% (99/100) [00:21<00:00, 4.66hosts/s] FAIL |█▉
The snippet pasted above now returns the same maxmind metadata for all hosts the job ran on: 2024-10-29 19:58:23.