Issue with ut_* fields when tld is not in lists #7

dbranger · 2023-02-03T08:43:35Z

Hi,

We encounter an issue when we use URL Toolbox with subdomains that are not in DAT lists used by the python script.

It seems that the script truncate and merge the end of the URL instead of keeping the last string after a dot.

Here are some examples :

test.containers.internal --> ut_subdomain = "test.containers" instead of "test", ut_domain = "int.host" instead of "containers.internal", ut_tld = "host" instead of "internal"
test.redhat.com.localdomain --> ut_subdomain = "test.redhat.com" instead of "test.redhat", ut_domain = "localdo.com" instead of "com.localdomain", ut_tld = "com" instead of "localdomain"
test.centos.pool.ntp.org.xxxlocal --> ut_subdomain = "test.centos.pool.ntp.org" instead of "test.centos.pool.ntp", ut_domain = "xxxl.org" instead of "org.xxxlocal", ut_tld = "org" instead of "xxxlocal"

When we add the TLD in DAT files used by the python script for the lists, it works well. Nevertheless we cannot add all possible and imaginable cases. The impact of this issue is concerning the correlation searches that does not detect the correct values.

Would it be please possible to update the python script to change this behavior when it does not find the TLD in DAT files and keep the correct values ? Or maybe is there a reason for that ?

We thank you in advance.

Best regards,

D.BRANGER

ggokdemir · 2024-10-20T16:20:31Z

Hi @dbranger,

Thank you for bringing this up!

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> tldextract.extract('http://test.containers.internal/')
ExtractResult(subdomain='test.containers', domain='internal', suffix='')
>>> tldextract.extract('http://test.redhat.com.localdomain/')
ExtractResult(subdomain='test.redhat.com', domain='localdomain', suffix='')
>>> tldextract.extract('http://test.centos.pool.ntp.org.xxxlocal/')
ExtractResult(subdomain='test.centos.pool.ntp.org', domain='xxxlocal', suffix='')
>>> tldextract.extract('http://1.something.com.local')
ExtractResult(subdomain='1.something.com', domain='local', suffix='')

Using any library like tldextract, which relies on the Public Suffix List (PSL) to accurately separate a URL's subdomain, domain, and public suffix, doesn't solve the issue. The workaround you mention—adding the TLD in the DAT files used by the Python script for the lists—works. Changes to the code and tests have not been successful in creating a common pattern.

I’ll keep this open. Please let me know if you have any thoughts or suggestions. I’d greatly appreciate any help or feedback!
Thank you! I'll keep you posted if I make any changes.

ggokdemir · 2024-10-21T10:18:06Z

I updated the repository at https://github.com/splunk/utbox/tree/utbox-PSL-update with the latest Public Suffix List from https://publicsuffix.org/list/.

edro15 added the bug Something isn't working label Aug 21, 2024

ggokdemir self-assigned this Sep 8, 2024

ggokdemir mentioned this issue Oct 20, 2024

Wrong ut_tld and ut_domain if a valid TLD is in the middle of the string but not at the end #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with ut_* fields when tld is not in lists #7

Issue with ut_* fields when tld is not in lists #7

dbranger commented Feb 3, 2023

ggokdemir commented Oct 20, 2024

ggokdemir commented Oct 21, 2024

Issue with ut_* fields when tld is not in lists #7

Issue with ut_* fields when tld is not in lists #7

Comments

dbranger commented Feb 3, 2023

ggokdemir commented Oct 20, 2024

ggokdemir commented Oct 21, 2024