Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with ut_* fields when tld is not in lists #7

Open
dbranger opened this issue Feb 3, 2023 · 2 comments
Open

Issue with ut_* fields when tld is not in lists #7

dbranger opened this issue Feb 3, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@dbranger
Copy link

dbranger commented Feb 3, 2023

Hi,

We encounter an issue when we use URL Toolbox with subdomains that are not in DAT lists used by the python script.

It seems that the script truncate and merge the end of the URL instead of keeping the last string after a dot.

Here are some examples :

  • test.containers.internal --> ut_subdomain = "test.containers" instead of "test", ut_domain = "int.host" instead of "containers.internal", ut_tld = "host" instead of "internal"
  • test.redhat.com.localdomain --> ut_subdomain = "test.redhat.com" instead of "test.redhat", ut_domain = "localdo.com" instead of "com.localdomain", ut_tld = "com" instead of "localdomain"
  • test.centos.pool.ntp.org.xxxlocal --> ut_subdomain = "test.centos.pool.ntp.org" instead of "test.centos.pool.ntp", ut_domain = "xxxl.org" instead of "org.xxxlocal", ut_tld = "org" instead of "xxxlocal"

When we add the TLD in DAT files used by the python script for the lists, it works well. Nevertheless we cannot add all possible and imaginable cases. The impact of this issue is concerning the correlation searches that does not detect the correct values.

Would it be please possible to update the python script to change this behavior when it does not find the TLD in DAT files and keep the correct values ? Or maybe is there a reason for that ?

We thank you in advance.

Best regards,

D.BRANGER

@ggokdemir
Copy link
Collaborator

Hi @dbranger,

Thank you for bringing this up!

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> tldextract.extract('http://test.containers.internal/')
ExtractResult(subdomain='test.containers', domain='internal', suffix='')
>>> tldextract.extract('http://test.redhat.com.localdomain/')
ExtractResult(subdomain='test.redhat.com', domain='localdomain', suffix='')
>>> tldextract.extract('http://test.centos.pool.ntp.org.xxxlocal/')
ExtractResult(subdomain='test.centos.pool.ntp.org', domain='xxxlocal', suffix='')
>>> tldextract.extract('http://1.something.com.local')
ExtractResult(subdomain='1.something.com', domain='local', suffix='')

Using any library like tldextract, which relies on the Public Suffix List (PSL) to accurately separate a URL's subdomain, domain, and public suffix, doesn't solve the issue. The workaround you mention—adding the TLD in the DAT files used by the Python script for the lists—works. Changes to the code and tests have not been successful in creating a common pattern.

I’ll keep this open. Please let me know if you have any thoughts or suggestions. I’d greatly appreciate any help or feedback!
Thank you! I'll keep you posted if I make any changes.

@ggokdemir
Copy link
Collaborator

I updated the repository at https://github.com/splunk/utbox/tree/utbox-PSL-update with the latest Public Suffix List from https://publicsuffix.org/list/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants