Skip to content
Pierre Chaumeil edited this page Oct 9, 2024 · 23 revisions

Scrape LPSN

cd /srv/db/gtdb/metadata/release213
mkdir lpsn
cd lpsn
mkdir lpsn_<date> (i.e. lpsn_20220718)

gtdb_migration_tk lpsn pull_html -o /srv/db/gtdb/metadata/release<version>/lpsn/
gtdb_migration_tk lpsn parse_html --in_dir . -o parse_html/ --lpsn_gss_file lpsn_gss_2022-09-19.csv

It may be necessary to run lpsn pull_html multiple times since downloading of LPSN pages can fail. Failed downloads are indicated in the *_failed.lst files. The --skip_taxa_per_letter_dl can be used to speed up additional runs of pull_html. There should be no failed downloads before proceeding with lpsn parse_html.

Create the date table

gtdb_migration_tk strains date_table --lpsn_scraped_species_info /srv/db/gtdb/metadata/release<release#>/lpsn/parse_html/all_ranks/lpsn_species.tsv --lpsn_gss_file ../lpsn/lpsn_gss_2022-09-19.csv --output_file year_table.txt

The lpsn_gss_file file is obtained from the Download section of the LPSN website.

Create a summary table for each source

gtdb_migration_tk strains type_table --lpsn_dir ../lpsn/lpsn_20240917/parse_html/all_ranks/ --year_table year_table.tsv --metadata /srv/db/gtdb/metadata/release226/strain_table/metadata_file.tsv --ncbi_names /srv/db/gtdb/metadata/release226/ncbi/taxonomy/taxdump_20240914/names.dmp --ncbi_nodes /srv/db/gtdb/metadata/release226/ncbi/taxonomy/taxdump_20240914/nodes.dmp --cpus 2 --output_dir . --lpsn_gss_file ../lpsn/lpsn_20240917/lpsn_gss_2024-10-04.csv

Compare with the old parsing

With the previous parsing, we did disregard strain shorter than 2 characters and strains without any digit Now we include them! So to compare old and new parsing, There is a boolean in the check_format_strain function in taxon_utils.py. swicth is to true

old_format = False

then

mkdir old_format_with_digits
cd old_format_with_digits
gtdb_migration_tk strains type_table --lpsn_dir ../../lpsn/lpsn_20240917/parse_html/all_ranks/ --year_table ../year_table.tsv --metadata /srv/db/gtdb/metadata/release226/strain_table/metadata_file.tsv --ncbi_names /srv/db/gtdb/metadata/release226/ncbi/taxonomy/taxdump_20240914/names.dmp --ncbi_nodes /srv/db/gtdb/metadata/release226/ncbi/taxonomy/taxdump_20240914/nodes.dmp --cpus 2 --output_dir . --lpsn_gss_file ../../lpsn/lpsn_20240917/lpsn_gss_2024-10-04.csv

Then compare both files with the compare_strain_files.py ( local sandbox, its a bit of a hack for now)