Run write_dwc() from CSV files #348

peterdesmet · 2025-03-31T12:19:25Z

Rationale

In the March 31 VLIZ-INBO meeting I suggested to have write_dwc() run from the CSV files generated by download_acoustic_dataset() rather than from the database. This is in line with the idea that a Marine Data Archive for an animal acoustic project will contain both the source and Darwin Core Archive data:

# source data, generated with download_acoustic_dataset()
datapackage.json
animals.csv
tags.csv
acoustic_detections.csv
archival_data.csv
deployments.csv
receivers.csv
projects.csv

# DwC-A data, generated with write_dwc()
dwc_occurrence.csv
dwc_emof.csv
meta.xml

Running from the CSV files has several advantages:

Only need to query data from DB once (in download_acoustic_dataset()).
Darwin Core data will always be consistent with CSV files. Currently it is possible that there is drift between the two, e.g. when write_dwc() is ran weeks later (and DB data are updated) or when the scientific_name argument was used in download_acoustic_dataset() (which is not available in write_dwc())
Can update datapackage.json to reference Darwin Core files.
Once you have the CSV files, it's faster to run write_dwc()

The process would thus be:

Run download_acoustic_dataset()
Quality assurance
Fix errors in database
Repeat step 1-3 until all is correct
Run write_dwc() on local CSV files

Implementation

Implementation would be similar to https://inbo.github.io/movepub/reference/write_dwc.html, where a Frictionless Data Package is provided.

Discuss with @PietrH what branch to use

Parameters

package (no default): a frictionless::read_package(). Alternatively, we ask the user for an input directory.
~~connection~~: remove
~~animal_project_code~~: remove, context is provided by package
directory (no default): output directory
contact (cf. movepub), not sure this is needed.
rights_holder (default NULL)
license (default "CC-BY")

Error checking

Check that all required resources are available. I assume those will be at least animals, detections.

Transformation

Convert [dwc_occurrence.sql(https://github.com/inbo/etn/blob/main/inst/sql/dwc_occurrence.sql) to dplyr
Test that all necessary information is available in the source CSVs. If not, then download_acoustic_dataset() should be updated

Testing

Create snapshot files for a small animal project with the current implementation of write_dwc()
Make sure the same snapshot files are created with the new version of write_dwc()
Test if this resolves write_dwc() hourly subsampling can return different detections between exports #347

The text was updated successfully, but these errors were encountered:

peterdesmet · 2025-03-31T12:22:05Z

Ping @CLAUMEMO @sannegovaert

PietrH · 2025-03-31T12:27:14Z

When do you plan work on this to happen Peter?

peterdesmet · 2025-03-31T12:33:16Z

It should be finished before @CLAUMEMO ramps up "publish all non-embargo data", which is the goal for DTO-BioFlow deliverable D3.5 (Month 38, end of October, 2026). It can happen in parallel to OpenCPU development, since it doesn't directly dependent on it.

PietrH · 2025-03-31T15:17:00Z

It can happen in parallel to OpenCPU development, since it doesn't directly dependent on it.

write_dwc() and the way the package places requests have changed in the branches away from main so it'll probably save us some difficult merges if we coordinate the order in which this happens.

#317 for example moves the SQL completely away from etn and into etnservice, this way we only have to maintain one copy.

peterdesmet · 2025-03-31T17:11:31Z

I suggest a new write_dwc_csv() during development. Once everything is operational, we can see where to merge and rename it to write_dwc() (replacing old functionality).

peterdesmet assigned sannegovaert Mar 31, 2025

peterdesmet mentioned this issue Mar 31, 2025

write_dwc() hourly subsampling can return different detections between exports #347

Closed

PietrH added enhancement New feature or request blocked labels Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run write_dwc() from CSV files #348

Run write_dwc() from CSV files #348

peterdesmet commented Mar 31, 2025 •

edited

Loading

peterdesmet commented Mar 31, 2025

PietrH commented Mar 31, 2025

peterdesmet commented Mar 31, 2025

PietrH commented Mar 31, 2025

peterdesmet commented Mar 31, 2025

Run write_dwc() from CSV files #348

Run write_dwc() from CSV files #348

Comments

peterdesmet commented Mar 31, 2025 • edited Loading

Rationale

Implementation

Parameters

Error checking

Transformation

Testing

peterdesmet commented Mar 31, 2025

PietrH commented Mar 31, 2025

peterdesmet commented Mar 31, 2025

PietrH commented Mar 31, 2025

peterdesmet commented Mar 31, 2025

peterdesmet commented Mar 31, 2025 •

edited

Loading