Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run write_dwc() from CSV files #348

Open
14 tasks
peterdesmet opened this issue Mar 31, 2025 · 5 comments
Open
14 tasks

Run write_dwc() from CSV files #348

peterdesmet opened this issue Mar 31, 2025 · 5 comments
Assignees
Labels
blocked enhancement New feature or request

Comments

@peterdesmet
Copy link
Member

peterdesmet commented Mar 31, 2025

Rationale

In the March 31 VLIZ-INBO meeting I suggested to have write_dwc() run from the CSV files generated by download_acoustic_dataset() rather than from the database. This is in line with the idea that a Marine Data Archive for an animal acoustic project will contain both the source and Darwin Core Archive data:

# source data, generated with download_acoustic_dataset()
datapackage.json
animals.csv
tags.csv
acoustic_detections.csv
archival_data.csv
deployments.csv
receivers.csv
projects.csv

# DwC-A data, generated with write_dwc()
dwc_occurrence.csv
dwc_emof.csv
meta.xml

Running from the CSV files has several advantages:

  • Only need to query data from DB once (in download_acoustic_dataset()).
  • Darwin Core data will always be consistent with CSV files. Currently it is possible that there is drift between the two, e.g. when write_dwc() is ran weeks later (and DB data are updated) or when the scientific_name argument was used in download_acoustic_dataset() (which is not available in write_dwc())
  • Can update datapackage.json to reference Darwin Core files.
  • Once you have the CSV files, it's faster to run write_dwc()

The process would thus be:

  1. Run download_acoustic_dataset()
  2. Quality assurance
  3. Fix errors in database
  4. Repeat step 1-3 until all is correct
  5. Run write_dwc() on local CSV files

Implementation

Implementation would be similar to https://inbo.github.io/movepub/reference/write_dwc.html, where a Frictionless Data Package is provided.

  • Discuss with @PietrH what branch to use

Parameters

  • package (no default): a frictionless::read_package(). Alternatively, we ask the user for an input directory.
  • connection: remove
  • animal_project_code: remove, context is provided by package
  • directory (no default): output directory
  • contact (cf. movepub), not sure this is needed.
  • rights_holder (default NULL)
  • license (default "CC-BY")

Error checking

  • Check that all required resources are available. I assume those will be at least animals, detections.

Transformation

  • Convert [dwc_occurrence.sql(https://github.com/inbo/etn/blob/main/inst/sql/dwc_occurrence.sql) to dplyr
  • Test that all necessary information is available in the source CSVs. If not, then download_acoustic_dataset() should be updated

Testing

@peterdesmet
Copy link
Member Author

Ping @CLAUMEMO @sannegovaert

@PietrH
Copy link
Member

PietrH commented Mar 31, 2025

When do you plan work on this to happen Peter?

@PietrH PietrH added enhancement New feature or request blocked labels Mar 31, 2025
@peterdesmet
Copy link
Member Author

It should be finished before @CLAUMEMO ramps up "publish all non-embargo data", which is the goal for DTO-BioFlow deliverable D3.5 (Month 38, end of October, 2026). It can happen in parallel to OpenCPU development, since it doesn't directly dependent on it.

@PietrH
Copy link
Member

PietrH commented Mar 31, 2025

It can happen in parallel to OpenCPU development, since it doesn't directly dependent on it.

write_dwc() and the way the package places requests have changed in the branches away from main so it'll probably save us some difficult merges if we coordinate the order in which this happens.

#317 for example moves the SQL completely away from etn and into etnservice, this way we only have to maintain one copy.

@peterdesmet
Copy link
Member Author

I suggest a new write_dwc_csv() during development. Once everything is operational, we can see where to merge and rename it to write_dwc() (replacing old functionality).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants