The ETL algorithm to create interoperable databases for the BETTER project.
- Docker Desktop is installed on the host machine.
- Get the image of the I-ETL:
- Either download the TAR image available in the repository (recommended)
- Go to the deployment artifacts page: https://git.rwth-aachen.de/padme-development/external/better/data-cataloging/etl/-/artifacts
- Click on the "folder" icon of the latest build (generally the first line of the table in the page)
- Download the TAR archive named
the-ietl-image.tar
- Or build it from the repository (not recommended, see section "For developers")
- Download the
comose.yaml
file, available in the repository (https://git.rwth-aachen.de/padme-development/external/better/data-cataloging/etl/-/blob/main/compose.yaml?ref_type=heads) - Download the settings file
.env
file, available in the repository (https://git.rwth-aachen.de/padme-development/external/better/data-cataloging/etl/-/blob/main/.env?ref_type=heads) - Create a folder with:
- The I-ETL Docker (TAR) image
- The
.env
file template - The
compose.yaml
file
- In that folder, load the TAR image within the Docker:
docker load < the-ietl-image.tar
- In that folder, follow any scenario described in the "Scenarios" section. The complete list of parameters is described in the "Parameters" section.
- To check whether the ETL has finished, one can run
docker ps
: ifthe-etl
does not show in the list, this means that it is done. - To check the logs of the ETL:
- One can either look at the log file produced by I-ETL if one has specified the
SERVER_FOLDER_LOG_ETL
parameter - Otherwise, use
docker logs the-etl
.
For any scenario, you need to set within the .env
file:
- The hospital name in
HOSPITAL_NAME
(see accepted values in Section 4) - The database name in
DB_NAME
- The absolute path to your MongoDB folder in
SERVER_FOLDER_MONGODB
(this can be any folder on the host machine; it will contain MongoDB's data) - The locale to be used to read your data in
USE_LOCALE
(this is important to read dates and numeric values with your country's norm) - If you want to keep the execution log: the absolute path to your log folder in
SERVER_FOLDER_LOG_ETL
(this can be any folder on the host machine; it will contain the ETL's log files)
- Variables are columns, patients are rows and patient have identifiers (which will be further anonymized by I-ETL), so please pre-process your data if this is not the case.
- The input files have the exact same name as specified in the metadata (https://drive.google.com/drive/u/0/folders/1eJOtoXj192Z9u0VENn4BdOfJ2wm5xZ5Z)
SERVER_FOLDER_METADATA
with the absolute path to the folder containing the metadata file.METADATA
with the name of the metadata file.SERVER_FOLDER_DATA
with the absolute path to the folder containing your data files.DATA_FILES
with the comma-separated list of the data file names (no space).PATIENT_ID
with the column name containing your patient IDs (e.g.,PatientID
, orpid
, etc.).- If you previously loaded data in the database, and you add new data to the existing database, set
DB_DROP=False
. If this is NOT set toFalse
, previously loaded data will be lost.
export CONTEXT_MODE=DEV
export ETL_ENV_FILE_NAME=.env
export ABSOLUTE_PATH_ENV_FILE=X
whereX
is the absolute path to your.env
file
- From your folder (step 5 in section 2), run
docker compose --env-file ${ABSOLUTE_PATH_ENV_FILE} up -d
(-d
stands for--daemon
, meaning that I-ETL will run as a background process).
SERVER_FOLDER_DATA_GENERATION
with the folder for storing synthetic data.NB_ROWS
with the number of rows (~= number of patients) to generate.- Note: synthetic data will be generated based on the hospital name because each hospital has its own structure of data.
export CONTEXT_MODE=GENERATION
export ETL_ENV_FILE_NAME=.env
export ABSOLUTE_PATH_ENV_FILE=X
whereX
is the absolute path to your.env
file
- From your folder (step 5 in section 2), run
docker compose --env-file ${ABSOLUTE_PATH_ENV_FILE} up -d
The template for the environment file is given in .env
.
Parameters with a * (star) are required, others can be left empty.
The first value in the column "Values" is the default one.
We provide synthetic data generator for each medical center in the BETTER project.
They rely on the metadata described by each center and generate NB_ROWS
rows of data.
Parameter name | Description | Values |
---|---|---|
SERVER_FOLDER_DATA_GENERATION |
The absolute path to the folder into which generated synthetic data will be saved. | /dev/null or an absolute folder path |
NB_ROWS |
The number of generated patients. | 100 or any other non-zero positive integer |
Parameter name | Description | Values |
---|---|---|
SERVER_FOLDER_METADATA * |
The absolute path to the folder containing the metadata file. | /dev/null or an absolute folder path |
METADATA * |
The metadata filename. | a filename |
SERVER_FOLDER_DATA * |
The absolute path to the folder containing the datasets. | /dev/null or a folder path |
DATA_FILES |
The list of comma-separated filenames. | (empty) or filename(s) |
SERVER_FOLDER_PIDS * |
The absolute path to the folder containing the patient anonymization data. | /dev/null or a folder path |
ANONYMIZED_PIDS |
The patient anonymized IDs filename. I-ETL will create it if it does not exist. | (empty) or filename(s) |
The FAIR database is in the Docker image, but can be accessed by opening a port, e.g., 27018.
Then, one can access it using that port (for instance, mongosh --port 27018
).
Parameter name | Description | Values |
---|---|---|
HOSPITAL_NAME * |
The hospital name. | it_buzzi_uc1 , rs_imgge , es_hsjd , it_buzzi_uc3 , es_terrassa , de_ukk , es_lafe , il_hmc |
DB_NAME * |
The database name. | better_default |
DB_DROP |
Whether to drop the database. WARNING: if True, this action is not reversible! | False , True |
SERVER_FOLDER_MONGODB * |
The absolute path to the folder in which MongoDB will store its databases. | a folder path |
Parameter name | Description | Values |
---|---|---|
SERVER_FOLDER_LOG_ETL * |
The absolute path to the folder in which I-ETL will write its log files. | /dev/null or a folder path |
USE_LOCALE * |
The locale to be used for reading numerics and dates. | en_GB , en_US , fr_FR , it_IT , etc. |
COLUMNS_TO_REMOVE * |
The set of column names to not include in the final database. | [] (empty list), or a list with strings being the column names |
RECORD_CARRIER_PATIENTS |
Whether to records patients carrying diseases without being affected as diagnosed patients | False , True |
PATIENT_ID |
The name of the column in the data containing patient IDs | Patient ID , or any other column name |
SAMPLE_ID |
The name of the column in the data containing sample IDs | (empty) if you do not have sample data, else a column name |
To be used when working with the I-ETL repository
- Install Docker Desktop and open it
- From the root of the project, run
docker build . --tag ietl
- If an error saying
ERROR: Cannot connect to the Docker daemon at XXX. Is the docker daemon running?
occurs, Docker Desktop has not started. - If an error saying
error getting credentials
occurs while building, go to your Docker config file (probably~/.docker/config.json
) and remove the linecredsStore
. Then, save the file and build again the image.
To be used when deploying I-ETL within a center
- Locally, build the Docker image: see above section
- Locally, create a TAR image of I-ETL (only with the ETL, not with the mongo):
docker save ietl > the-ietl-image.tar
- Send that TAR image to the host machine, e.g., using
scp the-ietl-image.tar "username@A.B.C.D:/somewhere/in/host/machine"
- Send the env file to the host machine in the same folder as the TAR image, e.g., using
scp .env "username@A.B.C.D:/somewhere/in/host/machine"
- Send the compose file to the host machine in the same folder as the TAR image, e.g., using
scp compose.yaml "username@A.B.C.D:/somewhere/in/host/machine
- In the host machine, move to
/somewhere/in/host/machine/
usingcd /somewhere/in/host/machine
- In the host machine, load the TAR image within the Docker of the host machine:
docker load < the-ietl-image.tar
- In the host machine, follow any above scenario, i.e., tune the .env file and run I-ETL