Skip to content

SetReplication Error. #18721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JiGuoDing opened this issue Mar 30, 2025 · 5 comments
Open

SetReplication Error. #18721

JiGuoDing opened this issue Mar 30, 2025 · 5 comments
Labels
type-bug This issue is about a bug

Comments

@JiGuoDing
Copy link

JiGuoDing commented Mar 30, 2025

Alluxio Version:
v2.9.4

Describe the bug
First, i deployed the Alluxio with Helm in a K8S cluster which has 1 master node and 7 worker nodes.

Second, when i entered an Alluxio Worker Pod, i tried like this "alluxio fs setReplication --max 3 --min 3 /test_ufs.txt", and it worked pretty good for the first time. However, when i tried another time with "alluxio fs setReplication --max 4 --min 4 /test_ufs.txt", it didn't work, the replication num remained to be 3.

Third, I found some information in Alluxio-master logs:

2025-03-30 06:33:39,802 WARN Master Replication Check - Unexpected exception encountered when starting a REPLICATE job (uri=/test_ufs.txt, block ID=16777216, num replicas=5) : alluxio.exception.status.NotFoundException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later
2025-03-30 06:44:39,783 WARN Master Replication Check - Unexpected exception encountered when starting a REPLICATE job (uri=/test_ufs.txt, block ID=16777216, num replicas=5) : alluxio.exception.status.NotFoundException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later

and information in Alluxio-job-master logs:

2025-03-30 06:43:39,782 WARN grpc-default-executor-0 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000\001\000\000\000\000\000\000\005t\000\r/test_ufs.txt"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later
2025-03-30 06:44:39,782 WARN grpc-default-executor-3 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000\001\000\000\000\000\000\000\005t\000\r/test_ufs.txt"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later
2025-03-30 06:45:39,783 WARN grpc-default-executor-3 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000\001\000\000\000\000\000\000\005t\000\r/test_ufs.txt"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later
2025-03-30 06:46:39,783 WARN grpc-default-executor-3 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000\001\000\000\000\000\000\000\005t\000\r/test_ufs.txt"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later
2025-03-30 06:47:39,782 WARN grpc-default-executor-5 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000\001\000\000\000\000\000\000\005t\000\r/test_ufs.txt"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/test_ufs.txt blockId:16777216, try later

Forth, I entered an Alluxio-worker pod and checked the alluxio job list:

sh-4.2# alluxio job ls
1743316123474 Persist COMPLETED
1743316123475 Replicate COMPLETED

it indicated that all the jobs were completed.

My confusion is why the job list says all tasks are completed, but the logs still show that there are setReplication jobs running? This problem prevents me from repeatedly adjusting the number of replicas for a file in Alluxio.

To Reproduce
Steps to reproduce the behavior (as minimally and precisely as possible)

Expected behavior
A clear and concise description of what you expected to happen.

Urgency
Describe the impact and urgency of the bug.

Are you planning to fix it
Yes.

Additional context
properties in values.yaml

properties:
  alluxio.security.stale.channel.purge.interval: 365d
  alluxio.conf.dynamic.update.enabled: true
  alluxio.user.file.metadata.sync.interval: 0
  alluxio.master.mount.table.root.ufs: "hdfs://<haodop-ip>:9001/alluxio/ufs"
  alluxio.underfs.address: "hdfs://<hadoop-ip>:9001/alluxio/ufs"
  alluxio.underfs.hdfs.configuration: "/secrets/hdfsConfig/core-site.xml:/secrets/hdfsConfig/hdfs-site.xml"
  alluxio.master.journal.ufs.option.alluxio.underfs.hdfs.configuration: "/secrets/hdfsConfig/core-site.xml:/secrets/hdfsConfig/hdfs-site.xml" 
  alluxio.master.journal.ufs.folder: "hdfs://<hadoop-ip>:9001/alluxio/journal"
  alluxio.security.authentication.type: "NOSASL"
  alluxio.security.authorization.permission.enabled: false
  alluxio.debug: true
  alluxio.proxy.s3.v2.version.enabled: false
  alluxio.proxy.s3.v2.async.processing.enabled: false
  alluxio.underfs.hdfs.user: "root"
  alluxio.user.metadata.cache.enabled: true
  alluxio.security.login.username: "root"
@JiGuoDing JiGuoDing added the type-bug This issue is about a bug label Mar 30, 2025
@JiGuoDing
Copy link
Author

Now I discovered new information when I execute sudo kubectl logs alluxio-master-0 -c alluxio-master -n cm-alluxio.

2025-04-01 02:58:48,019 WARN master-rpc-executor-TPE-thread-305 - Failed to sync metadata on root path InodeSyncStream{rootPath=LockingScheme{path=/catalog_sales_1_16.dat.backup, desiredLockPattern=READ, shouldSync={Should sync: true, Last sync time: 1743476215769}}, descendantType=NONE, commonOptions=syncIntervalMs: 0
ttl: -1
ttlAction: FREE
, forceSync=false} because it does not exist on the UFS or in Alluxio

I think this might be an issue with metadata synchronization. After I performed the replica count adjustment operation, although the setReplica job appeared to be completed, the file metadata was not actually fully synchronized.

@JiGuoDing
Copy link
Author

I would like to know if the alluxio fs setReplication command only applies to new files and does not affect the files which has been set replication number? If so, how can I manually set the replication number for an existing file which has been set replication number in the cache system?

Could someone please clarify this for me? I would greatly appreciate it!

@yuzhu
Copy link
Contributor

yuzhu commented Apr 2, 2025

Setreplication changes the number of replicas of an existing file. If the job service is running, it will replicate to the set number or reduce replicas. Hope this helps

@JiGuoDing
Copy link
Author

@yuzhu thanks for your kind explanation, but I'm still encountering this issue: I can't continuously set the replication factor for a new file. The first setting took effect but the second attempt didn't work, and the job master pod log shows the following

2025-04-02 04:33:04,187 WARN grpc-default-executor-19 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000\005\000\000\001\000\000\000\004t\000\027/catalog_sales_1_16.dat"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/catalog_sales_1_16.dat blockId:83886081, try later
2025-04-02 04:33:04,188 WARN grpc-default-executor-19 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000\005\000\000\002\000\000\000\004t\000\027/catalog_sales_1_16.dat"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/catalog_sales_1_16.dat blockId:83886082, try later
2025-04-02 04:33:04,188 WARN grpc-default-executor-19 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000\005\000\000\003\000\000\000\004t\000\027/catalog_sales_1_16.dat"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/catalog_sales_1_16.dat blockId:83886083, try later

and the master pod log shows the following

2025-04-02 04:30:14,297 WARN master-rpc-executor-TPE-thread-119 - Failed to sync metadata on root path InodeSyncStream{rootPath=LockingScheme{path=/default_tests_files/BASIC_NON_BYTE_BUFFER_NO_CACHE_ASYNC_THROUGH, desiredLockPattern=READ, shouldSync={Should sync: true, Last sync time: 1743568213103}}, descendantType=ONE, commonOptions=syncIntervalMs: 0
ttl: -1
ttlAction: FREE
operationId {
mostSignificantBits: 2945768404521208448
leastSignificantBits: -9117608880348615192
}
, forceSync=false} because it does not exist on the UFS or in Alluxio
2025-04-02 04:31:04,196 WARN Master Log Config Report Scheduling - Inconsistent configuration detected. Only a limited set of inconsistent configuration will be shown here. For details, please visit Alluxio web UI or run fsadmin doctor CLI.
Warnings: [InconsistentProperty{key=alluxio.fuse.mount.options, values=allow_other (192.28.132.20:29999, 210.28.132.15:29999, 192.28.132.19:29999, 192.28.132.18:29999, 192.28.132.17:29999, 192.28.132.16:29999, 192.28.132.14:29999, 192.28.132.21:29999), attr_timeout=600,entry_timeout=600 (100.67.102.138:19998)}, InconsistentProperty{key=alluxio.worker.data.server.domain.socket.as.uuid, values=false (100.67.102.138:19998), true (192.28.132.20:29999, 210.28.132.15:29999, 192.28.132.19:29999, 192.28.132.18:29999, 192.28.132.17:29999, 192.28.132.16:29999, 192.28.132.14:29999, 192.28.132.21:29999)}, InconsistentProperty{key=alluxio.fuse.mount.point, values=/mnt/alluxio-fuse (100.67.102.138:19998), /mnt/fuse (192.28.132.20:29999, 210.28.132.15:29999, 192.28.132.19:29999, 192.28.132.18:29999, 192.28.132.17:29999, 192.28.132.16:29999, 192.28.132.14:29999, 192.28.132.21:29999)}]

I executed like this sudo kubectl exec -ti alluxio-master-0 -c alluxio-master -n cm-alluxio -- alluxio job ls

1743564671463 Replicate COMPLETED
1743564671464 Replicate COMPLETED
1743564671465 Replicate COMPLETED
1743564671466 Replicate COMPLETED
1743564671467 Replicate COMPLETED

Is there someting wrong with my metadata, which incurred my problem?

@JiGuoDing
Copy link
Author

JiGuoDing commented Apr 3, 2025

I tried deploying Alluxio directly on the server, without using K8S, it turned to be the same Issue. When I attempted to adjust the replica num of a new file which has been loaded in Alluxio Cache, it always worked for the first attempt, but failed for the second attempt.
The job_master.log shows the same

[jgd@pasak8s-15 alluxio-2.9.5]$ tail -f logs/job_master.log
2025-04-03 10:05:49,704 INFO grpc-default-executor-1 - Loaded job definition BatchedJobDefinition for config alluxio.job.plan.BatchedJobConfig
2025-04-03 10:05:49,705 INFO grpc-default-executor-1 - Loaded job definition LoadDefinition for config alluxio.job.plan.load.LoadConfig
2025-04-03 10:05:49,706 INFO grpc-default-executor-1 - Loaded job definition MigrateDefinition for config alluxio.job.plan.migrate.MigrateConfig
2025-04-03 10:05:49,707 INFO grpc-default-executor-1 - Loaded job definition PersistDefinition for config alluxio.job.plan.persist.PersistConfig
2025-04-03 10:05:49,708 INFO grpc-default-executor-1 - Loaded job definition MoveDefinition for config alluxio.job.plan.replicate.MoveConfig
2025-04-03 10:05:49,709 INFO grpc-default-executor-1 - Loaded job definition SetReplicaDefinition for config alluxio.job.plan.replicate.SetReplicaConfig
2025-04-03 10:05:49,710 INFO grpc-default-executor-1 - Loaded job definition StressBenchDefinition for config alluxio.stress.job.StressBenchConfig
2025-04-03 10:05:49,711 INFO grpc-default-executor-1 - Loaded job definition CompactDefinition for config alluxio.job.plan.transform.CompactConfig
2025-04-03 10:05:49,711 INFO grpc-default-executor-1 - Loaded job definition NoopPlanDefinition for config alluxio.job.plan.NoopPlanConfig
2025-04-03 10:05:51,347 WARN grpc-default-executor-4 - JobType does not belong to Load, Migrate and Persist
2025-04-03 10:13:49,666 WARN grpc-default-executor-7 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000&\000\000\000\000\000\000\003t\000\017/clear_cache.py"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/clear_cache.py blockId:637534208, try later
2025-04-03 10:14:49,661 WARN grpc-default-executor-7 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000&\000\000\000\000\000\000\003t\000\017/clear_cache.py"
, Error=alluxio.exception.JobDoesNotExistException: There's SetReplica job running for path:/clear_cache.py blockId:637534208, try later
2025-04-03 10:15:49,660 WARN grpc-default-executor-8 - Exit (Error): run: request=jobConfig: "\254\355\000\005sr\000+alluxio.job.plan.replicate.SetReplicaConfig\031\027\020|\037\027z\302\002\000\003J\000\bmBlockIdI\000\tmReplicasL\000\005mPatht\000\022Ljava/lang/String;xp\000\000\000\000&\000\000\000\000\000\000\003t\000\017/clear_cache.py"

By the way, my ufs is hdfs. and my alluxio-site.properties is below:

# Common properties
alluxio.master.hostname=pasak8s-15
alluxio.master.mount.table.root.ufs=hdfs://pasak8s-15:9001/local-alluxio
alluxio.underfs.hdfs.configuration=/data1/jgd/software/hadoop-3.3.6/etc/hadoop/core-site.xml:/data1/jgd/software/hadoop-3.3.6/etc/hadoop/hdfs-site.xml
alluxio.debug=true
alluxio.conf.dynamic.update.enabled: true

# Security properties
# alluxio.security.authorization.permission.enabled=true
# alluxio.security.authentication.type=SIMPLE

# Worker properties
alluxio.worker.ramdisk.size=1GB
alluxio.worker.tieredstore.levels=1
alluxio.worker.tieredstore.level0.alias=MEM
alluxio.worker.tieredstore.level0.dirs.path=/data1/jgd/data/local-alluxio
alluxio.worker.memory.size=32GB 
alluxio.worker.block.manager.type=EMBEDDED 
alluxio.master.web.address=pasak8s-15:19999 
alluxio.worker.web.address=pasak8s-15:30000 
alluxio.worker.memory.freeSpaceThreshold=0.1 
alluxio.worker.memory.maxAllocatableSpace=80% 

# User properties
alluxio.user.file.metadata.sync.interval: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

No branches or pull requests

2 participants