Lustre
  1. Lustre
  2. LU-1101

ncorrect permission handling when creating existing directories

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Duplicate
    • Affects Version/s: Lustre 2.1.0
    • Fix Version/s: None
    • Labels:
      None
    • Environment:
      Lustre 2.1 on clients and servers, Scientific Linux 5
    • Severity:
      3
    • Bugzilla ID:
      23,459
    • Rank:
      4012

      Description

      Lustre seems to handle permissions on mkdir incorectlly in some cases. This issue makes it hard (or impossible) to use Torque scheduler directly on top of a Lustre filesystem. This is in fact copy of bugzilla bug #23459, which was reported by us some time ago for 1.8 branch, however it looks like the bug is still there even in 2.1. All the symptoms described in bugzilla are identical and the reproducer code provided by Lukasz Flis still works for this issue.

        Issue Links

          Activity

          Hide
          Marek Magrys added a comment - - edited

          To clarify:
          The problem occurs when Torque (pbs_mom) has the $tmpdir config (/var/torque/mom_priv/config) var set to Lustre filesystem (in our case $tmpdir /mnt/lustre/scratch/jobs). We occasionally get errors like:

          Feb 14 13:56:23 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18555647.batch.grid.cyf-kr.edu.pl
          Feb 14 14:37:35 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18557701.batch.grid.cyf-kr.edu.pl
          Feb 14 14:38:17 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18557716.batch.grid.cyf-kr.edu.pl
          Feb 14 14:50:46 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18559037.batch.grid.cyf-kr.edu.pl
          Feb 14 15:01:44 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18559949.batch.grid.cyf-kr.edu.pl

          An example output of the reproducer:

          [b14flis@n6-4-16 repro]$ ./a.out /mnt/lustre/scratch/jobs/
          Iteration: 1
          Creating directory: /mnt/lustre/scratch/jobs/1804289383
          mkdir(/mnt,mode) errno: 17
          mkdir(/mnt/lustre,mode) errno: 17
          mkdir(/mnt/lustre/scratch,mode) errno: 17
          mkdir(/mnt/lustre/scratch/jobs,mode) errno: 13
          mkdirtree: failed: rc=13
          sleeping for 2 seconds

          Iteration: 2
          doing stat before creating directory
          Creating directory: /mnt/lustre/scratch/jobs/846930886
          mkdir(/mnt,mode) errno: 17
          mkdir(/mnt/lustre,mode) errno: 17
          mkdir(/mnt/lustre/scratch,mode) errno: 17
          mkdir(/mnt/lustre/scratch/jobs,mode) errno: 17
          mkdirtree: successful: rc=0

          ERROR: inconsistency detected: previous rc: 13 vs current rc: 0

          Show
          Marek Magrys added a comment - - edited To clarify: The problem occurs when Torque (pbs_mom) has the $tmpdir config (/var/torque/mom_priv/config) var set to Lustre filesystem (in our case $tmpdir /mnt/lustre/scratch/jobs). We occasionally get errors like: Feb 14 13:56:23 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18555647.batch.grid.cyf-kr.edu.pl Feb 14 14:37:35 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18557701.batch.grid.cyf-kr.edu.pl Feb 14 14:38:17 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18557716.batch.grid.cyf-kr.edu.pl Feb 14 14:50:46 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18559037.batch.grid.cyf-kr.edu.pl Feb 14 15:01:44 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18559949.batch.grid.cyf-kr.edu.pl An example output of the reproducer: [b14flis@n6-4-16 repro] $ ./a.out /mnt/lustre/scratch/jobs/ Iteration: 1 Creating directory: /mnt/lustre/scratch/jobs/1804289383 mkdir(/mnt,mode) errno: 17 mkdir(/mnt/lustre,mode) errno: 17 mkdir(/mnt/lustre/scratch,mode) errno: 17 mkdir(/mnt/lustre/scratch/jobs,mode) errno: 13 mkdirtree: failed: rc=13 sleeping for 2 seconds Iteration: 2 doing stat before creating directory Creating directory: /mnt/lustre/scratch/jobs/846930886 mkdir(/mnt,mode) errno: 17 mkdir(/mnt/lustre,mode) errno: 17 mkdir(/mnt/lustre/scratch,mode) errno: 17 mkdir(/mnt/lustre/scratch/jobs,mode) errno: 17 mkdirtree: successful: rc=0 ERROR: inconsistency detected: previous rc: 13 vs current rc: 0
          Hide
          Lukasz Flis added a comment -

          Hi,

          One of our users using Quantum Espresso application hit the bug today.
          user has set outdir variable to her directory on lustre filesystem.

          %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
          task # 39
          from parallel_mkdir : error # 1
          /mnt/lustre/scratch/people/xuser/ non existent or non writable
          %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

          %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
          task # 14
          from parallel_mkdir : error # 1
          /mnt/lustre/scratch/people/xuser/ non existent or non writable
          %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
          task # 44
          from parallel_mkdir : error # 1
          /mnt/lustre/scratch/people/xuser/ non existent or non writable
          %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

          strace dump shown that mkdir result was:
          wrapper.6736.3477:mkdir("/mnt/lustre/scratch/people/xuser/", 0777) = -1 EACCES (Permission denied)

          After doing stat on the directory before invoking application problem dissapeared:
          wrapper.23170.3574:mkdir("/mnt/lustre/scratch/people/xuser/", 0777) = -1 EEXIST (File exists)

          Cheers,

          Show
          Lukasz Flis added a comment - Hi, One of our users using Quantum Espresso application hit the bug today. user has set outdir variable to her directory on lustre filesystem. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 39 from parallel_mkdir : error # 1 /mnt/lustre/scratch/people/xuser/ non existent or non writable %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 14 from parallel_mkdir : error # 1 /mnt/lustre/scratch/people/xuser/ non existent or non writable %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 44 from parallel_mkdir : error # 1 /mnt/lustre/scratch/people/xuser/ non existent or non writable %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% strace dump shown that mkdir result was: wrapper.6736.3477:mkdir("/mnt/lustre/scratch/people/xuser/", 0777) = -1 EACCES (Permission denied) After doing stat on the directory before invoking application problem dissapeared: wrapper.23170.3574:mkdir("/mnt/lustre/scratch/people/xuser/", 0777) = -1 EEXIST (File exists) Cheers,
          Hide
          Kit Westneat (Inactive) added a comment -

          To answer the last question in the bugzilla report, the code that causes this bug was added here as an MDS optimization:
          https://bugzilla.lustre.org/show_bug.cgi?id=18534

          Show
          Kit Westneat (Inactive) added a comment - To answer the last question in the bugzilla report, the code that causes this bug was added here as an MDS optimization: https://bugzilla.lustre.org/show_bug.cgi?id=18534
          Hide
          Lukasz Flis added a comment -

          Hi,

          Just to update:

          We have tested and it appeared this is not a problem in 2.2.0 clients.
          However 2.1.1 clients with 2.2 servers are still affected by the issue.

          Show
          Lukasz Flis added a comment - Hi, Just to update: We have tested and it appeared this is not a problem in 2.2.0 clients. However 2.1.1 clients with 2.2 servers are still affected by the issue.
          Hide
          Lukasz Flis added a comment -

          Hello,

          2.2.0 clients are not usable yet for us (one unreported LBUG)

          Is there any plan to include fix for the issue in upcoming 2.1.2?

          Show
          Lukasz Flis added a comment - Hello, 2.2.0 clients are not usable yet for us (one unreported LBUG) Is there any plan to include fix for the issue in upcoming 2.1.2?
          Hide
          Andreas Dilger added a comment -

          Closing as a duplicate of LU-4185.

          Show
          Andreas Dilger added a comment - Closing as a duplicate of LU-4185 .

            People

            • Assignee:
              HPDD Triage
              Reporter:
              Marek Magrys
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: