On Sun, 30 Oct 2011, Krisztian Krajczar wrote: > Hi all, > > this is a follow up on the failing Pede job reported in my earlier email > (marked by an asterisk in the list of geometries). A simple resubmit did not > help, the job did not finish. Thus, as of now, I was able to produce the > final geometries for 5 out of 6 scenarios. Hi Krisztian, the resubmission has been done jobm3 instead of jobm2, right? The I do not really know what went wrong, jobData/jobm3/STDOUT simply states sh: line 1: 32417 Killed /afs/cern.ch/user/c/ckleinw/bin/rev81/pede_32GB pedeSteerMaster.txt > pede.dump You should have recieved an email from LSF about this job - does this tell you something? Otherwise you should follow Claus' advice: > To maximize the (debug) output for a failing PEDE one should prevent > buffering: > export GFORTRAN_UNBUFFERED_ALL=1 (in run script) i.e. add this line at the beginning of jobData/jobm3/theScript.sh and resubmit: mps_retry.pl -m FAIL # OK, will reset all failed pede jobs # better set back to FAIL in mps.db jobm3 or 2 mps_fire.pl -m # Then we will have at least better diagnostics... BTW: Meanwhile the CPU limit of the millepede queues should have been doubled. Could you retry to run the original starting-from-ideal pede job, including the bows (best to add 'export GFORTRAN_UNBUFFERED_ALL=1' in case it fails again). Cheers Gero > I would appreciate any comment on how to proceed with the failing job. > > Thanks, > Krisztian > >> all the pede jobs have finished; let me summarize here the locations of the >> various geometries. >> >> starting from current MC scenario: >> mp0896/jobData/jobm/alignments_MP.db (default) >> mp0896/jobData/jobm1/alignments_MP.db (cosmics weighted up) >> mp0896/jobData/jobm2/alignments_MP.db [*](cosmics weighted down) >> >> Starting from ideal alignment (the bow-misalignment and the bow >> determination was removed from the alignment jobs, according to the "1a)" >> recipe): >> mp0899/jobData/jobm/alignments_MP.db (default) >> mp0899/jobData/jobm1/alignments_MP.db (cosmics weighted up) >> mp0899/jobData/jobm2/alignments_MP.db (cosmics weighted down) >> >> I have checked again the Pede dumps of these jobs, and found that [*] >> failed with the following message (all the other jobs ended correctly!): >> ------------------ >> Record 26600000 ... still reading >> >> Read cache usage (#blocks, #records, min,max records/block >> 11062 26693609 989 2725 >> Write cache usage (#flush,#overrun,,peak(levels)) >> 88496 3492, 75.6% 74.4% 106.7% 197.5% >> >> Data rejected in initial loop: >> 85 (rank deficit/NaN) 0 (Ndf=0) 930 >> (huge) 382 (large) >> ------------------ >> >> I try to simply resubmit this job. The other geometries are final. >> >> Cheers, >> Krisztian >> >>> method "1a)" does work, the Pede job finished correctly (output db file is >>> at >>> /afs/cern.ch/cms/CAF/CMSALCA/ALCA_TRACKERALIGN/MP/MPproduction/mp0899/jobData/jobm). >>> >>> I go on with the submission of the weighted cosmics samples. >>> >>> Cheers, >>> Krisztian >>> >>>> method "1)" did not work, the Pede job failed again with the same >>>> symptoms. The output in pede.dumb simply stops again without reaching the >>>> end: >>>> >>>> Record 12900000 ... still reading >>>> Record >>>> >>>> The dump is at >>>> /afs/cern.ch/cms/CAF/CMSALCA/ALCA_TRACKERALIGN/MP/MPproduction/mp0898/jobData/jobm >>>> >>>> I will proceed with your other method, "1a)". >>>> >>>> Cheers, >>>> Krisztian >>>> >>>>> thanks for the comments! >>>>> >>>>> I will modify the alignment_x.py config files according to your >>>>> suggestion "1)". >>>>> >>>>> I have moved the diagnostic files to a backup directory for future >>>>> reference: >>>>> /afs/cern.ch/cms/CAF/CMSALCA/ALCA_TRACKERALIGN/MP/MPproduction/mp0897/backup_failingJobVer1 >>>>> >>>>> Cheers, >>>>> Krisztian >>>>> >>>>>>> The mps_stat.pl command reports that the Pede job for the alignment of >>>>>>> ideal geometry failed. However, there are outputs produced in the >>>>>>> directory you indicated in your earlier email. >>>>>>> >>>>>>> I have checked the Pede dump in search for any errors, but found no >>>>>>> errors. >>>>>> >>>>>> Hi Krisztian, >>>>>> (adding Claus as pede expert asking for advice in the end) >>>>>> indeed this is the first file to look into. And it does not look >>>>>> healthy, but simply stops at some point - the last line should be >>>>>> something like >>>>>> >>>>>> < Millepede II-P ending ... Wed Oct 26 22:52:11 2011 >>>>>> >>>>>> as in mp0896/jobData/jobm/pede.dump. MPS looks for that line and >>>>>> reports failure since it is not there. >>>>>> >>>>>>> The memory usage was normal, although it was slightly higher than for >>>>>>> the previous alignments: >>>>>>> >>>>>>> Memory space: total 32.000000 GB >>>>>>> used 31.226771 GB = 97.58 % >>>>>>> >>>>>>> In STDOUT I found a possible source of the "fail" report of the >>>>>>> mps_stat.pl. One of the automatic root macros failed to run: >>>>>>> >>>>>>> --------- >>>>>>> Processing readPedeHists.C+("print nodraw")... >>>>>>> Info in : creating shared library >>>>>>> /pool/lsf/krajczar/182146920/./readPedeHists_C.so >>>>>>> Error in : failed reading x-y-dx-dy >>>>>>> content >>>>>>> --------- >>>>>> >>>>>> Before that I see >>>>>> >>>>>> sh: line 1: 27036 CPU time limit exceeded >>>>>> /afs/cern.ch/user/c/ckleinw/bin/rev81/pede_32GB pedeSteerMaster.txt > >>>>>> pede.dump >>>>>> >>>>>> and that tells us the reason whay pede did not run through - it is a >>>>>> serious problem! It is also stated in alignment.log.gz from CMSSW: >>>>>> >>>>>> %MSG-e Alignment: AfterModEndJob PedeReader() 28-Oct-2011 07:12:10 >>>>>> CEST PostEndRun >>>>>> Problem opening pede output file millepede.res >>>>>> %MSG >>>>>> %MSG-i Alignment: AfterModEndJob PedeReader::read() 28-Oct-2011 >>>>>> 07:12:10 CEST PostEndRun >>>>>> will read parameters for run range 1 - 4294967295 >>>>>> %MSG >>>>>> %MSG-i Alignment: AfterModEndJob PedeReader::read() 28-Oct-2011 >>>>>> 07:12:10 CEST PostEndRun >>>>>> 0 parameters for 0 alignables >>>>>> >>>>>> What you point to is a consequence of that: pede did not run through, >>>>>> so millepede.his with histogram-like infos of the pede job is not well >>>>>> behaving and cannot be correctly converted into ROOT/.ps - and there >>>>>> the error you see comes from. >>>>>> >>>>>>> For the previous rounds of alignments this problem did not appear. >>>>>>> >>>>>>> Reference: >>>>>>> /afs/cern.ch/cms/CAF/CMSALCA/ALCA_TRACKERALIGN/MP/MPproduction/mp0897/jobData/jobm/pede.dump >>>>>>> /afs/cern.ch/cms/CAF/CMSALCA/ALCA_TRACKERALIGN/MP/MPproduction/mp0897/jobData/jobm/STDOUT >>>>>>> >>>>>>> Is this a serious issue? Can I submit the Pede jobs for the weighted >>>>>>> samples regardless this error? >>>>>> >>>>>> The question is: >>>>>> Why does it need more CPU starting from ideal (but bows). Internally it >>>>>> is using an iterative procedure (MINRES) for solving the big matrix - >>>>>> and this is done three (4?) times with your settings. Then after each >>>>>> solving there is a line search in 1D. Procedures like that tend to have >>>>>> difficulties if we start too close to the final result (needing more >>>>>> MINRES iterations - see e.g. last page of >>>>>> mp0896/jobData/jobm/millepede.his.ps.gz how kuch this can vary in a >>>>>> succesfull job.)... >>>>>> >>>>>> So - what to do? >>>>>> >>>>>> 1) We can introduce a bit of noise in the procedure by adding some >>>>>> random >>>>>> misalignment. >>>>>> 1a) If that does not help, we could remove the bow-misalignment and the >>>>>> bow determination from teh alignment job - in the very end we could >>>>>> probably use the bows that are the result of the jobs starting from >>>>>> current MC scenario >>>>>> 2) I'll ask for a larger CPU limit on the special millepede queue. >>>>>> >>>>>> Claus - do you have another suggestion? >>>>>> >>>>>> about 1) >>>>>> add to configs >>>>>> process.AlignmentProducer.doMisalignmentScenario = True >>>>>> process.AlignmentProducer.MisalignmentScenario = cms.PSet( >>>>>> setRotations = cms.bool(True), >>>>>> setTranslations = cms.bool(True), >>>>>> seed = cms.int32(1234567), >>>>>> distribution = cms.string('gaussian'), #fixed'), >>>>>> setError = cms.bool(True), >>>>>> TIBBarrels = cms.PSet(DetUnits = cms.PSet( >>>>>> dXlocal = cms.double(0.001)) >>>>>> ) >>>>>> # same for TIDEndcaps, TECEndcap, TPBBarrels and TPEEndcaps >>>>>> # but leave out TOB for now >>>>>> ) >>>>>> about 1a) >>>>>> - setup new directory >>>>>> - remove process.trackerBowedSensors stuff from startgeometry.txt >>>>>> - deselect the bow parameters in alignables.txt: >>>>>> * last three '1' to set to '0' for single sensors (SelectorBowed) >>>>>> * remove 'SelectorTwoBowed' and add double sensor modules (TOB, outer >>>>>> TEC) to SelectorBowed with '101111 000' parameterisation. >>>>>> >>>>>> >>>>>> Cheers >>>>>> >>>>>> Gero >>>>>> >>>>>> -- >>>>>> ----------------------------------------------------------------------- >>>>>> Gero Flucke >>>>>> - Analysis Centre, Helmholtz Alliance "Physics at the Terascale" >>>>>> * Statistics Tools >>>>>> - CMS: Tracker Alignment Convenor >>>>>> DESY/CMS, Notkestr. 85, D-22607 Hamburg, Germany >>>>>> Bldg. 1e, Rm. 02.501 >>>>>> phone: +49 (0)40 8998 3525 >>>>>> fax: +49 (0)40 8998 3092 >>>>>> >>>>> >>>> >>> >> > -- ----------------------------------------------------------------------- Gero Flucke - Analysis Centre, Helmholtz Alliance "Physics at the Terascale" * Statistics Tools - CMS: Tracker Alignment Convenor DESY/CMS, Notkestr. 85, D-22607 Hamburg, Germany Bldg. 1e, Rm. 02.501 phone: +49 (0)40 8998 3525 fax: +49 (0)40 8998 3092