NDS Troubleshooting Methods.
Start from a server that contains a Read/Write or Master Replica of ROOT.
At the console prompt of this server type "Time"
The third line from the bottom will say "Time is/is not synchronized to the network"
If time is not synchronized to the network, see TID 2908867 for trouble shooting steps.
Check the version of dsrepair you are using (top middle of screen). If it is earlier than 4.35, immediately get 41nds9 or better and upgrade the ds and dsrepair to the latest released versions. Novell has improved both DS and DSRepair significantly since the release of Netware 4.10.
If you are on 4.35 or later:
Run "Time Synchronization"
If this pauses while running, you may suspect the server it is trying to contact may have communication problems. It could be down, etc.
When this finishes, you will be viewing the dsrepair log file.
On this screen check the following:
1. Check the second column for versions of ds. Make sure all servers are up to 5.01 or later of DS. Make sure all server are on the same versions.
2. Check in the left hand column for extra servers, any missing, any old dead servers still there. If a server is still there that crashed or was removed, see TID 2908056 on removing a server.
3. Time is in sync? Check the 5th column over to see that all servers say they are in sync. If any aren't, see TID 2908867 for ideas on timesync. If you recently fixed time, give the servers 10 minutes to run their time poll before troubleshooting further. If one server can't get time no matter what you try, check that server's timesync.cfg in the system directory and if needed, use configured sources on it to point it to the reference or single reference in your tree.
4. Check the 4th column for Time Server Types: verify that there exists either single and secondaries OR reference, primaries and secondaries. See TID 2908056 for ideas and sample configurations.
Make sure a single reference or reference time server exists.
5. Check the 3rd column for any replica depth of -1 on a server that should contain a replica. If a server is having difficulty getting a replica, is in transition or new state, it will often manifest itself on this screen with a -1 for replica depth. If a server doesn't contain a replica, it will normally show up with a depth of -1. This can help you to find what server to suspect if you don't already know which on is causing the problems. Replica depth shows the highest level replica that that server holds. Example: if a server contains [Root], it always reports a 0. If a server contains a replica at the Organization level, but no copy of [Root], it will report 1, and so on. If there is not a partition at the Org level, but there is at the OU below that, any servers holding replicas of that partition will report a 2.
Possible Errors on this screen:
-251 errors to any pre 4.x servers (2.2, 3.1x, etc) This reflects servers contacted for timesync that didn't understand the request. This is normal.
-251 errors to and 4.x servers. This represents a problem. See TID 1004643. Usually this is caused by an out of date dsprcfx.nlm from earlier than May, 1995.
-625 errors to any servers. These are most often caused by lack of reliable communication to a server. It could be down, it could have an old .lan driver, or out of date lan support nlms (nbi.nlm, ethertsm.nlm, msm.nlm). Update to the latest LANDRx.exe update kit from Novell, and get the latest .LAN driver from the nic manufacturer.
Next: Run "Report Synchronization Status"
This report follows the format:
partition: root (or whichever)
Replica: server.context Date Time
[server][CN: error server or object] [error][remote]
Check the upper right hand corner in the menu bar, third line down, see if there are any errors. If there are: check the right hand column, what are the errors, if any. If there are - 603s, check version of ds on those servers. We have seen that most -603 errors on this screen are not the dstrace -603's, but reflect older ds versions on those servers. If there are any -625 [remote], make sure those servers are up, make sure latest lan and support modules (ethertsm, msm) are loaded on those servers, the error is local (non-remote), run repair all network addresses and verify remote ids from advanced options menu of dsrepair. If there are -672s, check the replica rings on this server for the partition in question, and compare it to the replica ring on the server giving the -672 error.
See DSDOCx on the web for explainations of all error codes and possible solutions for those.
replica and partition operations
any replicas in non on status?? What status?
If a replica on this server is in a NEW REPLICA state, hit enter on it, select "View Replica Ring", find the server containing the Master replica, rconsole to that server and check dstrace for errors. When a replica is in a NEW REPLICA state, it is the master replica's responsibility to get a copy of all relevant data to this server. Thus, we check the master when a replica is in a new state.
If a replica is in a TRANSITION ON state, check this servers dstrace screen for errors. When a replica is in a TRANSITION ON state, it is that server's responsibility to verify that it contains the latest information and turn to an ON state.
If there exists a SPLIT STATE on a partition, view the ring, see that all replicas are in a SPLIT STATE, if not, check those servers who aren't to see if their replica rings are consistent. Then check the child partition, if they exists, see that the ring is consistent and every server in the parent ring is in the child ring.
If a replica is in a JOIN STATE, check the replica rings for parent and child to make sure they are the same. The data for both the parent and child partitions have to exist on all servers in the parent ring for a merge partition operation to complete, merging partitions is easier if both replica rings, the parent and child, are identical. Meaning where a read/write replica of the parent exists, a read write replica of the child does also, etc.