To ensure the availability and reliability of your SharePoint Server
2010 environment, you must actively monitor the physical platform, the
operating system, and all important SharePoint Server 2010 services.
Preventative maintenance will help you identify potential errors before
an error causes problems with the operation of your SharePoint
environment. Preventative maintenance combined with disaster recovery
planning and regular backups will help minimize problems if they occur.
Monitoring your SharePoint environment involves checking for problems
with connections, services, server resources, and system resources. You
can also set alerts to notify administrators when problems occur.
Windows Server and SharePoint Server 2010 provide many monitoring tools
and services to ensure that your SharePoint environment is running
smoothly.
The following maintenance tasks let you establish criteria for normal
behavior of your environment and to detect abnormal activity. It is
important to implement these daily maintenance tasks so that you can
capture and maintain data about your SharePoint environment, such as
usage levels, possible performance bottlenecks, and administrative
changes. The following sections describe specific monitoring tasks which
then map to the checklists as described below.
1. Diagnostic Logging
The Unified Logging Service (ULS) provides a single,
centralized location for logging error and informational messages
related to SharePoint Server and SharePoint solutions. Systems
administrators have one place to look when they need to troubleshoot an
issue or monitor the overall health of the environment.
SharePoint Server 2010 includes improvements that are related to the
management of the Unified Logging Service (ULS or Trace Logs) logs and
that make it easier for administrators to troubleshoot issues. These are
described in the following sections.
2. Event Throttling
Event throttling enables administrators to control the types of
event that SharePoint Server log based on the level of severity. The
administration of throttling is divided into two sections:
1. Destination
Log entries can be reported in two places. The first is the “Event Log”,
which is the standard Windows Event Log. Administrators can use the
Windows Event Viewer application to review entries. The second is the
ULS or “Trace Log”, a text based log format that is specific to
SharePoint Server and is stored on the file system. The default location
is C:\Program Files\Common Files\Microsoft Shared\Web Server
Extensions\14\LOGS.
2. Category
The event throttling dial can be applied to specific categories which
map directly to SharePoint Server functionality. This enables the
administrator to increase the logging detail for SharePoint components
individually, thereby managing the size of the logs and the amount of
information to review.
The default settings for all categories are as follows:
• Event Log: Information
• Trace Log: Medium Level
During normal operation, these settings are an appropriate balance of
detail and performance. During substantial reconfiguration of
SharePoint Server, during the installation of custom solutions, or when
SharePoint Server is experiencing issues, the throttling dial should be
turned down. This ensures as much information is available as possible
for troubleshooting.
Finally, after completing any troubleshooting, logging can be
returned to the default by selecting the “Reset to default” option in
the throttling drop-downs. Settings that are not currently configured
with the default option will appear in a bold font.
Correlation IDs
Correlation IDs are GUIDs that are assigned to events which occur during
the lifecycle of a resource request. This value is surfaced within
error messages, the ULS logs, and tools like the Developer Dashboard.
This value helps an administrator locate and isolate a specific request
across the ULS log, Usage Logging database, and SQL Server Profiler data
sets for debugging purposes.
For example, administrators can take the Correlation ID that appears on
an error page in their browsers and then rapidly locate any related
entries in the ULS logs through a simple search.
Correlation IDs also span machine boundaries. If a request, such as a
front-end Web server calling a Web service on an application server,
crosses a machine boundary the assigned Correlation ID can provide a
complete overview of activities during the life-cycle of the request.
Event Log Flood Protection –
Event Log Flood Protection prevents the “Event Log” from being
overwhelmed with many repetitive events. When Event Log Flood Protection
is enabled (default), it will start trimming events after the same
event is logged five times within two minutes. At this point it
suppresses additional entries. After an additional two minutes, it
throws a summary event that describes the number of times that the event
would have been repeated. An administrator can modify these thresholds.
3. ULS or Trace Logging -
Trace Logs can quickly consume disk space, especially when configured to
use the more verbose output settings. To manage this growth,
administrators can implement two types of restrictions:
a) Administrators can determine the number of days that log files should be kept. By default this is set to 14 days.
b) Administrators can also place a limitation on the overall disk space
that log files can consume. This is disabled by default but provides for
an additional layer of protection aimed at preventing excessive disk
space consumption.
2. Usage Data and Health Data Collection
In addition to Diagnostic Logging, SharePoint Server 2010 also
proactively logs information that is related to the overall health of
the farm. As an administrator you can individually select which events
are monitored, for example the usage of features, page load times, and
search queries.
This functionality both consumes disk space and has a performance
overhead. Like Diagnostic Logging, care needs to be taken to manage it
appropriately. The following options are available to administrators:
1) Health Data Collection
Health reports are built by taking snapshots of various resources, data,
and processes at specific points in time. The number of Timer Jobs to
schedule will depend on the number of events that you selected to
monitor. The frequency of these jobs can be modified to manage the
performance impact.
2) Log Collection Schedule
The Log Collection Schedule Timer Job is responsible for collecting
Usage Logs from the various servers in the farm, processing them, and
then populating a centralized database from where they can be queried
for reporting. Once processed, the logs are deleted from disk, freeing
up the space they were consuming. The frequency of this job can be
modified to manage the consumption of disk space.
Note:
Everything that is being logged to the Windows Event Viewer and to the
SharePoint log files is also being stored in the SharePoint Server 2010
logging database. The logging database is also used by the SharePoint
Health Analyzer and by SharePoint usage reporting.
Use the following checklist to implement these features in your daily operations:
4. SharePoint Health Analyzer
SharePoint has a number of features that log and gather detailed
statistics about all aspects of the health of the environment. The
SharePoint Health Analyzer aggregates all of this data, identifies
possible problems, then proactively looks for, and recommends solutions.
Many solutions that it finds will include a “Repair Now” link, which
when selected will automatically resolve the problem. Other solutions
will link to online help content which is constantly updated with the
latest information about the problem.
Like the “Best Practices Analyzers” available for other platforms (such
as Microsoft Exchange Server), Health Analyzer includes a set of rules
which can be extended by developers and which is continuously compared
to the existing settings and metrics drawn from your production
environment. Rules are applied across a number of categories, including
security, performance, configuration and availability.
5. Timer Jobs
The monitoring features in SharePoint Server 2010 use specific timer
jobs to perform monitoring tasks and collect monitoring data. The health
and usage data might consist of performance counter data, event log
data, timer service data, metrics for site collections and sites, search
usage data, or various performance aspects of the Web servers. The
system uses this data to create health reports, Web Analytics reports,
and administrative reports. The system writes usage and health data to
the logging folder and subsequently to the logging database.
You might want to change the schedules that the timer jobs run on to
collect data more frequently or less frequently. You might even want to
disable jobs that collect data if you are not interested in them. You
can perform the following tasks on timer jobs:
• Modify the schedule that the timer job runs on.
• Run timer jobs immediately.
• Enable or disable timer jobs.
• View timer job status. You can view currently scheduled jobs, failed
jobs, currently running jobs, and a complete timer job history.
6. Web Analytics
The reports that the Web Analytics functionality in SharePoint
Server generates provide detailed insight into how your SharePoint
environment is being used, and how well it’s performing. Administrators
should become familiar with these reports and how they can create their
own (directly in the browser) to plan future capacity and to produce
benchmarks to compare with future farm configurations.
All of these reports can be used to help you decide if the current
architecture remains “fit for purpose,” meaning that it meets the
desired service levels.
The reports are broken down into three categories and can be reviewed
based on Web Application, Site Collection, Site and Search Service.
Reference:
http://www.learningsharepoint.com/2010/10/16/monitoring-sharepoint-2010-%E2%80%93-tutorial/