== Kbases == https://pcp.io/ https://github.com/performancecopilot/pcp Performance Co-Pilot (PCP) Data Sheet https://access.redhat.com/articles/3119481 How do I install Performance Co-Pilot (PCP) on my RHEL server to capture performance logs https://access.redhat.com/solutions/1137023 How do I gather performance data logs to upload to my support case using Performance Co-Pilot (PCP) https://access.redhat.com/articles/1146063 Side-by-side comparison of PCP tools with legacy tools https://access.redhat.com/articles/2372811 Using Performance Co-Pilot for CEE as a replacement for collectl for storage performance cases https://access.redhat.com/articles/3901181 Introduction to storage performance analysis with PCP https://access.redhat.com/articles/2450251 How to begin Network performance debugging https://access.redhat.com/articles/1311173 [Troubleshooting] Gathering system baseline resource usage for IO performance issues https://access.redhat.com/articles/279063 How do I find out what process are eating up all my system resources like memory and CPU? https://access.redhat.com/solutions/62077 How to read Vmstat output https://access.redhat.com/solutions/1160343 == General == ;Troubleshooting Tools : sar :: provided by the sysstat package :: http://pagesperso-orange.fr/sebastien.godard/faq.html :: http://ksar.atomique.net/download.html : iostat :: provided by the sysstat package : vmstat :: provided by the sysstat package : tcpdump :: http://www.tcpdump.org/#documentation : oProfile :: http://oprofile.sourceforge.net/about/ : SystemTap :: http://sourceware.org/systemtap/ : gdb :: http://www.gnu.org/software/gdb/documentation/ :cron :: http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/sysadmin-guide/ch-autotasks.html ; Tuning Parameters: : /etc/sysctl.conf : /etc/security/limits.conf : NIC bonding :: http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding ; Benchmarking: : bonnie++ :: http://download.fedora.redhat.com/pub/epel/5/i386/bonnie++-1.03a-6.el5.i386.rpm :: http://download.fedora.redhat.com/pub/epel/5/x86_64/bonnie++-1.03a-6.el5.x86_64.rpm : Oracle stored procedures : apache benchmark - ab == Disk IO == * http://www.faqs.org/docs/Linux-mini/Ultra-DMA.html#ss8.2 * If the scsi/ide module is either not support or not loaded, and you attempt to set the DMA mode: "HDIO_SET_DMA fail operation not permitted" * http://www.redhat.com/magazine/008jun05/features/schedulers/ ** deadline or cfq for Oracle * http://www.pythian.com/news/247/basic-io-monitoring-on-linux/
Traditionally, it’s common to assume that the closer to 100% utilization a device is, the more saturated it is. This might be true when the system device corresponds to a single physical disk. However, with devices representing a LUN of a modern storage box, the story might be completely different. Rather than looking at device utilization, there is another way to estimate how loaded a device is. Look at the non-existent column I mentioned above — qutim — the average time a request is spending in the queue. If it’s insignificant, compare it to svctim — the IO device is not saturated. When it becomes comparable to svctim and goes above it, then requests are queued longer and a major part of response time is actually time spent waiting in the queue. The figure in the await column should be as close to that in the svctim column as possible. If await goes much above svctim, watch out! The IO device is probably overloaded.
=== sar disktop ===
#!/bin/bash

# Check particular hosts for their disk performance metrics
#       note1: run this on a trusted host
#       note2: bonnie++ is a better tool for benchmarking IO

HOSTS="ut40xl016 ut40xl045 ut40xl017 ut40xl046 ut40xl018 ut40xl047"

echo "host,topread,topwrite,toputil"

# Loop through hosts
for h in `echo $HOSTS | sort -n`; do
        #echo $h

        # Get list of sar data files
        SAFILES=`ssh root@$h 'ls /var/log/sa/sa[0-9][0-9]'`

        # Loop through sar data files and find top READ WRITE AND UTILIZATION
        TOPREAD=0
        TOPWRITE=0
        TOPUTIL=0
        for f in `echo $SAFILES`; do
                #echo $f
                READ=`ssh root@$h "sar -d -f $f" |  fgrep -v DEV | awk '{print $5}' | sort -n  | tail -n1 | cut -d\. -f1`
                WRITE=`ssh root@$h "sar -d -f $f" |  fgrep -v DEV | awk '{print $6}' | sort -n  | tail -n1 | cut -d\. -f1`
                UTIL=`ssh root@$h "sar -d -f $f" |  fgrep -v DEV | awk '{print $11}' | sort -n  | tail -n1 | sed 's/\...//g'`

                if [ "$READ" -gt  "$TOPREAD" ]; then
                        TOPREAD=`echo $READ`
                        #echo "TOP READ is :" $TOPREAD
                fi 

                if [ "$WRITE" -gt "$TOPWRITE" ]; then
                        TOPWRITE=`echo $WRITE`
                        #echo "TOP WRITE is :" $TOPWRITE
                fi


                if [ "$UTIL" -gt "$TOPUTIL" ]; then
                        TOPUTIL=`echo $UTIL`
                        #echo "TOP UTIL is :" $TOPUTIL
                fi
        done

        echo "$h,$TOPREAD,$TOPWRITE,$TOPUTIL"
done
== Network == * http://fasterdata.es.net/TCP-tuning/linux.html * http://www-iepm.slac.stanford.edu/bw/tcp-eval/
   # don't cache ssthresh from previous connection
   net.ipv4.tcp_no_metrics_save = 1
   # recommended to increase this for 10G NICS
   net.core.netdev_max_backlog = 30000 
* sysctl net.ipv4.tcp_available_congestion_control
/sbin/modprobe tcp_htcp 
/sbin/modprobe tcp_cubic
* For long fast paths, we highly recommend using cubic or htcp. * Warning on Large MTUs: If you have configured your Linux host to use 9K MTUs, but the connection is using 1500 byte packets, then you actually need 9/1.5 = 6 times more buffer space in order to fill the pipe. In fact some device drivers only allocate memory in power of two sizes, so you may even need 16/1.5 = 11 times more buffer space! * warning for both 2.4 and 2.6: for very large BDP paths where the TCP window is > 20 MB, you are likely to hit the Linux SACK implementation problem. If Linux has too many packets in flight when it gets a SACK event, it takes too long to located the SACKed packet, and you get a TCP timeout and CWND goes back to 1 packet. Restricting the TCP buffer size to about 12 MB seems to avoid this problem, but clearly limits your total throughput. Another solution is to disable SACK. ----------------------------------------------------------------------------------------
jtanner@trainwreck:~$ ifconfig -a | fgrep -e eth -e br -e que
br0       Link encap:Ethernet  HWaddr 40:61:86:BE:7D:D0  
          collisions:0 txqueuelen:0 
eth0      Link encap:Ethernet  HWaddr 00:10:4B:1F:95:71  
          collisions:0 txqueuelen:1000 
eth1      Link encap:Ethernet  HWaddr 40:61:86:BE:7D:D0  
          collisions:0 txqueuelen:1000 
          collisions:0 txqueuelen:0 
virbr0    Link encap:Ethernet  HWaddr 1E:16:34:68:69:02  
          collisions:0 txqueuelen:0 
          collisions:0 txqueuelen:500 
jtanner@trainwreck:~$ tc -s qdisc show dev eth0
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 41873920 bytes 349870 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 

jtanner@trainwreck:~$ tc -s qdisc show dev eth1
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 417024970 bytes 479786 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
jtanner@trainwreck:~$ cat /proc/net/softnet_stat
001def9c 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
000dad07 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
-------- --------
   \        \_ drop count
    \_ packet count
jtanner@trainwreck:~$ cat /proc/net/snmp | grep '^Ip:' | cut -f17 -d' ' 
ReasmFails
0
jtanner@trainwreck:~$ ps axo pid,comm,util | grep softirq
    4 ksoftirqd/0      0
    7 ksoftirqd/1      0
== Applications == === Oracle === ==== vmstat ==== vmstat is the first place to check performance issues. ** If the wa (time waiting for I/O) column is high, this is usually an indication that the storage subsystem is overloaded. See recipes 8-9 and 8-10 for identifying the sources of I/O contention. ** If b (processes sleeping) is consistently greater than 0, then you may not have enough CPU processing power. See recipe 8-2 for identifying Oracle processes and SQL state-ments consuming the most CPU. ** If so (memory swapped out to disk) and si (memory swapped in from disk) are consistently greater than 0, you may have a memory bottleneck. See recipe 8-5 for details on identifying Oracle processes and SQL statements consuming the most memory. ==== Enumerating high cpu oracle pids ==== Grab the pid number
ps -e -o pcpu,pid,user,tty,args | grep -i oracle | sort -n -k 1 -r | head
Query the database for more information
SET LINESIZE 80 HEADING OFF FEEDBACK OFF
SELECT
  RPAD('USERNAME : ' || s.username, 80) ||
  RPAD('OSUSER    : ' || s.osuser, 80) ||
  RPAD('PROGRAM : ' || s.program, 80) ||
  RPAD('SPID      : ' || p.spid, 80) ||
  RPAD('SID       : ' || s.sid, 80) ||
  RPAD('SERIAL# : ' || s.serial#, 80) ||
  RPAD('MACHINE : ' || s.machine, 80) ||
  RPAD('TERMINAL : ' || s.terminal, 80)
FROM v$session s,
     v$process p
WHERE s.paddr = p.addr
AND    p.spid = '&PID_FROM_OS';
==== Enumerating high memory oracle pids ==== Grab the pid number
ps -e -o pmem,pid,user,tty,args | grep -i oracle | sort -n -k 1 -r | head
Query the database for more information
SET LINESIZE 80 HEADING OFF FEEDBACK OFF
SELECT
  RPAD('USERNAME : ' || s.username, 80) ||
  RPAD('OSUSER    : ' || s.osuser, 80) ||
  RPAD('PROGRAM : ' || s.program, 80) ||
  RPAD('SPID      : ' || p.spid, 80) ||
  RPAD('SID       : ' || s.sid, 80) ||
  RPAD('SERIAL# : ' || s.serial#, 80) ||
  RPAD('MACHINE : ' || s.machine, 80) ||
  RPAD('TERMINAL : ' || s.terminal, 80) ||
  RPAD('SQL TEXT : ' || q.sql_text, 80)
FROM v$session s
    ,v$process p
    ,v$sql      q
WHERE s.paddr           = p.addr
AND    p.spid           = '&PID_FROM_OS'
AND    s.sql_address    = q.address(+)
AND    s.sql_hash_value = q.hash_value(+);
==== Enumerating high disk utilization ==== * Look for devices with abnormally high blocks read or written per second. * If any device is near 100 percent utilization, that’s a strong indicator I/O is a bottleneck. If the bottlenecked disks are used by Oracle, then you can query the data dictionary to identify sessions with high I/O activity. The following query is useful for determining which SQL statements generate the most read/write activity:
SELECT *
FROM
(SELECT
  parsing_schema_name
 ,direct_writes
 ,SUBSTR(sql_text,1,75)
 ,disk_reads
FROM v$sql
ORDER BY disk_reads DESC)
WHERE rownum < 20;
Determining which objects produce the heaviest I/O activity in the database:
SELECT *
FROM
(SELECT
  s.statistic_name
 ,s.owner
 ,s.object_type
 ,s.object_name
 ,s.value
  FROM v$segment_statistics s
  WHERE s.statistic_name IN
     ('physical reads', 'physical writes', 'logical reads',
      'physical reads direct', 'physical writes direct')
ORDER BY s.value DESC)
WHERE rownum < 20;
==== Enumerating network utilization ==== netstat -ptc
Proto Recv-Q Send-Q Local Address Foreign Address State              PID/Program name
tcp        0      0 rmug.com:62386 rmug.com:1521      ESTABLISHED    22864/ora_pmon_RMDB
tcp        0      0 rmug.com:53930 rmug.com:1521      ESTABLISHED    6091/sqlplus
tcp        0      0 rmug.com:1521 rmug.com:53930      ESTABLISHED    6093/oracleRMDB1
tcp        0      0 rmug.com:1521 rmug.com:62386      ESTABLISHED    10718/tnslsnr
If the Send-Q (bytes not acknowledged by remote host) column has an unusually high value for a process, this may indicate an overloaded network. The useful aspect about the previous output is that you can determine the operating system process ID (PID) associated with a network connection. === Tomcat === * http://tomcat.apache.org/articles/performance.pdf * http://www.devshed.com/c/a/BrainDump/Tomcat-Performance-Tuning/
   1. Decide what needs to be measured.
   2. Decide how to measure.
   3. Measure.
   4. Understand the implications of what you learned.
   5. Modify the configuration in ways that are expected to improve the measurements.
   6. Measure and compare with previous measurements.
   7. Go back to step 4. 
* Use ab (apache benchmark) as the benchmarking tool. * http://jakarta.apache.org/jmeter/ * http://www.itworld.com/networking/83035/tomcat-performance-tuning-tips ** Start the JVM with a higher heap memory maximum using the -Xmx switch. ** Start the JVM with its initial heap memory size (the -Xms switch) set the same value as its maximum memory size. ** Tune the Connector (web server) thread pool settings to more closely match the web request load you have. ** Tune some additional Connector attribute settings: *** compression *** compressableMimeTypes ** The database connection pool settings are very important in that case. Mainly, the maxActive, maxIdle, and maxWait attributes of the Resource element where you define your database connection pool. ** HTTP caching headers ** * http://www.solutionhacker.com/implement-your-idea/scale-your-website/tomcat-performance-tuning/ == Headline text == == Profiling resource usage with cron and ps ==
[root@t5400 ~]# crontab -l
*/2	*	*	*	* 	ps aux > /root/ps.output/ps.`date \+\%Y-\%m-\%d_\%H_\%M_\%S`.out
To archive and clean out the /root/ps.output directory every night do something like this: Put this in /root/ps.archive.sh
#!/bin/bash

if [ -f /root/ps.output-archive.tar.bz2 ]
then
 /bin/rm -f /root/ps.output-archive.tar.bz2
fi

/bin/tar -cjf /root/ps.output-archive.tar.bz2 /root/ps.output
* chmod +x /root/ps.archive.sh * crontab -e * 10 0 * * * /root/ps.archive.sh > /dev/null 2>&1 So with that cron entry and script at 10 past midnight every night it will tar up what is in /root/ps.output/ and remove the stuff in there. It should leave enough if you have a hang to determine the problem.