== Kbases == https://pcp.io/ https://github.com/performancecopilot/pcp Performance Co-Pilot (PCP) Data Sheet https://access.redhat.com/articles/3119481 How do I install Performance Co-Pilot (PCP) on my RHEL server to capture performance logs https://access.redhat.com/solutions/1137023 How do I gather performance data logs to upload to my support case using Performance Co-Pilot (PCP) https://access.redhat.com/articles/1146063 Side-by-side comparison of PCP tools with legacy tools https://access.redhat.com/articles/2372811 Using Performance Co-Pilot for CEE as a replacement for collectl for storage performance cases https://access.redhat.com/articles/3901181 Introduction to storage performance analysis with PCP https://access.redhat.com/articles/2450251 How to begin Network performance debugging https://access.redhat.com/articles/1311173 [Troubleshooting] Gathering system baseline resource usage for IO performance issues https://access.redhat.com/articles/279063 How do I find out what process are eating up all my system resources like memory and CPU? https://access.redhat.com/solutions/62077 How to read Vmstat output https://access.redhat.com/solutions/1160343 == General == ;Troubleshooting Tools : sar :: provided by the sysstat package :: http://pagesperso-orange.fr/sebastien.godard/faq.html :: http://ksar.atomique.net/download.html : iostat :: provided by the sysstat package : vmstat :: provided by the sysstat package : tcpdump :: http://www.tcpdump.org/#documentation : oProfile :: http://oprofile.sourceforge.net/about/ : SystemTap :: http://sourceware.org/systemtap/ : gdb :: http://www.gnu.org/software/gdb/documentation/ :cron :: http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/sysadmin-guide/ch-autotasks.html ; Tuning Parameters: : /etc/sysctl.conf : /etc/security/limits.conf : NIC bonding :: http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding ; Benchmarking: : bonnie++ :: http://download.fedora.redhat.com/pub/epel/5/i386/bonnie++-1.03a-6.el5.i386.rpm :: http://download.fedora.redhat.com/pub/epel/5/x86_64/bonnie++-1.03a-6.el5.x86_64.rpm : Oracle stored procedures : apache benchmark - ab == Disk IO == * http://www.faqs.org/docs/Linux-mini/Ultra-DMA.html#ss8.2 * If the scsi/ide module is either not support or not loaded, and you attempt to set the DMA mode: "HDIO_SET_DMA fail operation not permitted" * http://www.redhat.com/magazine/008jun05/features/schedulers/ ** deadline or cfq for Oracle * http://www.pythian.com/news/247/basic-io-monitoring-on-linux/
Traditionally, it’s common to assume that the closer to 100% utilization a device is, the more saturated it is. This might be true when the system device corresponds to a single physical disk. However, with devices representing a LUN of a modern storage box, the story might be completely different. Rather than looking at device utilization, there is another way to estimate how loaded a device is. Look at the non-existent column I mentioned above — qutim — the average time a request is spending in the queue. If it’s insignificant, compare it to svctim — the IO device is not saturated. When it becomes comparable to svctim and goes above it, then requests are queued longer and a major part of response time is actually time spent waiting in the queue. The figure in the await column should be as close to that in the svctim column as possible. If await goes much above svctim, watch out! The IO device is probably overloaded.=== sar disktop ===
#!/bin/bash # Check particular hosts for their disk performance metrics # note1: run this on a trusted host # note2: bonnie++ is a better tool for benchmarking IO HOSTS="ut40xl016 ut40xl045 ut40xl017 ut40xl046 ut40xl018 ut40xl047" echo "host,topread,topwrite,toputil" # Loop through hosts for h in `echo $HOSTS | sort -n`; do #echo $h # Get list of sar data files SAFILES=`ssh root@$h 'ls /var/log/sa/sa[0-9][0-9]'` # Loop through sar data files and find top READ WRITE AND UTILIZATION TOPREAD=0 TOPWRITE=0 TOPUTIL=0 for f in `echo $SAFILES`; do #echo $f READ=`ssh root@$h "sar -d -f $f" | fgrep -v DEV | awk '{print $5}' | sort -n | tail -n1 | cut -d\. -f1` WRITE=`ssh root@$h "sar -d -f $f" | fgrep -v DEV | awk '{print $6}' | sort -n | tail -n1 | cut -d\. -f1` UTIL=`ssh root@$h "sar -d -f $f" | fgrep -v DEV | awk '{print $11}' | sort -n | tail -n1 | sed 's/\...//g'` if [ "$READ" -gt "$TOPREAD" ]; then TOPREAD=`echo $READ` #echo "TOP READ is :" $TOPREAD fi if [ "$WRITE" -gt "$TOPWRITE" ]; then TOPWRITE=`echo $WRITE` #echo "TOP WRITE is :" $TOPWRITE fi if [ "$UTIL" -gt "$TOPUTIL" ]; then TOPUTIL=`echo $UTIL` #echo "TOP UTIL is :" $TOPUTIL fi done echo "$h,$TOPREAD,$TOPWRITE,$TOPUTIL" done== Network == * http://fasterdata.es.net/TCP-tuning/linux.html * http://www-iepm.slac.stanford.edu/bw/tcp-eval/
# don't cache ssthresh from previous connection net.ipv4.tcp_no_metrics_save = 1 # recommended to increase this for 10G NICS net.core.netdev_max_backlog = 30000* sysctl net.ipv4.tcp_available_congestion_control
/sbin/modprobe tcp_htcp /sbin/modprobe tcp_cubic* For long fast paths, we highly recommend using cubic or htcp. * Warning on Large MTUs: If you have configured your Linux host to use 9K MTUs, but the connection is using 1500 byte packets, then you actually need 9/1.5 = 6 times more buffer space in order to fill the pipe. In fact some device drivers only allocate memory in power of two sizes, so you may even need 16/1.5 = 11 times more buffer space! * warning for both 2.4 and 2.6: for very large BDP paths where the TCP window is > 20 MB, you are likely to hit the Linux SACK implementation problem. If Linux has too many packets in flight when it gets a SACK event, it takes too long to located the SACKed packet, and you get a TCP timeout and CWND goes back to 1 packet. Restricting the TCP buffer size to about 12 MB seems to avoid this problem, but clearly limits your total throughput. Another solution is to disable SACK. ----------------------------------------------------------------------------------------
jtanner@trainwreck:~$ ifconfig -a | fgrep -e eth -e br -e que br0 Link encap:Ethernet HWaddr 40:61:86:BE:7D:D0 collisions:0 txqueuelen:0 eth0 Link encap:Ethernet HWaddr 00:10:4B:1F:95:71 collisions:0 txqueuelen:1000 eth1 Link encap:Ethernet HWaddr 40:61:86:BE:7D:D0 collisions:0 txqueuelen:1000 collisions:0 txqueuelen:0 virbr0 Link encap:Ethernet HWaddr 1E:16:34:68:69:02 collisions:0 txqueuelen:0 collisions:0 txqueuelen:500
jtanner@trainwreck:~$ tc -s qdisc show dev eth0 qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 41873920 bytes 349870 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 jtanner@trainwreck:~$ tc -s qdisc show dev eth1 qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 417024970 bytes 479786 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0
jtanner@trainwreck:~$ cat /proc/net/softnet_stat 001def9c 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000dad07 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 -------- -------- \ \_ drop count \_ packet count
jtanner@trainwreck:~$ cat /proc/net/snmp | grep '^Ip:' | cut -f17 -d' ' ReasmFails 0
jtanner@trainwreck:~$ ps axo pid,comm,util | grep softirq 4 ksoftirqd/0 0 7 ksoftirqd/1 0== Applications == === Oracle === ==== vmstat ==== vmstat is the first place to check performance issues. ** If the wa (time waiting for I/O) column is high, this is usually an indication that the storage subsystem is overloaded. See recipes 8-9 and 8-10 for identifying the sources of I/O contention. ** If b (processes sleeping) is consistently greater than 0, then you may not have enough CPU processing power. See recipe 8-2 for identifying Oracle processes and SQL state-ments consuming the most CPU. ** If so (memory swapped out to disk) and si (memory swapped in from disk) are consistently greater than 0, you may have a memory bottleneck. See recipe 8-5 for details on identifying Oracle processes and SQL statements consuming the most memory. ==== Enumerating high cpu oracle pids ==== Grab the pid number
ps -e -o pcpu,pid,user,tty,args | grep -i oracle | sort -n -k 1 -r | headQuery the database for more information
SET LINESIZE 80 HEADING OFF FEEDBACK OFF SELECT RPAD('USERNAME : ' || s.username, 80) || RPAD('OSUSER : ' || s.osuser, 80) || RPAD('PROGRAM : ' || s.program, 80) || RPAD('SPID : ' || p.spid, 80) || RPAD('SID : ' || s.sid, 80) || RPAD('SERIAL# : ' || s.serial#, 80) || RPAD('MACHINE : ' || s.machine, 80) || RPAD('TERMINAL : ' || s.terminal, 80) FROM v$session s, v$process p WHERE s.paddr = p.addr AND p.spid = '&PID_FROM_OS';==== Enumerating high memory oracle pids ==== Grab the pid number
ps -e -o pmem,pid,user,tty,args | grep -i oracle | sort -n -k 1 -r | headQuery the database for more information
SET LINESIZE 80 HEADING OFF FEEDBACK OFF SELECT RPAD('USERNAME : ' || s.username, 80) || RPAD('OSUSER : ' || s.osuser, 80) || RPAD('PROGRAM : ' || s.program, 80) || RPAD('SPID : ' || p.spid, 80) || RPAD('SID : ' || s.sid, 80) || RPAD('SERIAL# : ' || s.serial#, 80) || RPAD('MACHINE : ' || s.machine, 80) || RPAD('TERMINAL : ' || s.terminal, 80) || RPAD('SQL TEXT : ' || q.sql_text, 80) FROM v$session s ,v$process p ,v$sql q WHERE s.paddr = p.addr AND p.spid = '&PID_FROM_OS' AND s.sql_address = q.address(+) AND s.sql_hash_value = q.hash_value(+);==== Enumerating high disk utilization ==== * Look for devices with abnormally high blocks read or written per second. * If any device is near 100 percent utilization, that’s a strong indicator I/O is a bottleneck. If the bottlenecked disks are used by Oracle, then you can query the data dictionary to identify sessions with high I/O activity. The following query is useful for determining which SQL statements generate the most read/write activity:
SELECT * FROM (SELECT parsing_schema_name ,direct_writes ,SUBSTR(sql_text,1,75) ,disk_reads FROM v$sql ORDER BY disk_reads DESC) WHERE rownum < 20;Determining which objects produce the heaviest I/O activity in the database:
SELECT * FROM (SELECT s.statistic_name ,s.owner ,s.object_type ,s.object_name ,s.value FROM v$segment_statistics s WHERE s.statistic_name IN ('physical reads', 'physical writes', 'logical reads', 'physical reads direct', 'physical writes direct') ORDER BY s.value DESC) WHERE rownum < 20;==== Enumerating network utilization ==== netstat -ptc
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 rmug.com:62386 rmug.com:1521 ESTABLISHED 22864/ora_pmon_RMDB tcp 0 0 rmug.com:53930 rmug.com:1521 ESTABLISHED 6091/sqlplus tcp 0 0 rmug.com:1521 rmug.com:53930 ESTABLISHED 6093/oracleRMDB1 tcp 0 0 rmug.com:1521 rmug.com:62386 ESTABLISHED 10718/tnslsnrIf the Send-Q (bytes not acknowledged by remote host) column has an unusually high value for a process, this may indicate an overloaded network. The useful aspect about the previous output is that you can determine the operating system process ID (PID) associated with a network connection. === Tomcat === * http://tomcat.apache.org/articles/performance.pdf * http://www.devshed.com/c/a/BrainDump/Tomcat-Performance-Tuning/
1. Decide what needs to be measured. 2. Decide how to measure. 3. Measure. 4. Understand the implications of what you learned. 5. Modify the configuration in ways that are expected to improve the measurements. 6. Measure and compare with previous measurements. 7. Go back to step 4.* Use ab (apache benchmark) as the benchmarking tool. * http://jakarta.apache.org/jmeter/ * http://www.itworld.com/networking/83035/tomcat-performance-tuning-tips ** Start the JVM with a higher heap memory maximum using the -Xmx switch. ** Start the JVM with its initial heap memory size (the -Xms switch) set the same value as its maximum memory size. ** Tune the Connector (web server) thread pool settings to more closely match the web request load you have. ** Tune some additional Connector attribute settings: *** compression *** compressableMimeTypes ** The database connection pool settings are very important in that case. Mainly, the maxActive, maxIdle, and maxWait attributes of the Resource element where you define your database connection pool. ** HTTP caching headers ** * http://www.solutionhacker.com/implement-your-idea/scale-your-website/tomcat-performance-tuning/ == Headline text == == Profiling resource usage with cron and ps ==
[root@t5400 ~]# crontab -l */2 * * * * ps aux > /root/ps.output/ps.`date \+\%Y-\%m-\%d_\%H_\%M_\%S`.outTo archive and clean out the /root/ps.output directory every night do something like this: Put this in /root/ps.archive.sh
#!/bin/bash if [ -f /root/ps.output-archive.tar.bz2 ] then /bin/rm -f /root/ps.output-archive.tar.bz2 fi /bin/tar -cjf /root/ps.output-archive.tar.bz2 /root/ps.output* chmod +x /root/ps.archive.sh * crontab -e * 10 0 * * * /root/ps.archive.sh > /dev/null 2>&1 So with that cron entry and script at 10 past midnight every night it will tar up what is in /root/ps.output/ and remove the stuff in there. It should leave enough if you have a hang to determine the problem.