Thirty-four years in IT – The Application That Almost Broke Me (Part 9)
The last half of 2011 was for me an my team a really, really tough time.
As I hinted to in this post, by August 2011 we were buried in Oracle 11 & application performance problems. By the time we were back into a period of relative stability that December, we had:
- Six Oracle Sev 1’s open at once, the longest open for months. The six incidents were updated a combined total of 800 times before they finally were all resolved.
- Multiple extended database outages, most during peak activity at the beginning of the semester.
- Multiple 24-hour+ Oracle support calls.
- An on-site Oracle engineer.
- A corrupt on-disk database forcing a point-in-time recovery from backups of our student primary records/finance/payroll database.
- Extended work hours and database patches and configuration changes more weekends than not.
- A forced re-write of major sections of the application to mitigate extremely poor design choices.
- Our applications, in order to work around old RDB bugs, was deliberately coded with literal strings in queries instead of passing variables as parameters.
- The application also carried large amounts of legacy code that scanned large, multi-million row database tables one row at a time, selecting each row in turn and performing operations on that row. Just like in the days of Hollerith cards.
- The combination of literals and single-row queries resulted in the Oracle SGA shared pool becoming overrun with simple queries, each used only once, cached, and then discarded. At times we were hard-parsing many thousands of queries per second, each with a literal string in the query, and each referenced and executed exactly once.
- A database engine that mutexed itself to death while trying to parse, insert and expire those queries from the SGA library cache.
- Listener crashes that caused the app – lacking basic error handling – to fail and required an hour or so to recover.
- We missed one required Solaris patch that may have impacted the database.
- We likely were overrunning the interrupts and network stack on the E25k network cards and/or Solaris 10 drivers as we performed many thousands of trivial queries per second. This may have been the cause of our frequent listener crashes.
… we have also identified another serious issue that is stemming from your application design using literals and is also a huge contributor to the fragmentation issues. There is one sql that is the same but only differs with literals and had 67,629 different versions in the shared pool.
*** This is a Security Bloggers Network syndicated blog from Last In - First Out authored by Michael Janke. Read the original post at: http://feedproxy.google.com/~r/LastInFirstOut/~3/_JvF_RZcIW4/thirty-four-years-application-that.html