Monday, September 12, 2016

Anomoly Detection





  • https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
  • 2d KS Test 
  • Wilcoxon signed rank - https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
  •  
  • Detecting regime change 
    • http://www.beringclimate.noaa.gov/regimes/rodionov_overview.pdf
    • https://www.r-bloggers.com/detecting-regime-change-in-irregular-time-series/
  • DBSCAN - http://techblog.netflix.com/2015/07/tracking-down-villains-outlier.html
  • https://github.com/twitter/AnomalyDetection 
  • https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series
  •  https://metamarkets.com/2012/algorithmic-trendspotting-the-meaning-of-interesting/
  • http://techblog.netflix.com/2015/02/rad-outlier-detection-on-big-data.html 
  • https://en.wikipedia.org/wiki/CUSUM

Regressions:

Isotonic Regressions
Polynomial
Regressive Random Forests

Tuesday, October 22, 2013

... Java so fast it blows women's cloths off!

The question I am trying to answer is what does it take to get Java to perform so fast it blows women's cloths off. The goal is not to understand what is fast - but simply to look at techniques which are faster than anything else available.

1. Never use java.util (collections). They generate tremendous amount of garbage and are slow.
2. Avoid garbage collection
 3. Reuse strings.
3. Async Logging - i.e. don't spend all your time writing logs
4. Don't use the Heap - i.e. No heap = no GC.
5.  Know what the heck your system is doing. Maybe it's the machine and not your crappy-code?
6. Use a ring instead of a queue / thread message passing
7. Maybe the problem isn't with your code at all - but with the watch you use?
8. Keep your methods short and sweet - helps with hotspot
9. Try some exotic features
 10. CAS / optimistic locking / lock free

DemiGods in this space - 










Monday, October 14, 2013

JVM Monitoring List

Yes - it's another list. In the last few days, I've had a perverse desire to make lists. This one is for tools that allow for JVM monitoring both internal or external.


Tools:
Open Source:

Vendors:


Metrics:
Log Aggregation


Tutorials:

Distributed File System List

LP Solvers List

I am feeling like making lists. I guess it's the human condition to want to organize and categorize. For this post, I will focus on linear programming solver libraries and links:


Open Source:
 Commercial:

Saturday, October 12, 2013

GC Parameters List

I am always on the lookout for common GC parameters. So I figured I'll compile a list of some of the common parameters and places where I found them:

Super Lists of All Parameters:
  1. http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html 
  2. http://stas-blogspot.blogspot.fi/2011/07/most-complete-list-of-xx-options-for.html
  3. http://blog.ragozin.info/2011/09/hotspot-jvm-garbage-collection-options.html
  4. http://reins.altervista.org/java/A_Collection_of_JVM_Options_MP.html 

Blog Roll
  • http://practicingtechie.wordpress.com/2013/06/15/java-vm-options/
    • -XX:+UseConcMarkSweepGC
    • -XX:+HeapDumpOnOutOfMemoryError
    • -XX:HeapDumpPath=$APP_HOME_DIR
    • -XX:OnOutOfMemoryError=
    • -XX:OnError=
    • -XX:+PrintGCDetails
    • -XX:+PrintGCTimeStamps
    • -Xloggc:$APP_HOME_DIR/gc.log
    • -XX:-UseGCLogFileRotation
    • -XX:GCLogFileSize=
    • -XX:NumberOfGCLogFiles=
  • http://forum.openspaces.org/thread.jspa?messageID=9277 
    • -Xms2g 
    • -Xmx2g 
    • -XX:+UseConcMarkSweepGC 
    • -XX:+CMSIncrementalMode 
    • -XX:+CMSIncrementalPacing 
    • -XX:CMSIncrementalDutyCycleMin=10 
    • -XX:CMSIncrementalDutyCycle=50 
    • -XX:ParallelGCThreads=8 
    • -XX:+UseParNewGC 
    • -Xmn150m 
    • -XX:MaxGCPauseMillis=2000 
    • -XX:GCTimeRatio=10 
    • -XX:+DisableExplicitGC 
  • http://java-is-the-new-c.blogspot.com/
    • -Xms11g 
    • -Xmx11g 
    • -verbose:gc 
    • -XX:-UseAdaptiveSizePolicy 
    • -XX:SurvivorRatio=12 
    • -XX:NewSize=100m 
    • -XX:MaxNewSize=100m 
    • -XX:MaxTenuringThreshold=2
  • http://blog.igorminar.com/2010/07/dgc-ii-jvm-tuning.html 
    • -XX:+UseConcMarkSweepGC
    • -XX:+UseParNewGC
    • -XX:CMSInitiatingOccupancyFraction=68
    • -XX:MaxTenuringThreshold=31
    • -XX:+CMSParallelRemarkEnabled
    • -XX:SurvivorRatio=6
    • -XX:TargetSurvivorRatio=90 
    • -XX:+AggressiveOpts
    • -XX:+DoEscapeAnalysis 
    • -Xloggc:/some/path/
    • -XX:+PrintGCDetails 
    • -XX:+PrintGCTimeStamps
    • -XX:+PrintGCDateStamps
    • -XX:+PrintTenuringDistribution
    • -XX:+HeapDumpOnOutOfMemoryError
    • -Xmn2818m 
  • http://blog.performize-it.com/2013/09/jvm-params-everyone-should-have-in.html
    • -Xms{#MB}m -Xmx{#MB}m
    • -XX:PermSize={#MB}m -XX:MaxPermSize={#MB}m
    • -XX:+HeapDumpOnOutOfMemoryError
    • -XX:+PrintFlagsFinal
    • -server
    • -XX:+PrintGCDetails
    • -XX:+PrintGCDateStamps 
    • -XX:+PrintTenuringDistribution 
    • -XX:+PrintGCApplicationStoppedTime 
    • -XX:+PrintGCApplicationConcurrentTime  
    • -XX:+UseGCLogFileRotation
    • -XX:NumberOfGCLogFiles={#files}
    • -XX:GCLogFileSize={#MB}M
    • -Xloggc:{some gc log file}.gc 
    • -Dcom.sun.management.jmxremote
    • -Dcom.sun.management.jmxremote.port={a port}
    • -Dcom.sun.management.jmxremote.authenticate=false
    • -Dcom.sun.management.jmxremote.authenticate=false
  • Big Bank System with a very large Heap (~80gb)
    • -d64
    • -server
    • -XX:+AggressiveOpts
    • -XX:+UseConcMarkSweepGC
    • -XX:+UseParNewGC
    • -XX:ParallelGCThreads=4
    • -XX:NewRatio=4
 Tools
  1. https://github.com/foursquare/heapaudit
  2. https://github.com/twitter/jvmgcprof
  3. https://github.com/Netflix/gcviz
  4. https://github.com/chewiebug/GCViewer

Tutorials
  1. http://www.slideshare.net/aszegedi/everything-i-ever-learned-about-jvm-performance-tuning-twitter
  2. http://java.dzone.com/articles/how-tame-java-gc-pauses
  3. http://blog.ragozin.info/p/garbage-collection.html
  4. http://blog.ragozin.info/2011/10/techtalk-garbage-collection-in-java.html
  5. https://blogs.oracle.com/jonthecollector/entry/the_second_most_important_gc
  6. http://stackoverflow.com/questions/17009961/understanding-the-java-memory-model-and-garbage-collection
  7. http://mechanical-sympathy.blogspot.com/2013/07/java-garbage-collection-distilled.html
  8. http://blog.mgm-tp.com/2013/03/garbage-collection-tuning/
  9. http://www.youtube.com/watch?v=o6qx_zvpOyI  
  10. http://www.infoq.com/presentations/Virtualizing-Tuning-JVM
  11. https://blog.codecentric.de/en/2013/10/useful-jvm-flags-part-7-cms-collector/
  12. http://blog.headius.com/2009/01/my-favorite-hotspot-jvm-flags.html 
  13. http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html
  14. http://java.dzone.com/articles/java-garbage-collection-0 

Tuesday, July 02, 2013

Greenfield Enterprise Architecture for an IT BU

Let's say you had the power to have a completely greenfield development for an entire Enterprise Architecture for an IT BU - what would it look like?

Well, which IT BU you might ask? Does it really matter?

You need a big database. What kind of enterprise architecture can you have without a database?

Next, we need a bunch of ETL to load data from source systems, because there is always a source system to source from. We can either code it or buy it - doesn't matter which.

And there shall be database performance problems with the data-load.

Next, we need some engine code. Let's call it "The System". And, It shall be in Java. And it shall have memory problems regardless of whether it's 32bit or 64bit or how much memory you allocate to it.

And next, there shall be a web-ui, for which, countless hours will be spent on things that will never be used. And there shall be an excel download link on every page - because that's the only feature the users seem to care about.

And there shall be data quality problems, and performance problems, and scalability problems, and extensibility, and bulk upload requirements, and usability issues, and technical debt, and The Business will cry out for Salvation.

Surely, there must be another way.

I have spent the last 5 years as an Enterprise Architect at a Tier 1 Investment Bank designing systems that solve Big and Expensive problems. There are a few observations I would like to make for things that worked and things that didn't.

1. User's are smart and love Excel and are better at coding than the H1B coder you got for the 2 for 1 sale from a body shop.
2. Build functions not systems - and expose your functions to your users. Allow the users to create a managed ecosystem around the functionality.
3. Make sure your functions work natively in Excel - think COM C# library.
4. Use elastic infrastructure like HDFS, and compute clouds, and data-grids, etc... - don't build dedicated systems, build services that have clean inputs and outputs and can run on scalable hardware like compute clouds.
5. The database and ETL has been my Achilles heel. The database schema is too rigid for the fast pass of change. Alternatively, the rigidness is required given how central the data is too everything. I have yet to really embrace the nosql movement, given the lack of ACID qualities. There are some promising developments in the form of Impala, which is a closer to a pure MPP database running on commodity elastic hardware. Perhaps an interim medium can be found between a strict data-model of a traditional database and a loose schema of a nosql database.

To be continued....