Monday, December 14, 2015

Check sklearn version

We can write a simple Python scripts to check the current version of sklearn:


import sklearn 
print('The scikit-learn version is {}.'.format(sklearn.__version__))

Wednesday, October 21, 2015

Use pip to install packages for specific Python version

cd /Library/Python/2.*/site-packages/
sudo pip install -t . package_name

Install NumPy and SciPy on Mac OS X El Capitan


Apple ships its own version of Python with OS X. However, it is recommended to install the official Python distribution.

My Macbook Pro has already got Apple Developer Tools installed, but I still need to install gfortran and Cython compiler

To install gfortran, I tried using Homebrew (brew install gcc) but with error results because of version problem. So I installed it according to instructions here

Thereafter I cloned NumPy and SciPy from their corresponding GitHub repo and installed following:
python setup.py build
python setup.py install

After these steps, I can import numpy and scipy module but can not run test cases. The error reported is
"ImportError: Need nose >= 0.10.0 for tests - see http://somethingaboutorange.com/mrl/projects/nose"

I tried install nose via pip, but fails (still can not import nose module afterwards, probably because I have 2 versions of Python installed). Later I tried install using easy_install, it finally succeeded. 


Wednesday, October 14, 2015

Upgrade to PHP 5.6 on Ubuntu 12.04

Three lines of command solves the problem:

sudo add-apt-repository ppa:ondrej/php5-5.6
sudo apt-get update && sudo apt-get dist-upgrade
sudo apt-get install php5

Tuesday, October 6, 2015

Setting up a local web server on OS X

The guide I followed to set up my personal web server on Yosemite. Full credit to author etresoft. The original article can be found at here


Here is my definitive guide to getting a local web server running on OS X. This is meant to be a development platform so that you can build and test your sites locally, then deploy to an internet server. This User Tip contains instructions for configuring the Apache and PHP. I have another User Tip for installing and configuring MySQL and Perl.
Note: Yosemite introduces some significant changes. Pay attention to your OS version.
Another note: These instructions apply to the client versions of OS X, not Server. Server does a few specific tricks really well and is a good choice for those. For things like database, web, and mail services, I have found it easier to just setup the client OS version manually.

Requirements:
  1. Basic understanding of Terminal.app and how to run command-line programs.
  2. Basic understanding of web servers.
  3. Basic usage of vi. You can substitute nano if you want.
Optional:
  1. Xcode is required for adding PHP modules.

Lines in bold are what you will have to type in at the Terminal.
Replace <your local host> with the name of your machine. Ideally, it should be a one-word name with no spaces or punctuation. It just makes life easier.
Replace <your short user name> with your short user name.

Here goes... Enjoy!

Lion and later versions no longer create personal web sites by default. If you already had a Sites folder in Snow Leopard, it should still be there. To create one manually, enter the following:
mkdir ~/Sites
echo "<html><body><h1>My site works</h1></body></html>" > ~/Sites/index.html.en

PHP is not enabled in recent versions of OS X. To enable it, do:
sudo vi /etc/apache2/httpd.conf

Uncomment the following line:
#LoadModule php5_module libexec/apache2/libphp5.so
to
LoadModule php5_module libexec/apache2/libphp5.so
(if you aren't familiar with vi, just press 'x' over the '#' character to delete it. Then type ':w!' to save and then 'ZZ' to quit.)
10.7 Lion - line 111
10.8 Mountain Lion - line 117
10.9 Mavericks - line 118
10.10 Yosemite - line 169

For Yosemite only, uncomment the following line at line 166:
#LoadModule userdir_module libexec/apache2/mod_userdir.so
to
LoadModule userdir_module libexec/apache2/mod_userdir.so

and do the same at line 493:
#Include /private/etc/apache2/extra/httpd-userdir.conf
to
Include /private/etc/apache2/extra/httpd-userdir.conf
Save and exit.

And again, for Yosemite only, open the file above with:
sudo vi /etc/apache2/extra/httpd-userdir.conf
and uncomment the following line at line 16:
#Include /private/etc/apache2/users/*.conf
to
Include /private/etc/apache2/users/*.conf
Save and exit.

While you are in /etc/apache2, double-check to make sure you have a user config file. It should exist at the path:/etc/apache2/users/<your short user name>.conf. That file may not be created in Lion and if you upgrade to Mountain Lion, you still won't have it. It does appear to be created when you create a new user in Mountain Lion. If that file doesn't exist, you will need to create it with:

sudo vi /etc/apache2/users/<your short user name>.conf

For all systems other than Yosemite, use the following as the content:
<Directory "/Users/<your short user name>/Sites/">
    Options Indexes MultiViews
    AllowOverride None
    Order allow,deny
    Allow from localhost
</Directory>

For Yosemite, use this content:
<Directory "/Users/<your short user name>/Sites/">
    AddLanguage en .en
    LanguagePriority en fr de
    ForceLanguagePriority Fallback
    Options Indexes MultiViews
    AllowOverride None
    Order allow,deny
    Allow from localhost
     Require all granted
</Directory>

In vi, press <esc> and then ZZ to save and quit.

If you want to run Perl scripts, you will have to do something similar:
Note: This section cannot be done on Yosemite. Yosemite does not include /usr/libexec/apache2/mod_perl.so. It should be possible to build your own mod_perl, but that would be outside the scope of this User Tip.

Uncomment the following line: (In Lion this is on line 110. In Mountain Lion it is on line 116. In Mavericks it is on 117.)
#LoadModule perl_module libexec/apache2/mod_perl.so
to
LoadModule perl_module libexec/apache2/mod_perl.so

Then, in /etc/apache2/users/<your short user name>.conf change the line that says:
    Options Indexes MultiViews
to:
    AddHandler perl-script .pl
    PerlHandler ModPerl::Registry
    Options Indexes MultiViews FollowSymLinks ExecCGI

Now you are ready to turn on Apache itself.

In Lion, do the following:
To turn on Apache, go to System Preferences > Sharing and enable Web Sharing.

NOTE: There appears to be a bug in Lion for which I haven't found a workaround. If web sharing doesn't start, just keep trying.

In more recent versions of OS X, the Web Sharing checkbox in System Preferences > Sharing is gone. Instead, do the following:
sudo launchctl load -w /System/Library/LaunchDaemons/org.apache.httpd.plist

In Safari, navigate to your web site with the following address:
http://<your local host>/

It should say:

It works!
Now try your user home directory:
http://<your local host>/~<your short user name>

It should say:

My site works
Now try PHP. Create a PHP info file with:
echo "<?php echo phpinfo(); ?>" > ~/Sites/info.php

And test it by entering the following into Safari's address bar:
http://<your local host>/~<your short user name>/info.php

You should see your PHP configuration information.

If you want to setup MySQL, see my User Tip on Installing MySQL.

If you want to add modules to PHP, I suggest the following site. I can't explain it any better.

If you want to make further changes to your Apache system or user config files, you will need to restart the Apache server with:
sudo apachectl graceful

Monday, October 5, 2015

Argus Examples (ra, racount, racluster, rabins, rasort)

Argus is a data network transaction auditing tool originally developed at CERT in 1993. 

This article made a brief summary for the usage of some Argus tools: ra, racount, racluster, rabins & resort, based on past experience. For more detailed documentation, please refer to http://qosient.com/argus/manuals.shtml 

1. List traffic records under certain filtering condition 
ra -r filename.arg - tcp //list all TCP records 

2. Display records statistics 
racount -r filename.arg - udp port domain //display record/packet/byte counts with DNS filtering 

3. Process traffic data into structured ‘bins’ (usually time bins) 
rabins -r filename.arg -M time 1h -m srcid -s load //align data into hourly bins, aggregates on srcid and display load (bps) for each hourly aggregation 

4. Aggregate traffic data and sort (For example, find out what IP address is receiving the most traffic) 
racluster -r filename.arg -m daddr -w - | rasort -r - -m sbytes -s daddr sbytes //use racluster tool to aggregate the records by destination address then pass the aggregated output to rasort tool to sort on the source to destination transaction bytes in descending order

Implement Naive Bayes Classifier with Matlab

Implement Logistic Regression Classifier with Matlab

Implement Neural Network Classifier with Matlab

Implement K-Means Clustering with Matlab


Notes about writing a Makefile

This post contains notes copied from my project writeup that demonstrating how to write a Makefile on our own. 

When "make" is invoked, it builds the first target listed in the Makefile, which is typically the project executable in its default configuration.

A Makefile is a plain text file that contains a set of rules.

Each rule is of the form:
target: prerequisites ... commands to build target from prerequisites ...

If one of the prerequisite files specified by a rule doesn’t exist, make attempts to build that prerequisite from another rule that specifies the prerequisite as its target.
Once a target is built, it will not be rebuilt by subsequent invocations of make unless a prerequisite is modified (and, thus, making the target out of date).

"make" allows a Makefile to assign variables with the syntax “var = value”, and substitution of variables into rules with the syntax “$(var)”. Variable assignments may be overridden
on the “make” command line. For example, most Makefiles assign the CC variable to gcc as the default compiler and write compile rules using “$(CC)” to invoke it. If you wanted to substitute the default compiler with the avr-gcc cross compiler, you would execute “make CC=avr-gcc” instead of “make”.

"make" automatically assigns the variables $@, $<, and $^ when evaluating commands for a rule:$@ is assigned the file name of the target, $< is assigned the filename of the first prerequisite, and$^ is assigned a string consisting of the filenames of all the prerequisites with spaces between them. This feature allows the rule:

foo.o: foo.c common.h gcc -c -Wall -Werror -o foo.o foo.c

to be simplified to:

foo.o: foo.c common.h gcc -c -Wall -Werror -o $@ $<


A pattern rule is an ordinary rule that specifies a target, prerequisite, and commands for building the target, except that the file names for the target and prerequisite contain a wildcard (“%”) that matches at the beginning of the file name. Pattern rules define how to build files of a certain type. For example, the following pattern rule specifies how to build an object file from any C source-file:

%.o: %.c gcc -c -Wall -Werror -o $@ $<

Launch EC2 instance and check status via AWS SDK for Java

Apart from the AWS web console, AWS functionalities are also accessible via Command Line Interface (CLI), APIs or SDK. This blog post summarizes how to launch an EC2 instance and to test its running status via AWS SDK for Java.

The programming environment:
Eclipse Java EE IDE for Web Developers.
Version: Luna Release (4.4.0)
Install the AWS toolkit according to the instructions listed on http://aws.amazon.com/eclipse/.

Credentials are important for authentication. 
AmazonEC2Client is the basis to call various APIs. (Since in this project I used the instance for load generation, I named it as runLoadGeneratorRequest/Result) To launch an instance, the key is the .runInstances() method. We can configure the RunInstancesRequest object as needed. 

AmazonEC2Client is the basis to call various APIs. (Since in this project I used the instance for load generation, I named it as runLoadGeneratorRequest/Result) To launch an instance, the key is the .runInstances() method. We can configure the RunInstancesRequest object as needed. 

The “loadGeneratorID” variable is used for identification.
Various tasks could not be performed unless the instance is in “Running” status and the initializing process could take long.
Code below demonstrates how to test if the instance specified by the “loadGeneratorID” is ready.

Using Elastic Load Balancing and Auto Scaling via AWS SDK for Java

Elastic Load Balancer acts as a network router that sends incoming requests to multiple EC2 instances sitting behind it in a round-robin fashion. The instances it points to can be added dynamically with an Auto Scaling Group. Amazon's Auto Scaling service automatically adds or removes computing resources allocated to an application, by responding to changes in demand.

Interact with the Elastic Load Balancer via AWS SDK. 


The “listeners” object redirects HTTP:80 requests from the load balancer to HTTP:80 on the instance.

ELB Health Check page could be established via the HealthCheck class and the ConfigureHealthCheckRequest class. Details could be found at "AWS SDK for Java API Reference" at (http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/)

AutoScaling is commonly used together with CloudWatch which invokes the appropriate scaling policies.
Prior to launching AutoScaling Groups, specify Launch Configuration using the CreateLaunchConfigurationRequest class and the createLaunchConfiguration() method.
Similarly, launch AutoScaling Group using the CreateAutScalingGroupRequest class and createAutoScalingGroup() method.
Utilize the PutScalingPolicyRequest class to specify scaling in/out policies, after setup the policy by putScalngPolicy() method, record down the policy Amazon Resource Number (ARN) by calling getPolicyARN() method.

Create a CloudWatch Alarm could be a little different. First specify the alarm dimension with value as name of launched ASG. 

addARN variable refers to the ARN of the scaling out policy while snsARN refers to ARN of certain SNS action (such as sending emails). Thereafter, create object of PutMetricAlarmRequest class, specify Namespace as “AWS/EC2”, Dimensions&AlarmActions as the ones defined above. Other metrics such as Statistics/Period/Threshold… depend on different scenarios.

Finally, please notice that EC2, ELB, ASG, etc all use different Tag class. 

Hadoop MapReduce jobs in the native Java framework

Developed from the common word count example, this blog demonstrates how to do an inverted index map reduce programming in the native Hadoop Java framework. Run on a dataset consisting of text files, inverted index lists all file names that each word appears in, in the form of:
word1      filename1, filename2, filename3...
word2      filename1, filename2, filename3...
…...
Environment:
Amazon EMR 2.4.6 AMI version Hadoop 1.0.3
Write code and compile in Eclipse using JRE 1.7 
Based on MapReduce 1.0 (Hadoop 1.xx), uses the org.apache.hadoop.mapreduce library

The basic idea is, tokenizer text line in mapper to get each word, output word and filename pair and do aggregation for each word in reducer. 
In the mapper, the Context and FileSplit classes allow you to fetch the name of the current file being processed. 
FileSplit fs = (FileSplit) context.getInputSplit();¬
String location = fs.getPath().getName();

Different form the word count example, mapper output datatype should be modified as <Text, Text> but not <Text and IntWritable>. Note that "String" won’t work. 
In the reducer, remember to take care of duplication removal. 
Export the code as executable jar. 

In the meantime, 
upload input file into HDFS by executing
hadoop dfs -put <filename> /input/directory
In the code, the input and output path were specified as the first and second argument:
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));


The generated invertedindex.jar file could be executed by
hadoop jar invertedindex.jar /input/directory /output/directory
Please note that the /output/directory should not exist beforehand, otherwise the job would stop immediately.