Ding's Programming Blog: October 2015

Wednesday, October 21, 2015

Use pip to install packages for specific Python version

cd /Library/Python/2.*/site-packages/
sudo pip install -t . package_name

Install NumPy and SciPy on Mac OS X El Capitan

Apple ships its own version of Python with OS X. However, it is recommended to install the official Python distribution.

My Macbook Pro has already got Apple Developer Tools installed, but I still need to install gfortran and Cython compiler.

To install gfortran, I tried using Homebrew (brew install gcc) but with error results because of version problem. So I installed it according to instructions here.

Thereafter I cloned NumPy and SciPy from their corresponding GitHub repo and installed following:

python setup.py build

python setup.py install

After these steps, I can import numpy and scipy module but can not run test cases. The error reported is

"ImportError: Need nose >= 0.10.0 for tests - see http://somethingaboutorange.com/mrl/projects/nose"

I tried install nose via pip, but fails (still can not import nose module afterwards, probably because I have 2 versions of Python installed). Later I tried install using easy_install, it finally succeeded.

Wednesday, October 14, 2015

Upgrade to PHP 5.6 on Ubuntu 12.04

Three lines of command solves the problem:

sudo add-apt-repository ppa:ondrej/php5-5.6
sudo apt-get update && sudo apt-get dist-upgrade
sudo apt-get install php5

Tuesday, October 6, 2015

Setting up a local web server on OS X

The guide I followed to set up my personal web server on Yosemite. Full credit to author etresoft. The original article can be found at here.

Here is my definitive guide to getting a local web server running on OS X. This is meant to be a development platform so that you can build and test your sites locally, then deploy to an internet server. This User Tip contains instructions for configuring the Apache and PHP. I have another User Tip for installing and configuring MySQL and Perl.

Note: Yosemite introduces some significant changes. Pay attention to your OS version.

Another note: These instructions apply to the client versions of OS X, not Server. Server does a few specific tricks really well and is a good choice for those. For things like database, web, and mail services, I have found it easier to just setup the client OS version manually.

Requirements:

Basic understanding of Terminal.app and how to run command-line programs.
Basic understanding of web servers.
Basic usage of vi. You can substitute nano if you want.

Optional:

Xcode is required for adding PHP modules.

Lines in bold are what you will have to type in at the Terminal.

Replace <your local host> with the name of your machine. Ideally, it should be a one-word name with no spaces or punctuation. It just makes life easier.

Replace <your short user name> with your short user name.

Here goes... Enjoy!

Lion and later versions no longer create personal web sites by default. If you already had a Sites folder in Snow Leopard, it should still be there. To create one manually, enter the following:

mkdir ~/Sites

echo "<html><body><h1>My site works</h1></body></html>" > ~/Sites/index.html.en

PHP is not enabled in recent versions of OS X. To enable it, do:

sudo vi /etc/apache2/httpd.conf

Uncomment the following line:

#LoadModule php5_module libexec/apache2/libphp5.so

LoadModule php5_module libexec/apache2/libphp5.so

(if you aren't familiar with vi, just press 'x' over the '#' character to delete it. Then type ':w!' to save and then 'ZZ' to quit.)

10.7 Lion - line 111

10.8 Mountain Lion - line 117

10.9 Mavericks - line 118

10.10 Yosemite - line 169

For Yosemite only, uncomment the following line at line 166:

#LoadModule userdir_module libexec/apache2/mod_userdir.so

LoadModule userdir_module libexec/apache2/mod_userdir.so

and do the same at line 493:

#Include /private/etc/apache2/extra/httpd-userdir.conf

Include /private/etc/apache2/extra/httpd-userdir.conf

Save and exit.

And again, for Yosemite only, open the file above with:

sudo vi /etc/apache2/extra/httpd-userdir.conf

and uncomment the following line at line 16:

#Include /private/etc/apache2/users/*.conf

Include /private/etc/apache2/users/*.conf

Save and exit.

While you are in /etc/apache2, double-check to make sure you have a user config file. It should exist at the path:/etc/apache2/users/<your short user name>.conf. That file may not be created in Lion and if you upgrade to Mountain Lion, you still won't have it. It does appear to be created when you create a new user in Mountain Lion. If that file doesn't exist, you will need to create it with:

sudo vi /etc/apache2/users/<your short user name>.conf

For all systems other than Yosemite, use the following as the content:

<Directory "/Users/<your short user name>/Sites/">

Options Indexes MultiViews

AllowOverride None

Order allow,deny

Allow from localhost

</Directory>

For Yosemite, use this content:

<Directory "/Users/<your short user name>/Sites/">

AddLanguage en .en

LanguagePriority en fr de

ForceLanguagePriority Fallback

Options Indexes MultiViews

AllowOverride None

Order allow,deny

Allow from localhost

Require all granted

</Directory>

In vi, press <esc> and then ZZ to save and quit.

If you want to run Perl scripts, you will have to do something similar:

Note: This section cannot be done on Yosemite. Yosemite does not include /usr/libexec/apache2/mod_perl.so. It should be possible to build your own mod_perl, but that would be outside the scope of this User Tip.

Uncomment the following line: (In Lion this is on line 110. In Mountain Lion it is on line 116. In Mavericks it is on 117.)

#LoadModule perl_module libexec/apache2/mod_perl.so

LoadModule perl_module libexec/apache2/mod_perl.so

Then, in /etc/apache2/users/<your short user name>.conf change the line that says:

Options Indexes MultiViews

to:

AddHandler perl-script .pl

PerlHandler ModPerl::Registry

Options Indexes MultiViews FollowSymLinks ExecCGI

Now you are ready to turn on Apache itself.

In Lion, do the following:

To turn on Apache, go to System Preferences > Sharing and enable Web Sharing.

NOTE: There appears to be a bug in Lion for which I haven't found a workaround. If web sharing doesn't start, just keep trying.

In more recent versions of OS X, the Web Sharing checkbox in System Preferences > Sharing is gone. Instead, do the following:

sudo launchctl load -w /System/Library/LaunchDaemons/org.apache.httpd.plist

In Safari, navigate to your web site with the following address:

http://<your local host>/

It should say:

It works!

Now try your user home directory:

http://<your local host>/~<your short user name>

It should say:

My site works

Now try PHP. Create a PHP info file with:

echo "<?php echo phpinfo(); ?>" > ~/Sites/info.php

And test it by entering the following into Safari's address bar:

http://<your local host>/~<your short user name>/info.php

You should see your PHP configuration information.

If you want to setup MySQL, see my User Tip on Installing MySQL.

If you want to add modules to PHP, I suggest the following site. I can't explain it any better.

If you want to make further changes to your Apache system or user config files, you will need to restart the Apache server with:

sudo apachectl graceful

Monday, October 5, 2015

Argus Examples (ra, racount, racluster, rabins, rasort)

Argus is a data network transaction auditing tool originally developed at CERT in 1993.

This article made a brief summary for the usage of some Argus tools: ra, racount, racluster, rabins & resort, based on past experience. For more detailed documentation, please refer to http://qosient.com/argus/manuals.shtml

1. List traffic records under certain filtering condition

ra -r filename.arg - tcp //list all TCP records

2. Display records statistics

racount -r filename.arg - udp port domain //display record/packet/byte counts with DNS filtering

3. Process traffic data into structured ‘bins’ (usually time bins)

rabins -r filename.arg -M time 1h -m srcid -s load //align data into hourly bins, aggregates on srcid and display load (bps) for each hourly aggregation

4. Aggregate traffic data and sort (For example, find out what IP address is receiving the most traffic)

racluster -r filename.arg -m daddr -w - | rasort -r - -m sbytes -s daddr sbytes //use racluster tool to aggregate the records by destination address then pass the aggregated output to rasort tool to sort on the source to destination transaction bytes in descending order

Implement Naive Bayes Classifier with Matlab

Estimating joint density directly from the joint density table requires intensive computation, especially with large dataset.

Naive Bayes classifier was developed to simplify the problem by assuming the conditional independence of of the features given the class.

To implement a Naive Bayes classifier in Matlab, two separate functions could be developed: nb_train and nb_test.

The nb_train() function takes in training dataset x and y, with each row of x represents the feature vector of one training instance and the corresponding row in y contains the class label for that instance.

Inside nb_train() function, data points for each class are grouped. For each class, the conditional probability of each feature is calculated. With Matlab, computation time could be greatly reduced if using matrix arithmetics rather than looping through every element. Another thing worth attention is that, smoothing is a widely employed technique in Naive Bayes algorithm implementation when calculating conditional probability. One common approach is additive smoothing, as illustrated below. This example calculates conditional probabilities of all n features for a binary class problem. catXOnes and catXZeros are the concatenated matrices for the two classes individually.

The computed conditional probabilities matrices as well as class priors should be put together inside a model struct as the return value.

Inside nb_test function, class label prediction is performed according to the following formula:

Information retrieved from the model struct is applied on the testing data points to calculate the probability of belonging to each class. The class label is determined as the one with the highest probability.

Implement Logistic Regression Classifier with Matlab

Logistic Regression classifier is a linear classifier, roughly of the form P(y|x,w) = sigmoid(x.w) (as compared to sign(x.w) in linear classifier).

The cost function is defined as follows:

where S() - the sigmoid function,theta - current logistic regression parameter,lambda - the regularization term.
The code implementation for this representation is straightforward.

One common approach to minimize the cost function is gradient descent.

Based on these helper functions implemented, the training phase could be simplified as:

1. randomly initialize theta

2. call minimize() on the cost function with the initial theta.

3. update theta the finally returned value from minimize()

The prediction function could hence be written as:

Implement Neural Network Classifier with Matlab

Neural Network classifier is a multilayer network of logistic units, with each unit takes some inputs and produces one output using a logistic classifier and output of one unit can be the input of another.
The following schematic represents a neural network with one hidden layer.

Similar to Linear Regression classifier, squared error loss will be used. Below is the forward propagation algorithm to compute the activation “act", where W1/W2 represent weights of each edge, b1/b2 represent biases.

According to the back propagation algorithm, the cost function and gradients could be calculated as follows.

Implement K-Means Clustering with Matlab

K-Means is a popular clustering algorithm with fast running speed and high scalability. Moreover, this algorithm could be easily implemented using the Matlab built in function “pdist2" ( for details please refer to http://www.mathworks.com/help/stats/pdist2.html)

[clusterCenters, clusterBelonging] = k_means(data, k, startingPoints)
data: points to be clustered
K: # of clusters
startPoints:the starting centroids of the k clusters. If not given explicitly, one common approach is to randomly select k points from input dataset
clusterCenter: the centers of the clusters after running K-Means
clusterBelonging: the cluster each data point belongs to

[clusterCenters, clusterBelonging] = k_means(data, k, startingPoints)

data: points to be clustered

K: # of clusters

startPoints:the starting centroids of the k clusters. If not given explicitly, one common approach is to randomly select k points from input dataset

clusterCenter: the centers of the clusters after running K-Means

clusterBelonging: the cluster each data point belongs to

Notes about writing a Makefile

This post contains notes copied from my project writeup that demonstrating how to write a Makefile on our own.

When "make" is invoked, it builds the first target listed in the Makefile, which is typically the project executable in its default configuration.

A Makefile is a plain text file that contains a set of rules.

Each rule is of the form:
target: prerequisites ... commands to build target from prerequisites ...

If one of the prerequisite files specified by a rule doesn’t exist, make attempts to build that prerequisite from another rule that specifies the prerequisite as its target.
Once a target is built, it will not be rebuilt by subsequent invocations of make unless a prerequisite is modified (and, thus, making the target out of date).

"make" allows a Makefile to assign variables with the syntax “var = value”, and substitution of variables into rules with the syntax “$(var)”. Variable assignments may be overridden
on the “make” command line. For example, most Makefiles assign the CC variable to gcc as the default compiler and write compile rules using “$(CC)” to invoke it. If you wanted to substitute the default compiler with the avr-gcc cross compiler, you would execute “make CC=avr-gcc” instead of “make”.

"make" automatically assigns the variables $@, $<, and $^ when evaluating commands for a rule:$@ is assigned the file name of the target, $< is assigned the filename of the first prerequisite, and$^ is assigned a string consisting of the filenames of all the prerequisites with spaces between them. This feature allows the rule:

foo.o: foo.c common.h gcc -c -Wall -Werror -o foo.o foo.c

to be simplified to:

foo.o: foo.c common.h gcc -c -Wall -Werror -o $@ $<

A pattern rule is an ordinary rule that specifies a target, prerequisite, and commands for building the target, except that the file names for the target and prerequisite contain a wildcard (“%”) that matches at the beginning of the file name. Pattern rules define how to build files of a certain type. For example, the following pattern rule specifies how to build an object file from any C source-file:

%.o: %.c gcc -c -Wall -Werror -o $@ $<

Launch EC2 instance and check status via AWS SDK for Java

Apart from the AWS web console, AWS functionalities are also accessible via Command Line Interface (CLI), APIs or SDK. This blog post summarizes how to launch an EC2 instance and to test its running status via AWS SDK for Java.

The programming environment:
Eclipse Java EE IDE for Web Developers.
Version: Luna Release (4.4.0)
Install the AWS toolkit according to the instructions listed on http://aws.amazon.com/eclipse/.

Credentials are important for authentication.

AmazonEC2Client is the basis to call various APIs. (Since in this project I used the instance for load generation, I named it as runLoadGeneratorRequest/Result) To launch an instance, the key is the .runInstances() method. We can configure the RunInstancesRequest object as needed.

The “loadGeneratorID” variable is used for identification.
Various tasks could not be performed unless the instance is in “Running” status and the initializing process could take long.
Code below demonstrates how to test if the instance specified by the “loadGeneratorID” is ready.

Using Elastic Load Balancing and Auto Scaling via AWS SDK for Java

Elastic Load Balancer acts as a network router that sends incoming requests to multiple EC2 instances sitting behind it in a round-robin fashion. The instances it points to can be added dynamically with an Auto Scaling Group. Amazon's Auto Scaling service automatically adds or removes computing resources allocated to an application, by responding to changes in demand.

Interact with the Elastic Load Balancer via AWS SDK.

The “listeners” object redirects HTTP:80 requests from the load balancer to HTTP:80 on the instance.

ELB Health Check page could be established via the HealthCheck class and the ConfigureHealthCheckRequest class. Details could be found at "AWS SDK for Java API Reference" at (http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/)

AutoScaling is commonly used together with CloudWatch which invokes the appropriate scaling policies.
Prior to launching AutoScaling Groups, specify Launch Configuration using the CreateLaunchConfigurationRequest class and the createLaunchConfiguration() method.
Similarly, launch AutoScaling Group using the CreateAutScalingGroupRequest class and createAutoScalingGroup() method.
Utilize the PutScalingPolicyRequest class to specify scaling in/out policies, after setup the policy by putScalngPolicy() method, record down the policy Amazon Resource Number (ARN) by calling getPolicyARN() method.

Create a CloudWatch Alarm could be a little different. First specify the alarm dimension with value as name of launched ASG.

addARN variable refers to the ARN of the scaling out policy while snsARN refers to ARN of certain SNS action (such as sending emails). Thereafter, create object of PutMetricAlarmRequest class, specify Namespace as “AWS/EC2”, Dimensions&AlarmActions as the ones defined above. Other metrics such as Statistics/Period/Threshold… depend on different scenarios.

Finally, please notice that EC2, ELB, ASG, etc all use different Tag class.

Hadoop MapReduce jobs in the native Java framework

Developed from the common word count example, this blog demonstrates how to do an inverted index map reduce programming in the native Hadoop Java framework. Run on a dataset consisting of text files, inverted index lists all file names that each word appears in, in the form of:
word1 filename1, filename2, filename3...
word2 filename1, filename2, filename3...
…...
Environment:
Amazon EMR 2.4.6 AMI version Hadoop 1.0.3
Write code and compile in Eclipse using JRE 1.7
Based on MapReduce 1.0 (Hadoop 1.xx), uses the org.apache.hadoop.mapreduce library

The basic idea is, tokenizer text line in mapper to get each word, output word and filename pair and do aggregation for each word in reducer.
In the mapper, the Context and FileSplit classes allow you to fetch the name of the current file being processed.
FileSplit fs = (FileSplit) context.getInputSplit();¬
String location = fs.getPath().getName();

Different form the word count example, mapper output datatype should be modified as <Text, Text> but not <Text and IntWritable>. Note that "String" won’t work.
In the reducer, remember to take care of duplication removal.
Export the code as executable jar.

In the meantime,
upload input file into HDFS by executing
hadoop dfs -put <filename> /input/directory
In the code, the input and output path were specified as the first and second argument:
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

The generated invertedindex.jar file could be executed by
hadoop jar invertedindex.jar /input/directory /output/directory
Please note that the /output/directory should not exist beforehand, otherwise the job would stop immediately.