Redhat developer toolset 1.1

Tru Huynh of centos.org has built the redhat developer toolset 1.1, for centos and it contains gcc 4.7.2

So you could simply use his repo and install just gcc, instantly.

cd /etc/yum.repos.d
wget http://people.centos.org/tru/devtools-1.1/devtools-1.1.repo 
yum --enablerepo=testing-1.1-devtools-6 install devtoolset-1.1-gcc 
devtoolset-1.1-gcc-c++

This will install it most likely into: /opt/centos/devtoolset-1.1/root/usr/bin/

Then you can tell your compile process to use the gcc 4.7 instead of 4.4 with the CC variable

export CC=/opt/centos/devtoolset-1.1/root/usr/bin/gcc  
export CPP=/opt/centos/devtoolset-1.1/root/usr/bin/cpp
export CXX=/opt/centos/devtoolset-1.1/root/usr/bin/c++

Also, worth noting; that instead of setting individual variables you can do
 scl enable devtoolset-1.1 bash
(it just starts new shell with all the appropriate variables already set).

How to install Maven on CentOS

Apache Maven is a project management software, managing building, reporting and documentation of a Java  development project. In order to install and configure Apache Maven on CentOS, follow these steps.

First of all, you need to install Java 1.7 JDK. Make sure to install Java JDK, not JRE.

Then go ahead and download the latest Maven binary from its official site. For example, for version 3.0.4:

$ wget http://mirror.cc.columbia.edu/pub/software/apache/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz
$ sudo tar xzf apache-maven-3.0.5-bin.tar.gz -C /usr/local
$ cd /usr/local
$ sudo ln -s apache-maven-3.0.5 maven

 

Next, set up Maven path system-wide:

$ sudo vi /etc/profile.d/maven.sh
export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}

Finally, log out and log in again to activate the above environment variables.
To verify successful installation of maven, check the version of maven:

$ mvn -version

 

Optionally, if you are using Maven behind a proxy, you must do the following.

$ vi ~/.m2/settings.xml
<settings>
  <proxies>
    <proxy>
      <active>true</active>
      <protocol>http</protocol>
      <host>proxy.host.com</host>
      <port>port_number</port>
      <username>proxy_user</username>
      <password>proxy_user_password</password>
      <nonProxyHosts>www.google.com</nonProxyHosts>
    </proxy>
  </proxies>
</settings>

Getting Started With Hadoop Using Hortonworks Sandbox

Getting started with a distributed system like Hadoop can be a daunting task for developers. From installing and configuring Hadoop to learning the basics of MapReduce and other add-on tools, the learning curve is pretty high.

Hortonworks recently released the Hortonworks Sandbox for anyone interested in learning and evaluating enterprise Hadoop.

The Hortonworks Sandbox provides:

  1. A virtual machine with Hadoop preconfigured.
  2. A set of hands-on tutorials to get you started with Hadoop.
  3. An environment to help you explore related projects in the Hadoop ecosystem like Apache Pig, Apache Hive, Apache HCatalog and Apache HBase.

You can download the Sandbox from Hortonworks website:

http://hortonworks.com/products/hortonworks-sandbox/

The Sandbox download is available for both VirtualBox and VMware Fusion/Player environments. Just follow the instruction to import the Sandbox into your environment.

The download is an OVA (open virtual appliance), which is really a TAR file.

1
tar -xvf Hortonworks+Sandbox+1.2+1-21-2012-1+vmware.ova

Untar it and the archive consists of an OVF (Open Virtualization Format) descriptor file, a manifest file and a disk image of vmdk format.

Rackspace Cloud doesn’t let you upload your own images, but if you have an OpenStack based cloud, you can boot a virtual machine with the image provided.

First, you can convert the vmdk image to a more familiar format like qcow2.

1
2
3
4
qemu-img convert –c -O qcow2 Hortonworks_Sandbox_1.2_1-21-2012-1_vmware-disk1.vmdk hadoop-sandbox.qcow2

file hadoop-sandbox.qcow2
hadoop-sandbox.qcow2: QEMU QCOW Image (v2), 17179869184 bytes

Now, let’s upload the image to Glance.

1
glance add name="hadoop-sandbox" is_public=true container_format=bare disk_format=qcow2 < /path/to/hadoop-sandbox.qcow2

Now let’s create a virtual server off of the new image – give at least 4GB of RAM.

1
nova boot --flavor $flavor_id --image $image_id hadoop-sandbox

Once the instance goes to ACTIVE status and that the instance pings, you can ssh into the instance using

  • Username: root
  • Password: hadoop

Watch /var/log/boot.log as the services are coming up, and it will let you know when the installation is complete. This can take about 10 minutes.

At the end, you should have these java processes running:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
jps
2912 TaskTracker
2336 DataNode
2475 SecondaryNameNode
3343 HRegionServer
2813 JobHistoryServer
2142 NameNode
3012 QuorumPeerMain
4215 RunJar
4591 Jps
3568 RunJar
3589 RunJar
1559 Bootstrap
2603 JobTracker
3857 RunJar

Go to the browser at http://instance_ip and your single node Hadoop cluster should be running. Just follow through the UI; it has demos, videos and step-by-step hands-on tutorials on Hadoop, Pig, Hive and HCatalog.

Make your web site faster

Google’s mod_pagespeed speeds up your site and reduces page load time. This open-source Apache HTTP server module automatically applies web performance best practices to pages, and associated assets (CSS, JavaScript, images) without requiring that you modify your existing content or workflow.

Features
  • Automatic website and asset optimization
  • Latest web optimization techniques
  • 40+ configurable optimization filters
  • Free, open-source, and frequently updated
  • Deployed by individual sites, hosting providers, CDN’s

How does mod_pagespeed speed up web-sites?

mod_pagespeed improves web page latency and bandwidth usage by changing the resources on that web page to implement web performance best practices. Each optimization is implemented as a custom filter in mod_pagespeed, which are executed when the Apache HTTP server serves the website assets. Some filters simply alter the HTML content, and other filters change references to CSS, JavaScript, or images to point to more optimized versions.

mod_pagespeed implements custom optimization strategies for each type of asset referenced by the website, to make them smaller, reduce the loading time, and extend the cache lifetime of each asset. These optimizations include combining and minifying JavaScript and CSS files, inlining small resources, and others. mod_pagespeed also dynamically optimizes images by removing unused meta-data from each file, resizing the images to specified dimensions, and re-encoding images to be served in the most efficient format available to the user.

mod_pagespeed ships with a set of core filters designed to safely optimize the content of your site without affecting the look or behavior of your site. In addition, it provides a number of more advanced filters which can be turned on by the site owner to gain higher performance improvements.

mod_pagespeed can be deployed and customized for individual web sites, as well as being used by large hosting providers and CDN’s to help their users improve performance of their sites, lower the latency of their pages, and decrease bandwidth usage.

Installing mod_pagespeed

Supported platforms

  • CentOS/Fedora (32-bit and 64-bit)
  • Debian/Ubuntu (32-bit and 64-bit)

To install the packages, on Debian/Ubuntu, please run (as root) the following command:

dpkg -i mod-pagespeed-*.deb
apt-get -f install

For CentOS/Fedora, please execute (also as root):

yum install at  # if you do not already have 'at' installed
rpm -U mod-pagespeed-*.rpm

Installing mod_pagespeed will add the Google repository so your system will automatically keep mod_pagespeed up to date. If you don’t want Google’s repository, do sudo touch /etc/default/mod-pagespeed before installing the package.

You can also download a number of system tests. These are the same tests available onModPageSpeed.com.

What is installed

  • The mod_pagespeed packages install two versions of the mod_pagespeed code itself, mod_pagespeed.so for Apache 2.2 andmod_pagespeed_ap24.so for Apache 2.4.
  • Configuration files: pagespeed.confpagespeed_libraries.conf, and (on Debian) pagespeed.load. If you modify one of these configuration files, that file will not be upgraded automatically in the future.
  • A standalone JavaScript minifier pagespeed_js_minify based on the one used in mod_pagespeed, that can both minify JavaScript and generate metadata for library canonicalization.

Facebook Events Join the Contextual-Computing Party

Facebook made a tweak to its Events system this week, adding a little embedded forecast that shows projected weather on the day of the event. It’s a small change, but part of a big shift in computing.

zuck

Facebook CEO Mark Zuckerberg at a product launch earlier this month. Photo: Alex Washburn/Wired

The new feature, described by Facebook in briefings with individual reporters, pulls forecasts for the location of the event from monitoring company Weather Underground and attaches it to the Facebook pages of events happening within the next 10 days. The data is also shown while the event is being created, helping organizers avoid rained-out picnics and the like.

The change makes Facebook more sensitive to contextual information, data like location and time of day that the user doesn’t even have to enter. Facebook rival Google has drawn big praise for its own context-sensitive application Google Now, which, depending on your habits, might show you weather and the day’s appointments when you wake up, traffic information when you get in your car, and your boarding pass when you arrive at the airport. Google Now was so successful on Android smartphones that Google is reportedly porting the app to Apple’s iOS.

Apple’s own stab at contextual computing, the Siri digital assistant, has been less successful, but that seems to have more to do with implementation issues – overloaded servers, bad maps, and tricky voice-recognition problems – than with the idea of selecting information based on location and other situational data.

Hungry as Facebook is to sell ever-more-targeted ads at ever-higher premiums, expect the social network to add more context-sensitive features. One natural step is putting the Graph Search search engine on mobile phones and tailoring results more closely to location. Another is to upgrade Facebook’s rapidly evolving News Feed, which already filters some information based on your past check-ins, along the same lines. Done right, pushing information to Facebook users based on context could multiply the social network’s utility. Done wrong, it could be creepy on a whole new level.

Running a Multi-Broker Apache Kafka 0.8 Cluster on a Single Node

In this article I describe how to install, configure and run a multi-broker Apache Kafka 0.8 (trunk) cluster on a single machine. The final setup consists of one local ZooKeeper instance and three local Kafka brokers. We will test-drive the setup by sending messages to the cluster via a console producer and receive those messages via a console receiver. I will also describe how to build Kafka for Scala 2.9.2, which makes it much easier to integrate Kafka with other Scala-based frameworks and tools that require Scala 2.9 instead of Kafka’s default Scala 2.8.

What we want to do

Here is an overview of what we want to do:

  • Build Kafka 0.8-trunk for Scala 2.9.2.
    • I also provide instructions for the default 2.8.0, just in case.
  • Use a single machine for this Kafka setup.
  • Run 1 ZooKeeper instance on that machine.
  • Run 3 Kafka brokers on that machine.
  • Create a Kafka topic called “zerg.hydra” and send/receive messages for that topic via the console. The topic will be configured to use 3 partitions and 2 replicas per partition.

The purpose of this article is not to present a production-ready configuration of a Kafka cluster. However it should get you started with using Kafka as a distributed messaging system in your own infrastructure.

Installing Kafka

Background: Why Kafka and Scala 2.9?

Personally I’d like to use Scala 2.9.2 for Kafka – which is still built for Scala 2.8.0 by default as of today – because many related software packages that are of interest to me (such as Finagle, Kestrel) are based on Scala 2.9. Also, the current versions of many development and build tools (e.g. IDEs, sbt) for Scala require at least version 2.9. If you are working in a similar environment you may want build Kafka for Scala 2.9 just like Michael G. Noll did – otherwise you can expect to run into issues such as Scala version conflicts.

Option 1 (preferred): Kafka 0.8-trunk with Scala 2.9.2

Unfortunately the current trunk of Kafka has problems to build against Scala 2.9.2 out of the box. Michael G. Noll created a fork of Kafka 0.8-trunk that includes the required fix (a change of one file) in the branch “scala-2.9.2”. The fix ties the Scala version used by Kafka’s shell scripts to 2.9.2 instead of 2.8.0.

The following instructions will use this fork to download, build and install Kafka for Scala 2.9.2:

$ cd $HOME
$ git clone git@github.com:miguno/kafka.git
$ cd kafka
# this branch of includes a patched bin/kafka-run-class.sh for Scala 2.9.2
$ git checkout -b scala-2.9.2 remotes/origin/scala-2.9.2
$ ./sbt update
$ ./sbt "++2.9.2 package"

Option 2: Kafka 0.8-trunk with Scala 2.8.0

If you are fine with Scala 2.8 you need to build and install Kafka as follows.

$ cd $HOME
$ git clone git@github.com:apache/kafka.git
$ cd kafka
$ ./sbt update
$ ./sbt package

Configuring and running Kafka

Unless noted otherwise all commands below assume that you are in the top level directory of your Kafka installation. If you followed the instructions above, this directory is $HOME/kafka/.

Configure your OS

For Kafka 0.8 it is recommended to increase the maximum number of open file handles because due to changes in 0.8 Kafka will keep more file handles open than in 0.7. The exact number depends on your usage patterns, of course, but on the Kafka mailing list the ballpark figure “tens of thousands” was shared:

In Kafka 0.8, we keep the file handles for all segment files open until they are garbage collected. Depending on the size of your cluster, this number can be pretty big. Few 10 K or so.

For instance, to increase the maximum number of open file handles for the user kafkato 98,304 (change kafka to whatever user you are running the Kafka daemons with – this can be your own user account, of course) you must add the following line to /etc/security/limits.conf:

/etc/security/limits.conf
1
kafka    -    nofile    98304

Start ZooKeeper

Kafka ships with a reasonable default ZooKeeper configuration for our simple use case. The following command launches a local ZooKeeper instance.

Start ZooKeeper
1
$ bin/zookeeper-server-start.sh config/zookeeper.properties

By default the ZooKeeper server will listen on *:2181/tcp.

Configure and start the Kafka brokers

We will create 3 Kafka brokers, whose configurations are based on the default config/server.properties. Apart from the settings below the configurations of the brokers are identical.

The first broker:

Create the config file for broker 1
1
$ cp config/server.properties config/server1.properties

Edit config/server1.properties and replace the existing config values as follows:

broker.id=1
port=9092
log.dir=/tmp/kafka-logs-1

The second broker:

Create the config file for broker 2
1
$ cp config/server.properties config/server2.properties

Edit config/server2.properties and replace the existing config values as follows:

broker.id=2
port=9093
log.dir=/tmp/kafka-logs-2

The third broker:

Create the config file for broker 3
1
$ cp config/server.properties config/server3.properties

Edit config/server3.properties and replace the existing config values as follows:

broker.id=3
port=9094
log.dir=/tmp/kafka-logs-3

Now you can start each Kafka broker in a separate console:

Start the first broker in its own terminal session
1
$ env JMX_PORT=9999  bin/kafka-server-start.sh config/server1.properties
Start the second broker in its own terminal session
1
$ env JMX_PORT=10000 bin/kafka-server-start.sh config/server2.properties
Start the third broker in its own terminal session
1
$ env JMX_PORT=10001 bin/kafka-server-start.sh config/server3.properties

Here is a summary of the configured network interfaces and ports that the brokers will listen on:

        Broker 1     Broker 2      Broker 3
----------------------------------------------
Kafka   *:9092/tcp   *:9093/tcp    *:9094/tcp
JMX     *:9999/tcp   *:10000/tcp   *:10001/tcp

Excursus: Topics, partitions and replication in Kafka

In a nutshell Kafka partitions incoming messages for a topic, and assigns those partitions to the available Kafka brokers. The number of partitions is configurable and can be set per-topic and per-broker.

First the stream [of messages] is partitioned on the brokers into a set of distinct partitions. The semantic meaning of these partitions is left up to the producer and the producer specifies which partition a message belongs to. Within a partition messages are stored in the order in which they arrive at the broker, and will be given out to consumers in that same order.

A new feature of Kafka 0.8 is that those partitions will be now be replicated across Kafka brokers to make the cluster more resilient against host failures:

Partitions are now replicated. Previously the topic would remain available in the case of server failure, but individual partitions within that topic could disappear when the server hosting them stopped. If a broker failed permanently any unconsumed data it hosted would be lost. Starting with 0.8 all partitions have a replication factor and we get the prior behavior as the special case where replication factor = 1. Replicas have a notion of committed messages and guarantee that committed messages won’t be lost as long as at least one replica survives. Replica are byte-for-byte identical across replicas.

Producer and consumer are replication aware. When running in sync mode, by default, the producer send() request blocks until the messages sent is committed to the active replicas. As a result the sender can depend on the guarantee that a message sent will not be lost. Latency sensitive producers have the option to tune this to block only on the write to the leader broker or to run completely async if they are willing to forsake this guarantee. The consumer will only see messages that have been committed.

The following diagram illustrates the relationship between topics, partitions and replicas.

The relationship between topics, partitions and replicas in Kafka.

Logically this relationship is very similar to how Hadoop manages blocks and replication in HDFS.

When a topic is created in Kafka 0.8, Kafka determines how each replica of a partition is mapped to a broker. In general Kafka tries to spread the replicas across all brokers (source). Messages are first sent to the first replica of a partition (i.e. to the current “leader” broker of that partition) before they are replicated to the remaining brokers. Message producers may choose from different strategies for sending messages (e.g. synchronous mode, asynchronous mode). Producers discover the available brokers in a cluster and the number of partitions on each, by registering watchers in ZooKeeper.

If you wonder how to configure the number of partitions per topic/broker, here’s feedback from LinkedIn developers:

At LinkedIn, some of the high volume topics are configured with more than 1 partition per broker. Having more partitions increases I/O parallelism for writes and also increases the degree of parallelism for consumers (since partition is the unit for distributing data to consumers). On the other hand, more partitions adds some overhead: (a) there will be more files and thus more open file handlers; (b) there are more offsets to be checkpointed by consumers which can increase the load of ZooKeeper. So, you want to balance these tradeoffs.

Create a Kafka topic

In Kafka 0.8, there are 2 ways of creating a new topic:

  1. Turn on auto.create.topics.enable option on the broker. When the broker receives the first message for a new topic, it creates that topic with num.partitionsand default.replication.factor.
  2. Use the admin command bin/kafka-topics.sh.

We will use the latter approach. The following command creates a new topic “zerg.hydra”. The topic is configured to use 3 partitions and a replication factor of 2. Note that in a production setting we’d rather set the replication factor to 3, but a value of 2 is better for illustrative purposes (i.e. we intentionally use different values for the number of partitions and replications to better see the effects of each setting).

Create the “zerg.hydra” topic
1
2
$ bin/kafka-topics.sh --zookeeper localhost:2181 \
    --create --topic zerg.hydra --partitions 3 --replication-factor 2

This has the following effects:

  • Kafka will create 3 logical partitions for the topic.
  • Kafka will create a total of two replicas (copies) per partition. For each partition it will pick two brokers that will host those replicas. For each partition Kafka will elect a “leader” broker.

Ask Kafka for a list of available topics. The list should include the new zerg.hydratopic:

List the available topics in the Kafka cluster
1
2
3
4
$ bin/kafka-topics.sh --zookeeper localhost:2181 --list
<snipp>
zerg.hydra
</snipp>

You can also inspect the configuration of the topic as well as the currently assigned brokers per partition and replica. Because a broker can only host a single replica per partition, Kafka has opted to use a broker’s ID also as the corresponding replica’s ID.

List the available topics in the Kafka cluster
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic zerg.hydra
<snipp>
zerg.hydra
    configs:
    partitions: 3
        partition 0
        leader: 1 (192.168.0.153:9092)
        replicas: 1 (192.168.0.153:9092), 2 (192.168.0.153:9093)
        isr: 1 (192.168.0.153:9092), 2 (192.168.0.153:9093)
        partition 1
        leader: 2 (192.168.0.153:9093)
        replicas: 2 (192.168.0.153:9093), 3 (192.168.0.153:9094)
        isr: 2 (192.168.0.153:9093), 3 (192.168.0.153:9094)
        partition 2
        leader: 3 (192.168.0.153:9094)
        replicas: 3 (192.168.0.153:9094), 1 (192.168.0.153:9092)
        isr: 3 (192.168.0.153:9094), 1 (192.168.0.153:9092)
<snipp>

In this example output the first broker (with broker.id = 1) happens to be the designated leader for partition 0 at the moment. Similarly, the second and third brokers are the leaders for partitions 1 and 2, respectively.

The following diagram illustrates the setup (and also includes the producer and consumer that we will run shortly).

Overview of our Kafka setup including the current state of the partitions and replicas. The colored boxes represent replicas of partitions. “P0 R1” denotes the replica with ID 1 for partition 0. A bold box frame means that the corresponding broker is the leader for the given partition.

You can also inspect the local filesystem to see how the --describe output above matches actual files. By default Kafka persists topics as “log files” (Kafka terminology) in the log.dir directory.

Local files that back up the partitions of Kafka topics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
$ tree /tmp/kafka-logs-{1,2,3}
/tmp/kafka-logs-1                   # first broker (broker.id = 1)
├── zerg.hydra-0                    # replica of partition 0 of topic "zerg.hydra" (this broker is leader)
│   ├── 00000000000000000000.index
│   └── 00000000000000000000.log
├── zerg.hydra-2                    # replica of partition 2 of topic "zerg.hydra"
│   ├── 00000000000000000000.index
│   └── 00000000000000000000.log
└── replication-offset-checkpoint

/tmp/kafka-logs-2                   # second broker (broker.id = 2)
├── zerg.hydra-0                    # replica of partition 0 of topic "zerg.hydra"
│   ├── 00000000000000000000.index
│   └── 00000000000000000000.log
├── zerg.hydra-1                    # replica of partition 1 of topic "zerg.hydra" (this broker is leader)
│   ├── 00000000000000000000.index
│   └── 00000000000000000000.log
└── replication-offset-checkpoint

/tmp/kafka-logs-3                   # third broker (broker.id = 3)
├── zerg.hydra-1                    # replica of partition 1 of topic "zerg.hydra"
│   ├── 00000000000000000000.index
│   └── 00000000000000000000.log
├── zerg.hydra-2                    # replica of partition 2 of topic "zerg.hydra" (this broker is leader)
│   ├── 00000000000000000000.index
│   └── 00000000000000000000.log
└── replication-offset-checkpoint

6 directories, 15 files

Caveat: Deleting a topic via bin/kafka-topics.sh --delete will apparently not delete the corresponding local files for that topic. I am not sure whether this behavior is expected or not.

Start a producer

Start a console producer in sync mode:

Start a console producer in sync mode
1
2
$ bin/kafka-console-producer.sh --broker-list localhost:9092,localhost:9093,localhost:9094 --sync \
    --topic zerg.hydra

Example producer output:

[...] INFO Verifying properties (kafka.utils.VerifiableProperties)
[...] INFO Property broker.list is overridden to localhost:9092,localhost:9093,localhost:9094 (...)
[...] INFO Property compression.codec is overridden to 0 (kafka.utils.VerifiableProperties)
[...] INFO Property key.serializer.class is overridden to kafka.serializer.StringEncoder (...)
[...] INFO Property producer.type is overridden to sync (kafka.utils.VerifiableProperties)
[...] INFO Property queue.buffering.max.messages is overridden to 10000 (...)
[...] INFO Property queue.buffering.max.ms is overridden to 1000 (kafka.utils.VerifiableProperties)
[...] INFO Property queue.enqueue.timeout.ms is overridden to 0 (kafka.utils.VerifiableProperties)
[...] INFO Property request.required.acks is overridden to 0 (kafka.utils.VerifiableProperties)
[...] INFO Property request.timeout.ms is overridden to 1500 (kafka.utils.VerifiableProperties)
[...] INFO Property send.buffer.bytes is overridden to 102400 (kafka.utils.VerifiableProperties)
[...] INFO Property serializer.class is overridden to kafka.serializer.StringEncoder (...)

You can now enter new messages, one per line. Here we enter two messages “Hello, world!” and “Rock: Nerf Paper. Scissors is fine.”:

Hello, world!
Rock: Nerf Paper. Scissors is fine.

After the messages are produced, you should see the data being replicated to the three log directories for each of the broker instances, i.e. /tmp/kafka-logs-{1,2,3}/zerg.hydra-*/.

Start a consumer

Start a console consumer that reads messages in zerg.hydra from the beginning (in a production setting you would usually NOT want to add the --from-beginning option):

Start a console consumer
1
$ bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic zerg.hydra --from-beginning

The consumer will see a new message whenever you enter a message in the producer above.

Example consumer output:

<snipp>
[...] INFO [console-consumer-28434_panama.local-1363174829799-954ed29e], Connecting to zookeeper instance at localhost:2181 ...
[...] INFO Starting ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)
[...] INFO Client environment:zookeeper.version=3.3.3-1203054, built on 11/17/2011 05:47 GMT ...
[...] INFO Client environment:host.name=192.168.0.153 (org.apache.zookeeper.ZooKeeper)
<snipp>
[...] INFO Fetching metadata with correlation id 0 for 1 topic(s) Set(zerg.hydra) (kafka.client.ClientUtils$)
[...] INFO Connected to 192.168.0.153:9092 for producing (kafka.producer.SyncProducer)
[...] INFO Disconnecting from 192.168.0.153:9092 (kafka.producer.SyncProducer)
[...] INFO [ConsumerFetcherThread-console-consumer-28434_panama.local-1363174829799-954ed29e-0-3], Starting ...
[...] INFO [ConsumerFetcherManager-1363174829916] adding fetcher on topic zerg.hydra, partion 2, initOffset -1 to broker 3 with fetcherId 0 ...
[...] INFO [ConsumerFetcherThread-console-consumer-28434_panama.local-1363174829799-954ed29e-0-2], Starting ...
[...] INFO [ConsumerFetcherManager-1363174829916] adding fetcher on topic zerg.hydra, partion 1, initOffset -1 to broker 2 with fetcherId 0 ...
[...] INFO [ConsumerFetcherThread-console-consumer-28434_panama.local-1363174829799-954ed29e-0-1], Starting ...
[...] INFO [ConsumerFetcherManager-1363174829916] adding fetcher on topic zerg.hydra, partion 0, initOffset -1 to broker 1 with fetcherId 0 ...

And at the end of the output you will see the following messages:

Hello, world!
Rock: Nerf Paper. Scissors is fine.

That’s it!

A note when using Kafka with Storm

The maximum parallelism you can have on a KafkaSpout is the number of partitions of the corresponding Kafka topic. The following question-answer thread (Michael G. Noll slightly modified the original text for clarification purposes) is from the Storm user mailing list, but supposedly refers to Kafka pre-0.8 and thereby before the replication feature was added:

Question: Suppose the number of Kafka partitions per broker is configured as 1 and the number of hosts is 2. If we set the spout parallelism as 10, then how does Storm handle the difference between the number of Kafka partitions and the number of spout tasks? Since there are only 2 partitions, does every other spout task (greater than first 2) not read the data or do they read the same data?

Answer (by Nathan Marz): The remaining 8 (= 10 – 2) spout tasks wouldn’t read any data from the Kafka topic.

My current understanding is that the number of partitions (i.e. regardless of replicas) is still the limiting factor for the parallelism of a KafkaSpout. Why? Because Kafka is not allowing consumers to read from replicas other than the (replica of the) leader of a partition to simplify concurrent access to data in Kafka.

A note when using Kafka with Hadoop

LinkedIn has published their Kafka->HDFS pipeline named Camus. It is a MapReduce job that does distributed data loads out of Kafka.

Where to go from here

The following documents provide plenty of information about Kafka that goes way beyond what I covered in this article:

Awesome MediaWiki theme

For anyone who saw the recent launch of the new oVirt website a while back and was wondering how they could make such an attractive theme and lay-out for a MediaWiki wiki, wonder no more. In fact, you don’t even have to be jealous! Because the theme, called Strapping, so called because it’s based on the Bootstrap web framework, has just been published by  Garrett on GitHub.

Kudos to Garrett, who did amazing work on this theme to make it as beautiful and reusable as possible. I’m looking forward to using it for other websites in the near future. And so can you!

LinkedIn has just announced the release of Camus

Kafka is a high-throughput, persistent, distributed messaging system that was originally developed at LinkedIn. It forms the backbone of Wikimedia’s new data analytics pipeline.

Kafka is both performant and durable. To make it easier to achieve high throughput on a single node it also does away with lots of stuff message brokers ordinarily provide (making it a simpler distributed messaging system).

LinkedIn has just announced the release of Camus: their Kafka to HDFS pipeline.

 

Connecting to HBase from Erlang using Thrift

The key was to piece together steps from the following two pages:

Thrift API and Hbase.thrift file can be found here
http://wiki.apache.org/hadoop/Hbase/ThriftApi

Download the latest thrift*.tar.gz from http://thrift.apache.org/download/

1
2
3
4
5
6
7
sudo apt-get install libboost-dev
tar -zxvf thrift*.tar.gz
cd thrift*
./configure
make
cd compiler/cpp
./thrift -gen erl Hbase.thrift

Take all the files in the gen-erl directory and copy them to your application’s /src.
Copy the thrift erlang client files from thrift*/lib/erl to your application or copy/symlink to $ERL_LIB

Can connect using either approach:

1
2
3
4
{ok, TFactory} = thrift_socket_transport:new_transport_factory("localhost", 9090, []).
{ok, PFactory} = thrift_binary_protocol:new_protocol_factory(TFactory, []).
{ok, Protocol} = PFactory().
{ok, C0} = thrift_client:new(Protocol, hbase_thrift).

Or by using the utility, need to investigate the difference

1
{ok, C0} = thrift_client_util:new("localhost", 9090, hbase_thrift, []).

Basic CRUD commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
% Load records into the shell
rr(hbase_types).
% Get a list of tables
{C1, Tables} = thrift_client:call(C0, getTableNames, []).
% Create a table
{C2, _Result} = thrift_client:call(C1, createTable, ["test", [#columnDescriptor{name="test_col:"}]]).
% Insert a column value
% TODO: Investigate the attributes dictionary's purpose
{C3, _Result} = thrift_client:call(C2, mutateRow, ["test", "key1", [#mutation{isDelete=false,column="test_col:", value="wooo"}], dict:new()]).
% Delete
{C4, _Result} = thrift_client:call(C3, mutateRow, ["test", "key1", [#mutation{isDelete=true}], dict:new()]).
% Get data
% TODO: Investigate the attributes dictionary's purpose
thrift_client:call(C4, getRow, ["test", "key1", dict:new()]).

 

Are You a Force Multiplier?

multiply

multiply

On most days, my To Do List seems longer than the Nile River.  It contains everything from the quotidien (remember the milk!) to the critical — tasks that trigger serious consequences. On days when it seems like I add two tasks for every one I complete, it can be tempting to focus on the noisiest ones.  What are noisy tasks?  The tasks with the most pressing deadline or the most vocal sponsor. And so it goes, racing from one due date to another, with barely enough time for a breath much less a moment to consider the true results of what I am doing.

Writers on productivity, time management and strategy have told us for a long time that we should focus on the IMPORTANT not the URGENT. That’s excellent advice.  However, I’ve recently started thinking about another lens through which to view and prioritize tasks:  Will the completion of the task (or project) act as a force multiplier?

To understand this better, let’s spend a moment on force multiplication.  The military calls a factor a “force multiplier” when that factor enables a force to work much more effectively.  The example in Wikipedia relates to GPS:  ”if a certain technology like GPS enables a force to accomplish the same results of a force five times as large but without GPS, then the multiplier is 5.”  Interestingly, while technology can be an enormous advantage, force multipliers are not limited to technology.  Some of the force multipliers listed in that Wikipedia article have nothing at all to do with technology:

Now come back to that growing To Do List and take another look at those tasks.  How many of them are basically chores — things that simply need to get done in order to get people off your back or to move things forward (perhaps towards an unclear goal)? How many of them are (or are part of) force multipliers — things that will allow you or your organization to work in a dramatically more effective fashion?  Viewed through this lens, the chores seem much less relevant, akin to rearranging the deck chairs on the Titanic, while the force multipliers are clearly much more deserving of your time and attention.

The challenge of course is that the noisy tasks grab your attention because others insist on it.  They want something when they want it because they want it.  They may not have a single strategic thought in their head, but they are demanding and persistent.  So how do you limit the encroachment of purveyors of noisy tasks?  One answer is to limit the amount of time available for chores.  To do this credibly, you’ll need to know where you and your activities fit within the strategy of your organization.  If the task does not advance strategy, don’t do it.  Or decide upfront to allow a fixed percentage of your time for chores that may be of minimal use to you, but may be important to keep the people around you happy.  Another approach is to get a better understanding of the task and its context.  If your job is to copy documents, one page looks much like another.  However, it matters if the document you are copying contains the cafeteria menu or the firm’s emergency response guidelines. Finally, you need to educate the folks around you.  With your subordinates, do your decision making aloud — explaining how you determine if a particular task or project is a force multiplier. With your superiors, ask them to help you understand better the force multiplication attributes they see in the tasks they assign.  (This will either provide you with more useful contextual information or smoke out a chore that is masquerading as an important task.) Finally, with the others, engage them in conversation. When you cannot see your way clear to handle their chore, explain your reasoning.  They won’t always be happy about it, but they will start learning when to call on you and when to dump their requests on someone else.

Of course, the concept of force multiplication goes far beyond your To Do List.  Do your projects have a force multiplying effect on your department?  Does your department have a force multiplying effect on your firm? These are important questions for everyone, but especially for people engaged in the sometime amorphous field of knowledge management. Sure, most of what we do helps.  But do we make a dramatic difference?  If not, why not?

[Photo Credit: Leo Reynolds]

Written By: V Mary Abraham

Anonymous members speak out about WikiLeaks’ fundraising tactics

anonymous

In the past, Anonymous has been among the most supportive of WikiLeaks and the mission behind it — which is still halfway true, but since everything seems to have funneled off into the ‘one man Julian Assange show,’ the majority of the hacktivist group no longer embraces the site.

AnonymousIRC released a statement on Pastebin yesterday, shortly after announcing their withdrawal of support for WikiLeaks via Twitter:

The end of an era. We unfollowed @wikileaks and withdraw our support. It was an awesome idea, ruined by Egos. Good Bye.

WikiLeaks is funded entirely through donations — which is fine, according to Anonymous, but the problem is how it began demanding users to donate money in order to access any content at all.

Since yesterday visitors of the Wikileaks site are presented a red overlay banner that asks them to donate money. This banner cannot be closed and unless a donation is made, the content like GIFiles and the Syria emails are not displayed.

That’s a great way for any donation-driven service to pull in a ton of donations in a short amount of time, but like Anonymous has already said, it clearly demonstrates that WikiLeaks’ primary focus has changed from releasing information and serving its users, to just another money-making scheme.

“The idea behind WikiLeaks was to provide the public with information that would otherwise be kept secret by industries and governments. Information we strongly believe the public has a right to know,” the statement said.

“But this has been pushed more and more into the background, instead we only hear about Julian Assange, like he had dinner last night with Lady Gaga. That’s great for him but not much of our interest. We are more interested in transparent governments and bringing out documents and information they want to hide from the public.”

I think I’ll have to agree with the group’s Pastebin statement — I’m all for establishing an online business or service and monetizing it to no end, but certainly not if you’re a not-for-profit organization who’s mission statement is to “bring important news and information to the public.”

Any organization – especially non-profit groups – needs funding to survive, but in the case of WikiLeaks, a fee shouldn’t be charged in order to access content — not if it wants to keep its credibility and supporters, anyway.

The banner has since been taken down, and Anonymous already made it clear that it still supports the original idea, and that it is completely in opposition to any legal action being taken against Assange;

It goes without saying that we oppose any plans of extraditing Julian to the USA. He is a content provider and publisher, not a criminal.

This whole ordeal could definitely cause some turbulence for WikiLeaks – a fair amount of content is believed to have been submitted by Anonymous in the past (including the recent Stratfor email cache).

So if Anonymous is cutting off ties to the organization, that could mean less information-leaks, and thus, less content for WikiLeaks.

T-Mobile Merging With MetroPCS

Last year T-Mobile tried to merge with AT&T but the deal was blocked by the FCC. Now T-Mobile and MetroPCS have agreed to merge in a $1.5 billion deal.There doesn’t seem to be much concern that the FCC will disagree with this deal, perhaps because the two companies combined will have a user base of 42.5 million, which will still be smaller than the #3 player Sprint‘s 56.4 million. Because the two companies have similar spectrum holdings T-Mobile claims the merger will allow them to offer better coverage. They also say they will continue to offera range of both on and off-contract plans.

r2d2b2g: an experimental prototype Firefox OS test environment

Developers building apps for Firefox OS should be able to test them without having to deploy them to actual devices.  Myk Melez looked into the state of the art recently and found that the existing desktop test environments, like B2G Desktop, the B2G Emulators, and Firefox’s Responsive Design View, are either difficult to configure or significantly different from Firefox OS on a phone.

Firefox add-ons provide one of the simplest software installation and update experiences. And B2G Desktop is a lot like a phone. So, Myk Melez decided to experiment with distributing B2G Desktop via an add-on. And the result is r2d2b2g, an experimental prototype test environment for Firefox OS.

How It Works

r2d2b2g bundles B2G Desktop with Firefox menu items for accessing that test environment and installing an app into it. With r2d2b2g, starting B2G Desktop is as simple as selecting Tools > B2G Desktop:

r2d2b2g bundles B2G Desktop with Firefox menu items for accessing that test environment and installing an app into it. With r2d2b2g, starting B2G Desktop is as simple as selecting Tools > B2G Desktop:

To install an app into B2G Desktop, navigate to it in Firefox, then select Tools > Install Page as App:

 To install an app into B2G Desktop, navigate to it in Firefox, then select Tools > Install Page as App:

r2d2b2g will install the app and start B2G Desktop so you can see the app the way it’ll appear to Firefox OS users:

 r2d2b2g will install the app and start B2G Desktop so you can see the app the way it’ll appear to Firefox OS users:

Try It Out!

Note that r2d2b2g is an experiment, not a product! It is neither stable nor complete, and its features may change or be removed over time. Or Mozilla might end the project after learning what they can from it. But if you’re the adventurous sort, and you’d like to provide feedback on this investigation into a potential future product direction, then they’d love to hear from you!

Install r2d2b2g via these platform-specific XPIs: MacLinux (32-bit), orWindows (caveat: the Windows version of B2G Desktop currently crashes on startup due to bug 794662 795484), or fork it on GitHub, and let us know what you think!

Also, try out the Wikipedia Mobile for Firefox OS application available on GitHub. You can see it in action here.

Google Glass, Augmented Reality Spells Data Headaches

Google seems determined to press forward with Google Glass technology, filinga patent for a Google Glass wristwatch. As pointed out by CNET, the timepiece includes a camera and a touch screen that, once flipped up, acts as a secondary display. In the patent, Google refers to the device as a ‘smart-watch. Whether or not a Google Glass wristwatch ever appears on the marketplace — just because a tech titan patents a particular invention doesn’t mean it’s bound for store shelves anytime soon — the appearance of augmented-reality accessories brings up a handful of interesting issues for everyone from app developers to those tasked with handling massive amounts of corporate data.For app developers, augmented-reality devices raise the prospect of broader ecosystems and spiraling complexity. It’s one thing to build an app for smartphones and tablets — but what if that app also needs to handle streams of data ported from a pair of tricked-out sunglasses or a wristwatch, or send information in a concise and timely way to a tiny screen an inch in front of someone’s left eye?

How to install Python 2.7 and 3.3 on CentOS 6

CentOS 6.2 and 6.3 ships with Python 2.6.6. You can manually install Python 2.7 and Python 3.3 but you must be careful to leave the system version alone. Several critical utilities, for example yum, depend on Python 2.6.6 and if you replace it bad things will happen.

Below are the steps necessary to install Python 2.7.3 and Python 3.3.0 without touching the system version of Python. The procedure is exactly the same for both versions except for the filenames. People have reported that this also works for CentOS 5.8 but I haven’t tested that. Execute all the commands below as root either by logging in as root or by using sudo.

Install development tools

In order to compile Python you must first install the development tools and a few extra libs. The extra libs are not strictly needed to compile Python but without them your new Python interpreter will be quite useless.

# yum groupinstall "Development tools"
# yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel

Download and install Python

It is critical that you use make altinstall below. If you use make install you will end up with two different versions of Python in the filesystem both named python. This can lead to problems that are very hard to diagnose.

Download and install Python 2.7.3

# wget http://python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2
# tar xf Python-2.7.3.tar.bz2
# cd Python-2.7.3
# ./configure --prefix=/usr/local
# make && make altinstall

Download and install Python 3.3.0

# wget http://python.org/ftp/python/3.3.0/Python-3.3.0.tar.bz2
# tar xf Python-3.3.0.tar.bz2
# cd Python-3.3.0
# ./configure --prefix=/usr/local
# make && make altinstall

After running the commands above your newly installed Python interpreter will be available as /usr/local/bin/python2.7 or /usr/local/bin/python3.3. The system version of Python 2.6.6 will continue to be available as /usr/bin/python and/usr/bin/python2.6.

Download and install Distribute

Distribute provides a framework for installing packages from the Python Package Index. Each Python interpreter on your system needs its own install of Distribute.

You can find out what the latest version of Distribute is here. At the time of this edit the current version is 0.6.35. Replace the version number below if there is a newer version available.

Download and install Distribute for Python 2.7

# wget http://pypi.python.org/packages/source/d/distribute/distribute-0.6.35.tar.gz
# tar xf distribute-0.6.35.tar.gz
# cd distribute-0.6.35
# python2.7 setup.py install

This generates the script /usr/local/bin/easy_install-2.7 that you use to install packages for Python 2.7. It puts your packages in /usr/local/lib/python2.7/site-packages/.

Download and install Distribute for Python 3.3

# wget http://pypi.python.org/packages/source/d/distribute/distribute-0.6.35.tar.gz
# tar xf distribute-0.6.35.tar.gz
# cd distribute-0.6.35
# python3.3 setup.py install

This generates the script /usr/local/bin/easy_install-3.3 that you use to install packages for Python 3.3. It puts your packages in /usr/local/lib/python3.3/site-packages/.

What’s next?

Working with multiple versions of Python is difficult and error-prone. I strongly recommend that you install virtualenv and learn how to use it. Virtualenv is a Virtual Python Environment builder that makes it possible to run Python in a sandbox-like environment. Each sandbox can have its own Python version and packages. This is very useful when you work on multiple projects, each with its own dependencies.

Install and use virtualenv for Python 2.7

# easy_install-2.7 virtualenv
# virtualenv-2.7 --distribute someproject
New python executable in someproject/bin/python2.7
Also creating executable in someproject/bin/python
Installing distribute...................done.
Installing pip................done.
# source someproject/bin/activate
(someproject)# python --version
Python 2.7.3
(someproject)#

Install and use virtualenv for Python 3.3

# easy_install-3.3 virtualenv
# virtualenv-3.3 --distribute otherproject
New python executable in otherproject/bin/python3.3
Also creating executable in otherproject/bin/python
Installing distribute...................done.
Installing pip................done.
# source otherproject/bin/activate
(otherproject)# python --version
Python 3.3.0
(otherproject)#

Analyzing Mobile Browser Energy Consumption

Recently, technology reporter Jacob Aron wrote a blog post on newscientist.com that talks about how bloated website code drains your smartphone’s battery.

He mentions how Stanford computer scientist Narendran Thiagarajan and colleagues used an Android phone hooked up to a multimeter to measure the energy used in downloading and rendering popular websites. Using their experimental setup they measured the energy needed to render popular web sites as well as the energy needed to render individual web elements such as images, Javascript, and Cascading Style Sheets (CSS). They claim that complex Javascript and CSS can be as expensive to render as images. Moreover, dynamic Javascript requests (in the form ofXMLHttpRequest) can greatly increase the cost of rendering the page, since it prevents the page contents from being cached. Finally, they show that on the Android browser, rendering JPEG images is considerably cheaper than other formats, such as GIF and PNG for comparably sized images.

One example that is cited is that simply loading the mobile version of Wikipedia over a3G connection consumed just over 1 per cent of the phone’s battery, while browsing apple.com, which does not have a mobile version, used 1.4 per cent.
Yet, in the summary of the paper they find that the results from this study are not meaningful except for the initial loading of just a single page resource. It would be interesting to extend these results in a meaningful way, and study the energy signature of an entire browsing session at a site such as Wikipedia, where a user typically moves from page to page. So, during that session, downloaded web elements such as Javascript, CSS and images would mostly be cached locally. Therefore, we really can’t estimate the energy cost of a total session by simply summing the energy usage of pages visited during that session. Measuring an entire typical session may help optimize the power signature of the entire site. Custom CSS that is applicable to every page of a site would easily outweigh the cost of the apparently excessive CSS download for the render of just the first page.
So, one of the ways that we are looking to improve our mobile browser energy consumption is by implementing the MediaWiki ResourceLoader in order to improve the load times for JavaScript and CSS. ResourceLoader is the delivery system in MediaWiki for the optimized loading and managing of modules. Its purpose is to improve MediaWiki’s front-end performance and the experience by making use of strong caching while still allowing near-instant deployment of new code that all clients start using within 5 minutes. Modules are built of JavaScript, CSS and interface messages; it was first released in MediaWiki 1.17.
On Wikimedia wikis, every page view includes hundreds of kilobytes of JavaScript. In many cases, some or all of this code goes unused due to browser support or because users do not make use of the features on the page. In these cases, bandwidth and loading time spent on downloading, parsing and executing JavaScript code are wasted. This is especially true when users visit MediaWiki sites using older browsers, like Internet Explorer 6, where almost all features are unsupported, and parsing and executing JavaScript is extremely slow.
ResourceLoader solves this problem by loading resources on demand and only for browsers that can run them. Although there is too much to summarize in a simple list, the major improvements for client-side performance are gained by:
  • Minifying and concatenating
  • → which reduces the code’s size and parsing/download time
  • JavaScript files, CSS files and interface messages are loaded in a single special formatted “ResourceLoader Implement” server response.
  • Batch loading
  • → which reduces the number of requests made
  • The server response for module loading supports loading multiple modules so that a single response contains multiple ResourceLoader Implements, which in itself contain the minified and concatenated result of multiple javascript/css files.
  • Data URIs embedding
  • → which further reduces the number of requests, response time and bandwidth
  • Optionally images referenced in stylesheets can be embedded as data URIs. Together with the gzippping of the server response, those embedded images, together, function as a “super sprite”.

Patrick Reilly, Senior Software Developer, Mobile
  • Copyright notes: “Phone charging” by Eml5526.sp11.team1.adam, in the public domain, from Wikimedia Commons.

HBase and Hive Thrift PHP Client

Due to my newest project I built a php client to access the HBase and Hive services within a hadoop cluster.

Those services are accessible via thrift a high performance protocol for back end services.

As building a client with thrift is not that easy I decided to put my HBase and Hive php thrift client packages online for others.

Links:
Hadoop

Hive
HBase
Thrift

How it works

Start the HBase and Hive Thrift server via shell:

Download HBase and Thrift php client package and write your own client:

MySQL ROW_NUMBER()

Have you ever been in a situation where you are selecting records from a database that need to be ranked, but the column(s) you’re attempting to ORDER BY are not unique. For example, using the Orders table below, how could you display a distinct list of customers and their last order? The problem is that the date field contains only the date, but not the time. Therefore, it’s possible that two orders can be placed by the same customer on the same day. Now if we had a field defined as AUTO_INCREMENT in MySQL or IDENTITY in Microsoft SQL Server and the records were entered sequentially, this would be a simple task.

Orders Table

Example:
Customer OrderDate Amount
Jane 2011-01-05 12
Jane 2011-01-07 15
Jane 2011-01-07 17
John 2011-01-01 11
John 2011-01-02 27
John 2011-01-02 13
Pat 2011-02-05 5
Pat 2011-02-07 34
Pat 2011-02-07 12

This can be solved in MS SQL , and with a little more code in MySQL , as well. In MS SQL Server 2005+, uniquely identifying the above records is a breeze using the ROW_NUMBER() function. If your not familiar withROW_NUMBER(), MSDN defines theT-SQL ROW_NUMBER()as“Returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.”

So, in order to display a distinct list of customers and uniquely identify their last order, we could write something like:

SELECT  ROW_NUMBER() OVER (PARTITION BY Customer ORDER BY OrderDate DESC) AS RowNumber
       ,Customer
       ,OrderDate
       ,Amount
  FROM Orders
Results:
RowNumber Customer OrderDate Amount
1 Jane 2011-01-07 15
2 Jane 2011-01-07 17
3 Jane 2011-01-05 12
1 John 2011-01-02 27
2 John 2011-01-02 13
3 John 2011-01-01 11
1 Pat 2011-02-07 34
2 Pat 2011-02-07 12
3 Pat 2011-02-05 5

Notice how a unique row number is now apparent on each row within the partition. The next step would be to encompasses the statement in a sub query or Common Table Expression (CTE), and filter out the unwanted records based on the generated row number.

SELECT  Customer
       ,OrderDate
       ,Amount
  FROM
      (
        SELECT  ROW_NUMBER() OVER (PARTITION BY Customer ORDER BY OrderDate DESC) AS RowNumber
               ,Customer
               ,OrderDate
               ,Amount
          FROM Orders
      ) subquery WHERE RowNumber = 1
Results:
Customer OrderDate Amount
Jane 2011-01-07 15
John 2011-01-02 27
Pat 2011-02-07 34

The result is a single record for each customer, even when the customer has more than one order on the same day.


MySQL Implementation

Recently, I ran into a similar situation on a WordPress implementation, but with Horses not Orders/Customers. The requirement was to display each horse’s last workout and next race. MySQL does not have a ROW_NUMBER()function. However, MySQL does allow for inline assignment of variables and the ability to reassign and reference those variables as the query is working its way through the execution. This allows the same functionality thatROW_NUMBER() provides to be achieved in MySQL. Sticking to the same example used above the MySQL solution would be.

SELECT  @row_num := IF(@prev_value=o.Customer,@row_num+1,1) AS RowNumber
       ,o.Customer
       ,o.OrderDate
       ,o.Amount
       ,@prev_value := o.Customer
  FROM Orders o,
      (SELECT @row_num := 1) x,
      (SELECT @prev_value := '') y
  ORDER BY o.Customer, o.OrderDate DESC
Results:
RowNumber Customer OrderDate Amount
1 Jane 2011-01-07 15
2 Jane 2011-01-07 17
3 Jane 2011-01-05 12
1 John 2011-01-02 27
2 John 2011-01-02 13
3 John 2011-01-01 11
1 Pat 2011-02-07 34
2 Pat 2011-02-07 12
3 Pat 2011-02-05 5

A unique row number is now apparent on each row with in the partition. The @row_num variable holds the current row number and the @pev_value variable holds the current value of the partition by field. The variables are defined and assigned a default value with in subqueries. The Orders table and the subqueries are then combined in a single select statement. The @row_num variable is incremented by one until the @prev_value does not equal the Customer and is then reset back to one.

Important

  • The @row_num variable must be set before the @prev_value variable
  • The first field in the ORDER BY must be the field that you are partitioning by
  • The default value assigned to the @prev_value variable must not exist in the partition by field

As we did with the MS SQL ROW_NUMBER() example, we will need to encompass the statement in a sub query in order to filter based on the generated row number.

SELECT  Customer
       ,OrderDate
       ,Amount
  FROM
     (
      SELECT  @row_num := IF(@prev_value=o.Customer,@row_num+1,1) AS RowNumber
             ,o.Customer 
             ,o.OrderDate
             ,o.Amount
             ,@prev_value := o.Customer
        FROM Orders o,
             (SELECT @row_num := 1) x,
             (SELECT @prev_value := '') y
       ORDER BY o.Customer, o.OrderDate DESC
     ) subquery
 WHERE RowNumber = 1
Results:
Customer OrderDate Amount
Jane 2011-01-07 15
John 2011-01-02 27
Pat 2011-02-07 34

The result is a single record for each customer, even when the customer had more than one order on the same day.

The best solution is to avoid being in this situation in the first place. However an existing data schema does not always lend itself to new requirements.

Calling mobile testers for round two

Thanks to everyone for participating in our first round of mobile gateway testing.

This time around we’d like you to have our new mobile gateway for your default experience.

Follow this link on your mobile phone to opt in: http://tinyurl.com/woptin and send us feedback.

Visually the gateway should look pretty much the same minus a beta logo. All the other changes are under the hood. If you can’t tell the difference between this and the old gateway then we’ve done our job.

Please let us know of any issues on our feedback page and if you don’t want to be in the beta then follow this link to opt out: http://tinyurl.com/woptout .

For those coming back; here are some of the issues that you guys reported that we fixed

  • Missing templates – We’re now using live content so this shouldn’t be an issue.
  • Mismatched Japanese & English templates – see above
  • Missing devices – We’ve added a lot more devices through WURFL. If yours is still having issues let us know.
  • Don’t redirect tablets – Fixed!
  • Remove the donate banners – Fixed!
  • … and numerous others

You can learn more about our mobile projects and future work by visitingour Mobile Projects page. If you are a developer and would like to get involved, check out the page detailing our work. And if you just want to say hello or give us some super quick feedback then join us on irc through freenode #wikimedia-mobile

Thanks for making Wikipedia Mobile better for everyone.

2004 Porsche 911 GT3 — For Sale (EXPORT ONLY)

Got Questions? Call (818) 767-7243
– 3,600 cc 3.6 liters horizontal 6 rear engine with 100 mm bore, 76.4 mm stroke, 11.7 compression ratio, double overhead cam, variable valve timing/camshaft and four valves per cylinder
– Premium unleaded fuel 91
– Fuel economy EPA highway (mpg): 23 and EPA city (mpg): 15
– Multi-point injection fuel system
– 23.5 gallon main premium unleaded fuel tank
– Power: 283 kW , 380 HP EEC @ 7,400 rpm; 284 ft lb , 385 Nm @ 5,000 rpm

Read more:http://www.motortrend.com/cars/2004/porsche/911/gt3_coupe/830/specifications/index.html#ixzz0anZxtz5R