Thursday, 10 April 2008

Getting started with distributed Erlang - Mnesia table relocation

Mnesia is a distributed database that forms part of the Erlang release. One of the features that I think is potentially powerful, is transparent table relocation across machines. With Mnesia, you can replicate tables to any nodes you wish in your network, and Mnesiatakes care of all the back end bits for you. With "transparent", I mean that you don't need to do anything in your clients to make them "aware" of the new tables. Reads that were taking place from a table on one machine, will now be distributed across multiple nodes (where the nodes reside on single or multiple machines).

I wanted to see how difficult it is to achieve this. For the setup, I installed two virtual Ubuntu 7.10 machines using VMware Player. You can get images for most Ubuntu distros at http://isv-image.ubuntu.com/vmware/. FYI, the username and password for these images is ubuntu:ubuntu. I named the two nodes

node1.21ccw.blogspot.com and
node2.21ccw.blogspot.com

You'll need to edit the network configurations with the IP addresses if you want to reproduce this experiment. If you need some help, post a question as comment :)

I now had two machines that could ping each other using the full names, and a warm and fuzzy feeling inside:


The next step was to start up an Erlang node on each machine. There's a catch here though. I got some problems using erl -sname, probably because of the way I set up the hostnames of the machines. So, I had to specify the fully qualified names manually:


ubuntu@node1:~/node1$ erl -name 'node1@node1.21ccw.blogspot.com'
Erlang (BEAM) emulator version 5.5.5 [source] [async-threads:0] [kernel-poll:false]

Eshell V5.5.5 (abort with ^G)


ubuntu@node2:~/node2$ erl -name 'node2@node2.21ccw.blogspot.com'
Erlang (BEAM) emulator version 5.5.5 [source] [async-threads:0] [kernel-poll:false]

Eshell V5.5.5 (abort with ^G)
(node1@node1.21ccw.blogspot.com)1> nodes().
[]


Notice the output of the nodes() command. This will return a list of other Erlang nodes that this node is aware of. Initially there's no awareness. To let a node know of another node, you can use net_adm:ping/1 to ping the other node. Both nodes will then become be aware of each other:


(node1@node1.21ccw.blogspot.com)4> net_adm:ping('node2@node2.21ccw.blogspot.com').
pong

(node1@node1.21ccw.blogspot.com)5> nodes().
['node1@node1.21ccw.blogspot.com']


(node2@node2.21ccw.blogspot.com)1> nodes().
['node1@node1.21ccw.blogspot.com']


Cool. Now the nodes know of each other. To get Mnesia started, you have to create a schema on each node. A schema is located on the file system, in the same location where the actual disc-copies of tables will reside. [node()|nodes()] creates a list of the current node and all the other connected nodes. ls() shows the directory that Mnesia has created for the database.


(node1@node1.21ccw.blogspot.com)5> mnesia:create_schema([node()|nodes()]).
ok

(node1@node1.21ccw.blogspot.com)6> ls().
Mnesia.node1@node1.21ccw.blogspot.com
ok



(node2@node2.21ccw.blogspot.com)2> ls().
Mnesia.node2@node2.21ccw.blogspot.com
ok


Now we have to start Mnesia on both nodes. You will notice that when we do an mnesia:info on node2 at this point, that it shows both nodes as being running database nodes.



(node1@node1.21ccw.blogspot.com)8> mnesia:start().
ok



(node2@node2.21ccw.blogspot.com)3> mnesia:start().
ok

(node2@node2.21ccw.blogspot.com)4> mnesia:info().
...
running db nodes = ['node1@node1.21ccw.blogspot.com','node2@node2.21ccw.blogspot.com']
...

Next we'll create an actual database table, and populate it with some data. We define a record using rd(), then create a table on node1 (by default, this table will reside in RAM and have a disc copy), write a record to it and then read the record again. The primary key of the table is the first field of the record, i.e. the name.


(node1@node1.21ccw.blogspot.com)9> rd(person, {name, email_address}).
person

(node1@node1.21ccw.blogspot.com)10> mnesia:create_table(person, [{attributes, record_info(fields, person)}, {disc_copies, [node()]}]).
{atomic,ok}

(node1@node1.21ccw.blogspot.com)11> mnesia:transaction(fun() -> mnesia:write(#person{name = "John", email_address = "john@21ccw.blogspot.com"}) end).
{atomic,ok}

(node1@node1.21ccw.blogspot.com)14> mnesia:transaction(fun() -> mnesia:read({person, "John"}) end).
{atomic,[#person{name = "John",email_address = "john@21ccw.blogspot.com"}]}

(node1@node1.21ccw.blogspot.com)15> mnesia:info(). ...
...
[{'node1@node1.21ccw.blogspot.com',disc_copies}] = [person]
...


What happens when we do the same read on node2? Remember that node has access to the person table only via the network, since it resides in RAM and on disc on node1.

node2@node2.21ccw.blogspot.com)5> mnesia:transaction(fun() -> mnesia:read({person, "John"}) end).
{atomic,[{person,"John","john@21ccw.blogspot.com"}]}
Nice. Mnesia has transparently read the record from a table that's on another machine :)

Now we decide to copy the table to node2. This requires a single command. Mnesia does the copying of the actual data for you to the other machine, and when you look at the file system on node2, there will now be "person.DCD" file, which is the disc copy of the table.


(node1@node1.21ccw.blogspot.com)15> mnesia:add_table_copy(person, 'node2@node2.21ccw.blogspot.com', disc_copies).
{atomic,ok}



(node2@node2.21ccw.blogspot.com)9> ls("Mnesia.node2@node2.21ccw.blogspot.com").
DECISION_TAB.LOG LATEST.LOG person.DCD
schema.DAT
ok


At this point, when you do a query on the person table, the actual data can come from either node. I'm not sure how Mnesia decides how to distribute the data, that's something to investigate further.

Since the table is resident on both nodes, we can actually delete it from node1, and doing a read on node1 will now read the table over the network from node2:


(node1@node1.21ccw.blogspot.com)23> mnesia:del_table_copy(person, node()).
{atomic,ok}

(node1@node1.21ccw.blogspot.com)19> mnesia:info().
...
[{'node2@node2.21ccw.blogspot.com',disc_copies}] = [person]
...

(node1@node1.21ccw.blogspot.com)18> mnesia:transaction(fun() -> mnesia:read({person, "John"}) end).
{atomic,[#person{name = "John",email_address = "john@21ccw.blogspot.com"}]}

Cool.

What I've show is how to start up an Erlang/Mnesia node on two machines that are networked together, create tables on either node, and move the tables to other nodes by copying and then deleting them. Mnesia has the ability to configure tables to be RAM only, RAM and disc and disc only, which gives you lots of power for optimisation. Couple this with the fact that you can change your configuration dynamically and you have powerful, dynamically configurable distributed database!

12 comments:

Harish Mallipeddi said...

Can mnesia only handle replication across a bunch of nodes or can it actually be used to build a distributed database - data is split across multiple mnesia instances?

Benjamin Nortier said...

I'm not sure what you mean with "multiple mnesia instances"...? The data is split across multiple machines already, and each node has it's own "instance" which is started with mnesia:start().

So for me, each node is already an "instance". Perhaps you could elaborate on how you define the separation between instances in your understanding?

Harish Mallipeddi said...

My bad. I just read the last part and assumed that you just replicated the table created on the first node to the second node. And hence I was wondering how one would split data across multiple nodes (each running 1 mnesia instance).

But I think you answered my qn right above, where you demonstrated that you could retrieve a row from a table on a different node transparently over the network.

Kode said...
This comment has been removed by the author.
Bosky said...
This comment has been removed by the author.
Bosky said...

hi,
could you comment more on the hosts file on either machine.
1)did you set it in /etc/hosts or .hosts.erlang or neither

2)Also, where else have you specified hosts before calling "erl -name node1@..."

3) did you have to shutdown / keep-alive the firewall/epnb

Keep Clicking,
~B

Benjamin Nortier said...

1) I had to set the IPs in /etc/hosts, since the VMs don't have proper host names.

I didn't specify the hosts in .hosts.erlang, since I explicitly pinged the other node (which makes them aware of each other)

2) Nowhere else

3) I didn't have any problems with firewalls, I'm not even sure the VMs had a firewall running.

Cheers
Ben

Jimmy said...

Interesting. I read somewhere that mnesia has an upper limit of 2 GB disk storage. I guess that is 2 GB per node then?

Also, I guess at some point adding nodes to the system will actually decrease performance. Any ideas what the max nr of nodes would be?

Benjamin Nortier said...

I believe it should be 2GB per node, but don't take my word for it :)

The performance degradation will definitely be application-specific, so it's impossible to say at which N that would be. But yes, you're right, there will be some point where the synchronisation overhead starts to trump the benefits of doing distributed reads. I think you'll have to start partitioning your data at that point.

P.S. Since writing this I've come to really like CouchDB (implemented in Erlang), you should have a look at it if you're interested in databases...

Jimmy said...

Hi Benjamin, I looked into couchdb, but I can't escape the highly relational structure of the data I have to deal with. Also, couchdb performs poorly when you have the requirement to support ad-hoc queries.

Perhaps a marriage between couchdb and mnesia is possible. I.e., use mnesia to store relational structures (in essence only integers would be stored) and store the rest (strings, text fields, blobs) in couchdb. Ad hoc queries would query mnesia for the relational stuff and couchdb via an inverted index doing text searches only (the latter should perform well for ad-hoc text queries).

Sephiroth said...

Benjamin,

Thanks a lot for your post. It has been a life saver xD

Cheers!,
Sephiroth

Magnetic Crack Detector said...

Great job you people are doing with this website. NDT Machine