Solr Cloud 4.5

SolrCloud



Solr Cloud is designed to provide the highly available, fault tolerant environment that can index data for searching. In this environment data can be organized into multiple pieses, shards or can reside in several machines, with replicas it provides the redundancy for both scalability and fault tolerance. Integration with ZooKeeper it helps to manage the overall environment so that both indexing and search requests can be routed properly.

Following section you will get an understanding on
  • How to distribute data over multiple instances by using ZooKeeper and creating shards.
  • How to create redundancy for shards by using replicas.
  • How to create redundancy for the overall cluster by running multiple ZooKeeper instances
This section explains SolrCloud and its inner workings in detail, but before you dive in, it's best to have an idea of what it is you're trying to accomplish. This page provides a simple tutorial that explains how SolrCloud works on a practical level, and how to take advantage of its capabilities. We'll use simple examples of configuring SolrCloud on a single machine.

Simple Two-Shard Cluster on the Same Machine

Creating a cluster with multiple shards involves two steps:
  1. Start the "overseer" node, which includes an embedded ZooKeeper server to keep track of your cluster.
  2. Start any remaining shard nodes and point them to the running ZooKeeper.
Make sure to run Solr from the example directory in non-SolrCloud mode at least once 
before beginning; this process unpacks the jar files necessary to run SolrCloud. 
However, do not load documents yet, just start it once and shut it down.
In this example, you'll create two separate Solr instances on the same machine. This is not a production-ready installation, but just a quick exercise to get you familiar with SolrCloud.
For this exercise, we'll start by creating two copies of the example directory that is part of the Solr distribution:
cd <SOLR_DIST_HOME>
cp -r example node1
cp -r example node2
These copies of the example directory can really be called anything. All we're trying to do is copy Solr's example app to the side so we can play with it and still have a stand-alone Solr example to work with later if we want.
Next, start the first Solr instance, including the -DzkRun parameter, which also starts a local ZooKeeper instance:
cd node1
java -DzkRun -DnumShards=2 -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -jar start.jar
Let's look at each of these parameters:
-DzkRun Starts up a ZooKeeper server embedded within Solr. This server will manage the cluster configuration. Note that we're doing this example all on one machine; when you start working with a production system, you'll likely use multiple ZooKeepers in an ensemble (or at least a stand-alone ZooKeeper instance). In that case, you'll replace this parameter with zkHost=<ZooKeeper Host:Port>, which is the hostname:port of the stand-alone ZooKeeper.
-DnumShards Determines how many pieces you're going to break your index into. In this case we're going to break the index into two pieces, or shards, so we're setting this value to 2. Note that once you start up a cluster, you cannot change this value. So if you expect to need more shards later on, build them into your configuration now (you can do this by starting all of your shards on the same server, then migrating them to different servers later).
-Dbootstrap_confdir ZooKeeper needs to get a copy of the cluster configuration, so this parameter tells it where to find that information.
-Dcollection.configName This parameter determines the name under which that configuration information is stored by ZooKeeper. We've used "myconf" as an example, it can be anything you'd like.
The -DnumShards-Dbootstrap_confdir, and -Dcollection.configName parameters 
need only be specified once, the first time you start Solr in SolrCloud mode. They load 
your configurations into ZooKeeper; if you run them again at a later time, they will re-load
 your configurations and may wipe out changes you have made.
At this point you have one sever running, but it represents only half the shards, so you will need to start the second one before you have a fully functional cluster. To do that, start the second instance in another window as follows:
cd node2
java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar
Because this node isn't the overseer, the parameters are a bit less complex:
-Djetty.port The only reason we even have to set this parameter is because we're running both servers on the same machine, so they can't both use Jetty's default port. In this case we're choosing an arbitrary number that's different from the default. When you start on different machines, you can use the same Jetty ports if you'd like.
-DzkHost This parameter tells Solr where to find the ZooKeeper server so that it can "report for duty". By default, the ZooKeeper server operates on the Solr port plus 1000. (Note that if you were running an external ZooKeeper server, you'd simply point to that.)
At this point you should have two Solr windows running, both being managed by ZooKeeper. To verify that, open the Solr Admin UI in your browser and go to the Cloud screen:
Use the port of the first Solr you started; this is your overseer. You can go to the
You should see both node1 and node2, as in:

Post data to newly created nodes. ( You can do this any way you like, but the easiest way is to use the exampledocs)
java -Durl=http://localhost:<port1>/solr/collection1/update  -jar post.jar mem.xml
java -Durl=http://localhost:<port2>/solr/collection1/update  -jar post.jar monitor2.xml

At this point each shard contains a subset of the data, but a search directed at either server should span both shards. For example, the following searches should both return the identical set of all results:
The reason that this works is that each shard knows about the other shards, so the search is carried out on all cores, then the results are combined and returned by the called server.
In this way you can have two cores or two hundred, with each containing a separate portion of the data.
If you want to check the number of documents on each shard, you could add
 distrib=false to each query and your search would not span all shards.
But what about providing high availability, even if one of these servers goes down? To do that, you'll need to look at replicas.

Two-Shard Cluster with Replicas

In order to provide high availability, you can create replicas, or copies of each shard that run in parallel with the main core for that shard. The architecture consists of the original shards, which are called the leaders, and their replicas, which contain the same data but let the leader handle all of the administrative tasks such as making sure data goes to all of the places it should go. This way, if one copy of the shard goes down, the data is still available and the cluster can continue to function.
Start by creating two more fresh copies of the example directory:
cd <SOLR_DIST_HOME>
cp -r example node3
cp -r example node4
Just as when we created the first two shards, you can name these copied directories whatever you want.
If you don't already have the two instances you created in the previous section up and running, go ahead and restart them. From there, it's simply a matter of adding additional instances. Start by adding node3:
cd node3
java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar
Notice that the parameters are exactly the same as they were for starting the second node; you're simply pointing a new instance at the original ZooKeeper. But if you look at the SolrCloud admin page, you'll see that it was added not as a third shard, but as a replica for the first:

This is because the cluster already knew that there were only two shards and they were already accounted for, so new nodes are added as replicas. Similarly, when you add the fourth instance, it's added as a replica for the second shard:
cd node4
java -Djetty.port=7500 -DzkHost=localhost:9983 -jar start.jar

Post data to newly created nodes. ( You can do this any way you like, but the easiest way is to use the exampledocs)
java -Durl=http://localhost:<port3>/solr/collection1/update  -jar post.jar money.xml
Execute the search queries on all for nodes and check the data distribution through out the node. use distrib=false  for easily identify the data distribution among the nodes.
If you were to add additional instances, the cluster would continue this round-robin, adding replicas as necessary. Replicas are attached to leaders in the order in which they are started, unless they are assigned to a specific shard with an additional parameter of shardId (as a system property, as in -DshardId=1, the value of which is the ID number of the shard the new node should be attached to). Upon restarts, the node will still be attached to the same leader even if the shardId is not defined again (it will always be attached to that machine).
So where are we now? You now have four servers to handle your data. If you were to send data to a replica, as in:
curl http://localhost:7500/solr/update?commit=true -H "Content-Type: text/xml" -d "@money.xml"
the course of events goes like this:
  1. Replica (in this case the server on port 7500) gets the request.
  2. Replica forwards request to its leader (in this case the server on port 7574).
  3. The leader processes the request, and makes sure that all of its replicas process the request as well.
In this way, the data is available via a request to any of the running instances, as you can see by requests to:
But how does this help provide high availability? Simply put, a cluster must have at least one server running for each shard in order to function. To test this, shut down the server on port 7574, and then check the other servers:
You should continue to see the full set of data, even though one of the servers is missing. In fact, you can have multiple servers down, and as long as at least one instance for each shard is running, the cluster will continue to function. If the leader goes down – as in this example – a new leader will be "elected" from among the remaining replicas.
Note that when we talk about servers going down, in this example it's crucial that one particular server stays up, and that's the one running on port 8983. That's because it's our overseer – the instance running ZooKeeper. If that goes down, the cluster can continue to function under some circumstances, but it won't be able to adapt to any servers that come up or go down.
That kind of single point of failure is obviously unacceptable. Fortunately, there is a solution for this problem: multiple ZooKeepers.

Using Multiple ZooKeepers in an Ensemble

To simplify setup for this example we're using the internal ZooKeeper server that comes
 with Solr, but in a production environment, you will likely be using an external ZooKeeper. The concepts are the same, however. You can find instructions on setting up an external ZooKeeper server here: http://zookeeper.apache.org/doc/r3.3.4/zookeeperStarted.html
To truly provide high availability, we need to make sure that not only do we also have at least one shard server running at all times, but also that the cluster also has a ZooKeeper running to manage it. To do that, you can set up a cluster to use multiple ZooKeepers. This is called using a ZooKeeper ensemble.
A ZooKeeper ensemble can keep running as long as more than half of its servers are up and running, so at least two servers in a three ZooKeeper ensemble, 3 servers in a 5 server ensemble, and so on, must be running at any given time. These required servers are called a quorum.
In this example, you're going to set up the same two-shard cluster you were using before, but instead of a single ZooKeeper, you'll run a ZooKeeper server on three of the instances. Start by cleaning up any ZooKeeper data from the previous example:
cd <SOLR_DIST_DIR>
rm -r shard*/solr/zoo_data
Next you're going to restart the Solr servers, but this time, rather than having them all point to a single ZooKeeper instance, each will run ZooKeeper and listen to the rest of the ensemble for instructions.
You're using the same ports as before – 8983, 7574, 8900 and 7500 – so any ZooKeeper instances would run on ports 9983, 8574, 9900 and 8500. You don't actually need to run ZooKeeper on every single instance, however, so assuming you run ZooKeeper on 9983, 8574, and 9900, the ensemble would have an address of:
localhost:9983,localhost:8574,localhost:9900
This means that when you start the first instance, you'll do it like this:
cd node1
java -DzkRun -DnumShards=2 -Dbootstrap_confdir=./solr/collection1/conf \
   -Dcollection.configName=myconf -DzkHost=localhost:9983,localhost:8574,localhost:9900 \
   -jar start.jar
Note that the order of the parameters matters. Make sure to specify the -DzkHost
 parameter after the other ZooKeeper-related parameters.
You'll notice a lot of error messages scrolling past; this is because the ensemble doesn't yet have a quorum of ZooKeepers running.
Notice also, that this step takes care of uploading the cluster's configuration information to ZooKeeper, so starting the next server is more straightforward:
cd node2
java -Djetty.port=7574 -DzkRun -DnumShards=2 \
    -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar
Once you start this instance, you should see the errors begin to disappear on both instances, as the ZooKeepers begin to update each other, even though you only have two of the three ZooKeepers in the ensemble running.
Next start the last ZooKeeper:
cd node3
java -Djetty.port=8900 -DzkRun -DnumShards=2 \
    -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar
Finally, start the last replica, which doesn't itself run ZooKeeper, but references the ensemble:
cd node4
java -Djetty.port=7500 -DzkHost=localhost:9983,localhost:8574,localhost:9900 \
    -jar start.jar
Just to make sure everything's working properly, run a query:
and check the SolrCloud admin page:
Now you can go ahead and kill the server on 8983, but ZooKeeper will still work, because you have more than half of the original servers still running. To verify, open the SolrCloud admin page on another server, such as:



Comments

Popular posts from this blog

PostgreSQL bytea and oid

MySQL as Hive metadata store

Microservices Architecture with Spring Boot in 15mins