Bulk elasticsearch file split

Recently needed to split a pretty massive (2M line) file to bulk load into elasticsearch. I’m sure others have run into this same problem. Here’s the code:


#!/usr/bin/ruby

#!/usr/bin/ruby

filename = "#{ARGV[0]}"
max_commands = 10000  # Max number of commands per file
lines_per_command = 2 # Each elasticsearch command is 2 lines

# Set up to iterate (don't change these)
count = 0
iteration = 0
outstring = ""

# Take the lines of the file and add them to temp string
File.open(filename).each_slice(lines_per_command) do |lines|
  count = count+1
  outstring << lines.join

  # once we hit our max, write it out
  if count> max_commands
    outstring << "\n"
    path = filename + ".#{iteration}"
    puts "Writing to #{path}"
    File.open(path, "w").puts outstring
    iteration = iteration+1

    # ... and reset for the next iteration
    count = 0
    outstring = ""
  end

end


You can all call this: ruby ./es_split.rb [file_to_split.bulk]

Since the file was actively being written to, i then needed to continually send the information to elasticsearch. To do this, i used the following script:

USER=[username]
PASS=[password]
HOST=[hostname]
FILE=[filename]

while true; do
echo "[x] Splitting file"
ruby ./es_split.rb $FILE
echo "[x] Sending files to Elasticsearch"
for x in `ls $FILE.*`; do
echo "[x] File: $x"
curl -s -XPOST https://$USER:$PASS@$HOST/_bulk --data-binary @$x;

sleep 10
done

rm *.$FILE.*
sleep 90

done

Find the latest version of the code in the intrigue-core repo.

 

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

5 Responses to Bulk elasticsearch file split

  1. konrads3000 says:

    Any reason split(1) didn’t work for you?

    • jcran says:

      You know, looking at it now, split would work just fine for this problem. Thanks for making me feel like an idiot! 🙂

      • konrads3000 says:

        Here’s your x-mas reading list : dpkg -L coreutils|grep man|xargs -L 1 man
        🙂
        On a serious note, you could have used your multi-threaded code and piped data directly from ruby into elasticsearch thereby reducing one write/read cycle of on-disk data. If you add ProgressBar, then you get an estimate of how far you are 🙂

  2. jcran says:

    At the scale in which i was grabbing data, I found it was much quicker to drop to bulk vs going directly into elasticsearch one-by-one. See: elasticsearch_bulk.rb vs elasticsearch.rb (https://github.com/intrigueio/intrigue-core/tree/develop/lib/handlers)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s